frictionlessdata / datapackage Goto Github PK

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.

Home Page: https://datapackage.org

License: The Unlicense

JavaScript 16.05% TypeScript 7.82% CSS 3.16% Astro 16.37% MDX 56.60%

csv data-science json metadata schema validation

datapackage's People

Contributors

Stargazers

Watchers

Forkers

rufuspollock pietercolpaert paulfitz tantek jpstacey tinio bianchimro scraperdragon coolaj86 trestletech todrobbins domoritz gthb mchelen benoitc binocarlos groundrace ldodds mode tony solidsnack spencerx richhall ajschumacher sampathweb pmackay laserson edwindj rayleyva pvgenuchten festercluck deiz stevage phette23 squioc morty trickvi brew vitorbaptista akariv jfcalcerrada ljoelle staxmanade kublaj deanrather jimtyhurst philipashlock opensavannah bobharper1 charlesnepote gnaritasinc amercader orihoch ptsefton sandervdwaal reidab harry-wood incredimed nuest reactual stephen-gates dahlbaek anuveyatsu danielfireman nichtich jqnatividad khongjhon augusto-herrmann zougloub michaelamadi bz2 nokome robdyke fils johanricher micimize priya-gittest isnow evgeniiavak mr-vara antoineaugusti vdubya cpina da5nsy openspending-clone leonardcim ricardomiron pchtsp forschung nortyboy1983 vanthanh0202lk 00mjk jonrmitchell tamaracha pschumm idontwantausernameok rgaiacs ridoo camillelegeron aborruso

datapackage's Issues

Discussion of Catalogs re Data Packages

Need to think further about this. Removed the material below from the current spec since this is not finalized.

Current Primary Proposal

Making your registry into a (tabular) Data Package. A real-life example here:

https://github.com/datasets/registry

Here's the rough structure:

datapackage.json
catalog.csv

catalog.csv is a CSV file with the following structure:

url,name,owner
...

url: url to the dataset, usually the URL to the github repository
name: the name of the dataset as set in its datapackage.json (will
usually be the same as the name of the repository)
owner: the username of the owner of the package. For datasets in github
this will be the github username

name and owner are both optional.

# OLD

Options

Option 1

[ 
   { data-package },
   { data-package }
]

Option 2

{ 
   dp-id: { data-package },
   dp-id: { data-package }
}

Option 3

 {
    dataPackageCatalogVersion: [an integer indicating version of the spec this corresponds to]
    dataPackages: 
      like option 1 or 2 ...
    ...
 }

Existing material

Catalogs and Discovery

In order to find Data Packages tools may make use of a "consolidated" catalog
either online or locally.

A general specification for (online) Data Catalogs can be found at
http://spec.datacatalogs.org/.

For local catalogs on disk we suggest locating at "HOME/.dpm/catalog.json" and
having the following structure::

 {
    version: ...
    datasets:
      {name}: {
        {version}:
          metadata: {metadata},
          bundles: [
            url: ...
            type: file, url, ckan, zip, tgz
          ]
 }

When Package metadata is added to the catalog a field called bundle is added
pointing to a bundle for this item (see below for more on bundles).

Disqus comments or annotateit.org annotations on every dataprotocols.org page

Refine has moved

This page http://www.dataprotocols.org/en/latest/refining.html
references the old site.

The new page is https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API

[DataPackage] homepage attribute

Hyper Text Query Language

In February (sorry for postponing this) I will be writing a page on hyper text query language such as HTSQL

Data package standard is not versioned

AFAICS the datapackage standard contains no field describing which version of the standard the package adheres to.

datapackages - title element - SHOULD or MAY

On the page http://www.dataprotocols.org/en/latest/data-packages.html the title metadata element is specified as both a SHOULD or MAY be used element?

Simple data format docs - label

Is there a label element (as SHOULD?) in the schema definition for CSV files? Appears missing from docs in http://www.dataprotocols.org/en/latest/simple-data-format.html
but use case appears in eg https://github.com/theodi/hot-drinks/blob/master/datapackage.json

Question: Differences between sdf schema and json-schema

Hi all, was there any thought of adopting the JSON Schema (http://json-schema.org/) as the schema for SDF? Not saying that it should be used, but wondering whether it was ever considered, and if so, what the limitations that were seen in it are.

Latest version of the schema protocol draft is here: http://tools.ietf.org/html/draft-zyp-json-schema-03
Website, linked above, includes many examples including links to implementations.

Grammar issues in CouchDB replication doc

The CouchDB replication protocol documentation has a lot of grammatical errors and typos throughout. It needs review by a native English speaker, or should just be replaced with a link to the source document it's based on.

Since the document was clearly based on my TouchDB replication document, I wish that you had simply copied the text wholesale (after asking me) instead of rewriting it. That way it would at least be grammatical. Even better would be if you simply linked to my document — as it is, as I improve or fix my document, yours becomes out of date. I'm not sure why you felt you needed to have your own, since it doesn't add anything to mine.

At a higher level, I don't think it makes sense to try to standardize this protocol on its own. It's really just a specific usage of the entire CouchDB API and data model, and doesn't stand on its own without them. (Not to mention that the idea of CouchDB specs being written by people not associated with CouchDB is rather weird.)

CSV Params standard

Distil this from e.g. python etc

Clarify DataPackage File Info Section

I'm unclear on the File Info Section on the Data Package documentation. Specifically:

Is a name/ID not a required/suggested attributed for a file? #23 seems to imply that this information would be required for each file. Of course, if files were a hash, each file element could be named explicitly in a pretty natural way. Currently, it's listed as an array.
What combination of data is expected/required? Would one use the schema key be used only if the file were in the JSON Table Schema, while the dialect key would be used for CSVs? Or should they both always be supplied?

I'm working on the ability to retrieve data files from a specified Data Package JSON file in the R client and need to figure these two things out in order to proceed.

Interactive guide to datapackage.json

Nicer theme

Probably use https://github.com/okfn/sphinx-theme-okfn

SLEEP: mutliple sequences with the same id

I think the SLEEP spec needs some clarification around what to do in the case of multiple sequences with the same id.

The spec makes reference to them being unique but in the case of CouchDB's _changes?feed=continuous that is actually not the case, as the database is updated later sequences will appear with the same id's as previous sequences as those documents are updated.

This is particularly important to me as mikeal/couchup keeps prior sequences around until compaction. The sparse sequence CouchDB maintains on write is really just an optimization.

Move site to gh-pages

ReadTheDocs does not buy us much here and results in the /en/latest in every url (and we aren't going to version anyway ...)

The `name` field SHOULD NOT include version/time information

The name field of a data package is specified to be unique, but not to be permanent. If it may change between versions/updates of a dataset, that complicates maintenance of links and synchronization/replication. This happens not just by explicit versioning in the name, but also by generating it from a dataset title, since titles very commonly include a year range: world-gdp-1944-to-2012.

So I suggest that the spec state that:

name SHOULD be permanent, perhaps with some elaboration like “unless the new package version should be considered a distinct package, e.g. due to significant changes in structure or interpretation”.
version distinction SHOULD be left solely to the version field
information about time range coverage SHOULD also be left out of the name. (Perhaps also suggest a specific attribute for that information, like time-coverage ... but that's a separate issue)

Closer alignment with JSON Schema

It seems to me like you can use JSON Schema (or something very close) to describe CSVs just as well as JSON files, where instead of:

{
"fields": [{
    "id": "foo",
    "label": "bar",
    "description": "...",
    "type": "date",
    "format": "YYYY.MM.DD"
  }, {
    ...
  }]
}

You'd have:

{
"$schema": "http://link-to-dataprotocols-dialect-of-json-schema.tld/path",
"properties": {
    "foo": {
      "title": "bar",
      "description": "...",
      "type": "date",
      "format": "YYYY.MM.DD"
    },
    "baz": {
      ...
    },
    ...
  }
}

format has a different meaning in JSON Schema, so maybe mint a scheme property.

JSON Schema is already well supported, so I figure it would be a benefit to try to be as close to it as possible.

JTS - Method for describing units for a field

Say I have real GDP in 2009 £m (i.e. in millions of £ in the year 2009) I have no way to specify this.

Propose two new fields:

"scaler" attribute. Value is a number. Default value is 1 but for £m would be 1m i.e. 1000000
"unit" attribute whose value is a hash with following properties:

{
   type: "currency",
   value: "GBP",
   # base date in iso 8601 format
   date: "2009"
}

Concerns

This has the potential for massively increased complexity
Would this not be better part of a proper dimension description approach (keep JTS simple)
Further research into existing work e.g. sdmx

Flesh out on Refining / Reconciliation protocol section

@frabcus has stubbed a refining / reconciliation section named refining.rst (maybe better named reconciliation).

@pudo would you be up for fleshing out this section based on your knowledge of Refine, Helmut etc. Could start with a summary of existing work e.g. Google Refine, Helmut etc (and any refs to algorithms) and then provide a proposal.

Create issue here for each spec on dataprotocols.org and link to it

There can be many other issues for each page / spec on data protocols but having a 'core' base one which can be primary link for comments.

Alternative / complement: #2 (comments/annotations)

Support Categorical Variables in JSON Schema

We're currently evaluating how to support categorical "factors" (as named in R) in JSON Table Schema and would love to have an official structure for this in the spec.

I noticed that #23 didn't seem to go over very well, so maybe this similar concept is outside of the spec.

But we're thinking about something along the lines of:

"fields": [
  {
    "id": "color",
    "factor": {
      "1": "Blue",
      "2": "Black",
      "3": "Red"
    }
}]

I realize this is already a valid schema, I'm just wishing there were an officially sanctioned way to do this so that our R client implementation doesn't end up forking the spec by implementing this in a proprietary/unexpected way.

Create Data Formats page and move over existing material

http://wiki.ckan.org/Data_Formats

See also @frabcus comments on http://lists.okfn.org/pipermail/data-protocols/2012-March/000009.html (already on CKAN wiki ...)

[DP] Using datapackage.json (or SDF) to describe APIs (or multiple generic URLs)

Examples:

data.gov.uk weather data
- e.g. get urls like https://datagovuk.blob.core.windows.net/csv/Daily_All_20130805_0000.csv
- Form is https://datagovuk.blob.core.windows.net/csv/Daily_{{PredictionSiteId-or-All}}_{yyyymmddd}_{{hhmm}}.csv
- Unfortunately URLs for 3h forecast and observations appear to produce UUID style download urls :-(
Ship telemetry data
Twitter

[JTS] Change from id to name attribute in fields (BREAKING)

This suggests a breaking change to replace id attribute with name in json table schema fields.

Why?

Consistency with other parts of datapackage spec (use name on resources and datapackage itself)
name is more obvious to users while id is not - see e.g. https://gist.github.com/psychemedia/5633865#file-datapackage-json

Personally I'm not yet convinced that this is worth it ...

Add additional detail of potential fields describing data files in Data Package spec

encoding - character encoding (if not utf8)
format or mimetype
size (in bytes)
title
last modified

Question: should these be a separate spec?

cf DCAT, CKAN etc

Add predominant existing data packaging specifications

In the section describing the relationship to other data packaging specifications, the predominant specifications that are in widespread use in the science community are missing. For an overview of these, see the DataONE Data Packaging description. Here is a list of existing data package specifications that are mature and widely adopted:

OAI-ORE: Object Reuse and Exchange
NetCDF: Network Common Data Format - file format spec https://www.unidata.ucar.edu/software/netcdf/docs/file_format_specifications.html
HDF5: Hierarchical Data Format
BagIt

Simple Data Format v0.1 - Discussion

Dedicated issue for discussing Simple Data Format v0.1: http://www.dataprotocols.org/en/latest/simple-data-format.html

Alternative is mailing list - see http://lists.okfn.org/pipermail/data-protocols/2012-May/000015.html

[DataPackage] URL and Path Exclusivity on Resource

I'm finding that many packages on data.okfn.org (for example, this one) have both a path and a url specified for a single resource. http://github.com/QBRC/RODProt throws an error in such an instance, as it's unclear whether it should be using the path element (relative to the original URL of the datapackage.json file), or the absolute URL.

I suppose we could code in a check to see if they're equivalent, in which case there should be no error, but it seems like this is representing a more fundamental misunderstanding.

How do you resolve conflicts between URL and path for a single resource, and why aren't these mutually exclusive?

BagIt is missing from the Background (and possibly being reinvented here)

A lot of the goals of the Data Packages format seem very similar to the BagIt standard. Are there reasons it can't simply be adopted? I can't help but be reminded of the XKCD comic ...

SLEEP / The Cut-Out

This protocol might be of interest in defining SLEEP: http://thecutout.org/protocol.html – it bears a lot of similarities, but with some extra pieces that I believe are necessary for completeness. It's not particularly dissimilar to CouchDB's protocol.

(Note: I'm the author of the protocol and project that implements both client and server.)

Data Packages name descriptor

In the Descriptor (datapackage.json) section of http://www.dataprotocols.org/en/latest/data-packages.html would it make more sense to describe the name element in a well formed way?

That is, rather than use:
name: "a unique human readable and url-usable identifier",

replace with:
name: "a-unique-human-readable-and-url-usable-identifier",

String Encoding

How should I specify in a schema that my string field is in an unusual character encoding?

Simple data format: metadata file format?

The SDF document references the "Data Package specification" for its metadata format. In my mind, this raises two slightly pedantic issues that might be worth clarifying:

What's the metadata filename called in SDF? The DP spec says "datapackage.json" for this file, which sounds too DP-specific. "metadata.json"?
Is the structure of this file literally single-level JSON key/value pairs i.e:

{
  "version": "1.0",
  "license": "CC BY-NC-SA 2.0",
  ...
}

? If so, it might be worth putting a sample of it on the DP page.

(As a separate and much less important note, clicking on the Data Package specification loses menu context: should it be somewhere on the right-hand menu hierarchy? As it is, it's difficult to find without clicking through from e.g. the SDF page.)

Sort out hosting so we do not have /en/latest in all the urls

Host at ReadTheDocs atm so have this issue. Could solve by:

nginx proxy
Build and deploy w/o readthedocs
Get readthedocs fixed - see readthedocs/readthedocs.org#293
Splitting into multiple repos and host via jekyll or similar

Current Plan (as of August 2013)

Move to use open repo per spec + use github pages for hosting

Split each spec into a separate repo
- Note for some things that don't yet justify their own repos fine to keep in the main repo (just recommend using a url structure that allows easy migration)
Handle redirects
- given how gh-pages works (no redirects) we have to either do this via having nginx proxy or by putting in some manual pages with meta redirects in them. I opt for latter ...)

Why?

Easier to maintain (just push to gh-pages, markdown probably easier to rst)
Each spec can have its own tags/branches
Each spec gets its own set of issues

Repo changes:

dataprotocols => dataprotocols.github.com (if we want it as base repo)
- Set up a reusable theme which each subrepo uses
Split out data-package (or data-package_s_)?
Split out json-table-schema
Split out simple-data-format
(?) Split out versioning (how?)

Documentation - datapackages - optional fields

In http://www.dataprotocols.org/en/latest/data-packages.html I wonder if it would make sense to give literal examples of possible optional fields?

For example, if we are to use datapackages to organise data for School of Data data expeditions, it might be useful to be able to capture which expeditions data sets were used in?

"useCases": [ {"title":"xyz data expedition", "organisedBy":"School of Data", "url":"etc..."} ]

Hosted Set of Examples

It would be great to have a collection of sample Data Packages hosted online to:

Provide a unified set of tests for any client implementations, and
Provide a set of examples to help clarify some of the simple issues (#30) until they become clarified in the docs.

I18N and Metadata Translations for Data Package

How should the standard support titles, descriptions and data fields in languages other than English?

Proposal (Nov 2016)

An internationalised field:

# i18n
"title": { 
    "": "Israel's 2015 budget", 
    "he-IL": "תקציב לשנת 2015", 
    "es": "Presupuestos Generales del Estado para el año 2015 de Israel" 
}
...

Summary:

Each localizable string in datapackage.json could take two forms:

A simple string (for backward compatibility)
An object, mapping from ISO Locale codes (with or without the region specification, e.g. 'en', or 'es-ES') to their representations.
In this object, you could have an empty key "" which denotes the 'default' representation

Not all properties would be localizable for now. For the sake of simplicity, we limit this to only the following properties;

title (at package and resource level)
description (at package and resource level)

Default Language

You can define the default language for a data package using a lang attribute:

"lang": "en"

The default language if none is specified is English (?).

Data Packages - Inline Data

Allow inlining data directly on "resources".

What

datapackage.json looks like:

{
   ...
   resources: [
     {
        "format": "json",
        # some json data e.g. 
        "data": [
           { "a": 1, "b": 2 },
           { .... }
        ]
     }
   ]
}

OR

{
   ...
   resources: [
     {
        "format": "csv",
        "data": "A,B,C\n1,2,3\n4,5,6"
     }
   ]
}

Why?

this is attractive for small datasets and for creating all in one items.

Dataprotocols has moved

For some reason dataprotocols is no longer part of OKFN, so all references need to be updated.

One example is the side bar on http://www.dataprotocols.org/en/latest/refining.html which still points to github.com/okfn/...

Table schema for geojson feature properties

The table schema (or a closely related schema) should be able to describe the properties of features in a geojson file.

Currently, the table schema allows a field to be of type geojson, or other fields to be of type geopoint. This makes the file unreadable in a geojson viewer or map application, and dissociates the fields from their geographical representation.
On the other hand, geojson requires fields that are associated with a geographical feature to be in its "properties" dictionary, which makes them un-describable using the table schema.

I think the table schema could perfectly be used to describe the properties of features in a geojson file, possibly without modification. It just has to be mentioned in the standard, and possibly in the schema itself.

Foreign Key attribute in JSON schema

Suggest in a type field:

foreignkey: {
  // points to the datapackage.json of the relevant data package
  url: ...
  // id / name of the dataset
  file: ...
  # id of the field in the referenced table
  field: ...
}

[JTS] Primary Key / ID attribute support in JSON table schema

Ability to specify a field(s) as primary key / id field.

Proposal

{
   fields: [
     {
        id: ...
        type: ...
        primarykey: true
   ]
}

Questions

Do we allow multiple fields or insist on a single field
Naming of attribute: "primarykey"

Relationship to a possible distinct attribute "unique"

TODO: research / compare with other specs e.g. SQL, bigquery etc

Allow aliases for types in JSON Table Schema

number aka decimal aka float aka numeric
integer aka int
boolean aka bool

Also ignore case in type identifiers

Improve theme for dataprotocols

Use bootstrap?
Logo
More info and fork on github link
Disqus comments / annotateit.org annotations -- see #2

Start doc outlining approaches for "Data Package Managers"

Where to store stuff on disk
What a catalog / index is
Reference implementations

directory name to store the data packages listed in dependencies

If datapackage A has datapackages B and C as dependencies, where B and C should be stored in case I want to store them somewhere within the A directory ?
Is there a convention already established ?
something like A/data_dependencies ?

Thanks!

CSV literature

Hi,
I was wondering if it makes sense to add a reference from http://www.dataprotocols.org/en/latest/simple-data-format.html to http://datapatterns.org/csv.html or the two are unrelated.

DP - field "hash" in resource information is unclear.

Is it a checksum?
Should it include the algorithm? Should it allow for several algorithms? Should it require certain algorithms?

"hash":"28cb0cb25701a242d84c2857fdf52775"

"hash":{
  "md5":"28cb0cb25701a242d84c2857fdf52775",
  "sha256":"..."
}

Files should have a required name (or id) attribute

Files attribute should have a name (or id).

Discussion

Originally files was a hash keyed by a name/id (so it was required). However that adds complexity for creators and does not seem a hard requirement so it was removed.

Pros / cons

(+) Makes it easier to "address" files and can be useful if presenting in a web interface
(-) Simplicity - not essential and it means a creator has to generate a name or id for each file. Why bother.

On hash versus array: array is needed if no id. Also order matters for users (UX of presentation etc).

frictionlessdata / datapackage Goto Github PK

datapackage's People

Contributors

Stargazers

Watchers

Forkers

datapackage's Issues

Current Primary Proposal

Options

Option 1

Option 2

Option 3

Existing material

Catalogs and Discovery

Concerns

Current Plan (as of August 2013)

Proposal (Nov 2016)

Default Language

What

Why?

Proposal

Questions

Discussion

Recommend Projects

Recommend Topics

Recommend Org