Giter Club home page Giter Club logo

frictionlessdata / datapackage Goto Github PK

View Code? Open in Web Editor NEW
485.0 485.0 111.0 109.69 MB

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.

Home Page: https://datapackage.org

License: The Unlicense

JavaScript 16.05% TypeScript 7.82% CSS 3.16% Astro 16.37% MDX 56.60%
csv data-science json metadata schema validation

datapackage's People

Contributors

akariv avatar amercader avatar benoitc avatar cpina avatar danfowler avatar dependabot[bot] avatar domoritz avatar georgiana-b avatar lauragift21 avatar ldodds avatar lwinfree avatar max-mapper avatar michaelamadi avatar micimize avatar mk270 avatar monikappv avatar nichtich avatar nirabpudasaini avatar orihoch avatar paulfitz avatar peterdesmet avatar pwalsh avatar rgieseke avatar roll avatar rufuspollock avatar serahkiburu avatar spatchcock avatar stephen-gates avatar trestletech avatar vitorbaptista avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datapackage's Issues

Discussion of Catalogs re Data Packages

Need to think further about this. Removed the material below from the current spec since this is not finalized.

Current Primary Proposal

Making your registry into a (tabular) Data Package. A real-life example here:

https://github.com/datasets/registry

Here's the rough structure:

datapackage.json
catalog.csv

catalog.csv is a CSV file with the following structure:

url,name,owner
...
  • url: url to the dataset, usually the URL to the github repository
  • name: the name of the dataset as set in its datapackage.json (will
    usually be the same as the name of the repository)
  • owner: the username of the owner of the package. For datasets in github
    this will be the github username

name and owner are both optional.


# OLD

Options

Option 1

[ 
   { data-package },
   { data-package }
]

Option 2

{ 
   dp-id: { data-package },
   dp-id: { data-package }
}

Option 3

 {
    dataPackageCatalogVersion: [an integer indicating version of the spec this corresponds to]
    dataPackages: 
      like option 1 or 2 ...
    ...
 }

Existing material

Catalogs and Discovery

In order to find Data Packages tools may make use of a "consolidated" catalog
either online or locally.

A general specification for (online) Data Catalogs can be found at
http://spec.datacatalogs.org/.

For local catalogs on disk we suggest locating at "HOME/.dpm/catalog.json" and
having the following structure::

 {
    version: ...
    datasets:
      {name}: {
        {version}:
          metadata: {metadata},
          bundles: [
            url: ...
            type: file, url, ckan, zip, tgz
          ]
 }

When Package metadata is added to the catalog a field called bundle is added
pointing to a bundle for this item (see below for more on bundles).

Hyper Text Query Language

In February (sorry for postponing this) I will be writing a page on hyper text query language such as HTSQL

Grammar issues in CouchDB replication doc

The CouchDB replication protocol documentation has a lot of grammatical errors and typos throughout. It needs review by a native English speaker, or should just be replaced with a link to the source document it's based on.

Since the document was clearly based on my TouchDB replication document, I wish that you had simply copied the text wholesale (after asking me) instead of rewriting it. That way it would at least be grammatical. Even better would be if you simply linked to my document — as it is, as I improve or fix my document, yours becomes out of date. I'm not sure why you felt you needed to have your own, since it doesn't add anything to mine.

At a higher level, I don't think it makes sense to try to standardize this protocol on its own. It's really just a specific usage of the entire CouchDB API and data model, and doesn't stand on its own without them. (Not to mention that the idea of CouchDB specs being written by people not associated with CouchDB is rather weird.)

Clarify DataPackage File Info Section

I'm unclear on the File Info Section on the Data Package documentation. Specifically:

  1. Is a name/ID not a required/suggested attributed for a file? #23 seems to imply that this information would be required for each file. Of course, if files were a hash, each file element could be named explicitly in a pretty natural way. Currently, it's listed as an array.
  2. What combination of data is expected/required? Would one use the schema key be used only if the file were in the JSON Table Schema, while the dialect key would be used for CSVs? Or should they both always be supplied?

I'm working on the ability to retrieve data files from a specified Data Package JSON file in the R client and need to figure these two things out in order to proceed.

SLEEP: mutliple sequences with the same id

I think the SLEEP spec needs some clarification around what to do in the case of multiple sequences with the same id.

The spec makes reference to them being unique but in the case of CouchDB's _changes?feed=continuous that is actually not the case, as the database is updated later sequences will appear with the same id's as previous sequences as those documents are updated.

This is particularly important to me as mikeal/couchup keeps prior sequences around until compaction. The sparse sequence CouchDB maintains on write is really just an optimization.

Move site to gh-pages

ReadTheDocs does not buy us much here and results in the /en/latest in every url (and we aren't going to version anyway ...)

The `name` field SHOULD NOT include version/time information

The name field of a data package is specified to be unique, but not to be permanent. If it may change between versions/updates of a dataset, that complicates maintenance of links and synchronization/replication. This happens not just by explicit versioning in the name, but also by generating it from a dataset title, since titles very commonly include a year range: world-gdp-1944-to-2012.

So I suggest that the spec state that:

  • name SHOULD be permanent, perhaps with some elaboration like “unless the new package version should be considered a distinct package, e.g. due to significant changes in structure or interpretation”.
  • version distinction SHOULD be left solely to the version field
  • information about time range coverage SHOULD also be left out of the name. (Perhaps also suggest a specific attribute for that information, like time-coverage ... but that's a separate issue)

Closer alignment with JSON Schema

It seems to me like you can use JSON Schema (or something very close) to describe CSVs just as well as JSON files, where instead of:

{
"fields": [{
    "id": "foo",
    "label": "bar",
    "description": "...",
    "type": "date",
    "format": "YYYY.MM.DD"
  }, {
    ...
  }]
}

You'd have:

{
"$schema": "http://link-to-dataprotocols-dialect-of-json-schema.tld/path",
"properties": {
    "foo": {
      "title": "bar",
      "description": "...",
      "type": "date",
      "format": "YYYY.MM.DD"
    },
    "baz": {
      ...
    },
    ...
  }
}

format has a different meaning in JSON Schema, so maybe mint a scheme property.

JSON Schema is already well supported, so I figure it would be a benefit to try to be as close to it as possible.

JTS - Method for describing units for a field

Say I have real GDP in 2009 £m (i.e. in millions of £ in the year 2009) I have no way to specify this.

Propose two new fields:

  • "scaler" attribute. Value is a number. Default value is 1 but for £m would be 1m i.e. 1000000
  • "unit" attribute whose value is a hash with following properties:
{
   type: "currency",
   value: "GBP",
   # base date in iso 8601 format
   date: "2009"
}   

Concerns

  • This has the potential for massively increased complexity
  • Would this not be better part of a proper dimension description approach (keep JTS simple)
  • Further research into existing work e.g. sdmx

Flesh out on Refining / Reconciliation protocol section

@frabcus has stubbed a refining / reconciliation section named refining.rst (maybe better named reconciliation).

@pudo would you be up for fleshing out this section based on your knowledge of Refine, Helmut etc. Could start with a summary of existing work e.g. Google Refine, Helmut etc (and any refs to algorithms) and then provide a proposal.

Support Categorical Variables in JSON Schema

We're currently evaluating how to support categorical "factors" (as named in R) in JSON Table Schema and would love to have an official structure for this in the spec.

I noticed that #23 didn't seem to go over very well, so maybe this similar concept is outside of the spec.

But we're thinking about something along the lines of:

"fields": [
  {
    "id": "color",
    "factor": {
      "1": "Blue",
      "2": "Black",
      "3": "Red"
    }
}]

I realize this is already a valid schema, I'm just wishing there were an officially sanctioned way to do this so that our R client implementation doesn't end up forking the spec by implementing this in a proprietary/unexpected way.

Add predominant existing data packaging specifications

In the section describing the relationship to other data packaging specifications, the predominant specifications that are in widespread use in the science community are missing. For an overview of these, see the DataONE Data Packaging description. Here is a list of existing data package specifications that are mature and widely adopted:

[DataPackage] URL and Path Exclusivity on Resource

I'm finding that many packages on data.okfn.org (for example, this one) have both a path and a url specified for a single resource. http://github.com/QBRC/RODProt throws an error in such an instance, as it's unclear whether it should be using the path element (relative to the original URL of the datapackage.json file), or the absolute URL.

I suppose we could code in a check to see if they're equivalent, in which case there should be no error, but it seems like this is representing a more fundamental misunderstanding.

How do you resolve conflicts between URL and path for a single resource, and why aren't these mutually exclusive?

SLEEP / The Cut-Out

This protocol might be of interest in defining SLEEP: http://thecutout.org/protocol.html – it bears a lot of similarities, but with some extra pieces that I believe are necessary for completeness. It's not particularly dissimilar to CouchDB's protocol.

(Note: I'm the author of the protocol and project that implements both client and server.)

String Encoding

How should I specify in a schema that my string field is in an unusual character encoding?

Simple data format: metadata file format?

The SDF document references the "Data Package specification" for its metadata format. In my mind, this raises two slightly pedantic issues that might be worth clarifying:

  • What's the metadata filename called in SDF? The DP spec says "datapackage.json" for this file, which sounds too DP-specific. "metadata.json"?
  • Is the structure of this file literally single-level JSON key/value pairs i.e:
{
  "version": "1.0",
  "license": "CC BY-NC-SA 2.0",
  ...
}

? If so, it might be worth putting a sample of it on the DP page.

(As a separate and much less important note, clicking on the Data Package specification loses menu context: should it be somewhere on the right-hand menu hierarchy? As it is, it's difficult to find without clicking through from e.g. the SDF page.)

Sort out hosting so we do not have /en/latest in all the urls

Host at ReadTheDocs atm so have this issue. Could solve by:

  • nginx proxy
  • Build and deploy w/o readthedocs
  • Get readthedocs fixed - see readthedocs/readthedocs.org#293
  • Splitting into multiple repos and host via jekyll or similar

Current Plan (as of August 2013)

Move to use open repo per spec + use github pages for hosting

  • Split each spec into a separate repo
    • Note for some things that don't yet justify their own repos fine to keep in the main repo (just recommend using a url structure that allows easy migration)
  • Handle redirects
    • given how gh-pages works (no redirects) we have to either do this via having nginx proxy or by putting in some manual pages with meta redirects in them. I opt for latter ...)

Why?

  • Easier to maintain (just push to gh-pages, markdown probably easier to rst)
  • Each spec can have its own tags/branches
  • Each spec gets its own set of issues

Repo changes:

  • dataprotocols => dataprotocols.github.com (if we want it as base repo)
    • Set up a reusable theme which each subrepo uses
  • Split out data-package (or data-package_s_)?
  • Split out json-table-schema
  • Split out simple-data-format
  • (?) Split out versioning (how?)

Hosted Set of Examples

It would be great to have a collection of sample Data Packages hosted online to:

  1. Provide a unified set of tests for any client implementations, and
  2. Provide a set of examples to help clarify some of the simple issues (#30) until they become clarified in the docs.

I18N and Metadata Translations for Data Package

How should the standard support titles, descriptions and data fields in languages other than English?

Proposal (Nov 2016)

An internationalised field:

# i18n
"title": { 
    "": "Israel's 2015 budget", 
    "he-IL": "תקציב לשנת 2015", 
    "es": "Presupuestos Generales del Estado para el año 2015 de Israel" 
}
...

Summary:

Each localizable string in datapackage.json could take two forms:

  • A simple string (for backward compatibility)
  • An object, mapping from ISO Locale codes (with or without the region specification, e.g. 'en', or 'es-ES') to their representations.
  • In this object, you could have an empty key "" which denotes the 'default' representation

Not all properties would be localizable for now. For the sake of simplicity, we limit this to only the following properties;

  • title (at package and resource level)
  • description (at package and resource level)

Default Language

You can define the default language for a data package using a lang attribute:

"lang": "en"

The default language if none is specified is English (?).

Data Packages - Inline Data

Allow inlining data directly on "resources".

What

datapackage.json looks like:

{
   ...
   resources: [
     {
        "format": "json",
        # some json data e.g. 
        "data": [
           { "a": 1, "b": 2 },
           { .... }
        ]
     }
   ]
}

OR

{
   ...
   resources: [
     {
        "format": "csv",
        "data": "A,B,C\n1,2,3\n4,5,6"
     }
   ]
}

Why?

this is attractive for small datasets and for creating all in one items.

Table schema for geojson feature properties

The table schema (or a closely related schema) should be able to describe the properties of features in a geojson file.

Currently, the table schema allows a field to be of type geojson, or other fields to be of type geopoint. This makes the file unreadable in a geojson viewer or map application, and dissociates the fields from their geographical representation.
On the other hand, geojson requires fields that are associated with a geographical feature to be in its "properties" dictionary, which makes them un-describable using the table schema.

I think the table schema could perfectly be used to describe the properties of features in a geojson file, possibly without modification. It just has to be mentioned in the standard, and possibly in the schema itself.

Foreign Key attribute in JSON schema

Suggest in a type field:

foreignkey: {
  // points to the datapackage.json of the relevant data package
  url: ...
  // id / name of the dataset
  file: ...
  # id of the field in the referenced table
  field: ...
}

[JTS] Primary Key / ID attribute support in JSON table schema

Ability to specify a field(s) as primary key / id field.

Proposal

{
   fields: [
     {
        id: ...
        type: ...
        primarykey: true
   ]
}

Questions

  • Do we allow multiple fields or insist on a single field
  • Naming of attribute: "primarykey"

Relationship to a possible distinct attribute "unique"

TODO: research / compare with other specs e.g. SQL, bigquery etc

DP - field "hash" in resource information is unclear.

Is it a checksum?
Should it include the algorithm? Should it allow for several algorithms? Should it require certain algorithms?

"hash":"28cb0cb25701a242d84c2857fdf52775" 

or

"hash":{
  "md5":"28cb0cb25701a242d84c2857fdf52775",
  "sha256":"..."
}

?

Files should have a required name (or id) attribute

Files attribute should have a name (or id).

Discussion

Originally files was a hash keyed by a name/id (so it was required). However that adds complexity for creators and does not seem a hard requirement so it was removed.

Pros / cons

  • (+) Makes it easier to "address" files and can be useful if presenting in a web interface
  • (-) Simplicity - not essential and it means a creator has to generate a name or id for each file. Why bother.

On hash versus array: array is needed if no id. Also order matters for users (UX of presentation etc).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.