Giter Club home page Giter Club logo

sql-unifier's Introduction

datasets-br

Describing the datasets-br directives and using this project as point of generic discussions.

Dataset-BR directives

  1. To post qualified datasets in the Datahub.io;
  2. To unify, by curatory process, a set of Wikidata fragments if items, or commom instances of an item;
  3. To unify terminology to express CSV colunm names, table and column semantics (SchemaOrg conventions when possible)
  4. Digital preservation (CSV files and data dumps from original soruces) of the curated datasets;
  5. Monitoring/auditing Wikidata and OpenStreetMap changes, in the context of the curated datasets.

Use as an ecosystem of datasets

Example of use with 2 BR's datasets, state-codes and city-codes.

Operating with pure SQL or SQL-unifier will be easy to merge with other datasets... With PopstgreSQL you can offer datasets in an standard API with PostgreREST (or its descendents pREST and PostGraphile), or plug-and-play with SchemaOrg standards, FrictionlessData standards (and tools), etc.

Documentation

... under construction

Conventions for data provenance and prepare.


  Contents and data of this project are dedicated to

sql-unifier's People

Contributors

ppkrauss avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

sql-unifier's Issues

Merging 2 temporary-datasets, before to insert into dataset.big

Without special tools, for example to merge many standard spreadsheets into one dataset, it is possible to do it by SQL (eg. by SELECT UNION).

There are two ways:

  • Merging into the dataset.big representation, as issue #4 suggested. Perhaps the simplest and ideal.

  • Merge before to insert into dataset.big, when it is a temporary file with temporary FOREIGN TABLE representation.

Here, this issue, is to check strategies for the second... There are some advantage in to write specific scripts? (perhaps a non-automatic work)

Abandon use of SQL-rows representation, expand datasets to rows by demand

(this issue was for v0.1.0, that change from v0.2+, so, use benchmarking of v0.1 as reference)

The benchmark, see Wiki, shows that

  • the dataset expansion by the jsonb_array_elements() function is so faster tham per-row JSONs.

  • Only need faster when there are a lot of rows (more tham ~9000)... When need really fast SELECT, use best (~2 to 5 times, not so much) is a table or MATERIALIZED VIEW; and faster is a indexed one.

By other hand, there are some demand to add into dataset.big "any other free JSON" dataset.

Conclusion: the best is "1 dataset per row" at dataset.bit, a free-JSONb dataset with "some JSON-schema" controled by dataset.meta. We denominated it as "JSON TYPE DEFINITION", JTD, to be a set of labels indication standard structures. Only some JTDs are tabular: other need a rule to join/split into rows... The best is to stay all in "1 dataset per row".


Reference Benchmark of v0.1.0

15 datasets, load as "many rows per dataset". In terms of disk-usage, all the dataset-schema sums table_size 6080 kB. If all datasets of dataset.big was translated to usual SQL-tables, the custo will be less (!), ~75% or less.

id urn pkey lang n_cols n_rows
15 datasets-br:br_city_codes state/lexLabel pt 9 5570
14 datasets-br:br_city_synonyms state/lexLabel/synonym pt 5 26
... ... (see more at Wiki) ... ... ... ... ...
nspname n_tables total_bytes table_bytes table_size
dataset 4 8552448 6225920 6080 kB

Comparing JSON and JSONb

A test with less rows (before "datasets:world_cities"), but good results, for dataset.big total usage:

  • JSONb with 16138 rows, table disk-usage 3496 kB.
  • JSON with 16138 rows, same datasets, table disk-usage 3432 kB.

So no expressive advantage to use a JSON, with only ~2% of gain. Better stay with JSONb, a "first-class citizen".

Disk-usage reduction when all rows in one JSONb-array

Checking table_usage with "datasets-br:br_city_codes":

  • All rows in one JSONb array: 8192 bytes ~ 8kB
  • One JSONb row per SQL row: 736 kB

Conclusion: "all rows in one" is ~10 times better!

Checking performance

No loss of performance by PostgreSQL's EXPLAIN ANALYZE: as showed at Wiki, the function jsonb_array_elements() is very fast.

SQL export functions

Export to JSON, JSONb and file, using SQL COPY and JSON functions.

COPY and Frictionless Data Tabular Resources:

  • CSV: function dataset.export_as_csv(dataset_name,filename,tmp=true) implemented with COPY (SELECT * FROM dataset.vw_$name) TO filename.

  • Any JSON: function dataset.export_as_json(json,filename,tmp=true) implemented with COPY (SELECT $1) TO $2.

  • JSON Tabular Data:

    • Row Arrays: function dataset.as_jsontab_arrays(dataset_name) implemented with SELECT * FROM meta.keys where dataset.meta_id($1) UNION ALL (SELECT big.c ...) ORDER By 1.

    • Row Objects: SELECT jsonb_object(meta.keys,big.c) FROM big, meta WHERE dataset=$1

... Cats, convertions, etc. of datasets.

Corrigir falhas do make

Falhas que aparecem durante o sh src/cache/make.sh:

ERRO:  não pode mudar nome de parâmetro de entrada "p_no_null"
DICA:  Primeiro utilize DROP FUNCTION array_distinct(anyarray,boolean).
...
CREATE FUNCTION
ERRO:  parênteses não correspondem em ou próximo a ")"
LINHA 41:           );
                    ^
NOTA:  extensão "file_fdw" já existe, ignorando
CREATE EXTENSION
NOTA:  removendo em cascata outros 15 objetos
DETALHE:  removendo em cascata tabela externa tmpcsv6_autoridade
removendo em cascata tabela externa tmpcsv6_localidade
...
...
ERRO:  erro de sintaxe em ou próximo a "68"
LINHA 1: ...->>2)::text AS assis_brasil,  (j->>3)::numeric AS 68 FROM (S...
                                                              ^
CONSULTA:  CREATE VIEW dataset.vw8_anatel2013_abrevs3letras AS SELECT  (j->>0)::text AS AC,  (j->>1)::text AS ABL,  (j->>2)::text AS assis_brasil,  (j->>3)::numeric AS 68 FROM (SELECT jsonb_array_elements(j) FROM dataset.big WHERE source=16) t(j)
CONTEXTO:  função PL/pgSQL dataset."create"(integer,text,boolean,text,text) linha 18 em EXECUTE
INSERT 0 1
ERRO:  faltando dados da coluna "municipio"
CONTEXTO:  COPY tmpcsv8_anatel_ddd_chave, linha 69: ""
comando SQL "INSERT INTO dataset.big(source,j)  SELECT 17, jsonb_agg(jsonb_build_array(ddd, municipio, uf)) FROM tmpcsv8_anatel_ddd_chave"
função PL/pgSQL dataset."create"(integer,text,boolean,text,text) linha 36 em EXECUTE
INSERT 0 1

O problema surgiu com o seguinte conf:

{
  "db": "postgresql://postgres:postgres@localhost:5432/trydatasets",
  "github.com":{
    "lexml/lexml-vocabulary":null,
    "datasets/language-codes":null,
    "datasets/country-codes":null,
    "datasets/world-cities":{
      "_corrections_":{"resources":[{"primaryKey": "geonameid"}]}
    },
    "datasets-br/state-codes":"br-state-codes",
    "datasets-br/city-codes":null
  },
  "local-csv":{
     "anatel":{
       "separator":",",
       "folder":"/home/user/gits/datasets-br/city-codes/data/dump_etc"
     }
  },
  "useBig":true,  "useIDX":false,        "useRename":true,
  "useYUml":true, "useAllNsAsDft":false
}

Quando modificado o local-csv também deu erro

  "local-csv":{
    "citAux":{"folder":"/home/user/gits/datasets-br/city-codes/data"}
  },

Dataset primary-key hash options

The key column of dataset.big can be NULL, but ideal is to keep an primary-key even when it is long and complex. Adding _SHA1_, _MD5_ or _CRC32_ hashes as option is an issue for long-text keys, or when no key exists. Options:

  • hash(key): when primary key exist and the option is used;
  • hash(c::text): when the option is used but there are no primary key;
  • hash(other_key): "overwrite" to a new primary key.

Ideal is to convert hash to base64 in order to optimize comparisons and reduce space.

Merging 2 datasets of dataset.big

A carga automática de diversas planilhas pode ser relevante para a automação de processos, mas o processo só termina quando os dados estiverem agregados.
Existem duas situações:

  1. o nome da planilha em si é um dado: precisa ser agregado por exemplo como primeiro campo.

    1.1. além disso o nome entra na composição da chave primária.

  2. o nome não traz informação alguma ao dataset.

Algoritmos que podem resolver o problema:

  • para o caso-2, basta renomear o dataset conferindo simultaneamente se não houve duplicação, ou disparando o tratamento delas (ex. deletando duplicados).

  • para o caso-1, além de renomear, levar o nome ou uma transformação do nome para um campo inicial ou final (ver função jsonb_insert()). Requer acrescimento também nos metadados. Para o caso 1.1 há que se alterar também o metadado da PK.

Adopt an integer hash of the namespace name as its ns_id

As proposed and implemented after issue #7 with namespace control, the automatic SQL-View names and other automatic labels can use the ns_id. So a view of the dataset named "abcd_ok" in a namespace named "xptoabc" (ns_id=3) will be "vw3_abcd" instead a longo and confuse name with namespace name (eg. "vw_xptoabc_abcd_ok".

Little problem... For tutorials and user-projects, that changes de datasets and namespace, but back with same names, ideal is to preserve when possible the correspondense name-ID (namespace name with namespace ns_id).

The solution is a hash, an integer (1 or 2 digits) hash obtained from the SHA1 of the name... It looks complex but I can reuse some libraries...

Namespace and VIEW conventions

Ideal is to add more one table to control namespaces, something as

CREATE TABLE dataset.ns(
  -- Namespace
  nsid  serial NOT NULL PRIMARY KEY, 
  name text NOT NULL,  -- namespace label
  dft_lang text,  -- default lang of datasets
  jinfo JSONB,    -- any metadata as description, etc.
  created date DEFAULT now(),
  UNIQUE(label)
);

The view names are in the form vw_$uname, but it was only for "default" namespace... when it is not, use vw{$nsid}_$uname.

incluir rotina para a carga de todos os repositórios-dataset de um projeto

No Github é oferecida a API de acesso. Por exemplo para todos em https://github.com/datasets , basta usar o nome do projeto, datasets na URL da API (https://api.github.com/users/datasets/repos), ou seja, https://api.github.com/users/datasets/repos

Fazer primeiro testes com https://github.com/datasets-br


Outros exemplos

curl -s https://api.github.com/orgs/twitter/repos?per_page=200 \
  | ruby -rubygems -e 'require "json"; JSON.load(STDIN.read).each { |repo| %x[git clone #{repo["ssh_url"]} ]}'

Ver código Python em https://stackoverflow.com/a/8713716/287948

Add namespace control, and refactoring meta names

Each dataset name make sense in the context of a project or, generalizing, of a namespace. Naspaces can be IDs or URNs. Name of the dataset must generate a consistent URN in the form namespace:datasetName.

The "canonic datasets" also need some control over "derived" ones. Example: language-codes is derived (as projection) from language-codes-full.

So a new dataset.meta will seem as

CREATE TABLE dataset.meta (
	id serial PRIMARY KEY,
	namespace NOT NULL text DEFAULT '',  --  empty namespace is the main one
	name text NOT NULL, -- original dataset name or filename of the CSV
        is_canonic BOOLEAN DEFAULT false, -- for canonic or "reference datasets"
        sametypes_as text,  -- pointing to a kx_urn of its is_canonic-dataset
	projection_of text, -- pointing to a kx_urn of its is_canonic-dataset

	-- Cache fields generated by UPDATE or trigger.
	kx_uname text, -- the normalized name, used for  dataset.meta_id()
        ..., -- other caches

	info JSONb, -- all metadata (information) here!

	UNIQUE(namespace,kx_uname), -- not need but same as kx_urn
        CHECK( lib.normalizeterm(namespace)=namespace AND lower(namespace)=namespace ),
	CHECK( ... )
);

And a trigger can populate the kx_* fields.

Assert, as validation and constraint strategies

As in the Schematron-validation philosophy... To validate any kind of constraint, but, mainlly, to validate frictionlessdata.io/table-schema/constraints

  • some SQL can obtained with MATERIALIZED VIEW , as CREATE UNIQUE INDEX... But all usual SQL constraints only with real tables. ... So, we need table? Datasets are "read only" in general. The most important is FOREGIN KEY/REFERENCE. Advantage of table use: to not reinvent the wheel.

  • standard asserts on JTD (json-type-definition). Example: tab-aoa JSON need to simulate CHECK(jsonb_array_length(j)>0) on insert...

  • non-SQL but standard: all implemented by SQL or PLpgSQL functions. See "not-null", "primary-key", etc. of frictionlessdata/constraints... Need standard asserts.

  • non-SQL non-standard: ... free constraints. See Schematron-validation-like use,


3-Layer architecture:

  1. Git with datapackage, CSVs, etc. The main reference and data origin... The single source of truth in the community context.
  2. The datasets on dataset.big, as single source of truth in the database context.
  3. Views, materialized views and tables for performance (fast queries) or for constraint-control in maintenance operations (validating new datasets or consolidating the ecosystem).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.