datasets-br / sql-unifier Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 92 KB

Try all datasets in a single SQL table, plug-and-play!

PHP 17.41% PLpgSQL 75.24% SQLPL 7.35%

csv datasets frictionlessdata json postgresql

sql-unifier's Introduction

datasets-br

Describing the datasets-br directives and using this project as point of generic discussions.

Dataset-BR directives

To post qualified datasets in the Datahub.io;
To unify, by curatory process, a set of Wikidata fragments if items, or commom instances of an item;
To unify terminology to express CSV colunm names, table and column semantics (SchemaOrg conventions when possible)
Digital preservation (CSV files and data dumps from original soruces) of the curated datasets;
Monitoring/auditing Wikidata and OpenStreetMap changes, in the context of the curated datasets.

Use as an ecosystem of datasets

Example of use with 2 BR's datasets, state-codes and city-codes.

Operating with pure SQL or SQL-unifier will be easy to merge with other datasets... With PopstgreSQL you can offer datasets in an standard API with PostgreREST (or its descendents pREST and PostGraphile), or plug-and-play with SchemaOrg standards, FrictionlessData standards (and tools), etc.

Documentation

... under construction

Conventions for data provenance and prepare.

Contents and data of this project are dedicated to

sql-unifier's People

Contributors

Stargazers

Watchers

sql-unifier's Issues

Add fuzzy sets as option

Like bags (issue #15 and implemented as lib-bags), but implementing other operations, see https://en.wikipedia.org/wiki/Fuzzy_set_operations

Namespace and VIEW conventions

Ideal is to add more one table to control namespaces, something as

CREATE TABLE dataset.ns(
  -- Namespace
  nsid  serial NOT NULL PRIMARY KEY, 
  name text NOT NULL,  -- namespace label
  dft_lang text,  -- default lang of datasets
  jinfo JSONB,    -- any metadata as description, etc.
  created date DEFAULT now(),
  UNIQUE(label)
);

The view names are in the form vw_$uname, but it was only for "default" namespace... when it is not, use vw{$nsid}_$uname.

incluir rotina para a carga de todos os repositórios-dataset de um projeto

No Github é oferecida a API de acesso. Por exemplo para todos em https://github.com/datasets , basta usar o nome do projeto, datasets na URL da API (https://api.github.com/users/datasets/repos), ou seja, https://api.github.com/users/datasets/repos

Fazer primeiro testes com https://github.com/datasets-br

Outros exemplos

curl -s https://api.github.com/orgs/twitter/repos?per_page=200 \
  | ruby -rubygems -e 'require "json"; JSON.load(STDIN.read).each { |repo| %x[git clone #{repo["ssh_url"]} ]}'

Ver código Python em https://stackoverflow.com/a/8713716/287948

use URI connection with psql and configure it at conf.json

Now is possible to use standard URI for connect PostgreSQL.

No external dependences, take all preparing to the SQL script

SQL inserts and create views, etc. all into SQL, no external dependence. No external preparing-script can be sh or any other simple language.

Add bag as option

For multiset operations and easy representation with JSONb.

Copy from https://github.com/datasets-br/sql-json-bags

Add namespace control, and refactoring meta names

Each dataset name make sense in the context of a project or, generalizing, of a namespace. Naspaces can be IDs or URNs. Name of the dataset must generate a consistent URN in the form namespace:datasetName.

The "canonic datasets" also need some control over "derived" ones. Example: language-codes is derived (as projection) from language-codes-full.

So a new dataset.meta will seem as

CREATE TABLE dataset.meta (
	id serial PRIMARY KEY,
	namespace NOT NULL text DEFAULT '',  --  empty namespace is the main one
	name text NOT NULL, -- original dataset name or filename of the CSV
        is_canonic BOOLEAN DEFAULT false, -- for canonic or "reference datasets"
        sametypes_as text,  -- pointing to a kx_urn of its is_canonic-dataset
	projection_of text, -- pointing to a kx_urn of its is_canonic-dataset

	-- Cache fields generated by UPDATE or trigger.
	kx_uname text, -- the normalized name, used for  dataset.meta_id()
        ..., -- other caches

	info JSONb, -- all metadata (information) here!

	UNIQUE(namespace,kx_uname), -- not need but same as kx_urn
        CHECK( lib.normalizeterm(namespace)=namespace AND lower(namespace)=namespace ),
	CHECK( ... )
);

And a trigger can populate the kx_* fields.

Adopt an integer hash of the namespace name as its ns_id

As proposed and implemented after issue #7 with namespace control, the automatic SQL-View names and other automatic labels can use the ns_id. So a view of the dataset named "abcd_ok" in a namespace named "xptoabc" (ns_id=3) will be "vw3_abcd" instead a longo and confuse name with namespace name (eg. "vw_xptoabc_abcd_ok".

Little problem... For tutorials and user-projects, that changes de datasets and namespace, but back with same names, ideal is to preserve when possible the correspondense name-ID (namespace name with namespace ns_id).

The solution is a hash, an integer (1 or 2 digits) hash obtained from the SHA1 of the name... It looks complex but I can reuse some libraries...

Merging 2 datasets of dataset.big

A carga automática de diversas planilhas pode ser relevante para a automação de processos, mas o processo só termina quando os dados estiverem agregados.
Existem duas situações:

o nome da planilha em si é um dado: precisa ser agregado por exemplo como primeiro campo.

1.1. além disso o nome entra na composição da chave primária.
o nome não traz informação alguma ao dataset.

Algoritmos que podem resolver o problema:

para o caso-2, basta renomear o dataset conferindo simultaneamente se não houve duplicação, ou disparando o tratamento delas (ex. deletando duplicados).
para o caso-1, além de renomear, levar o nome ou uma transformação do nome para um campo inicial ou final (ver função jsonb_insert()). Requer acrescimento também nos metadados. Para o caso 1.1 há que se alterar também o metadado da PK.

Corrigir falhas do make

Falhas que aparecem durante o sh src/cache/make.sh:

ERRO:  não pode mudar nome de parâmetro de entrada "p_no_null"
DICA:  Primeiro utilize DROP FUNCTION array_distinct(anyarray,boolean).
...
CREATE FUNCTION
ERRO:  parênteses não correspondem em ou próximo a ")"
LINHA 41:           );
                    ^
NOTA:  extensão "file_fdw" já existe, ignorando
CREATE EXTENSION
NOTA:  removendo em cascata outros 15 objetos
DETALHE:  removendo em cascata tabela externa tmpcsv6_autoridade
removendo em cascata tabela externa tmpcsv6_localidade
...
...
ERRO:  erro de sintaxe em ou próximo a "68"
LINHA 1: ...->>2)::text AS assis_brasil,  (j->>3)::numeric AS 68 FROM (S...
                                                              ^
CONSULTA:  CREATE VIEW dataset.vw8_anatel2013_abrevs3letras AS SELECT  (j->>0)::text AS AC,  (j->>1)::text AS ABL,  (j->>2)::text AS assis_brasil,  (j->>3)::numeric AS 68 FROM (SELECT jsonb_array_elements(j) FROM dataset.big WHERE source=16) t(j)
CONTEXTO:  função PL/pgSQL dataset."create"(integer,text,boolean,text,text) linha 18 em EXECUTE
INSERT 0 1
ERRO:  faltando dados da coluna "municipio"
CONTEXTO:  COPY tmpcsv8_anatel_ddd_chave, linha 69: ""
comando SQL "INSERT INTO dataset.big(source,j)  SELECT 17, jsonb_agg(jsonb_build_array(ddd, municipio, uf)) FROM tmpcsv8_anatel_ddd_chave"
função PL/pgSQL dataset."create"(integer,text,boolean,text,text) linha 36 em EXECUTE
INSERT 0 1

O problema surgiu com o seguinte conf:

{
  "db": "postgresql://postgres:postgres@localhost:5432/trydatasets",
  "github.com":{
    "lexml/lexml-vocabulary":null,
    "datasets/language-codes":null,
    "datasets/country-codes":null,
    "datasets/world-cities":{
      "_corrections_":{"resources":[{"primaryKey": "geonameid"}]}
    },
    "datasets-br/state-codes":"br-state-codes",
    "datasets-br/city-codes":null
  },
  "local-csv":{
     "anatel":{
       "separator":",",
       "folder":"/home/user/gits/datasets-br/city-codes/data/dump_etc"
     }
  },
  "useBig":true,  "useIDX":false,        "useRename":true,
  "useYUml":true, "useAllNsAsDft":false
}

Quando modificado o local-csv também deu erro

  "local-csv":{
    "citAux":{"folder":"/home/user/gits/datasets-br/city-codes/data"}
  },

Config instructin for corrections or enhancements in the datapackage

As the primaryKey error at old world-cities described here. To fiz need replace datapackage attributes.
In many cases, even with no error, there are a lack of something (eg. no primaryKey), so we can also to fix it.

At conf.json the instruction is

   { 
      "...":"...",
      "datasets/world-cities":{
           "_corrections_":{"resources":[{"primaryKey": "geonameid"}]}
      }, 
      "...":"..."
   }

Abandon use of SQL-rows representation, expand datasets to rows by demand

(this issue was for v0.1.0, that change from v0.2+, so, use benchmarking of v0.1 as reference)

The benchmark, see Wiki, shows that

the dataset expansion by the jsonb_array_elements() function is so faster tham per-row JSONs.
Only need faster when there are a lot of rows (more tham ~9000)... When need really fast SELECT, use best (~2 to 5 times, not so much) is a table or MATERIALIZED VIEW; and faster is a indexed one.

By other hand, there are some demand to add into dataset.big "any other free JSON" dataset.

Conclusion: the best is "1 dataset per row" at dataset.bit, a free-JSONb dataset with "some JSON-schema" controled by dataset.meta. We denominated it as "JSON TYPE DEFINITION", JTD, to be a set of labels indication standard structures. Only some JTDs are tabular: other need a rule to join/split into rows... The best is to stay all in "1 dataset per row".

Reference Benchmark of v0.1.0

15 datasets, load as "many rows per dataset". In terms of disk-usage, all the dataset-schema sums table_size 6080 kB. If all datasets of dataset.big was translated to usual SQL-tables, the custo will be less (!), ~75% or less.

id	urn	pkey	lang	n_cols	n_rows
15	datasets-br:br_city_codes	state/lexLabel	pt	9	5570
14	datasets-br:br_city_synonyms	state/lexLabel/synonym	pt	5	26
...	... (see more at Wiki) ...	...	...	...	...

nspname	n_tables	total_bytes	table_bytes	table_size
dataset	4	8552448	6225920	6080 kB

Comparing JSON and JSONb

A test with less rows (before "datasets:world_cities"), but good results, for dataset.big total usage:

JSONb with 16138 rows, table disk-usage 3496 kB.
JSON with 16138 rows, same datasets, table disk-usage 3432 kB.

So no expressive advantage to use a JSON, with only ~2% of gain. Better stay with JSONb, a "first-class citizen".

Disk-usage reduction when all rows in one JSONb-array

Checking table_usage with "datasets-br:br_city_codes":

All rows in one JSONb array: 8192 bytes ~ 8kB
One JSONb row per SQL row: 736 kB

Conclusion: "all rows in one" is ~10 times better!

Checking performance

No loss of performance by PostgreSQL's EXPLAIN ANALYZE: as showed at Wiki, the function jsonb_array_elements() is very fast.

Assert, as validation and constraint strategies

As in the Schematron-validation philosophy... To validate any kind of constraint, but, mainlly, to validate frictionlessdata.io/table-schema/constraints

some SQL can obtained with MATERIALIZED VIEW , as CREATE UNIQUE INDEX... But all usual SQL constraints only with real tables. ... So, we need table? Datasets are "read only" in general. The most important is FOREGIN KEY/REFERENCE. Advantage of table use: to not reinvent the wheel.
standard asserts on JTD (json-type-definition). Example: tab-aoa JSON need to simulate CHECK(jsonb_array_length(j)>0) on insert...
non-SQL but standard: all implemented by SQL or PLpgSQL functions. See "not-null", "primary-key", etc. of frictionlessdata/constraints... Need standard asserts.
non-SQL non-standard: ... free constraints. See Schematron-validation-like use,

3-Layer architecture:

Git with datapackage, CSVs, etc. The main reference and data origin... The single source of truth in the community context.
The datasets on dataset.big, as single source of truth in the database context.
Views, materialized views and tables for performance (fast queries) or for constraint-control in maintenance operations (validating new datasets or consolidating the ecosystem).

Merging 2 temporary-datasets, before to insert into dataset.big

Without special tools, for example to merge many standard spreadsheets into one dataset, it is possible to do it by SQL (eg. by SELECT UNION).

There are two ways:

Merging into the dataset.big representation, as issue #4 suggested. Perhaps the simplest and ideal.
Merge before to insert into dataset.big, when it is a temporary file with temporary FOREIGN TABLE representation.

Here, this issue, is to check strategies for the second... There are some advantage in to write specific scripts? (perhaps a non-automatic work)

better configs for local-csv

Need to configure folders or files, separator, etc. and use automatic detection of fields (from csvstat)

Add instructions about direct transfer to remote database

Example, after dataset.big generated, use pipe or to produce a SQL file by PGPASSWORD=postgres pg_dump -h localhost -U postgres -n dataset trydatasets > /tmp/dump_n_dataset.sql

SQL export functions

Export to JSON, JSONb and file, using SQL COPY and JSON functions.

COPY and Frictionless Data Tabular Resources:

CSV: function dataset.export_as_csv(dataset_name,filename,tmp=true) implemented with COPY (SELECT * FROM dataset.vw_$name) TO filename.
Any JSON: function dataset.export_as_json(json,filename,tmp=true) implemented with COPY (SELECT $1) TO $2.
JSON Tabular Data:
- Row Arrays: function dataset.as_jsontab_arrays(dataset_name) implemented with SELECT * FROM meta.keys where dataset.meta_id($1) UNION ALL (SELECT big.c ...) ORDER By 1.
- Row Objects: SELECT jsonb_object(meta.keys,big.c) FROM big, meta WHERE dataset=$1

... Cats, convertions, etc. of datasets.

Dataset primary-key hash options

The key column of dataset.big can be NULL, but ideal is to keep an primary-key even when it is long and complex. Adding _SHA1_, _MD5_ or _CRC32_ hashes as option is an issue for long-text keys, or when no key exists. Options:

hash(key): when primary key exist and the option is used;
hash(c::text): when the option is used but there are no primary key;
hash(other_key): "overwrite" to a new primary key.

Ideal is to convert hash to base64 in order to optimize comparisons and reduce space.