frictionlessdata / frictionless-py Goto Github PK

Tabular Validator is not a great - it is not even that description seeing as we are actually working towards validation (or, linting) and optional transformation (or, cleaning) of data.

I don' have any suggestions right now, just putting it out there.

Refactor report output structure

Based on notes in user story 3A.

So, what we have is a summary object with:

Total row count
Bad row count
Total column count
Bad column count
Columns:
- name
- index
- incorrect_type_percent

The summary object is calculated from the report results. Each report result object has:

result category (e.g.: row, column, header)
result level (e.g.: info, warning, error)
result message (the message describing the result)
result type (e.g: "Invalid Header")
row index (of this result)
row name (of this result, if row has id or _id field, else None)
column index (of this result, None if not column error)
column name (of this result, '' if not column error)

Cases

Resource not found
Resource is HTML
Decoding error while iterating over stream

First two require user stories from @rgrp

Provide additional test coverage

Add and/or ensure complete test coverage for the following. If it is an option, ensure test for all states.

Col name order
client report stream
other validator options
Formatting of dates and numbers (";" and "," in numbers, various formats in dates)
Run some tests on a sample of GLA data
Run some tests on a sample of messytables data

Do all CI on Travis

OKFN has an account there, and other projects are there.

Depends on #26

Add warning/info to results when optional fields are missing

If XLS, try to load data anyway?

And add a warning about it to the report.

Schema validator

The SchemaValidator checks that data conforms to a JSON Table Schema.

Implement shared validator API
Create a better reference spec for JTS itself (see: https://github.com/dataprotocols/schemas)
Implement standalone run method
Check headers are valid according to schema
Check data is valid according to schema
- This can be very deep, so discussed with @rgrp to minimally start with date and number validation, and build out from there in iterations. Will create separate issues when this issue closes.
Write tests as stand alone (via self.run)
Write tests as part of pipeline (via PipelineValidator.run)

Rabbit hole

Stuff that is beyond scope of this first pass, but that defines the larger scope of where we'd like to get.

Generate schema from the data, if we do not have a schema #15
foreignKeys: #17
constraints.minLength, constraints.maxLength, constraints.minimum, constraints.maximum needs discussion frictionlessdata/datapackage#161
Some issues around type and format. Would like to see this resolved before implementing deeper support of spec frictionlessdata/datapackage#159

Write user stories for the validation pipeline

CKAN
OS
Python developer working with tabular data in some program
Developer/data wrangler using the web API
Valid structure
Valid schema
Check for expected values (probes)

Validate Data Packages

Currently, we support validating a data source (i.e: a CSV file).

The pipeline constructor takes a data package spec, but does not use it/know what to do with it.

Need to spec out how to work with Data Packages in a validation pipeline.

Run spec validations before data pipeline

Spec validation is exposed as a pipeline validator like anything else. However, in the main use case, we really want to validation the spec files (schema, CSV dialect, data package) before we even open the data.

General pipeline refactoring

There are a few things I want to change, but as long as tests are passing I'm not focussing on it now. Rather, this issue is a placeholder for notes on changes I want make while working on other things.

Get rid of all the run_* stuff in the pipeline calling code per validator - it is enough to call run on each validator
Tidy up all the stuff around dry_run and transform
Refactor all distinct validation checks to generators
Fix workspace and let it support both an s3 directory and a local directory (use https://github.com/pudo/barn for that - it does everything I need already)
Don't need the utf-8 textstream until we start reading the data, so only need to do the conversion as we iterate over data
Remove all the help and help_edit keys on the result dicts

Extract headers with appropriate guesswork

Presumably, we need to get headers from files with all sorts of irrelevant data in preceding lines (notes, etc.).

I've held off from this as (a) I don't have a use case yet, and (b) https://github.com/okfn/messytables doesn't run on Python 3.

Action:

One of

use messytables directly (needs to run on Python 3)
Get the appropriate code from messy tables in here in a Py3 combat way

ref okfn/messytables#117

Ensure that the `options` argument in the pipeline constructor is valid

is dict
has keys that match validators
has valid options per key/validator (?)

Compat for Python 2.7, 3.3, 3.4

Code currently runs on 3.3 and 3.4, and will run on 2.7 with minimal changes.

Tasks

Setup CI server integration to run tests on multiple runtimes
Configure test runner to persist coverage reports
Add compat.py for handling py2/3 stuff
- I'm avoiding six and the like if possible, and this should be fine as long as py2.7 is the oldest runtime we support
- I'm using this as my primary reference

Handling incorrect encoding detection

For detecting the encoding of a stream we are using chardet on a small sample of the stream data.

This mostly works fine, but there are cases of incorrect detection. This is a hard problem.

In some examples, such as when detecting encoding of this file, ascii is detected from the sample, but, if we pass charted the whole file, it does correctly guess ISO-8859-2. However, passing the entire stream contents to the detector by default is very expensive in time.

We need to do something here: investigate different ways to detect encoding etc.

I've also added the ability to pass the encoding string to the pipeline/validator constructors, which is potentially useful for large-scale programmatic use; yet not so much for the average user.

Accept CSV Dialect spec

Whenever we read a CSV file, we should also accept a CSV dialect spec.

Currently, the pipeline constructor accepts a CSV dialect spec, and checks it is validly formed, but it is not used when reading a CSV file.

Create system for help docs

Users should be able to contribute to creating help docs that correspond with errors we throw in the pipeline.

these docs can be shown on the web via the contextual help in goodtables-web. This has been somewhat stubbed out there.

Actions:

Create a new repo for content only (goodtables-handbook, goodtables-help, etc.)
Have a markdown docs that corresponds with the ID of each error in goodtables
pull the markdown content in as the help content for the errors
take this chance to tidy up the report result types a bit - e.g.: use sequential names "schema_001", "structure_014", etc.

SQL-backed pipeline/ETL process

A wealth of info from Stefan here: https://lists.okfn.org/pipermail/okfn-labs/2014-December/001542.html

Prepare blog post for the validator

Add soft and hard limits for # of rows to process

Particularly important for the Web API. A soft limit of 10K and a hard limit of 30K.

Implementing a DataTable interface

@rgrp @tryggvib (and of course anyone else)

In my work so far on API design for the validation pipeline, an important issue has revolved around (a) how to parse a stream into some type of object-oriented interface, and (b) which backend to choose for this (and why). There are many ways to go about this of course - we're not going to go over them all here. Some discussion around it took place here

Anyway, the bottom line is that I decided to depend on Pandas to read data (CSV, and also JSON), and I currently work with that data in a Pandas DataFrame.

However, I do not want to expose all the DataFrame API directly, for the following reasons:

Pandas is really just a backend here: end users and clients mostly just need to know about headers, rows and columns
Future changes to the (this) backend should have minimal impact on public APIs
Pandas provides different interfaces for parsing CSV and JSON; I would like to provide a single interface and pass off to the appropriate parser as required

So, locally I'm now working with a DataTable class that wraps a Pandas DataFrame (here is a WIP of the code), and only exposes properties we need for the validation pipeline (all DataFrame properties can still be accessed via self._frame).

Each validator package will depend on this DataTable class. Hence, this will become another package in the suite.

Any thoughts/comments before I commit to this pattern?

One reason I'm even writing this up as an issue is because of the many "DataTable" like interfaces around in Python (csv.DictReader, Tablib, Dataset, Pandas' DataFrame), so I'm a bit wary of my own need to package up another one.

Travis false positives

@tryggvib @rgrp

A question:

Seeing as OK is using Travis regularly, are false positives a common occurrence?

eg: https://travis-ci.org/okfn/tabular-validator/builds/52039531

I'm seeing a false positive test failure every 5-10 pushes, always due to network issues (connection to http resources like CSVs in tests, and dependencies via pip).

As you know, I'm also running the same tests on Shippable, and I just don't have this happen, ever.

It is not a big problem, but I'd like to know if it is something you see as Travis users, and/or, if there is some way I can get travis to retry if it has network errors.

Update documentation

Ensure the docs are informative and up to date, in preparation for an alpha public release.

Spec validator

The SpecValidator ensures that JSON Table Schema, CSV Dialect, and Data Package spec files are well formed.

Check for blank columns on structure validator

Was not done on first pass.

coveralls config problem

Something is up with the coveralls config (possibly the name change)? It is not collecting coverage data.

Complete CLI for validation pipeline

There is a CLI, but it doesn't yet support all pipeline options (only takes data_source).

Ensure the the options it takes are consistent with the Web API options.

Pass stream to validator for reporting

Allow calling code that creates a pipeline to pass in a writable stream for reporting (as an option).

If such a stream is passed in, the reporting code will use this to write report data into.

@rgrp (cc @tryggvib) question:

In such a case, what should be the format of the report?

The current default behaviour of the report module is to maintain a stream over a YAML file (YAML objects are append friendly). The reporter also supports an SQLite backend via https://github.com/pudo/dataset

In the case of this passed-in stream, I'm thinking line-delimited JSON objects (as YAML would require additional dependencies for the calling code) might be the most useful?

Ensure that the `validators` iterable in the pipeline constructor is valid

Code currently assumes it is.

Check data is valid according to foreignKeys descriptor

When validating data against a JSON Table Schema, there may be foreign keys to other files.

This will need to be implemented in a way that supports:

Processing of a raw data source + a table schema
Processing of a DataPackage

Register/upload to PyPI

As discussed.

I can upload in my account, and of course with all license rights to OKFN, but what we want is to upload from some common OKFN account for centralised management.

CLI for validation pipeline

Improve report['summary'] in pipeline

Should manage it in the tellme.Report instance, not outside of it. Probably on report['meta'] or something.

Handle obvious formatting errors/discrepancies

Dates - could have data that is parsable as a a date string, but is not ISO8601 - be lenient and try to get a date ~~(messy tables has code for this)~~ python-dateutil has a parser exactly for this.
Numbers - check about handling of "," and ";" when we cast strings to Decimal or float (i.e.: we want to allow certain common patterns, like "28,000.95")

Pull out standalone "Is it JTS?" module from schema validator

Probably useful as a small tool, without the rest of the package.

Create JSON Table Schema from data, if one does not exist

If no JSON Table Schema is supplied, the SchemaValidator should optionally be able to create a JSON Table Schema for the data.

Spend publishing results aggregator

The spend publishing results aggregator is a (Python) script that writes results to a results.csv file used in a spend publishing dashboard data set (example dataset implementation).

It needs to be invoked, per data pipeline, as part of a batch processing job on a list of data sources.

How it works

A developer runs the validation over a set of data sources at regular intervals
Said data is managed in a git repository (eg), which the developer has locally, along with Tabular Validator code
The developer runs a pipeline.Batch. This gets all data/schema URLs out of the data/sources.csv, and runs each though a validation pipeline (pipeline.Pipeline)
At the end of each pipeline run, a post-processing hook loads the aggregator callable, which is responsible for writing the data/results.csv (each pipeline run appends a new record to the results.csv)
Once all data in the data/sources.csv has been run through a validation pipeline, the batcher instance also calls its own post processor. this post-precessing function commits the new changes in the data git repo, and pushes the code up to the central repo.
At this point, the updated results are live, and the associated Spend Publishing Dashboard instance is working in front of the new data

Requirements

pipeline.Batch class, which takes a CSV of sources, and knows how to get the data and schema urls out of it, and then line them up for running through a pipeline.Pipeline
- pipeline.Batch also has a post-processing hook to call a callable with an instance of the batcher as its argument
pipeline.Pipeline needs to support a post-processing hook to call a callable with an instance of the pipeline as its argument
result_aggregator that will be called as the post-processing function of pipeline.Pipeline
data_deployer that will be called as the post-processing function of pipeline.Batch
Tests for it all

License?

@pwalsh same question as for validator: what would be the license of this repo? thanks!

The improbability of being able to valid a date/time/datetime format if the source is excel

When the source is Excel, we use xlrd's stuff to turn Excel's number formatting for dates into Python dates. At the time of converting the excel source into a text stream for use in goodtables, we don't know if there is a schema, or what the formatting of a date should be according to the schema. So, the date is forced to isoformat.

That means that later on, the only way that a date/time/datetime field from excel will pass its schema validation is if the format is set to any.

Just putting this here for future.

"Expected values" validator example

Flow

Receive data

Client provides one of either:
- A file object
- A stream (known to be CSV)
Client provides a configuration object

Prepare for analysis

If received configuration object, configure accordingly
If received probes that do not match the CSV, throw and say why

Make analysis

Probe values based on config
- This could be extended considerably. I'm suggesting a very simple set of rules for first pass (see config)

Return report

Respond with a report, which is a JSON object, like:
- name (of file, or stream, or whatever)
- config: the passed on config for this run
- results:
  - per row: array of rows that have any deviations, with the deviation shown
  - per dataset: array of probe rules, and results of each probe
  - description: (eg: the CSV is valid according to probe because...)

Implementation

Code

Based on examples here: https://github.com/Stiivi/brewery/blob/master/brewery/probes.py
Suggest that first implementation is only for NULL check, and within range check
- Thought: this really should be done alongside schema - e.g.: only check for number range if field is a valid field for numbers?

Configuration

Exit early (first value outside of probe conditions)
A rules object with probing conditions. Ideas from discussion on labs list with Friedrich:
- Each rule applies to a column
  - Check if values in column are not NULL
  - Check that values in column are within X range (e.g.: 10,000 <-> 25,000)
  - For each probe, set a detection strategy:
    - true/false (each row should comply)
    - weighted: e.g.: warn of more than 5% of data outside condition

Write some tests using GLA data

https://github.com/rgrp/dataset-gla

Python 2.7 io.TextIOWrapper is not compatible with urllib2.urlopen

Currently we use compat.urlopen to open a url and pass that on to io.TextIOWrapper in tabular_validator.utitilities.data_table (currently in line 66). io.TextIOWrapper checks the readable attribute of the buffer which the buffer output of urllib2.urlopen does not have and thus it fails.

This causes build on 2.7 not to pass.

Structure validator

The StructureValidator checks data for correct structure.