Giter Club home page Giter Club logo

cutplace's Introduction

PyPI Documentation Build Status Test coverage Black

Cutplace is a tool and API to validate that tabular data stored in CSV, Excel, ODS and PRN files conform to a cutplace interface definition (CID).

As an example, consider the following customers.csv file that stores data about customers:

customer_id,surname,first_name,born,gender
1,Beck,Tyler,1995-11-15,male
2,Gibson,Martin,1969-08-18,male
3,Hopkins,Chester,1982-12-19,male
4,Lopez,Tyler,1930-10-13,male
5,James,Ana,1943-08-10,female
6,Martin,Jon,1932-09-27,male
7,Knight,Carolyn,1977-05-25,female
8,Rose,Tammy,2004-01-12,female
9,Gutierrez,Reginald,2010-05-18,male
10,Phillips,Pauline,1960-11-09,female

A CID can describe such a file in an easy to read way. It consists of three sections. First, there is the general data format:

  Property Value
D Format Delimited
D Encoding UTF-8
D Header 1
D Line delimiter LF
D Item delimiter ,

Next there are the fields stored in the data file:

  Name Example Empty Length Type Rule
F customer_id 3798     Integer 0...99999
F surname Miller   ...60    
F first_name John X ...60    
F date_of_birth 1978-11-27     DateTime YYYY-MM-DD
F gender male X   Choice female, male

Optionally you can describe conditions that must be met across the whole file:

  Description Type Rule
C customer must be unique IsUnique customer_id

The CID can be stored in common spreadsheet formats, in particular Excel and ODS, for example cid_customers.ods.

Cutplace can validate that the data file conforms to the CID:

$ cutplace cid_customers.ods customers.csv

Now add a new line with a broken date_of_birth:

73921,Harris,Diana,04.08.1953,female

Cutplace rejects this file with the error message:

customers.csv (R12C4): cannot accept field 'date_of_birth': date must match format YYYY-MM-DD (%Y-%m-%d) but is: '04.08.1953'

Additionally, cutplace provides an easy to use API to read and write tabular data files using a common interface without having to deal with the intrinsic of data format specific modules. To read and validate the above example:

import cutplace
import cutplace.errors

cid_path = 'cid_customers.ods'
data_path = 'customers.csv'
try:
    for row in cutplace.rows(cid_path, data_path):
        pass  # We could also do something useful with the data in ``row`` here.
except cutplace.errors.DataError as error:
    print(error)

For more information, read the documentation at http://cutplace.readthedocs.org/ or visit the project at https://github.com/roskakori/cutplace.

cutplace's People

Contributors

butterhirsch avatar dependabot[bot] avatar heubi95 avatar patrick26041996 avatar ropable avatar roskakori avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cutplace's Issues

Report line delimiter violations in CSV data

In the ICD you can specify the line delimiter to use (for example "LF"). However, currently cutplace does not report if the data use other line delimiters.

The reason for that is because Python's standard csv.reader is broken in this respect.

Add Decimal field type

Currently there is only a field type "Integer" to specify numbers, which is not appropriate for monetary values like "17.30".

The syntax for the Decimal type has to be considered, but as a first stab (expressed as CSV):

,Name,Example,Type,Empty,Length,Rule
F,amount,17.30,Decimal,,,12.2

The rule "12.2" could mean "12 digits with 2 of them after to dot".

Add example as property to class InterfaceControlDocument

Original request from SourceForge user nougat98:

i would like the fieldExamples to be attached to the icd objects so I can retrieve them to display upon errors.

presently the field examples are validated against the formats then ignorned

Fails when last char of field is escaped

OpenSUSE 12.1, Python 2.7.2, cutplace-0.7.1

Using the following ICD:

d,format,delimited
d,line delimiter,lf
d,encoding,ascii
d,quote character,""""
d,escape character,""
f,TEST1,,,,Text,

This CSV passes...

""A"

... but this one fails: [[[ looks like it processes an additional char (the final "), because "A""" passes ]]]

"A""

ERROR:cutplace:cannot handle unexpected error: newline inside string
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/cutplace-0.7.1-py2.7.egg/cutplace/_cutplace.py", line 308, in main
result = process(argv)
File "/usr/local/lib/python2.7/site-packages/cutplace-0.7.1-py2.7.egg/cutplace/_cutplace.py", line 285, in process
cutPlace.validate(path)
File "/usr/local/lib/python2.7/site-packages/cutplace-0.7.1-py2.7.egg/cutplace/_cutplace.py", line 215, in validate
self.icd.validate(dataFilePath)
File "/usr/local/lib/python2.7/site-packages/cutplace-0.7.1-py2.7.egg/cutplace/interface.py", line 709, in validate
for row in reader:
File "/usr/local/lib/python2.7/site-packages/cutplace-0.7.1-py2.7.egg/cutplace/_parsers.py", line 174, in delimitedReader
producer.join()
File "/usr/local/lib/python2.7/site-packages/cutplace-0.7.1-py2.7.egg/cutplace/_parsers.py", line 88, in join
raise self._error
Error: newline inside string

Add rule for Decimal field format

Currently the Decimal field format demands the rule to to empty.

Naturally it would be nice to address:

  • The format (leading zeros, number of digits before or after the decimal separator)
  • localization issues (decimal separator ("." or ","), optional thousands separator)
  • Value range (greater than 0, between -100.00 and +999.99, ...)

Concerning conversion to SQL and deciding on a default, the maximum ranges for a decimal(precision, scale) are:

Database Maximum precision and scale Source
IBM DB2 31 http://www-01.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.sql.ref.doc/doc/r0008469.html?cp=SSEPGG_9.7.0%2F2-10-2-3-0-0
Micosoft 38 https://msdn.microsoft.com/en-us/library/ms187746.aspx
MySQL 64 http://dev.mysql.com/doc/refman/5.0/en/precision-math-decimal-characteristics.html
Oracle 38 http://www.techonthenet.com/oracle/datatypes.php
Postgre precision 131072, scale 16383 http://www.postgresql.org/docs/9.1/static/datatype-numeric.html
Python positive integer https://docs.python.org/3/library/decimal.html

Auto detection of CSV dialect does not recognize ";" as item delimiter

Actually, the whole dialect auto detection broke with change [160]. While !DelimitedParser still detects the dialect properly, it does not pass it to csv.reader().

For instance, if you have an ICD which uses ";" as item delimiter, cutplace will reject it with a message like:

cannot process Excel format: first item in row 1 is ';some heading..." but must be empty or one of: ['c', 'd', 'f']

Handle numbers and dates in Excel properly

Excel internally uses float to represent any number or date. Xlrd's cell.value yields float values, which currently confuse cutplace field validation which expects strings.

Add source distribution

Add a source distribution based on

python setup.py sdist --formats=zip"

which would simplify installation for Jython and platforms without setuptools. With this, the following should be possible:

python setup.py install

Add sniffing of numeric fields

Goals:

  • fields always containing integer values get a sniffed type Integer.
  • fields always containing decimal values get a sniffed type Decimal with data formats properties decimal separator and thousands separator set accordingly.
  • If there are both fields with "." and "," as decimal separator, change one of them to Text by majority vote. In case counts are equal, prefer ".".

Handle encoding errors in ICD and data properly

Currently character encoding errors in the ICD or data result in a cryptic stack dump and program exit, for example:

!UnicodeDecodeError: 'ascii' codec can't decode byte 0x8a in position 1: ordinal not in range(128)

Instead, cutplace should stop processing the ICD/data and show a compact error message.

check errors get reported as a non-existent column

Original request by SourceForge user nougat98:

let's say I have 17 fields, and checks at the end of my icd like:

C,filename must be unique,IsUnique,Filename,,,

These checks get reported for "column 18", which doesn't actually exist.

Row check failed: filename must be unique: tmpaCF0y7 (R4C18): unique [Filename] has already occurred: [dir2/test2-1.dat](see also: tmpaCF0y7 %28R3C18%29: location of previous occurrence) (row 4)

This is misleading if we are trying to coalesce repeated errors or otherwise use those Row-Column units. If the error could be simply R3C* or something it would be easier to handle.

Reduce memory foot print of CSV reading

Currently the CSV reader stores the whole file in memory before parsing it. This is a relict of refactoring the whole parser infrastructure several times during the early days of cutplace.

The proper solution would be to use a consumer/producer like for ODS, where the producer would read the CSV in a separate thread.

Treat Python keywords in field names as error

When declaring a field in the ICD using a Python keyword as name (for example "if"), cutplace should reject the ICD and report an error.

Currently such names work find unless the show up in Checks that internally attempt to assign the field to a Python variable.

Add CID format version

The ICD should detect if the format version does not match the cutplace version and quickly exit with a constructive error message.

This could be implemented by having to specify the format version before anything else, for example:

V,ICD format version, 1

Add simple homepage

Currently the cutplace web site shows the user guide. It would be more useful to have a simple page that links to the relevant resources related to cutplace, in particular:

  • forums
  • user guide
  • PyPI page
  • Trac
  • ...

Add support for plugins for own field formats and checks

Add command line option --plugins to specify folder which should be scanned for python modules to import. These modules can define classes derived from AbstractFieldFormat and AbstractCheck which are available for ICDs to use.

Add a proper tutorial

The current tutorial in the user guide shows a huge ICD and explains the various aspects of it. It would be more useful to start with a very simple ICD containing only text fields and start refining the field format and adding checks with multiple steps.

This would allow to outline the various features and uses in a way that is easier to understand for first time users.

The examples from the tutorial should be stored in examples/tutorial, which should have to positive side effect of increasing the cheesecake score.

Move developer documentation to Wiki

Currently the developer documentation is part of the user guide, see section "Development".

It might me useful to integrate it in the Trac project site.

Impossible to set range on Integers:"must be within range: '-2147483648:2147483647'"

Original request by SourceForge user rufus-devrel:

It is impossible to change the range (length) of Integer fields: length is ignored.

Please change fields.py:IntegerFieldFormat:init from
self.rangeRule = ranges.Range(rule, length)
to:
if not length:
length = IntegerFieldFormat._DEFAULT_RANGE
self.rangeRule = ranges.Range(rule, length)

Version information:
cutplace 0.6.8 (2011-07-26, r541)
Python 2.6.5, Linux-2.6.38.8-gg683-x86_64-with-Ubuntu-10.04-lucid

Specific error:
error: 100-data.csv (R2C1): field u'customer_id' must match format: value is 9999125515L but must be within range: '-2147483648:2147483647'

Move project to github

Advantages:

  • easier to fork and contribute
  • faster and cleaner user interface than !SourceForge

Considerations:

  • How to migrate existing Trac tickets?
  • Forum will be gone; this should be no real loss given the past activity.

Add support for "setup.py install"

Installtion should be possible using

python setup.py install

which would be nice for systems without setuptools installed. This would also simplify installation for Jython, which currently requires the usage of the repository to obtain a working cutplace.

Move field type before field rule

Currently a field is defined using:

name,example, type, empty, length, rule

From now on this should be:

name, example, empty, length, type rule

This has the advantage that domain experts get confused a little later in the line, and have more information the they can relate to before this happens.

Reset checks when validating multiple data files

Currently checks maintain their state after validating the first data file, and consequently easily start to choke on data in the second file.

For instance, !IsUniqueCheck remembers the unique values from the first file, and complains if a value shows up again in the second file although this is perfectly valid.

ImportError: No module named xlrd

A coworker complained about this, so I tried it myself. When I try to pip-install cutplace in a fresh virtual environment, I get "ImportError: No module named xlrd". I see xlrd specified as a requirement in cutplace's setup.py, however, so I don't understand what's going on. Apologies if this is some sort of configuration error on my side. Disclaimer: I'm not a Python guru.

Versions: Python 2.7.2, setuptools 0.6c11, pip 1.3.1 (from running python in the activated environment, importing setuptools and pip and printing them). Mac OS X 10.7, BTW.

pip install -r requirements.txt 
Downloading/unpacking cutplace (from -r requirements.txt (line 1))
  Downloading cutplace-0.7.1.zip (580kB): 580kB downloaded
  Running setup.py egg_info for package cutplace
    Traceback (most recent call last):
      File "<string>", line 16, in <module>
      File "/Users/Me/Documents/code/python/test-env/build/cutplace/setup.py", line 33, in <module>
        from cutplace import _cutplace
      File "cutplace/_cutplace.py", line 26, in <module>
        import xlrd
    ImportError: No module named xlrd
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 16, in <module>

  File "/Users/Me/Documents/code/python/test-env/build/cutplace/setup.py", line 33, in <module>

    from cutplace import _cutplace

  File "cutplace/_cutplace.py", line 26, in <module>

    import xlrd

ImportError: No module named xlrd

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.