roskakori / cutplace Goto Github PK

View Code? Open in Web Editor NEW

18.0 8.0 20.0 8.42 MB

validate data stored in CSV, PRN, ODS or Excel files

Home Page: http://cutplace.readthedocs.org/

License: GNU Lesser General Public License v3.0

Python 99.75% Shell 0.25%

validator csv excel prn ods

cutplace's Introduction

Cutplace is a tool and API to validate that tabular data stored in CSV, Excel, ODS and PRN files conform to a cutplace interface definition (CID).

As an example, consider the following customers.csv file that stores data about customers:

customer_id,surname,first_name,born,gender
1,Beck,Tyler,1995-11-15,male
2,Gibson,Martin,1969-08-18,male
3,Hopkins,Chester,1982-12-19,male
4,Lopez,Tyler,1930-10-13,male
5,James,Ana,1943-08-10,female
6,Martin,Jon,1932-09-27,male
7,Knight,Carolyn,1977-05-25,female
8,Rose,Tammy,2004-01-12,female
9,Gutierrez,Reginald,2010-05-18,male
10,Phillips,Pauline,1960-11-09,female

A CID can describe such a file in an easy to read way. It consists of three sections. First, there is the general data format:

	Property	Value
D	Format	Delimited
D	Encoding	UTF-8
D	Header	1
D	Line delimiter	LF
D	Item delimiter	,

Next there are the fields stored in the data file:

	Name	Example	Empty	Length	Type	Rule
F	customer_id	3798			Integer	0...99999
F	surname	Miller		...60
F	first_name	John	X	...60
F	date_of_birth	1978-11-27			DateTime	YYYY-MM-DD
F	gender	male	X		Choice	female, male

Optionally you can describe conditions that must be met across the whole file:

	Description	Type	Rule
C	customer must be unique	IsUnique	customer_id

The CID can be stored in common spreadsheet formats, in particular Excel and ODS, for example cid_customers.ods.

Cutplace can validate that the data file conforms to the CID:

$ cutplace cid_customers.ods customers.csv

Now add a new line with a broken date_of_birth:

73921,Harris,Diana,04.08.1953,female

Cutplace rejects this file with the error message:

customers.csv (R12C4): cannot accept field 'date_of_birth': date must match format YYYY-MM-DD (%Y-%m-%d) but is: '04.08.1953'

Additionally, cutplace provides an easy to use API to read and write tabular data files using a common interface without having to deal with the intrinsic of data format specific modules. To read and validate the above example:

import cutplace
import cutplace.errors

cid_path = 'cid_customers.ods'
data_path = 'customers.csv'
try:
    for row in cutplace.rows(cid_path, data_path):
        pass  # We could also do something useful with the data in ``row`` here.
except cutplace.errors.DataError as error:
    print(error)

For more information, read the documentation at http://cutplace.readthedocs.org/ or visit the project at https://github.com/roskakori/cutplace.

cutplace's People

Contributors

Stargazers

Watchers

Forkers

zebuline pombredanne heubi95 patrick26041996 butterhirsch chrmorais ropable bobquest33 vishakhabhasin bigsmoke quapka mjroach ytec-nl suisse mms-analytics damiendurant sushrutgirdhari mohanprabhus

cutplace's Issues

Simplify build process (ant, setup.py)

This includes:

get rid of build.xml
get rid of !DocBook, see ticket #11
get rid of XSL transformation of version.xml

Allow quoted and escaped values for choices in ChoiceFieldFormat

Currently choices can only be single words separated by a comma. It should be possible to specify strings using all of Pythons capabilities, for example:

F;color;;;red,"green","bl,ue","yell\tow"

Add rule to TextFieldFormat to specify allowed characters

Use fnmatch in PatternFieldFormat

...instead of carefully handmade regex converter.

Report line delimiter violations in CSV data

In the ICD you can specify the line delimiter to use (for example "LF"). However, currently cutplace does not report if the data use other line delimiters.

The reason for that is because Python's standard csv.reader is broken in this respect.

Include input location in errors detected in ODS data

When detecting error while reading ODS data, the exact location in the input (file, line, ...) should be included in the information presented to the user.

Add Decimal field type

Currently there is only a field type "Integer" to specify numbers, which is not appropriate for monetary values like "17.30".

The syntax for the Decimal type has to be considered, but as a first stab (expressed as CSV):

,Name,Example,Type,Empty,Length,Rule
F,amount,17.30,Decimal,,,12.2

The rule "12.2" could mean "12 digits with 2 of them after to dot".

Add example as property to class InterfaceControlDocument

Original request from SourceForge user nougat98:

i would like the fieldExamples to be attached to the icd objects so I can retrieve them to display upon errors.

presently the field examples are validated against the formats then ignorned

ValidationListener does not receive checks failed at end

When !InterfaceControlDocument.validate() calls check.checkAtEnd(), the validation listener has already been removed and won't get any events.

Fails when last char of field is escaped

OpenSUSE 12.1, Python 2.7.2, cutplace-0.7.1

Using the following ICD:

d,format,delimited
d,line delimiter,lf
d,encoding,ascii
d,quote character,""""
d,escape character,""
f,TEST1,,,,Text,

This CSV passes...

""A"

... but this one fails: [[[ looks like it processes an additional char (the final "), because "A""" passes ]]]

"A""

ERROR:cutplace:cannot handle unexpected error: newline inside string
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/cutplace-0.7.1-py2.7.egg/cutplace/_cutplace.py", line 308, in main
result = process(argv)
File "/usr/local/lib/python2.7/site-packages/cutplace-0.7.1-py2.7.egg/cutplace/_cutplace.py", line 285, in process
cutPlace.validate(path)
File "/usr/local/lib/python2.7/site-packages/cutplace-0.7.1-py2.7.egg/cutplace/_cutplace.py", line 215, in validate
self.icd.validate(dataFilePath)
File "/usr/local/lib/python2.7/site-packages/cutplace-0.7.1-py2.7.egg/cutplace/interface.py", line 709, in validate
for row in reader:
File "/usr/local/lib/python2.7/site-packages/cutplace-0.7.1-py2.7.egg/cutplace/_parsers.py", line 174, in delimitedReader
producer.join()
File "/usr/local/lib/python2.7/site-packages/cutplace-0.7.1-py2.7.egg/cutplace/_parsers.py", line 88, in join
raise self._error
Error: newline inside string

Change InterfaceControlDocument.checkNames to return check names in order defined

Currently it uses whatever order the dictionary's keys() returns.

Note: turns out checkNames was horribly broken and actually returned the field format names.

Add rule for Decimal field format

Currently the Decimal field format demands the rule to to empty.

Naturally it would be nice to address:

The format (leading zeros, number of digits before or after the decimal separator)
localization issues (decimal separator ("." or ","), optional thousands separator)
Value range (greater than 0, between -100.00 and +999.99, ...)

Concerning conversion to SQL and deciding on a default, the maximum ranges for a decimal(precision, scale) are:

Database	Maximum precision and scale	Source
IBM DB2	31	http://www-01.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.sql.ref.doc/doc/r0008469.html?cp=SSEPGG_9.7.0%2F2-10-2-3-0-0
Micosoft	38	https://msdn.microsoft.com/en-us/library/ms187746.aspx
MySQL	64	http://dev.mysql.com/doc/refman/5.0/en/precision-math-decimal-characteristics.html
Oracle	38	http://www.techonthenet.com/oracle/datatypes.php
Postgre	precision 131072, scale 16383	http://www.postgresql.org/docs/9.1/static/datatype-numeric.html
Python	positive integer	https://docs.python.org/3/library/decimal.html

Auto detection of CSV dialect does not recognize ";" as item delimiter

Actually, the whole dialect auto detection broke with change [160]. While !DelimitedParser still detects the dialect properly, it does not pass it to csv.reader().

For instance, if you have an ICD which uses ";" as item delimiter, cutplace will reject it with a message like:

cannot process Excel format: first item in row 1 is ';some heading..." but must be empty or one of: ['c', 'd', 'f']

Add test case for web validation

Make test_web.py work.

The poster module could be useful for this: http://pypi.python.org/pypi/poster/.

Handle numbers and dates in Excel properly

Excel internally uses float to represent any number or date. Xlrd's cell.value yields float values, which currently confuse cutplace field validation which expects strings.

Add source distribution

Add a source distribution based on

python setup.py sdist --formats=zip"

which would simplify installation for Jython and platforms without setuptools. With this, the following should be possible:

python setup.py install

Add sniffing of numeric fields

Goals:

fields always containing integer values get a sniffed type Integer.
fields always containing decimal values get a sniffed type Decimal with data formats properties decimal separator and thousands separator set accordingly.
If there are both fields with "." and "," as decimal separator, change one of them to Text by majority vote. In case counts are equal, prefer ".".

Replace test and coverage reports with Jenkins

Get rid of dev_*.py and use a Jenkins job instead.

--split should actually write files

--split does not work as described, no *_accepted.csv and *.rejected.txt are written.

Handle encoding errors in ICD and data properly

Currently character encoding errors in the ICD or data result in a cryptic stack dump and program exit, for example:

!UnicodeDecodeError: 'ascii' codec can't decode byte 0x8a in position 1: ordinal not in range(128)

Instead, cutplace should stop processing the ICD/data and show a compact error message.

check errors get reported as a non-existent column

Original request by SourceForge user nougat98:

let's say I have 17 fields, and checks at the end of my icd like:

C,filename must be unique,IsUnique,Filename,,,

These checks get reported for "column 18", which doesn't actually exist.

Row check failed: filename must be unique: tmpaCF0y7 (R4C18): unique [Filename] has already occurred: [dir2/test2-1.dat](see also: tmpaCF0y7 %28R3C18%29: location of previous occurrence) (row 4)

This is misleading if we are trying to coalesce repeated errors or otherwise use those Row-Column units. If the error could be simply R3C* or something it would be easier to handle.

Add example for own field and check to API tutorial

Added emptyValue parameter to constructors of all field formats

Reduce memory foot print of CSV reading

Currently the CSV reader stores the whole file in memory before parsing it. This is a relict of refactoring the whole parser infrastructure several times during the early days of cutplace.

The proper solution would be to use a consumer/producer like for ODS, where the producer would read the CSV in a separate thread.

Treat Python keywords in field names as error

When declaring a field in the ICD using a Python keyword as name (for example "if"), cutplace should reject the ICD and report an error.

Currently such names work find unless the show up in Checks that internally attempt to assign the field to a Python variable.

Include input location in errors detected in ICD

When detecting error while reading the ICD, the exact location in the input (file, line, ...) should be included in the information presented to the user.

Add CID format version

The ICD should detect if the format version does not match the cutplace version and quickly exit with a constructive error message.

This could be implemented by having to specify the format version before anything else, for example:

V,ICD format version, 1

Add simple homepage

Currently the cutplace web site shows the user guide. It would be more useful to have a simple page that links to the relevant resources related to cutplace, in particular:

forums
user guide
PyPI page
Trac
...

Add support for plugins for own field formats and checks

Add command line option --plugins to specify folder which should be scanned for python modules to import. These modules can define classes derived from AbstractFieldFormat and AbstractCheck which are available for ICDs to use.

Add a proper tutorial

The current tutorial in the user guide shows a huge ICD and explains the various aspects of it. It would be more useful to start with a very simple ICD containing only text fields and start refining the field format and adding checks with multiple steps.

This would allow to outline the various features and uses in a way that is easier to understand for first time users.

The examples from the tutorial should be stored in examples/tutorial, which should have to positive side effect of increasing the cheesecake score.

Support Excel and ODS as ICD format for web server

Currently a ICD passed to the cutplace web server has to be in CSV format. The server should be able to auto detect the format of the ICD and handle Excel and ODS properly.

Move developer documentation to Wiki

Currently the developer documentation is part of the user guide, see section "Development".

It might me useful to integrate it in the Trac project site.

Impossible to set range on Integers:"must be within range: '-2147483648:2147483647'"

Original request by SourceForge user rufus-devrel:

It is impossible to change the range (length) of Integer fields: length is ignored.

Please change fields.py:IntegerFieldFormat:init from
self.rangeRule = ranges.Range(rule, length)
to:
if not length:
length = IntegerFieldFormat._DEFAULT_RANGE
self.rangeRule = ranges.Range(rule, length)

Version information:
cutplace 0.6.8 (2011-07-26, r541)
Python 2.6.5, Linux-2.6.38.8-gg683-x86_64-with-Ubuntu-10.04-lucid

Specific error:
error: 100-data.csv (R2C1): field u'customer_id' must match format: value is 9999125515L but must be within range: '-2147483648:2147483647'

Change public instance variables to read only properties

...so they show up in the API documentation properly.

Include input location in errors detected in CSV data

When detecting error while reading CSV data, the exact location in the input (file, line, ...) should be included in the information presented to the user.

Move project to github

Advantages:

easier to fork and contribute
faster and cleaner user interface than !SourceForge

Considerations:

How to migrate existing Trac tickets?
Forum will be gone; this should be no real loss given the past activity.

Add support for "setup.py install"

Installtion should be possible using

python setup.py install

which would be nice for systems without setuptools installed. This would also simplify installation for Jython, which currently requires the usage of the repository to obtain a working cutplace.

Add option to split data in accepted and rejected rows

If the option --split is set, accepted rows will be stored in *_accepted.csv and rejected ones in *_rejected.csv

Add tool to generate draft ICD by analyzing example files

Usage should be along the line of:

{{{
cutsniff ICD_customers.csv sample_customers1.ods
}}}

Move field type before field rule

Currently a field is defined using:

name,example, type, empty, length, rule

From now on this should be:

name, example, empty, length, type rule

This has the advantage that domain experts get confused a little later in the line, and have more information the they can relate to before this happens.

Fix that command line client gets stuck on ICD in ODS format with syntax error

Steps to reproduce:

$ cutplace tests/input/icds/broken_syntax_error.ods 
ERROR:cutplace:broken_syntax_error.ods (R13C3): cannot validate example for field u'gender': value is u'some' but must be one of: u'male' or u'female'

Add proper test command to setup.py

The test suite in test_all.py resp. dev_test.py should be called when running

python setup.py test

Include input location in errors detected in fixed data

When detecting error while reading fixed data, the exact location in the input (file, line, ...) should be included in the information presented to the user.

Migrate user guide from DocBook to Restructured Text

...mostly in order to be more Pythonic. Still, will look nicer out of the box and there will be a "Search" field without any effort. The tool of choice is sphinx.

Reset checks when validating multiple data files

Currently checks maintain their state after validating the first data file, and consequently easily start to choke on data in the second file.

For instance, !IsUniqueCheck remembers the unique values from the first file, and complains if a value shows up again in the second file although this is perfectly valid.

Add possibility to specify decimal separator

It should be possible to specify the decimal separator used by DecimalFieldFormat. Possible values should be "." and ",".

Add validating CSV writer

Add a writer similar to csv.writer that validates the written data.

ImportError: No module named xlrd

A coworker complained about this, so I tried it myself. When I try to pip-install cutplace in a fresh virtual environment, I get "ImportError: No module named xlrd". I see xlrd specified as a requirement in cutplace's setup.py, however, so I don't understand what's going on. Apologies if this is some sort of configuration error on my side. Disclaimer: I'm not a Python guru.

Versions: Python 2.7.2, setuptools 0.6c11, pip 1.3.1 (from running python in the activated environment, importing setuptools and pip and printing them). Mac OS X 10.7, BTW.

pip install -r requirements.txt 
Downloading/unpacking cutplace (from -r requirements.txt (line 1))
  Downloading cutplace-0.7.1.zip (580kB): 580kB downloaded
  Running setup.py egg_info for package cutplace
    Traceback (most recent call last):
      File "<string>", line 16, in <module>
      File "/Users/Me/Documents/code/python/test-env/build/cutplace/setup.py", line 33, in <module>
        from cutplace import _cutplace
      File "cutplace/_cutplace.py", line 26, in <module>
        import xlrd
    ImportError: No module named xlrd
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 16, in <module>

  File "/Users/Me/Documents/code/python/test-env/build/cutplace/setup.py", line 33, in <module>

    from cutplace import _cutplace

  File "cutplace/_cutplace.py", line 26, in <module>

    import xlrd

ImportError: No module named xlrd

Web interface should be able to use ICDs in Excel format

Make field type optional and use "Text" as default

This allows to quickly prototype the ICD fields section by simply enumerating the field names.