csbp-cpse / opentabulate Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 4.0 9.96 MB

This is an exploratory and experimental open project. / Ce projet ouvert est exploratoire et expérimental.

License: Other

Python 100.00%

opentabulate's Issues

Pre/post processing and compressed datasets

Wanted features:

Pre/post processing source file tag

Some specific datasets require manual intervention that was not handled by the original maintainer, such as reformatting and format error correction. Similarly, there may be desired post-processing editing that one would like the production system to handle automatically if you provide it the script.

Add a pre and post tag in a source file which accepts a list of strings containing paths to scripts to run specific to the dataset
The scripts are executed in order
pre corresponds to scripts to run prior to general data processing and post refers to scripts to run afterwards

Handling compressed datasets

Some datasets are packaged in a zip or tar archive, perhaps with a specific compression algorithm.

Write a method to handle archived files
Implement source file feature to emphasize the scripts whether or not a dataset is compressed

A wiki for OpenBusinessRepository

Combine and assemble the documentation in a GitHub wiki.

Filter by type in input files

Some open data sources are "Points of Interest", which may contain objects from one or several data themes (e.g., locations of libraries, hospitals, police stations, etc). A way to filter within OpenTabulate would reduce the need to pre-process these files into separate entries for each theme.

Source file syntax checking limitations

Part of the data processing software checks syntax for every source file before processing. This is done to make it easier to debug issues with processing input, output, and functionality. It does not cover all corner cases, some of which are listed below.

Missing checks (for future version 2.0)

Empty strings, empty lists, or empty objects for values
Handling values which are not JSON arrays, strings, or objects
Duplicate key names
Containerizing or securing execution of pre/post-processing scripts

Missing checks (for version 1.0)

Requirement of localarchive and compression if url downloads an archive file
Empty strings, empty lists, or empty objects for values
Handling values which are not JSON arrays, strings, or objects
Duplicate key names
Handling force:* values (does not check if ":" or "force" occurs as a substring)
Executability of pre/post-processing scripts
The colon syntax for localfile

Any future implementations which checks one of the above will be removed from the above list.

'county' key

Should the county key moved into the general tags list? It is currently duplicated (in code and documentation) for both library and hospital database types.

preprocessing character encoding bug

In the experimental version of OpenTabulate (and likely the older versions), OpenTabulate always tries to process pre-processed data using the original encoding. So if the pre-processing script does not preserve the original character encoding, then decoding errors are likely to happen during the processing.

Change time in which importing of postal occurs

To prevent loading the postal module (which takes a few seconds and loads 2 GB into memory) unnecessarily (for instance, if you run $ python pdctl.py --ignore-proc ), it should be loaded later when needed (when data processing begins).

This will require the data processing to be broken up into different stages, all of which can still support parallelization. If there is a downside, more processes would have to be spawned over the run time of the program.

Dataset flaw

Many of the given datasets contain column entries of the form "," (e.g. "NEYRA-NESTERENKO, MAX"). The CSV parser simply splits a row by commas, ignoring a comma embedded in quotations, which givens the incorrect output CSV.

Due to the obrparser.py making use of csv.DictReader, the proposed fix is to clean the datasets via stream manipulation before processing.

Error in Configuration Error message

After running the following command on the command line:

$ opentab -s <file_name>.json

I got the following error message:

Configuration error: Value of label '%s' is not a tuple

This is because in the opentabulate.conf file, under the [labels] section, I had the following key-value pair:

metadata = ('localfile')

which Python recognizes as just a string, instead of a tuple (as required by opentab). But you can reproduce this error with any value that is not a tuple under the [labels] section such as:

testing = 7

I understand what this error message means and how to fix it, but it would've been more helpful and saved me a bit of time if '%s' was formatted properly to output the key-value pair that was causing the problem.

JSON data parsing problems

Several sample datasets that are stored in JSON format have incorrect JSON syntax. Moreover, they run into the same encoding/decoding issues as presented with the CSV parser.

(Proposed) (temporary) solutions/workarounds:

Write specialized scripts for these datasets to correct syntax (e.g. using sed)
Follow the same encoding/decoding handling as the CSV parser
Write specialized scripts to handle problematic characters (that are not in UTF-8 encoding)

Character encode checking and entry filtering

Character Encoding Checking: OpenTabulate will do character encoding checking twice on raw data that comes in CSV format. There is no noticeable performance hit although it would be preferred to eliminate redundancy where it is not necessary.

JSON lists for entry filtering: Currently, lists of regular expressions are accepted in a source/metadata file for entry filtering, where the list acts like a logical OR. This is not needed since regular expressions already support such functionality using |.

Add url of open data license in the source file

need to add "license" in the source file, with path to the open data license. This should be done for all source files.

Unintended behaviour in use of multiple pre-processing scripts

Supplying a list at least two pre-processing scripts in a source file has unintended behaviour.

Currently, OpenTabulate makes use a pre-processing script to make a new copy of the corresponding raw dataset, and then apply its custom reformatting to the copy. If more than one script is used, this methodology is applied to each script, which means scripts that run later overwrite any reformatting of all previously ran scripts.

Documentation Error: lowercase_entries

In the configuration section of the documentation, it states that a valid key-valid pair for the [general] section is lowercase_entries. However, as of v2.1.1, if you actually include the key lowercase_entries in the opentabulate.conf file, it will raise the following error:

Configuration error: 'lowercase_entries' is not a valid option in 'general' section

After some searching through other project's .conf files, it appears that the correct key is actually lowercase_output.

Error: "Failed to load address parser module"

Hi,
I'm trying to use OpenTabulate to process the "Federal corporations" dataset in

LODE-ECDO/sources/Business/.

I've downloaded the following source file:

misc/open_data_cfrci.json

I use this command line:
opentab --verbose .opentabulate/sources/open_data_cfrc1.json

Everything if fine until line 237 of "opentab.py" script. In this line we want to import a function parse_adress from postal.parser. Here the problem, because postal doen't exist.

How can I solve this problem in order to parse "Federal corporations" dataset?

Thank for helpoing me!

How-to and FAQ

It would be nice to have a single 1-pager that links to all the resources excellently prepared here so as to achieve at least three functions:

transporting/cloning the OBR to a new linux workstation/server environment (and all that it requires); and
update the OBR with new municipalities which are not already covered in the existing database
update existing municipalities/data sources with new links/sources and potentially new formats.

Also, a FAQ would be helpful addressing questions such as:
What is the OBR?
What type of data does it use?
How do we use the OBR?
Why was the OBR constructed?
What limitations are present in the OBR?
What is the future of the OBR?

Note: issues with debugging errors in Algorithm.parse()

The process method in tabctl.py will silently fail (as in, produce no output to the terminal) if any exception is caught, excluding the special case when the wrong column names are defined for a CSV dataset.

This is because the try-catch block that wraps parse in process does not raise or print any exceptions and simply returns the integer 1.

Detailed logs

Request: add a 'debug' mode for obrpdctl.py, which creates more verbose logs.

Remarks:

requires io module to order outputs from multiprocessing pool

csbp-cpse / opentabulate Goto Github PK

opentabulate's Issues

Pre/post processing source file tag

Handling compressed datasets

Missing checks (for future version 2.0)

Missing checks (for version 1.0)

Recommend Projects

Recommend Topics

Recommend Org