csbp-cpse / opentabulate Goto Github PK
View Code? Open in Web Editor NEWThis is an exploratory and experimental open project. / Ce projet ouvert est exploratoire et expérimental.
License: Other
This is an exploratory and experimental open project. / Ce projet ouvert est exploratoire et expérimental.
License: Other
Should the county
key moved into the general tags list? It is currently duplicated (in code and documentation) for both library and hospital database types.
To prevent loading the postal
module (which takes a few seconds and loads 2 GB into memory) unnecessarily (for instance, if you run $ python pdctl.py --ignore-proc
), it should be loaded later when needed (when data processing begins).
This will require the data processing to be broken up into different stages, all of which can still support parallelization. If there is a downside, more processes would have to be spawned over the run time of the program.
Supplying a list at least two pre-processing scripts in a source file has unintended behaviour.
Currently, OpenTabulate makes use a pre-processing script to make a new copy of the corresponding raw dataset, and then apply its custom reformatting to the copy. If more than one script is used, this methodology is applied to each script, which means scripts that run later overwrite any reformatting of all previously ran scripts.
It would be nice to have a single 1-pager that links to all the resources excellently prepared here so as to achieve at least three functions:
Also, a FAQ would be helpful addressing questions such as:
What is the OBR?
What type of data does it use?
How do we use the OBR?
Why was the OBR constructed?
What limitations are present in the OBR?
What is the future of the OBR?
Combine and assemble the documentation in a GitHub wiki.
The process
method in tabctl.py
will silently fail (as in, produce no output to the terminal) if any exception is caught, excluding the special case when the wrong column names are defined for a CSV dataset.
This is because the try-catch block that wraps parse
in process
does not raise or print any exceptions and simply returns the integer 1.
Part of the data processing software checks syntax for every source file before processing. This is done to make it easier to debug issues with processing input, output, and functionality. It does not cover all corner cases, some of which are listed below.
localarchive
and compression
if url
downloads an archive fileforce:*
values (does not check if ":" or "force" occurs as a substring)localfile
Any future implementations which checks one of the above will be removed from the above list.
Many of the given datasets contain column entries of the form "," (e.g. "NEYRA-NESTERENKO, MAX"). The CSV parser simply splits a row by commas, ignoring a comma embedded in quotations, which givens the incorrect output CSV.
Due to the obrparser.py
making use of csv.DictReader
, the proposed fix is to clean the datasets via stream manipulation before processing.
In the experimental version of OpenTabulate (and likely the older versions), OpenTabulate always tries to process pre-processed data using the original encoding. So if the pre-processing script does not preserve the original character encoding, then decoding errors are likely to happen during the processing.
After running the following command on the command line:
$ opentab -s <file_name>.json
I got the following error message:
Configuration error: Value of label '%s' is not a tuple
This is because in the opentabulate.conf
file, under the [labels]
section, I had the following key-value pair:
metadata = ('localfile')
which Python recognizes as just a string, instead of a tuple (as required by opentab). But you can reproduce this error with any value that is not a tuple under the [labels]
section such as:
testing = 7
I understand what this error message means and how to fix it, but it would've been more helpful and saved me a bit of time if '%s' was formatted properly to output the key-value pair that was causing the problem.
Hi,
I'm trying to use OpenTabulate to process the "Federal corporations" dataset in
LODE-ECDO/sources/Business/.
I've downloaded the following source file:
misc/open_data_cfrci.json
I use this command line:
opentab --verbose .opentabulate/sources/open_data_cfrc1.json
Everything if fine until line 237 of "opentab.py" script. In this line we want to import a function parse_adress
from postal.parser
. Here the problem, because postal doen't exist.
How can I solve this problem in order to parse "Federal corporations" dataset?
Thank for helpoing me!
Several sample datasets that are stored in JSON format have incorrect JSON syntax. Moreover, they run into the same encoding/decoding issues as presented with the CSV parser.
(Proposed) (temporary) solutions/workarounds:
sed
)Wanted features:
Some specific datasets require manual intervention that was not handled by the original maintainer, such as reformatting and format error correction. Similarly, there may be desired post-processing editing that one would like the production system to handle automatically if you provide it the script.
pre
and post
tag in a source file which accepts a list of strings containing paths to scripts to run specific to the datasetpre
corresponds to scripts to run prior to general data processing and post
refers to scripts to run afterwardsSome datasets are packaged in a zip
or tar
archive, perhaps with a specific compression algorithm.
Character Encoding Checking: OpenTabulate will do character encoding checking twice on raw data that comes in CSV format. There is no noticeable performance hit although it would be preferred to eliminate redundancy where it is not necessary.
JSON lists for entry filtering: Currently, lists of regular expressions are accepted in a source/metadata file for entry filtering, where the list acts like a logical OR. This is not needed since regular expressions already support such functionality using |
.
In the configuration section of the documentation, it states that a valid key-valid pair for the [general] section is lowercase_entries
. However, as of v2.1.1, if you actually include the key lowercase_entries
in the opentabulate.conf file, it will raise the following error:
Configuration error: 'lowercase_entries' is not a valid option in 'general' section
After some searching through other project's .conf files, it appears that the correct key is actually lowercase_output
.
Some open data sources are "Points of Interest", which may contain objects from one or several data themes (e.g., locations of libraries, hospitals, police stations, etc). A way to filter within OpenTabulate would reduce the need to pre-process these files into separate entries for each theme.
Request: add a 'debug' mode for obrpdctl.py
, which creates more verbose logs.
Remarks:
io
module to order outputs from multiprocessing poolneed to add "license" in the source file, with path to the open data license. This should be done for all source files.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.