Giter Club home page Giter Club logo

datapusher's Introduction

Tests Latest Version Downloads Supported Python versions License

DataPusher

DataPusher is a standalone web service that automatically downloads any tabular data files like CSV or Excel from a CKAN site's resources when they are added to the CKAN site, parses them to pull out the actual data, then uses the DataStore API to push the data into the CKAN site's DataStore.

This makes the data from the resource files available via CKAN's DataStore API. In particular, many of CKAN's data preview and visualization plugins will only work (or will work much better) with files whose contents are in the DataStore.

To get it working you have to:

  1. Deploy a DataPusher instance to a server (or use an existing DataPusher instance)
  2. Enable and configure the datastore plugin on your CKAN site.
  3. Enable and configure the datapusher plugin on your CKAN site.

Note that if you installed CKAN using the package install option then a DataPusher instance should be automatically installed and configured to work with your CKAN site.

DataPusher is built using CKAN Service Provider and Messytables.

The original author of DataPusher was Dominik Moritz [email protected]. For the current list of contributors see github.com/ckan/datapusher/contributors

Development installation

Install the required packages::

sudo apt-get install python-dev python-virtualenv build-essential libxslt1-dev libxml2-dev zlib1g-dev git libffi-dev

Get the code::

git clone https://github.com/ckan/datapusher
cd datapusher

Install the dependencies::

pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install -e .

Run the DataPusher::

python datapusher/main.py deployment/datapusher_settings.py

By default DataPusher should be running at the following port:

http://localhost:8800/

If you need to change the host or port, copy deployment/datapusher_settings.py to deployment/datapusher_local_settings.py and modify the file to suit your needs. Also if running a production setup, make sure that the host and port matcht the http settings in the uWSGI configuration.

To run the tests:

pytest

Production deployment

Note: If you installed CKAN via a package install, the DataPusher has already been installed and deployed for you. You can skip directly to the Configuring section.

Thes instructions assume you already have CKAN installed on this server in the default location described in the CKAN install documentation (/usr/lib/ckan/default). If this is correct you should be able to run the following commands directly, if not you will need to adapt the previous path to your needs.

These instructions set up the DataPusher web service on uWSGI running on port 8800, but can be easily adapted to other WSGI servers like Gunicorn. You'll probably need to set up Nginx as a reverse proxy in front of it and something like Supervisor to keep the process up.

 # Install requirements for the DataPusher
 sudo apt install python3-venv python3-dev build-essential
 sudo apt-get install python-dev python-virtualenv build-essential libxslt1-dev libxml2-dev git libffi-dev

 # Create a virtualenv for datapusher
 sudo python3 -m venv /usr/lib/ckan/datapusher

 # Create a source directory and switch to it
 sudo mkdir /usr/lib/ckan/datapusher/src
 cd /usr/lib/ckan/datapusher/src

 # Clone the source (you should target the latest tagged version)
 sudo git clone -b 0.0.17 https://github.com/ckan/datapusher.git

 # Install the DataPusher and its requirements
 cd datapusher
 sudo /usr/lib/ckan/datapusher/bin/pip install -r requirements.txt
 sudo /usr/lib/ckan/datapusher/bin/python setup.py develop

 # Create a user to run the web service (if necessary)
 sudo addgroup www-data
 sudo adduser -G www-data www-data

 # Install uWSGI
 sudo /usr/lib/ckan/datapusher/bin/pip install uwsgi

At this point you can run DataPusher with the following command:

/usr/lib/ckan/datapusher/bin/uwsgi -i /usr/lib/ckan/datapusher/src/datapusher/deployment/datapusher-uwsgi.ini

Note: If you are installing the DataPusher on a different location than the default one you need to adapt the relevant paths in the datapusher-uwsgi.ini to the ones you are using. Also you might need to change the uid and guid settings when using a different user.

High Availability Setup

The default DataPusher configuration uses SQLite as the backend for the jobs database and a single uWSGI thread. To increase performance and concurrency you can configure DataPusher in the following way:

  1. Use Postgres as database backend, which will allow concurrent writes (and provide a more reliable backend anyway). To use Postgres, create a user and a database and update the SQLALCHEMY_DATABASE_URI settting accordingly:

    # This assumes DataPusher is already installed
    sudo apt-get install postgresql libpq-dev
    sudo -u postgres createuser -S -D -R -P datapusher_jobs
    sudo -u postgres createdb -O datapusher_jobs datapusher_jobs -E utf-8
    
    # Run this in the virtualenv where DataPusher is installed
    pip install psycopg2
    
    # Edit SQLALCHEMY_DATABASE_URI in datapusher_settings.py accordingly
    # eg SQLALCHEMY_DATABASE_URI=postgresql://datapusher_jobs:YOURPASSWORD@localhost/datapusher_jobs
    
  2. Start more uWSGI threads. On the deployment/datapusher-uwsgi.ini file, set workers and threads to a value that suits your needs, and add the lazy-apps=true setting to avoid concurrency issues with SQLAlchemy, eg:

    # ... rest of datapusher-uwsgi.ini
    workers         =  3
    threads         =  3
    lazy-apps       =  true
    

Configuring

CKAN Configuration

Add datapusher to the plugins in your CKAN configuration file (generally located at /etc/ckan/default/production.ini or /etc/ckan/default/ckan.ini):

ckan.plugins = <other plugins> datapusher

In order to tell CKAN where this webservice is located, the following must be added to the [app:main] section of your CKAN configuration file :

ckan.datapusher.url = http://127.0.0.1:8800/

Starting from CKAN 2.10, DataPusher requires a valid API token to operate (see the documentation on API tokens), and will fail to start if the following option is not set:

ckan.datapusher.api_token = <api_token>

There are other CKAN configuration options that allow to customize the CKAN - DataPusher integation. Please refer to the DataPusher Settings section in the CKAN documentation for more details.

DataPusher Configuration

The DataPusher instance is configured in the deployment/datapusher_settings.py file. Here's a summary of the options available.

Name Default Description
HOST '0.0.0.0' Web server host
PORT 8800 Web server port
SQLALCHEMY_DATABASE_URI 'sqlite:////tmp/job_store.db' SQLAlchemy Database URL. See note about database backend below.
MAX_CONTENT_LENGTH '1024000' Max size of files to process in bytes
CHUNK_SIZE '16384' Chunk size when processing the data file
CHUNK_INSERT_ROWS '250' Number of records to send a request to datastore
DOWNLOAD_TIMEOUT '30' Download timeout for requesting the file
SSL_VERIFY False Do not validate SSL certificates when requesting the data file (Warning: Do not use this setting in production)
TYPES [messytables.StringType, messytables.DecimalType, messytables.IntegerType, messytables.DateUtilType] Messytables types used internally, can be modified to customize the type guessing
TYPE_MAPPING {'String': 'text', 'Integer': 'numeric', 'Decimal': 'numeric', 'DateUtil': 'timestamp'} Internal Messytables type mapping
LOG_FILE /tmp/ckan_service.log Where to write the logs. Use an empty string to disable
STDERR True Log to stderr?

Most of the configuration options above can be also provided as environment variables prepending the name with DATAPUSHER_, eg DATAPUSHER_SQLALCHEMY_DATABASE_URI, DATAPUSHER_PORT, etc. In the specific case of DATAPUSHER_STDERR the possible values are 1 and 0.

By default, DataPusher uses SQLite as the database backend for jobs information. This is fine for local development and sites with low activity, but for sites that need more performance, Postgres should be used as the backend for the jobs database (eg SQLALCHEMY_DATABASE_URI=postgresql://datapusher_jobs:YOURPASSWORD@localhost/datapusher_jobs. See also High Availability Setup. If SQLite is used, its probably a good idea to store the database in a location other than /tmp. This will prevent the database being dropped, causing out of sync errors in the CKAN side. A good place to store it is the CKAN storage folder (if DataPusher is installed in the same server), generally in /var/lib/ckan/.

Usage

Any file that has one of the supported formats (defined in ckan.datapusher.formats) will be attempted to be loaded into the DataStore.

You can also manually trigger resources to be resubmitted. When editing a resource in CKAN (clicking the "Manage" button on a resource page), a new tab named "DataStore" will appear. This will contain a log of the last attempted upload and a button to retry the upload.

DataPusher UI

Command line

Run the following command to submit all resources to datapusher, although it will skip files whose hash of the data file has not changed:

ckan -c /etc/ckan/default/ckan.ini datapusher resubmit

On CKAN<=2.8:

paster --plugin=ckan datapusher resubmit -c /etc/ckan/default/ckan.ini

To Resubmit a specific resource, whether or not the hash of the data file has changed::

ckan -c /etc/ckan/default/ckan.ini datapusher submit {dataset_id}

On CKAN<=2.8:

paster --plugin=ckan datapusher submit <pkgname> -c /etc/ckan/default/ckan.ini

License

This material is copyright (c) 2020 Open Knowledge Foundation and other contributors

It is open and licensed under the GNU Affero General Public License (AGPL) v3.0 whose full text may be found at:

http://www.fsf.org/licensing/licenses/agpl-3.0.html

datapusher's People

Contributors

alvarollmenezes avatar amercader avatar davidmiller avatar dependabot[bot] avatar domoritz avatar joetsoi avatar jqnatividad avatar kdwarn avatar kindly avatar klikstermkd avatar kourylape avatar madebydavid avatar mbocevski avatar metaodi avatar morty avatar nigelbabu avatar obdulia-losantos avatar rahul-nath avatar rhunwicks avatar rossjones avatar rufuspollock avatar seanh avatar shubham-mahajan avatar smotornyuk avatar stefina avatar thrawnca avatar tino097 avatar tktech avatar vitorbaptista avatar wardi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datapusher's Issues

error while adding csv-files with duplicate column names

Currently, it is not possible to add csv-files with duplicate column names, since these are also used as table-columns in postgres.
In web the datapusher returns an "internal server error 500" at /api/3/action/datastore_create.

Wrong schema from CKAN

I have hundreds of those error emails

Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
Traceback (most recent call last):
 File "/var/www/datapusher/venv/lib/python2.7/site-packages/APScheduler-2.1.0-py2.7.egg/apscheduler/scheduler.py", line 510, in _run_job
   retval = job.func(*job.args, **job.kwargs)
 File "/var/www/datapusher/datapusher/datapusher/jobs.py", line 250, in push_to_datastore
   resource = get_resource(resource_id, ckan_url)
 File "/var/www/datapusher/datapusher/datapusher/jobs.py", line 208, in get_resource
   headers={'Content-type': 'application/json'})
[...]
   raise MissingSchema("Invalid URL %r: No schema supplied" % url)
MissingSchema: Invalid URL u'http:/api/3/action/resource_show': No schema supplied

Support generic data urls rather than just CKAN resources as input

At the moment it seems datapusher gets passed resource id and then looks up data url.

That is metadata pushed to a job would be:

        'metadata': {
            'url': path-to-data-csv-or-xls
            'resource_id': res_id, # resource to push to (a bonus would to have this auto-created
            'set_url_type': data_dict.get('set_url_type', False)
        }

Why do this?

  • It seems to be to be simpler - don't have to spend time looking up the resource url from CKAN
  • More general - can support data not yet in CKAN

Add CloudFlare CDN notes in documentation

When using CloudFlare default settings, datapusher will start returning HTTP 403 Forbidden error.

This is because of the Browser Integrity Check feature of CloudFlare.

To correct, either turn off Browser Integrity Check for the whole CKAN instance or add a page rule for the pattern below with Browser Integrity Check off.

*.YOURCKANDOMAINHERE/en/dataset/*     # e.g. *.ckan.org/en/dataset/*

This caused a lot of head-scratching on our end, before we finally pinned it down, so wanted to pass the info along.

Log events

Use the new logging from CKAN service provider

Size of backlog is opaque

It doesn't seem to be possible to see what the current datapusher backlog is because there doesn't seem to be a way to inspect its internal queue.

Fix documentation for running locally

The datapusher needs to be run locally with the JOB_CONFIG environment variable for it to work or it leads to this nasty traceback.

The correct way to run is something like this JOB_CONFIG='/home/foo/datapusher/deployment/datapusher_settings.py' python wsgi.py

Exception in thread APScheduler:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/apscheduler/scheduler.py", line 581, in _main_loop
    next_wakeup_time = self._process_jobs(now)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/apscheduler/scheduler.py", line 560, in _process_jobs
    self._remove_job(job, alias, jobstore)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/apscheduler/scheduler.py", line 294, in _remove_job
    jobstore.remove_job(job)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/apscheduler/jobstores/sqlalchemy_store.py", line 64, in remove_job
    self.engine.execute(delete)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2447, in execute
    return connection.execute(statement, *multiparams, **params)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1449, in execute
    params)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1584, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1698, in _execute_context
    context)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1691, in _execute_context
    context)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 331, in do_execute
    cursor.execute(statement, parameters)
OperationalError: (OperationalError) no such table: apscheduler_jobs u'DELETE FROM apscheduler_jobs WHERE apscheduler_jobs.id = ?' (3,)

Add support for running locally

At the moment:

python wsgi.py

does nothing.

python datapusher/main.py

requires a config file (no docs as to what that config file would look like).

DataPusher should not deliberately maintain uppercase of field names

I've noticed that if you upload a dataset with uppercase field names (which is almost always the case when pulling data from Esri/Arc databases), they get added to the postgres database with uppercase names. By default, postgres converts all field names to lowercase unless you explicitly double quote the field names. If you do double quote them, then you always need to use double quotes to interact with them.

As a result, you cannot query or use the DataStore API as the documentation suggests:

?sql=SELECT * from "2c0d8231-e6b8-4598-8089-d1bcaf1bdaa6" WHERE ZIP_CODE = '19146' (example)
will fail with the error column "zip_code" does not exist (note the error includes a lowercase field name - postgres converts field names to lowercase unless they're double quoted)

You must wrap the field name in double quotes to query it, ie.
?sql=SELECT * from "2c0d8231-e6b8-4598-8089-d1bcaf1bdaa6" WHERE "ZIP_CODE" = '19146' (example)

My hypothesis is that somewhere in the DataPusher extension, field names are being wrapped in double quotes, and my proposal is that they not be wrapped in double quotes. If there's some other reason they need to be (ie. to support special characters) then the field names should be converted to lowercase first to provide the expected functionality. Thoughts?

datapusher issue to upload a CSV into datastore

I have ckan 2.3, I tried to install datapusher (last stable) and it seems to work, but when I load a CSV file and click to Upload to datastore, I get a "Process completed but unable to post to result_url" error.

in /var/log/apache2/datapusher.error.log:

_[Wed Jun 17 09:03:15 2015] [error] Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
[Wed Jun 17 09:03:15 2015] [error] Traceback (most recent call last):
[Wed Jun 17 09:03:15 2015] [error] File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
[Wed Jun 17 09:03:15 2015] [error] retval = job.func(_job.args, *job.kwargs)
[Wed Jun 17 09:03:15 2015] [error] File "/root/ckan/lib/datapusher/src/datapusher/datapusher/jobs.py", line 226, in push_to_datastore
[Wed Jun 17 09:03:15 2015] [error] resource = get_resource(resource_id, ckan_url, api_key)
[Wed Jun 17 09:03:15 2015] [error] File "/root/ckan/lib/datapusher/src/datapusher/datapusher/jobs.py", line 180, in get_resource
[Wed Jun 17 09:03:15 2015] [error] check_response(r, url, 'CKAN')
[Wed Jun 17 09:03:15 2015] [error] File "/root/ckan/lib/datapusher/src/datapusher/datapusher/jobs.py", line 91, in check_response
[Wed Jun 17 09:03:15 2015] [error] resp=response.text[:200]))
[Wed Jun 17 09:03:15 2015] [error] JobError: CKAN bad response. Status code: 404 Not Found. At: http://127.0.0.1/api/3/action/resource_show. Response:
[Wed Jun 17 09:03:15 2015] [error]
[Wed Jun 17 09:03:15 2015] [error] <title>404 Not Found</title>
[Wed Jun 17 09:03:15 2015] [error]
[Wed Jun 17 09:03:15 2015] [error]

Not Found


[Wed Jun 17 09:03:15 2015] [error]

The requested URL /api/3/action/resource_show was not found on this ser

Any ideas?

Strip column names

It is very easy to create column names that include spaces, in particular when hand-typing CSV data:

my field, my other field, a third field
value one, value two, value three

This will create fields named " my other field" and " a third field". Having field names that start/end with spaces causes a lot of complications, because some code strips the field names, while others don't. This is particularly true when using third party libraries, and parsing text-representations of strings such as the sort string "my field ASC, my other field DESC", etc.

See for instance: ckan/ckan#1970

Given that there is no practical use for having field names that start or end with spaces (these will be just as confusing to humans as they are to machines), I suggest the datapusher should strip field names. (And, accordingly, CKAN datastore API should refuse field names that start/end with spaces).

I will submit a PR for this.

Missing ckan url

I'm getting many emails:

Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
Traceback (most recent call last):
 File "/var/www/datapusher/venv/lib/python2.7/site-packages/APScheduler-2.1.0-py2.7.egg/apscheduler/scheduler.py", line 510, in _run_job
   retval = job.func(*job.args, **job.kwargs)
 File "/var/www/datapusher/datapusher/datapusher/jobs.py", line 250, in push_to_datastore
   resource = get_resource(resource_id, ckan_url)
 File "/var/www/datapusher/datapusher/datapusher/jobs.py", line 208, in get_resource
   headers={'Content-type': 'application/json'})
 File "/var/www/datapusher/venv/lib/python2.7/site-packages/requests-1.2.0-py2.7.egg/requests/api.py", line 88, in post
   return request('post', url, data=data, **kwargs)
 File "/var/www/datapusher/venv/lib/python2.7/site-packages/requests-1.2.0-py2.7.egg/requests/api.py", line 44, in request
   return session.request(method=method, url=url, **kwargs)
 File "/var/www/datapusher/venv/lib/python2.7/site-packages/requests-1.2.0-py2.7.egg/requests/sessions.py", line 354, in request
   resp = self.send(prep, **send_kwargs)
 File "/var/www/datapusher/venv/lib/python2.7/site-packages/requests-1.2.0-py2.7.egg/requests/sessions.py", line 460, in send
   r = adapter.send(request, **kwargs)
 File "/var/www/datapusher/venv/lib/python2.7/site-packages/requests-1.2.0-py2.7.egg/requests/adapters.py", line 246, in send
   raise ConnectionError(e)
ConnectionError: HTTPConnectionPool(host='172.16.0.120', port=5000): Max retries exceeded with url: /api/3/action/resource_show (Caused by <class 'socket.error'>: [Errno 110] Connection timed out)

Just by looking at the last error, I assume that the CKAN url is missing so the url is not complete. Maybe add an assertion for that.

Install fails

ckanserviceprovider is not pypi

You have a github dep in requirements-dev but setup.py install_requires does not have it (i note it is in dependency_links but at least for me that had no effect when i did pip install -e .)

BTW: any reason you don't use requirements.txt (instead of setup.py stuff ...)?

Is it possible to configure a proxy server for datapusher?

If I understand correctly when adding a resource (either as a link ot uploading a file) datapusher will try to fetch the contents of the file and the add it to the datastore using the data API.

I wonder is there's some way to configure a proxy server for datapusher to use when fetching files. I already have a global configuration on /etc/enviroment but python doesn't seem to take it into account

Decimal fields get truncated to int

I have geographic data stored in .xls format. The number fields all get truncated to int when this data is uploaded. Example data

Lattitude, Logitude
-27.44314448,127.00122

This data gets truncated to -27 and 127. I tried changing the basic types mapping to double precision, but that does not seem to help. I'm using ckan2.2 on ubuntu with postgres in the development environment [paster, python ]

Datapusher inferring column type incorrectly.

Received the following errors while pushing data via the API. The dataset in question has a column whose initial values are numeric (first row entry is 1) but later values are strings ("K"). It appears that the inference routine is inferring type from initial rows and is therefore never seeing the first instance of a "K" value which doesn't show up until ~1000 rows in.

Is it possible to either a) have the scheme generated off of unique col values vs. 1st/initial row values or b) allow for a user defined schema to be passed in when adding a resource?

[Thu Oct 16 15:50:30 2014] [error] Job "push_to_datastore (trigger: RunTriggerNow, run = True,    next run at: None)" raised an exception
[Thu Oct 16 15:50:30 2014] [error] Traceback (most recent call last):
[Thu Oct 16 15:50:30 2014] [error]   File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
[Thu Oct 16 15:50:30 2014] [error]     retval = job.func(*job.args, **job.kwargs)
[Thu Oct 16 15:50:30 2014] [error]   File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 322, in push_to_datastore
[Thu Oct 16 15:50:30 2014] [error]     records, api_key, ckan_url)
[Thu Oct 16 15:50:30 2014] [error]   File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 150, in send_resource_to_datastore
[Thu Oct 16 15:50:30 2014] [error]     check_response(r, url, 'CKAN DataStore')
[Thu Oct 16 15:50:30 2014] [error]   File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 84, in check_response
[Thu Oct 16 15:50:30 2014] [error]     resp=pprint.pformat(json_response)))
[Thu Oct 16 15:50:30 2014] [error] JobError: CKAN DataStore bad response. Status code: 409 Conflict. At: http://68.169.49.221:8088/api/3/action/datastore_create. Response: {u'error': {u'__type':  u'Validation Error',
[Thu Oct 16 15:50:30 2014] [error]             u'data': u'(DataError) invalid input syntax for type numeric: "K"\\nLINE 2:             VALUES (\\'Andover\\', 901301080, 2008, \\'K\\', \\'Percen...\\n                                                        ^\\n',
[Thu Oct 16 15:50:30 2014] [error]             u'info': {u'orig': [u'invalid input syntax for type numeric: "K"\\nLINE 2:             VALUES (\\'Andover\\', 901301080, 2008, \\'K\\', \\'Percen...\\n                                                        ^\\n']}},

do not strip header value when type datetime.

A bit of an edge case, but thought I'd put it here anyway.

When an .xls has a column header that is a properly formatted date, it is imported as datetime.datetime. Datapusher will try to strip the value, causing the below exception:

Offending file: https://dl.dropboxusercontent.com/u/6526141/cap180201.xls (second sheet in the excel file)

Trace:

Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
Traceback (most recent call last):
  File "/Users/dz/Envs/datapusher/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
    retval = job.func(*job.args, **job.kwargs)
  File "/Users/dz/source/datapusher/datapusher/jobs.py", line 346, in push_to_datastore
    headers = [header.strip() for header in headers if header.strip()]
AttributeError: 'datetime.datetime' object has no attribute 'strip'

Error when uploading data

Hi to all.
I have the following errors when uploading the file with DataPusher:

Invalid HTTP response: HTTP Error 403: Forbidden

or
[u' File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job\n retval = job.func(_job.args, *_job.kwargs)\n',
u' File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 238, in push_to_datastore\n response = urllib2.urlopen(request, timeout=DOWNLOAD_TIMEOUT)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen\n return _opener.open(url, data, timeout)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 400, in open\n response = self._open(req, data)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 418, in _open\n '_open', req)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain\n result = func(*args)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open\n return self.do_open(httplib.HTTPConnection, req)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open\n raise URLError(err)\n',
u"URLError(error(111, 'Connection refused'),)"]

or
[u' File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job\n retval = job.func(_job.args, *_job.kwargs)\n',
u' File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 238, in push_to_datastore\n response = urllib2.urlopen(request, timeout=DOWNLOAD_TIMEOUT)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen\n return _opener.open(url, data, timeout)\n'
u' File "/usr/lib/python2.7/urllib2.py", line 400, in open\n response = self._open(req, data)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 418, in _open\n '_open', req)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain\n result = func(*args)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open\n return self.do_open(httplib.HTTPConnection, req)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open\n raise URLError(err)\n',
u"URLError(gaierror(-3, 'Temporary failure in name resolution'),)"]

or

[u' File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job\n retval = job.func(_job.args, *_job.kwargs)\n',
u' File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 238, in push_to_datastore\n response = urllib2.urlopen(request, timeout=DOWNLOAD_TIMEOUT)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen\n return _opener.open(url, data, timeout)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 400, in open\n response = self._open(req, data)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 418, in _open\n '_open', req)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain\n result = func(*args)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open\n return self.do_open(httplib.HTTPSConnection, req)\n',
u' File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open\n raise URLError(err)\n', u"URLError(error(101, 'Network is unreachable'),)"]

it seems to me that the problem is because the server is running behind a proxy.

And perhaps, the opener is not correctly setted, by getting the proxy options from the system, i.e.:
proxy = urllib2.ProxyHandler()
opener = urllib2.build_opener(proxy)

nosetests do not use virtualenv and fail with ImportError

With datapusher's virtualenv enabled, this virtualenv's nosetests does not use the virtualenv.
Naturally it will fail with

======================================================================
ERROR: Failure: ImportError (No module named httpretty)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/loader.py", line 413, in loadTestsFromName
    addr.filename, addr.module)
  File "/usr/local/lib/python2.7/dist-packages/nose/importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "/usr/local/lib/python2.7/dist-packages/nose/importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "/mnt/ckan/datapusher/src/datapusher/tests/test_unit.py", line 12, in <module>
    import httpretty
ImportError: No module named httpretty

Pip freeze disagrees - both httpretty and ckanserviceprovider are installed:

APScheduler==2.1.2
Flask==0.9
Flask-Admin==1.0.8
Flask-Login==0.2.11
Jinja2==2.7.3
MarkupSafe==0.23
Pygments==2.0.1
SQLAlchemy==0.7.8
Sphinx==1.2.3
Unidecode==0.04.16
WTForms==2.0.1
Werkzeug==0.9.6
argparse==1.2.1
chardet==2.1.1
ckanserviceprovider==0.0.1
datapusher==1.0
docutils==0.12
httpretty==0.6.2
itsdangerous==0.24
json-table-schema==0.1
lxml==3.4.0
messytables==0.14.1
nose==1.2.1
python-dateutil==1.5
python-magic==0.4.6
python-slugify==0.1.0
requests==2.4.3
six==1.8.0
sphinxcontrib-httpdomain==1.3.0
urllib3==1.7.1
wsgiref==0.1.2
xlrd==0.9.3

python -c "import httpretty" run inside datapusher's src/datapusher and src/datapusher/tests works fine (no importerror), and which nosetests confirms that nosetests is from the datapusher virtualenv.

Could a section on testing be included in the docs, specifying how to run nosetests using the virtualenv?

DataPusher fails when getting private CKAN resources

For uploaded files in CKAN that belong to private datasets, we must send the API key, otherwise we won't be able to get them.

Also the code that handles messytables parsing of the file could be improved, otherwise you get a nasty exception:

Fetching from: http://localhost:5000/dataset/097d4f94-3642-4f7a-a7ac-992bec6d5627/resource/9d00a104-1773-4422-81d8-f67a74b70f74/download/districtcenterpoints.csv
Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
Traceback (most recent call last):
  File "/home/adria/dev/pyenvs/ckan_datastore/lib/python2.7/site-packages/APScheduler-2.1.1-py2.7.egg/apscheduler/scheduler.py", line 512, in _run_job
    retval = job.func(*job.args, **job.kwargs)
  File "/home/adria/dev/pyenvs/ckan_datastore/src/datapusher/datapusher/jobs.py", line 278, in push_to_datastore
    row_set = table_set.tables.pop()
IndexError: pop from empty list

Options to avoid creating stale DataStore content for external data resources

Because data resources added as a path to an external data source (i.e. a file that is not uploaded to the FileStore) are subject to update, the content of the DataStore can go stale. If services are built off the DataStore API they will be using out-of-date data - as will be the previews.

Considering turning off default / automatic DataPusher additions for data resources added as a path to a data resource (e.g. externally hosted file) . Would still allow manual DataPusher additions.

Also considering developing mechanism to regularly monitor currency of external data resources to determine when DataStore update via DataPusher is required.

Datastore upload error

Hi,

I'm using CKAN 2.2 and the latest DataPusher master.
I'm trying to push a csv File with the following content:

name;id;title;description;image_url;state
Group_1;;Erster Gruppe;;https://pbs.twimg.com/media/BxkP0jHIYAA8iFK.jpg;deleted
Group_2;;Zweite Gruppe;;https://pbs.twimg.com/media/BxkP0jHIYAA8iFK.jpg;deleted
Group_3;;Dritte Gruppe;;https://pbs.twimg.com/media/BxkP0jHIYAA8iFK.jpg;active
Group_4;;Vierte Gruppe;;https://pbs.twimg.com/media/BxkP0jHIYAA8iFK.jpg;deleted
Group_5;;Funfte Gruppe;;https://pbs.twimg.com/media/BxkP0jHIYAA8iFK.jpg;active

In the datapusher tab I only receive the following error:

Traceback (most recent call last):
   File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
     retval = job.func(*job.args, **job.kwargs)
   File "/home/vagrant/ckan/lib/datapusher/src/datapusher/datapusher/jobs.py", line 305, in push_to_datastore
     delete_datastore_resource(resource_id, api_key, ckan_url)
   File "/home/vagrant/ckan/lib/datapusher/src/datapusher/datapusher/jobs.py", line 129, in delete_datastore_resource
     good_status=(201, 200, 404), ignore_no_success=True)
   File "/home/vagrant/ckan/lib/datapusher/src/datapusher/datapusher/jobs.py", line 78, in check_response
     if not ignore_no_success or json_response.get('success'):
 AttributeError: 'unicode' object has no attribute 'get'

Any ideas how to fix this?

Connection timed out when updloading using datapusher...

Hi, I get an error when trying to uploading a file.. :

[error] /usr/lib/ckan/datapusher/lib/python2.7/site-packages/sqlalchemy/engine/default.py:585: SAWarning: Unicode type received non-unicodebind param value.
[error] processorskey
[error] Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
[error] Traceback (most recent call last):
[error] File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
[error] retval = job.func(_job.args, *_job.kwargs)
[error] File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 222, in push_to_datastore
[error] resource = get_resource(resource_id, ckan_url, api_key)
[error] File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 178, in get_resource
[error] 'Authorization': api_key}
[error] File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/requests/api.py", line 99, in post
[error] return request('post', url, data=data, json=json, *_kwargs)
[error] File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/requests/api.py", line 49, in request
[error] response = session.request(method=method, url=url, *_kwargs)
[error] File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/requests/sessions.py", line 461, in request
[error] resp = self.send(prep, *_send_kwargs)
[error] File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
[error] r = adapter.send(request, *_kwargs)
[error] File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/requests/adapters.py", line 415, in send
[error] raise ConnectionError(err, request=request)
[error] ConnectionError: ('Connection aborted.', error(110, 'Connection timed out'))

Here is an extract of my configuration :
in /etc/apache2/sites-available/datapusher

<VirtualHost 0.0.0.0:8800>
WSGIScriptAlias / /etc/ckan/datapusher.wsgi

in /etc/apache2/ports.conf

NameVirtualHost *:8800
Listen 8800

in /etc/apache2/sites-available/ckan

WSGIScriptAlias /ckan /etc/ckan/default/apache.wsgi

in /etc/ckan/datapusher_settings.py

SQLALCHEMY_DATABASE_URI = 'sqlite:////tmp/job_store.db'
HOST = '0.0.0.0'
PORT = 8800

This file is not well documented.. Is there something else to configure (like SQLALCHEMY_DATABASE_URI in order to use postgres for example?)
I tried also to create an other database for dp : SQLALCHEMY_DATABASE_URI = 'postgresql://user:pass@localhost/ckan_dp_default', doesnt work better.

in /etc/ckan/default/development.ini / production.ini

ckan.site_url = http://www.domain.com/ckan
ckan.plugins = stats text_preview recline_preview datastore pdf_preview resource_proxy datapusher spatial_metadata spatial_query geojson_preview
ckan.datapusher.url = http://0.0.0.0:8800/

datastore is installed too..

I really don't understand why I get this error, and I read nobody else with this error..
Do you have any idea..? If not, how could I debug this?

Thank you,
Regards,
AD

`[Errno 32] Broken pipe` in integration tests

The integration tests (tests only run when CKAN is running) fail most of the time. CKAN shows the following error:

Traceback (most recent call last):
  File "/Users/dominik/.virtualenvs/okfn/lib/python2.7/site-packages/paste/httpserver.py", line 1068, in process_request_in_thread
    self.finish_request(request, client_address)
  File "/usr/local/Cellar/python/2.7.4/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/local/Cellar/python/2.7.4/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py", line 651, in __init__
    self.finish()
  File "/usr/local/Cellar/python/2.7.4/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py", line 710, in finish
    self.wfile.close()
  File "/usr/local/Cellar/python/2.7.4/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 279, in close
    self.flush()
  File "/usr/local/Cellar/python/2.7.4/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe

Upload CSV / GeoJSON

Hi all!

I am using CKAN 2.3 and the latest DataPusher version. I am trying to upload a CSV and GeoJSON and I am getting some errors:

CSV (first time was Ok but secondly I am getting this error):

[Mon Mar 30 16:10:03 2015] [error] Determined headers and types: [{'type': u'numeric', 'id': u'CODI_CARRER'}, {'type': u'text', 'id': u'CODI_CARRER_INE'}, {'type': u'text', 'id': u'SIGLA'}, {'type': u'text', 'id': u'NOM_OFICIAL'}, {'type': u'text', 'id': u'NOM_CURT'}, {'type': u'text', 'id': u'NRE_MIN'}, {'type': u'text', 'id': u'NRE_MAX'}]
[Mon Mar 30 16:10:03 2015] [error] Saving chunk 0
[Mon Mar 30 16:10:03 2015] [error] Saving chunk 1
[Mon Mar 30 16:10:03 2015] [error] Saving chunk 2
[Mon Mar 30 16:10:04 2015] [error] Saving chunk 3
[Mon Mar 30 16:10:04 2015] [error] Saving chunk 4
[Mon Mar 30 16:10:04 2015] [error] Saving chunk 5
[Mon Mar 30 16:10:04 2015] [error] Saving chunk 6
[Mon Mar 30 16:10:04 2015] [error] Saving chunk 7
[Mon Mar 30 16:10:04 2015] [error] Saving chunk 8
[Mon Mar 30 16:10:04 2015] [error] Saving chunk 9
[Mon Mar 30 16:10:05 2015] [error] Saving chunk 10
[Mon Mar 30 16:10:05 2015] [error] Saving chunk 11
[Mon Mar 30 16:10:05 2015] [error] Saving chunk 12
[Mon Mar 30 16:10:05 2015] [error] Saving chunk 13
[Mon Mar 30 16:10:05 2015] [error] Saving chunk 14
[Mon Mar 30 16:10:05 2015] [error] Fetching from: http://ckan.cat:5000/dataset/097cb1d9-8227-4e70-b90b-0b7b3ba1a8cf/resource/f20f4931-3cee-414f-991d-00a95b0d6a2a/download/carrerer0opendata2.csv
[Mon Mar 30 16:10:05 2015] [error] Saving chunk 15
[Mon Mar 30 16:10:05 2015] [error] Deleting "f20f4931-3cee-414f-991d-00a95b0d6a2a" from datastore.
[Mon Mar 30 16:10:06 2015] [error] Saving chunk 16
[Mon Mar 30 16:10:06 2015] [error] Determined headers and types: [{'type': u'numeric', 'id': u'CODI_CARRER'}, {'type': u'text', 'id': u'CODI_CARRER_INE'}, {'type': u'text', 'id': u'SIGLA'}, {'type': u'text', 'id': u'NOM_OFICIAL'}, {'type': u'text', 'id': u'NOM_CURT'}, {'type': u'text', 'id': u'NRE_MIN'}, {'type': u'text', 'id': u'NRE_MAX'}]
[Mon Mar 30 16:10:06 2015] [error] Saving chunk 0
[Mon Mar 30 16:10:06 2015] [error] Saving chunk 17
[Mon Mar 30 16:10:06 2015] [error] Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
[Mon Mar 30 16:10:06 2015] [error] Traceback (most recent call last):
[Mon Mar 30 16:10:06 2015] [error] File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
[Mon Mar 30 16:10:06 2015] [error] retval = job.func(_job.args, *_job.kwargs)
[Mon Mar 30 16:10:06 2015] [error] File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 321, in push_to_datastore
[Mon Mar 30 16:10:06 2015] [error] records, api_key, ckan_url)
[Mon Mar 30 16:10:06 2015] [error] File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 150, in send_resource_to_datastore
[Mon Mar 30 16:10:06 2015] [error] check_response(r, url, 'CKAN DataStore')
[Mon Mar 30 16:10:06 2015] [error] File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 84, in check_response
[Mon Mar 30 16:10:06 2015] [error] resp=pprint.pformat(json_response)))
[Mon Mar 30 16:10:06 2015] [error] JobError: CKAN DataStore bad response. Status code: 409 Conflict. At: http://localhost:5000/api/3/action/datastore_create. Response: {u'error': {u'__type': u'Validation Error',
[Mon Mar 30 16:10:06 2015] [error] u'constraints': [u'Cannot insert records or create index because of uniqueness constraint'],
[Mon Mar 30 16:10:06 2015] [error] u'info': {u'orig': u'duplicate key value violates unique constraint "pg_type_typname_nsp_index"\nDETAIL: Key (typname, typnamespace)=(f20f4931-3cee-414f-991d-00a95b0d6a2a__id_seq, 2200) already exists.\n',
[Mon Mar 30 16:10:06 2015] [error] u'pgcode': u'23505'}},
[Mon Mar 30 16:10:06 2015] [error] u'help': u'http://localhost:5000/api/3/action/help_show?name=datastore_create',
[Mon Mar 30 16:10:06 2015] [error] u'success': False}

JSON (the error emerges constantly):

[Mon Mar 30 15:56:11 2015] [error] Fetching from: http://ckan.cat:5000/dataset/06b02e6a-3a96-494f-8405-12e36ee5030d/resource/42460ac7-9dda-4b91-99f5-2ffbe9636d52/download/estacions.geojson
[Mon Mar 30 15:56:12 2015] [error] Deleting "42460ac7-9dda-4b91-99f5-2ffbe9636d52" from datastore.
[Mon Mar 30 15:56:12 2015] [error] Determined headers and types: [{'type': u'text', 'id': u'{ "type": "Feature"'}, {'type': u'text', 'id': u' "properties": { "LINIA": "LINIES_CONVENCIONALS"'}, {'type': u'text', 'id': u' "ESTAT": "Estat actual"'}, {'type': u'text', 'id': u' "INTERCANVIADOR": null'}, {'type': u'text', 'id': u' "NOM_ESTACIO": "Vimbod\xed"'}, {'type': u'text', 'id': u' "XARXA": "ADIF" }'}, {'type': u'text', 'id': u' "geometry": { "type": "Point"'}, {'type': u'text', 'id': u' "coordinates": [ 336671.35572722001'}, {'type': u'numeric', 'id': u' 4585317.9943092503'}, {'type': u'text', 'id': u' 0.0 ] } }'}]
[Mon Mar 30 15:56:12 2015] [error] Saving chunk 0
[Mon Mar 30 15:56:12 2015] [error] Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
[Mon Mar 30 15:56:12 2015] [error] Traceback (most recent call last):
[Mon Mar 30 15:56:12 2015] [error] File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
[Mon Mar 30 15:56:12 2015] [error] retval = job.func(_job.args, *_job.kwargs)
[Mon Mar 30 15:56:12 2015] [error] File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 321, in push_to_datastore
[Mon Mar 30 15:56:12 2015] [error] records, api_key, ckan_url)
[Mon Mar 30 15:56:12 2015] [error] File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 150, in send_resource_to_datastore
[Mon Mar 30 15:56:12 2015] [error] check_response(r, url, 'CKAN DataStore')
[Mon Mar 30 15:56:12 2015] [error] File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 84, in check_response
[Mon Mar 30 15:56:12 2015] [error] resp=pprint.pformat(json_response)))
[Mon Mar 30 15:56:12 2015] [error] JobError: CKAN DataStore bad response. Status code: 409 Conflict. At: http://localhost:5000/api/3/action/datastore_create. Response: {u'error': {u'__type': u'Validation Error',
[Mon Mar 30 15:56:12 2015] [error] u'fields': [u'"{ "type": "Feature"" is not a valid field name']},
[Mon Mar 30 15:56:12 2015] [error] u'help': u'http://localhost:5000/api/3/action/help_show?name=datastore_create',
[Mon Mar 30 15:56:12 2015] [error] u'success': False}
[Mon Mar 30 16:08:47 2015] [error] Fetching from: http://localhost:5000/dataset/097cb1d9-8227-4e70-b90b-0b7b3ba1a8cf/resource/a66548ff-681f-448b-9387-6157ed9a27b7/download/carrerer0opendata.csv
[Mon Mar 30 16:08:47 2015] [error] Deleting "a66548ff-681f-448b-9387-6157ed9a27b7" from datastore.
[Mon Mar 30 16:08:47 2015] [error] Determined headers and types: [{'type': u'numeric', 'id': u'CODI_CARRER'}, {'type': u'text', 'id': u'CODI_CARRER_INE'}, {'type': u'text', 'id': u'SIGLA'}, {'type': u'text', 'id': u'NOM_OFICIAL'}, {'type': u'text', 'id': u'NOM_CURT'}, {'type': u'text', 'id': u'NRE_MIN'}, {'type': u'text', 'id': u'NRE_MAX'}]

What can I do to solve this errors?

Support private datasets

CKAN support private datasets that cannot be downloaded easily. There has to be a way to import them into the datastore, though.

Explain how it datapusher works and add API documentation

Need to add the following to the documentation:

  • How does the datapusher push to CKAN - is it via a direct DB connection or via the CKAN DataStore API
  • Does datapusher expose its own API so I can POST a file or is it tightly integrated with CKAN? If it does have an API where is that documented?

Automatically import from file upload/update

I'm testing out CKAN 2.2 with the DataPusher, instead of the old DataStorer.

The old DataStorer ran a cronjob every X hour to check for updates to the DataStore.

It seems as if the DataPusher does not do this. Are there a way for having the DataPusher check resources every X something for any updates?

Incomlete Read exception

I loaded large (28mb) file and have an exception after serveral minutes:

--------------------------------------------------------------------------------
[pid: 20171|app: 0|req: 5/5] 172.18.100.106 () {42 vars in 705 bytes} [Wed Feb  5 17:29:30 2014] GET /job/d0394bb1-1b72-4c3a-8c4c-7a7bc37ca784 => generated 446 bytes in 16 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 0)
[pid: 20171|app: 0|req: 6/6] 172.18.100.106 () {42 vars in 705 bytes} [Wed Feb  5 17:29:35 2014] GET /job/d0394bb1-1b72-4c3a-8c4c-7a7bc37ca784 => generated 735 bytes in 19 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 0)
[pid: 20171|app: 0|req: 7/7] 172.18.100.106 () {40 vars in 578 bytes} [Wed Feb  5 17:44:58 2014] POST /job => generated 497 bytes in 47 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 0)
[pid: 20171|app: 0|req: 8/8] 172.18.100.106 () {40 vars in 578 bytes} [Wed Feb  5 17:45:04 2014] POST /job => generated 497 bytes in 26 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 0)
Wed Feb  5 17:53:17 2014 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /job/d0394bb1-1b72-4c3a-8c4c-7a7bc37ca784 (ip 172.18.100.106) !!!
--------------------------------------------------------------------------------
ERROR in scheduler [/var/www/ckan/devenv/lib/python2.6/site-packages/apscheduler/scheduler.py:520]:
Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
--------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/www/ckan/devenv/lib/python2.6/site-packages/apscheduler/scheduler.py", line 512, in _run_job
    retval = job.func(*job.args, **job.kwargs)
  File "/var/www/ckan/devenv/src/datapusher/datapusher/jobs.py", line 261, in push_to_datastore
    f = cStringIO.StringIO(response.read())
  File "/usr/lib64/python2.6/socket.py", line 354, in read
    data = self._sock.recv(rbufsize)
  File "/usr/lib64/python2.6/httplib.py", line 522, in read
    return self._read_chunked(amt)
  File "/usr/lib64/python2.6/httplib.py", line 571, in _read_chunked
    value.append(self._safe_read(amt))
  File "/usr/lib64/python2.6/httplib.py", line 621, in _safe_read
    raise IncompleteRead(''.join(s), amt)
IncompleteRead: IncompleteRead(7483 bytes read, 709 more expected)

All timeoutes on nginx setted on half of hour, max_file_size setted to 1G in datapuser and ckan.

Set URL type and webstore_last_updated upon (first) push

I've noticed that when the datapusher is pushing data to the datastore for the first time, it doesn't set webstore_last_updated or even the url_type as datastore in the resource. The latter problem(?) seems to be that the setting of "set_url_type", which means update_resource doesn't get called*, but the bigger problem is that even though the comments in the update_resource method state that webstore_last_updated will be updated**, we then rely on the CKAN controller which does nothing with that field (one could also wonder why it should)***. I find it logical that both fields, url_type and webstore_last_updated, should get updated, even on the first push to the datastore. Thoughts?

*https://github.com/ckan/datapusher/blob/master/datapusher/jobs.py#L392:
I assume the logic is that if the resource is /already/ datastore, we are not setting the url_type field, and therefore this is an update, which also seems a bit fragile... but I haven't looked at the datapusher plugin code to confirm
** https://github.com/ckan/datapusher/blob/master/datapusher/jobs.py#L208
*** https://github.com/ckan/ckan/blob/9ab53fd540869570275547c9aa2741d013a8ae97/ckan/logic/action/update.py#L105

Generic HTTPError in push_to_datastore

At the national Bulgarian data portal we are getting this:

Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
Traceback (most recent call last):
  File "/ckan/virtualenv/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
    retval = job.func(*job.args, **job.kwargs)
  File "/ckan/virtualenv/src/datapusher/datapusher/jobs.py", line 387, in push_to_datastore
    records, api_key, ckan_url)
  File "/ckan/src/datapusher/datapusher/jobs.py", line 203, in send_resource_to_datastore
    check_response(r, url, 'CKAN DataStore')
  File "/ckan/virtualenv/src/datapusher/datapusher/jobs.py", line 137, in check_response
    request_url=request_url, response=response.text)
HTTPError

I can't find any other related issues. Is this a known bug?

Day of week (& other date fields) recast as timestamps

Our client is uploading CSV data to CKAN which includes a column containing a day of the week.

Datapusher (via messytables) appears to decide that this is a date column and converts it to a timestamp. "Wednesday" appears to be interpreted as "This Wednesday" and is replaced by a timestamp in the form 2015-06-03T00:00:00 - which is useless to anyone reading the data, and frequently misleading.

Since the messytables code seemed to say that DateType includes format detection, I tried altering the configuration to use that instead of the DateUtilType that Datapusher uses normally, by adding:

TYPES = [messytables.StringType, messytables.DecimalType, messytables.IntegerType, messytables.DateType, messytables.BoolType]
TYPE_MAPPING = {
    'String': 'text',
    'Integer': 'numeric',
    'Decimal': 'numeric',
    'Date': 'timestamp'
}

into datapusher_settings.py. But it made no difference.

Is there any way of ensuring that dates get recognised as dates (and don't get given a spurious 00:00:00 timestamp), and days get left as days (strings)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.