A Scrapy pipeline module to persist items to a postgres table automatically.
Here's an example showing automatic item pipeline, with a custom JSONB
field.
# settings.py
from sqlalchemy.dialects.postgresql import JSONB
ITEM_PIPELINES = {
'pgpipeline.PgPipeline': 300,
}
PG_PIPELINE = {
'connection': 'postgresql://localhost:5432/scrapy_db',
'table_name': 'demo_items',
'pkey': 'item_id',
'ignore_identical': ['item_id', 'job_id'],
'types': {
'some_data': JSONB
},
'onconflict': 'upsert'
}
All columns, tables, and indices are automatically created.
pkey
: a primary key for this item (other than database-generatedid
)ignore_identical
: these are a set of fields by which we identify duplicates and skip insert.types
: keys specified here will be using the type given, otherwise types are guessed.onconflict
: upsert|ignore|non-null -ignore
will skip inserting on conflict andupsert
will update.non-null
will upsert only values that are notNone
and thus avoid removing existing values.
Set up a development environment
$ pip install -r requirements.txt
- Dependencies: list them in
requirements.txt
- Dependencies: list them in
setup.py
underinstall_requires
:
install_requires=['peppercorn'],
Then:
$ make dist && make release
Fork, implement, add tests, pull request, get my everlasting thanks and a respectable place here :).
To all Contributors - you make this happen, thanks!
Copyright (c) 2017 Dotan Nahum @jondot. See LICENSE for further details.