Giter Club home page Giter Club logo

dcat-catalog-downloader's Introduction

DCAT Catalog Downloader

TL;DR

Download and merge DCAT RDF data from a DCAT catalog endpoint to query it locally.

Description

The ckanext-dcat extension implements a DCAT metadata export for CKAN. Among other things, it implements a catalog endpoint to download CKAN's dataset metadata as RDF. The output of the catalog endpoint is paged, and there is no way to download the complete metadata dump at once. Also, CKAN and ckanext-dcat have no SPARQL endpoint to query the metadata online.

The DCAT Catalog Downloader makes it easier to query the DCAT data from a CKAN instance by paging through the catalog endpoint, downloading each page and merging it.

The repository contains:

  • a Python-based command line tool download_catalog that pages through the DCAT catalog endpoint to download the complete catalog
  • a Makefile to orchestrate downloading and merging the data
  • a handful of SPARQL queries that you can run on the downloaded data

Requirements

  • Python 3 for the downloader
  • To query RDF data locally, you need an RDF framework with SPARQL capabilities. The excellent rdflib library that is installed as a dependency of the downloader lets you use SPARQL on local data files, so you could just use that. I prefer to use the Java-based Apache Jena, because it is very performant and comes with a range of useful command line tools. Some package managers offer packaged versions of Jena. E.g., on Mac OS you can use homebrew to install Jena: brew install jena.

Installing the Downloader

  • Clone the repository:
$ git clone https://github.com/berlinonline/dcat-catalog-downloader
  • Create and activate a Python virtual environment:
$ cd dcat-catalog-downloader
$ python -m venv venv
$ . venv/bin/activate
  • Install the required python libraries:
(venv) $ pip install -r requirements.txt
  • Install the command line tool:
(venv) $ python setup.py develop

Running the Downloader

You now have the downloader available as a command line tool.

(venv) $ download_catalog --help                                                             
usage: download_catalog [-h] --endpoint ENDPOINT [--output OUTPUT]

Page through the DCAT-catalog endpoint of CKAN instance and save all pages into an output folder.

options:
  -h, --help           show this help message and exit
  --endpoint ENDPOINT  endpoint URL of the CKAN's DCAT catalog, e.g. https://datenregister.berlin.de/catalog.ttl
  --output OUTPUT      path to the output folder containing the catalog pages. Default is data/temp.

You can either use the downloader directly or use the Makefile to create a folder structure, download the data and merge it to a specific location.

(venv) $ make data/output/catalog.ttl                                                                
creating data/temp directory ...
creating data/temp/parts ...
downloading all catalog pages to data/temp/parts ...
INFO:root:Reading first catalog page from https://datenregister.berlin.de/catalog.ttl ...
INFO:root:Last page is 33 ...
INFO:root:Saving catalog page: https://datenregister.berlin.de/catalog.ttl?page=2 to data/temp/parts/catalog.ttl?page=2
INFO:root:Saving catalog page: https://datenregister.berlin.de/catalog.ttl?page=3 to data/temp/parts/catalog.ttl?page=3
...
INFO:root:Saving catalog page: https://datenregister.berlin.de/catalog.ttl?page=32 to data/temp/parts/catalog.ttl?page=32
INFO:root:Saving catalog page: https://datenregister.berlin.de/catalog.ttl?page=33 to data/temp/parts/catalog.ttl?page=33
creating data/output directory ...
merging all catalog pages in data/temp/parts to data/output/catalog.ttl ...

The Makefile downloads from https://datenregister.berlin.de. If you want to download from a different CKAN instance, you need to change ckan_base variable at the top of the file.

Querying the Data

How you run a query depends on which SPARQL engine you have installed. If you have Jena, you can use the sparql command line tool. There are some queries that come as part of the repository, which you can just use. Or you write your own queries, of course.

$ sparql --data data/output/catalog.ttl --query queries/distinct_authors.rq --results=csv
publisher
 Bezirksamt Friedrichshain-Kreuzberg von Berlin - Vermessung
Abt. III C - Freiraumplanung und Stadtgrün
Abteilung II - Integrativer Umweltschutz
Amt für Statistik Berlin-Brandenburg
Berlin.de
BerlinOnline Stadtportal GmbH & Co KG
BerlinOnline Stadtportal GmbH & Co. KG
Berliner Feuerwehr
Berliner Forsten
Berliner Stadtreinigung (BSR)
Berliner Umweltportal
...
Statistisches Landesamt Bremen
Stromnetz Berlin GmbH
Tegel Projekt GmbH
Tegel Projekt GmbH - Urban Data & Plattform
VBB - Verkehrsverbund Berlin-Brandenburg GmbH
Verkehrslenkung Berlin
WISTA Management GmbH
openstreetmap.org

License

This material is copyright ©BerlinOnline Stadtportal GmbH & Co. KG.

All software in this repository is published under the MIT License.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.