Giter Club home page Giter Club logo

bni's Introduction

BNI – Bibliografia Nazionale Italiana

Download Unimarc XML files from BNIweb and convert to Parquet with Duckdb.

The Parquet dump is available here https://atomotic.github.io/bni/bni.parquet (70M) and can be used with DuckDB Shell

Steps

The following steps are available inside the Justfile

Scrape all XML urls (tools needed: pup and sd)

curl -s "http://bni.bncf.firenze.sbn.it/bniweb/menu.jsp" \
    | pup 'a attr{href}' \
    | grep elenco_fasc \
    | sd "&" "&" \
    | sd "elenco_fasc" "scaricaxml" \
    > links.txt

Download all XML files (tool needed: wcurl)

parallel wcurl --curl-options="--remote-header-name" "http://bni.bncf.firenze.sbn.it/bniweb/{}" :::: links.txt
mkdir xml
move *.xml xml/

Load all XML files to DuckDB (tools needed: Go and gnu parallel)

go build
parallel -j1 ./bni {} ::: xml/*.xml

Export from DuckDB to Parquet

duckdb bni.ddb "copy bni to bni.parquet (format parquet);"

Size comparison

du -h bni.ddb bni.parquet
1.2G    bni.ddb
67M     bni.parquet

Example query

duckdb

The schema: data contains the full Unimarc record converted to JSON

DESCRIBE SELECT * FROM 'https://atomotic.github.io/bni/bni.parquet';
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ id          │ VARCHAR     │ YES     │         │         │         │
│ isbn        │ VARCHAR     │ YES     │         │         │         │
│ title       │ VARCHAR     │ YES     │         │         │         │
│ data        │ VARCHAR     │ YES     │         │         │         │
│ source      │ VARCHAR     │ YES     │         │         │         │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
D .mode line
D SELECT id,title,isbn,source FROM 'https://atomotic.github.io/bni/bni.parquet' WHERE title LIKE '%biblioteco%' LIMIT 5;

    id = USM1959877
 title = Biblioteche e biblioteconomia
  isbn = 9788843075294
source = xml/Monografie201503.xml

    id = PAV0095007
 title = I fondamenti della biblioteconomia
  isbn = 9788870758474
source = xml/Monografie201601.xml

    id = SBT0014568
 title = Conferimento della laurea magistrale ad honorem in scienze archivistiche e biblioteconomiche a Michele Casalini
  isbn = 9788864538822
source = xml/Monografie201904.xml

    id = MOD1738924
 title = Guida alla biblioteconomia moderna
  isbn = 9788893574013
source = xml/Monografie202204.xml

    id = SBT0045209
 title = Principi, approcci e applicazioni della biblioteconomia comparata
  isbn = 9788855186063
source = xml/Monografie202301.xml

bni's People

Contributors

atomotic avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.