Giter Club home page Giter Club logo

openlibrary-search's Introduction

Open Library database

Open Library is an online library of bibliographic data. The library publishes full data dumps of all authors, works, and editions.

This project provides instructions and scripts for importing this data into a PostgreSQL database, and some sample queries to test the database.

The database is primarily aimed at querying the database using ISBN and includes tables specifically for these identifiers. It could be extended to change this to other identifiers, such as Open Library ID, or to query by title or author.

Getting started

The following steps should get you up and running with a working database.

  1. Install the required prerequisites so that you have a database server.
  2. Download the data from Open Library.
  3. Run the processing the data scripts to make it easier to import.
  4. Import the data into the database.

Prerequisites

  • Python 3 - Tested with 3.10
  • PostgreSQL - Version 15 is tested but all recent versions should work
  • Disk space - The data files are large, and the uncompressed editions file is 45GB. You will need at least 250GB of free space to import all the data.

Downloading the data

Open Library offer bulk downloads on their website, available from the data dumps page.

These are updated every month. The downloads available include (with compressed size):

  • Editions (~9GB)
  • Works (~2.5GB)
  • Authors (~0.5GB)
  • All types (~10GB)

Download the Editions, Works, and Authors data dumps.

wget https://openlibrary.org/data/ol_dump_editions_latest.txt.gz -P ~/downloads
wget https://openlibrary.org/data/ol_dump_works_latest.txt.gz -P ~/downloads
wget https://openlibrary.org/data/ol_dump_authors_latest.txt.gz -P ~/downloads

Move the data from your downloads folder.

mv ~/downloads/ol_dump_authors_*txt.gz ./data/unprocessed/ol_dump_authors.txt.gz
mv ~/downloads/ol_dump_works_*txt.gz ./data/unprocessed/ol_dump_works.txt.gz
mv ~/downloads/ol_dump_editions_*txt.gz ./data/unprocessed/ol_dump_editions.txt.gz

Then uncompress the data files.

gzip -d -c data/unprocessed/ol_dump_editions.txt.gz > data/unprocessed/ol_dump_editions.txt
gzip -d -c data/unprocessed/ol_dump_works.txt.gz > data/unprocessed/ol_dump_works.txt
gzip -d -c data/unprocessed/ol_dump_authors.txt.gz > data/unprocessed/ol_dump_authors.txt

Processing the data

Unfortunately the downloads provided don't seem to play nicely for direct importing into PostgreSQL. The open library file errors on import as the number of columns provided varies. Cleaning it up is difficult as just the text file for editions is 25GB.

Note: Check if this is still the case and if so there could be some Linux tools to do this - maybe try sed and awk

That can be tackled with a python script. The file openlibrary_data_process.py reads in the text file and writes it out again for each row, but only where there are 5 columns.

python openlibrary_data_process.py

Because the files are huge and are only going to grow (editions is now 45gb+) you can use the openlibrary_data_process_chunked.py file to split the data into smaller files to load sequentially. You can change the number of lines in each chunk. The default is 2 million.

Once the files are split you can delete the 3 .txt files in the uncompressed folder because you will need around 250 Gb of freespace to load all 3 files into the database without encountering lack of space errors. If you have plenty of space you can keep the files!

python openlibrary_data_process_chunked.py

This generates multiple files into the data/processed directory.

One of those files will be used to access the rest of them when loading the data.

Import into database

It is then possible to import the data directly into PostgreSQL tables and do complex searches with SQL.

There are a series of database scripts which will create the database and tables, and then import the data. These are in the database folder. The data files (created in the previous process) need to be within the data/processed folder for this to work.

The PostgreSQL database command line tool psql is used to run the scripts. The following command will create the database and tables:

psql --set=sslmode=require -f openlibrary-db.sql -h localhost -p 5432 -U username postgres

Database details

The database is split into 5 main tables

Data Description
Authors Authors are the individuals who write the works
Works The works as created by the authors, with titles, and subtitles
Author Works A table linking the works with authors
Editions The particular editions of the works, including ISBNs
Edition ISBNs The ISBNs for the editions

Query the data

That's the database set up - it can now be queried using SQL.

Get details for a single item using the ISBN13 9781551922461 (Harry Potter and the Prisoner of Azkaban):

select
    e.data->>'title' "EditionTitle",
    w.data->>'title' "WorkTitle",
	a.data->>'name' "Name",
    e.data->>'subtitle' "EditionSubtitle",
    w.data->>'subtitle' "WorkSubtitle",
    e.data->>'subjects' "Subjects",
    e.data->'description'->>'value' "EditionDescription",
    w.data->'description'->>'value' "WorkDescription",
    e.data->'notes'->>'value' "EditionNotes",
    w.data->'notes'->>'value' "WorkNotes"
from editions e
join edition_isbns ei
    on ei.edition_key = e.key
join works w
    on w.key = e.work_key
join author_works a_w
	on a_w.work_key = w.key
join authors a
	on a_w.author_key = a.key
where ei.isbn = '9781551922461'

openlibrary-search's People

Contributors

chloe-meinshausen avatar coreygirard avatar danmichaelo avatar davebathnes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

openlibrary-search's Issues

Issue running openlibrary-data-process.py

Using Python 3.10.8
Windows10

Had to remove csv.field_size_limit(sys.maxsize) from the original script because it returned:
OverflowError: Python int too large to convert to C long

So i set the size to the LONG_MAX in C which is 2147483647 for csv.field_size_limit , but still get this issue seen below

image
image

Useful code for checking progress during indexing

Hi all,
just wanted to share this very helpful query that will check the status on the create index process. -I have one thats been running for almost 24 hours and seeing the progress creep up has been a stress relief.

This is the results you get if you run it in pgadmin.

image

      SELECT
        now()::TIME(0),
        a.query,
        p.phase,
        round(p.blocks_done / p.blocks_total::numeric * 100, 2) AS "% done",
        p.blocks_total,
        p.blocks_done,
        p.tuples_total,
        p.tuples_done,
        ai.schemaname,
        ai.relname,
        ai.indexrelname
      FROM pg_stat_progress_create_index p
      JOIN pg_stat_activity a ON p.pid = a.pid
      LEFT JOIN pg_stat_all_indexes ai on ai.relid = p.relid AND ai.indexrelid = p.index_relid;

source:
https://dba.stackexchange.com/questions/11329/monitoring-progress-of-index-construction-in-postgresql/249784#249784?newreg=7a3d794dcb154b6782ca390ead375050

Postgres tables only loading 2000 rows a table

Using the batch process script openlibrary-data-chunk-process.py and following it with the openlibrary-db.sql only loads 2000 rows per created table and seems to stop after that. Postgres version is 15.2, everything else is fine and the processed files are in the data/processed folder.

null value in author_key when populating the authorship table

Using the 2022/11/15 data dumps, and using

insert into authorship
select distinct 
	jsonb_array_elements(data->'authors')->'author'->>'key', 
	key 
from works
where 
	key is not null
	and data->'authors'->0->'author' is not null

I'm getting the following error:

ERROR:  null value in column "author_key" of relation "authorship" violates not-null constraint
DETAIL:  Failing row contains (null, /works/OL20655218W).
SQL state: 23502

PostgreSQL 14.6 (Homebrew)

Error loading editions

I'm getting this error when trying to copy editions. The next step in the readme is to update editions and set the work key but I'm getting an error about missing data for work_key when even trying to do the copy.

Do I need to add another column to the end of each line of the editions dump?

postgres=# COPY editions FROM '/Users/thedug/Documents/Workspace/ol_dump/ol_dump_editions_2021-11-30_processed.csv' DELIMITER E'\t' QUOTE '|' CSV;
ERROR: missing data for column "work_key"
CONTEXT: COPY editions, line 1: "/type/edition /books/OL10001035M 2 2010-03-11T23:52:40.542344 {"publishers": ["Stationery Office Boo..."

I am experiencing slow query performance in my PostgreSQL database

Even after optimizing my Python code and database configuration. Queries are taking over 2 minutes and 30 seconds to execute.

I am running the following SQL queries:

  • Query 1: SELECT data FROM authors WHERE data ->> 'name' ILIKE %s
  • Extract the author key from the result of Query 1.
  • Use the author key in Query 2: SELECT * FROM works WHERE data-> 'authors' @> '[{{\"author\": {{\"key\": \"{author_key}\"}}}}]'

Environment Details

  • PostgreSQL Version: [14]
  • Operating System: [macOS]
  • Hardware: [M2, 24GB RAM]

I have already tried the following:

  • Optimizing Python code
  • Adjusting PostgreSQL configurations (shared_buffers, work_mem, etc.)
  • Adding appropriate indexes

I expected both queries to execute much faster, ideally within a few seconds.

Thank you ๐Ÿ™

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.