Giter Club home page Giter Club logo

manuscript-dl's Introduction

manuscript-dl

Collection of scripts to download digitized manuscripts from different online libraries.

Some online libraries provide convenient way to download complete manuscript as a PDF file. Some don't. Mad scripting skills to the resque!

Supported libraries

To download a book:

  1. Find out its ID, e.g.: https://www.nb.no/items/URN:NBN:no-nb_digibok_2008091504048?page=1
  2. (optional) Copy curl command from the browser, so that you preserve cookies, and adjust it.
  3. Run:
$ python ./nb.no.py -H 'cookie: something' URN:NBN:no-nb_digibok_2008091504048

To download a book:

  1. Go to book description page, e.g.: http://www.e-codices.unifr.ch/en/list/one/csg/0369
  2. Right click on the link "IIIF Manifest URL" and save it to file, e.g. manifest.json
  3. Run
$ e-codices.sh manifest.json [size]

size is an optional argument. Original size of manuscripts on e-codices is usually way too big and needs to be reduced.

This downloader uses montage (imagemagick suite) program to convert images to PDFs and pdftk to concatenate PDFs together. You need to have pdftk and montage installed in your system.

Ubuntu:

sudo apt-get install pdftk imagemagick

To download a book you need to find out its short name:

  1. Open manuscript description, e.g.: http://www.bl.uk/manuscripts/FullDisplay.aspx?ref=Add_MS_24686
  2. In this case the name is "add_ms_24686" (notice lower case). But you can double check if you click any of the pictures below and open a new page: http://www.bl.uk/manuscripts/Viewer.aspx?ref=add_ms_24686_f002r
  3. Here, add_ms_24686_f002r is a manuscript name + page name. You only need manuscript name.
  4. Run the bl.uk.py with manuscript name:
$ python3 bl.uk.py add_ms_24686 --resolution 12

This will grab all available pages with resolution 12. If you want specific pages, you can set page range using --pages A:B argument.

At some point the Library started replying with HTTP 429 (Too Many Requests). Faking user agent helped. If default user agent is not working for you, you can replace it using --user-agent option like this:

python3 bl.uk.py add_ms_24686 --user-agent 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'

Author

(c) 2015-2018 Yuri Bochkarev

manuscript-dl's People

Contributors

balta2ar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.