Giter Club home page Giter Club logo

obvilcorpusimporter's Introduction

OBVILCorpusImporter

OBVILCorpusImporter

This project is intended to ease the mass import of the OBVIL Library into the OBVIL OAI-PMH repository.

What is this script doing

Once launched with the proper command, (for instance
python3 scrap_obvil_corpora.py -s "crawled_data" -c ../configs/config_omeka.json ) this will crawls the specified1 OBVIL Corpora available in the OBVIL Library.

It will:

  • saves XML/TEI version of the texts in the specified directory (I.e. "crawled_data");
  • extracts the relevant header meta-data to be exposed in the OAI-PMH repository (eg. dc:creator, dc:relation, dc:rights, dc:format, dc:identifier, dc:title, dc:contributor...)
  • creates a thumbnail ("vignette") for each document. All the thumbnails have been generated once and are stored here. In case some are missing, you may consider scp them directly with your admin privileges.
  • builds one Omeka csv import file per specified project with all the necessary information in the specified directory (I.e. "crawled_data");.
Tl;dr:
  • python3 scrap_obvil_corpora.py -s "crawled_data" -c ../configs/config_omeka.json
  • All you need is in the folder crawled_data.

What it does not do (i.e. DIY)

To successfully import the documents into the OAI-PMH repository, you will need to:

  • Run this script with the right options and configuration.
  • Put the generated vignettes on the right place on the server if they are missing.
  • Manually import the generated CSV file into Omeka, with proper rights and mappings.

Disclamer

  • Should you run this spiders, you are going to scrap A LOT of data. Use at your own risk !

  • The text provided by the OBVIL are copyrighted.

1 To specify which corpora should be imported, you will need to custom a configuration file. See the "configs" directory of this repo. โ†ฉ

obvilcorpusimporter's People

Contributors

valerie-hanoka avatar

Watchers

James Cloos avatar  avatar

Forkers

obvil

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.