Giter Club home page Giter Club logo

arxiv-fulltext's Introduction

arXiv plain text extraction

This service extracts plain text from arXiv PDFs. This is used during the submission process as part of routine quality assurance checks. This is also used after announcement, to update quality assurance tools (e.g. overlap detection) and to make plain text content available to authorized API consumers.

Bulk access to arXiv e-prints

See the arXiv help pages for instructions about how to obtain access to arXiv e-prints in bulk.

TODO

Quick start

Dependencies

We use Pipenv for managing Python dependencies. You can install all of the dependencies for this project like this:

pipenv install --dev

Docker

The minimal working service cluster requires the API application, the worker application, a Docker host (e.g. a dind container), and a Redis for the task queue/result backend. The easiest way to get these all running together is to use the docker-compose.yml configuration in the root of this repository.

Note that the DinD container has its own image storage, so it will need to pull down the extractor image before the API will start handling requests.

docker-compose build    # Will build the API and worker images.
docker-compose up

This will also start a mock arXiv server that yields a PDF for requests to /pdf/<arxiv id>. Note that the same PDF is returned no matter what ID is requested.

Authn/z

To use the API you will need an auth token with scopes fulltext:read and fulltext:write. The easiest way to generate one of these is to use the helper script generate-token included with arxiv.auth package.

JWT_SECRET=foosecret pipenv run generate-token --scope="fulltext:read fulltext:create"

Make sure that you use the same JWT_SECRET that is used in docker-compose.yml.

To extract text from announced e-prints, you need only fulltext:create and fulltext:read.

To extract text from submissions, you will also need compile:read and compile:create. Your user ID will also need to match the owner stated by the mock preview endpoint in mock_arxiv.py.

You should pass this token as the value of the Authorization header in all requests to the API. For example:

curl -XPOST -H "Authorization: [auth token]" http://127.0.0.1:8000/arxiv/1802.00125

This should return status 202 (Accepted) with the following JSON response: {"reason": "fulltext extraction in process"}.

You can retrieve the result with:

curl -H "Authorization: [auth token]" http://127.0.0.1:8000/arxiv/1802.00125

or

curl -H "Accept: text/plain" -H "Authorization: [auth token]" http://127.0.0.1:8000/arxiv/1802.00125

Notes

When you first start the cluster, the worker may take a little while to start up. This is because it is waiting for the dind container to pull down the extractor image. You may notice that the dind container chews up a lot of CPU during this process (e.g. if you use docker stats). Once the worker starts (i.e. you see log output from the worker) it should be ready to begin processing texts. Any requests made prior to that point will simply be enqueued.

Also note that running things with docker compose is a bit fragile if it feels squeezed for resources. If you have trouble getting/keeping the dind container running, try giving Docker a bit more juice. If your computer goes to sleep and comes back up, and/or the dind container starts complaining about containerd not starting, you may want to start fresh with:

docker-compose rm -v
DIND_LIB_DIR=/tmp/dind-var-lib-docker docker-compose up

Tests

You can run all of the tests with:

CELERY_ALWAYS_EAGER=1 pipenv run nose2 -s extractor -s ./ --with-coverage --coverage-report=term-missing

Note that the app tests in tests/ require Docker to be running and available at /var/run/docker.sock. If you do not already have the image arxiv/fulltext-extractor:0.3 on your system, the first run may take a little longer than usual.

Documentation

The latest documentation can be found at https://arxiv.github.io/arxiv-fulltext.

Building

sphinx-apidoc -o docs/source/fulltext/ -e -f -M fulltext *test*/*
cd docs/
make html SPHINXBUILD=$(pipenv --venv)/bin/sphinx-build

License

See LICENSE.

arxiv-fulltext's People

Contributors

erickpeirson avatar mattbierbaum avatar mhl10 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.