Giter Club home page Giter Club logo

composite-handler's Introduction

DLCS - Composite Handler

The DLCS Composite Handler is an implementation of DLCS RFC011.

About

The component is written in Python and utilises Django with the following extensions:

Additionally, the project uses:

Getting Started

The project ships with a docker-compose.yml that can be used to get a local version of the component running:

docker compose up

Note that for the Composite Handler to be able to interact with the target S3 bucket, the Docker Compose assumes that the AWS_PROFILE environment variable has been set and a valid AWS session is available.

This will create a PostgreSQL instance, bootstrap it with the required tables, deploy a single instance of the API, and three instances of the engine. Requests can then be targetted at localhost:8000.

The component can also be run directly, either in an IDE or from the CLI. The component must first be configured either via the creation of a .env file (see .env.dist for an example configuration), or via a set of environment variables (see the Configuration section).

Once configuration is in place, the following commands will start the API and / or engine:

  • API: python manage.py runserver 0.0.0.0:8000
  • Engine: python manage.py qcluster

Should the required tables not exist in the target database, the following commands should be run first:

python manage.py migrate
python manage.py createcachetable

Once the API is running, an administrator interface can be accessed via the browser at http://localhost:8000/admin. To create an administrator login, run the following command:

python manage.py createsuperuser

The administrator user can be used to browse the database and manage the queue (including deleting tasks and resubmitting failed tasks into the queue).

Entrypoints

There are 3 possible entrypoints to make the above easier:

  • entrypoint.sh - this will wait for Postgres to be available and run manage.py migrate and manage.py createcachetable if MIGRATE=True. It will run manage.py createsuperuser is INIT_SUPERUSER=True (also needs DJANGO_SUPERUSER_* envvars)
  • entrypoint-api.sh - this runs above then starts nginx instance fronting gunicorn process
  • entrypoint-worker.sh - this runs above then python manage.py qcluster

Configuration

The following list of environment variables are supported:

Environment Variable Default Value Component(s) Description
DJANGO_DEBUG True API, Engine Whether Django should run in debug. Useful for development purposes but should be set to False in production.
DJANGO_SECRET_KEY None API, Engine The secret key used by Django when generating sensitive tokens. This should a randomly generated 50 character string.
SCRATCH_DIRECTORY /tmp/scratch Engine A locally accessible filesystem path where work-in-progress files are written during rasterization.
WEB_SERVER_SCHEME http API The HTTP scheme used when generating URI's.
WEB_SERVER_HOSTNAME localhost:8000 API The hostname (and optional port) used when generating URI's.
ORIGIN_CHUNK_SIZE 8192 Engine The chunk size, in bytes, used when retrieving objects from origins. Tailoring this value can theoretically improve download speeds.
DATABASE_URL None API, Engine The URL of the target PostgreSQL database, in a format acceptable to django-environ, e.g. postgresql://dlcs:password@postgres:5432/compositedb.
CACHE_URL None API, Engine The URL of the target cache, in a format acceptable to django-environ, e.g. dbcache://app_cache.
PDF_RASTERIZER_THREAD_COUNT 3 Engine The number of concurrent Poppler threads spawned when a worker is rasterizing a PDF. Each thread typically consumes 100% of a CPU core.
PDF_RASTERIZER_DPI 500 Engine The DPI of images generated during the rasterization process. For JPEG's, the default value of 500 typically produces images approximately 1.5MiB to 2MiB in size.
PDF_RASTERIZER_FALLBACK_DPI 200 Engine The DPI to use for images that exceed pdftoppm memory size and produce a 1x1 pixel (see Belval/pdf2image#34)
PDF_RASTERIZER_FORMAT jpg Engine The format to generate rasterized images in. Supported values are ppm, jpeg / jpg, png and tiff
PDF_RASTERIZER_MAX_LENGTH 0 Engine Optional, the maximum size of pixels on longest edge that will be saved. If rasterized image exceeds this it will be resized, maintaining aspect ratio.
DLCS_API_ROOT https://api.dlcs.digirati.io Engine The root URI of the API of the target DLCS deployment, without the trailing slash.
DLCS_S3_BUCKET_NAME dlcs-composite-images Engine The S3 bucket that the Composite Handler will push rasterized images to, for consumption by the wider DLCS. Both the Composite Handler and the DLCS must have access to this bucket.
DLCS_S3_OBJECT_KEY_PREFIX composites Engine The S3 key prefix to use when pushing images to the DLCS_S3_BUCKET_NAME - in other words, the folder within the S3 bucket into which images are stored.
DLCS_S3_UPLOAD_THREADS 8 Engine The number of concurrent threads to use when pushing images to the S3 bucket. A higher number of threads will significantly lower the amount of time spent pushing images to S3, however too high a value will cause issues with Boto3. 8 is a testing and sensible value.
ENGINE_WORKER_COUNT 2 Engine The number of workers a single instance of the engine will spawn. Each worker will handle the processing of a single PDF, so the total number of concurrent PDF's that can be processed is engine_count * worker_count.
ENGINE_WORKER_TIMEOUT 3600 Engine The number of seconds that a task (i.e. the processing of a single PDF) can run for before being terminated and treated as a failure. This value is useful to purging "stuck" tasks which haven't technically failed but are occupying a worker.
ENGINE_WORKER_RETRY 4500 Engine The number of seconds since a task was presented for processing before a worker will re-run, regardless of whether it is still running or failed. As such, this value must be higher than ENGINE_WORKER_TIMEOUT.
ENGINE_WORKER_MAX_ATTEMPTS 0 Engine The number of processing attempts a single task will undergo before it is abandoned. Setting this value to 0 will cause a task to be retried forever.
MIGRATE None API, Engine If "True" will run migrations + createcachetable on startup if entrypoint used.
INIT_SUPERUSER None API, Engine If "True" will attempt to create superuser. Needs standard Django envvars to be set (e.g. DJANGO_SUPERUSER_USERNAME, DJANGO_SUPERUSER_EMAIL, DJANGO_SUPERUSER_PASSWORD) if entrypoint used.
GUNICORN_WORKERS 2 API The value of --workers arg when running gunicorn
SQS_BROKER_QUEUE_NAME None API, Engine If set, django-q SQS broker will be used. Queue created if doesn't exist. If empty default Django ORM broker is used

Note that in order to access the S3 bucket, the Composite Handler assumes that valid AWS credentials are available in the environment - this can be in the former of environment variables, or in the form of ambient credentials.

Django Q Broker

By default Django Q will use the default Django ORM broker.

The SQS broker can be configured by specifying the SQS_BROKER_QUEUE_NAME environment variable. Default SQS broker behaviour is to create this queue if it is not found.

As with S3, above, Composite Handler assumes that valid AWS credentials are available in the environment.

Building

The project ships with a Dockerfile:

docker build -t dlcs/composite-handler:local .

This will produce a single image that can be used to execute any of the supported Django commands, including running the API and the engine:

docker run dlcs/composite-handler:local python manage.py migrate # Apply any pending DB schema changes
docker run dlcs/composite-handler:local python manage.py createcachetable # Create the cache table (if it doesn't exist)
docker run --env-file .env -it --rm dlcs/composite-handler:local /srv/dlcs/entrypoint-api.sh # Run the API
docker run --env-file .env -it --rm dlcs/composite-handler:local /srv/dlcs/entrypoint-worker.sh # Run the engine
docker run dlcs/composite-handler:local python manage.py qmonitor # Monitor the workers

composite-handler's People

Contributors

donaldgray avatar dependabot[bot] avatar

Watchers

Finlay McCourt avatar  avatar  avatar

composite-handler's Issues

Detecting failures

We'll need to confirm whether there is ongoing but it seems like there may be an issue were task failure isn't detected correctly.

I initially thought this could be related to running out of resources (due to memory issue resolved in #71). When using the default django-q broker the max_retries seemed to be ignored, items were continuously picked up and aborted. The endpoint to check status wasn't returning an error.

#68 added support for SQS broker, it also looks like max_retries is ignored in this case. I tested using deadletter queues via SQS to detect failures but in this instance we need to investigate how we can determine which message has failed. The tasks are pickled and signed so we'd need to investigate whether it is possible or not. Does the API / worker may need to listen to the DLQ and mark the task as failed? Is there a standard pattern for this?

This may be releated to how the API has been implemented and something has been overlooked.

Tasks should synchronously scavenge scratch disk

After rasterizing pdf/uploading images/creating DLCS batch etc an async_task call is made to scavenge disk space.

finally:
if folder_path:
async_task(
"app.engine.tasks.cleanup_scratch",
folder_path,
task_name="Scavenger: [{0}]".format(args["id"]),
)

This is causing instances to run out of disk space when horizontally scaled without a shared scratch space. It looks like the cleanup_scratch task can be picked up by tasks that do not have the working folder present in their epehemeral storage. The more instances running the more likely this is to occur.

The fix is to run cleanup_task as a synchronous step after completion of processing.

Use alternative Broker

Look at using an alternative broker. We are currently running the default Django ORM broker. This is handy for running locally but caused issues with database when scaled up due to volume of incoming requests. The docs recommend not using this unless we have a dedicated database instance, which we generally won't.

See https://django-q.readthedocs.io/en/latest/brokers.html for details of alternative brokers. SQS seems like best candidate as we already have boto3 in there for S3 but there may be a better alternative.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.