Detecting failures

DLCS - Composite Handler

The DLCS Composite Handler is an implementation of DLCS RFC011.

About

The component is written in Python and utilises Django with the following extensions:

Additionally, the project uses:

Getting Started

The project ships with a docker-compose.yml that can be used to get a local version of the component running:

docker compose up

Note that for the Composite Handler to be able to interact with the target S3 bucket, the Docker Compose assumes that the AWS_PROFILE environment variable has been set and a valid AWS session is available.

This will create a PostgreSQL instance, bootstrap it with the required tables, deploy a single instance of the API, and three instances of the engine. Requests can then be targetted at localhost:8000.

The component can also be run directly, either in an IDE or from the CLI. The component must first be configured either via the creation of a .env file (see .env.dist for an example configuration), or via a set of environment variables (see the Configuration section).

Once configuration is in place, the following commands will start the API and / or engine:

API: python manage.py runserver 0.0.0.0:8000
Engine: python manage.py qcluster

Should the required tables not exist in the target database, the following commands should be run first:

python manage.py migrate
python manage.py createcachetable

Once the API is running, an administrator interface can be accessed via the browser at http://localhost:8000/admin. To create an administrator login, run the following command:

python manage.py createsuperuser

The administrator user can be used to browse the database and manage the queue (including deleting tasks and resubmitting failed tasks into the queue).

Entrypoints

There are 3 possible entrypoints to make the above easier:

entrypoint.sh - this will wait for Postgres to be available and run manage.py migrate and manage.py createcachetable if MIGRATE=True. It will run manage.py createsuperuser is INIT_SUPERUSER=True (also needs DJANGO_SUPERUSER_* envvars)
entrypoint-api.sh - this runs above then starts nginx instance fronting gunicorn process
entrypoint-worker.sh - this runs above then python manage.py qcluster

Configuration

The following list of environment variables are supported:

Environment Variable	Default Value	Component(s)	Description
`DJANGO_DEBUG`	`True`	API, Engine	Whether Django should run in debug. Useful for development purposes but should be set to `False` in production.
`DJANGO_SECRET_KEY`	None	API, Engine	The secret key used by Django when generating sensitive tokens. This should a randomly generated 50 character string.
`SCRATCH_DIRECTORY`	`/tmp/scratch`	Engine	A locally accessible filesystem path where work-in-progress files are written during rasterization.
`WEB_SERVER_SCHEME`	`http`	API	The HTTP scheme used when generating URI's.
`WEB_SERVER_HOSTNAME`	`localhost:8000`	API	The hostname (and optional port) used when generating URI's.
`ORIGIN_CHUNK_SIZE`	`8192`	Engine	The chunk size, in bytes, used when retrieving objects from origins. Tailoring this value can theoretically improve download speeds.
`DATABASE_URL`	None	API, Engine	The URL of the target PostgreSQL database, in a format acceptable to django-environ, e.g. `postgresql://dlcs:password@postgres:5432/compositedb`.
`CACHE_URL`	None	API, Engine	The URL of the target cache, in a format acceptable to django-environ, e.g. `dbcache://app_cache`.
`PDF_RASTERIZER_THREAD_COUNT`	`3`	Engine	The number of concurrent Poppler threads spawned when a worker is rasterizing a PDF. Each thread typically consumes 100% of a CPU core.
`PDF_RASTERIZER_DPI`	`500`	Engine	The DPI of images generated during the rasterization process. For JPEG's, the default value of `500` typically produces images approximately 1.5MiB to 2MiB in size.
`PDF_RASTERIZER_FALLBACK_DPI`	`200`	Engine	The DPI to use for images that exceed pdftoppm memory size and produce a 1x1 pixel (see Belval/pdf2image#34)
`PDF_RASTERIZER_FORMAT`	`jpg`	Engine	The format to generate rasterized images in. Supported values are `ppm`, `jpeg` / `jpg`, `png` and `tiff`
`PDF_RASTERIZER_MAX_LENGTH`	`0`	Engine	Optional, the maximum size of pixels on longest edge that will be saved. If rasterized image exceeds this it will be resized, maintaining aspect ratio.
`DLCS_API_ROOT`	`https://api.dlcs.digirati.io`	Engine	The root URI of the API of the target DLCS deployment, without the trailing slash.
`DLCS_S3_BUCKET_NAME`	`dlcs-composite-images`	Engine	The S3 bucket that the Composite Handler will push rasterized images to, for consumption by the wider DLCS. Both the Composite Handler and the DLCS must have access to this bucket.
`DLCS_S3_OBJECT_KEY_PREFIX`	`composites`	Engine	The S3 key prefix to use when pushing images to the `DLCS_S3_BUCKET_NAME` - in other words, the folder within the S3 bucket into which images are stored.
`DLCS_S3_UPLOAD_THREADS`	`8`	Engine	The number of concurrent threads to use when pushing images to the S3 bucket. A higher number of threads will significantly lower the amount of time spent pushing images to S3, however too high a value will cause issues with Boto3. `8` is a testing and sensible value.
`ENGINE_WORKER_COUNT`	`2`	Engine	The number of workers a single instance of the engine will spawn. Each worker will handle the processing of a single PDF, so the total number of concurrent PDF's that can be processed is `engine_count * worker_count`.
`ENGINE_WORKER_TIMEOUT`	`3600`	Engine	The number of seconds that a task (i.e. the processing of a single PDF) can run for before being terminated and treated as a failure. This value is useful to purging "stuck" tasks which haven't technically failed but are occupying a worker.
`ENGINE_WORKER_RETRY`	`4500`	Engine	The number of seconds since a task was presented for processing before a worker will re-run, regardless of whether it is still running or failed. As such, this value must be higher than `ENGINE_WORKER_TIMEOUT`.
`ENGINE_WORKER_MAX_ATTEMPTS`	`0`	Engine	The number of processing attempts a single task will undergo before it is abandoned. Setting this value to `0` will cause a task to be retried forever.
`MIGRATE`	None	API, Engine	If "True" will run migrations + createcachetable on startup if entrypoint used.
`INIT_SUPERUSER`	None	API, Engine	If "True" will attempt to create superuser. Needs standard Django envvars to be set (e.g. `DJANGO_SUPERUSER_USERNAME`, `DJANGO_SUPERUSER_EMAIL`, `DJANGO_SUPERUSER_PASSWORD`) if entrypoint used.
`GUNICORN_WORKERS`	`2`	API	The value of `--workers` arg when running gunicorn
`SQS_BROKER_QUEUE_NAME`	None	API, Engine	If set, django-q SQS broker will be used. Queue created if doesn't exist. If empty default Django ORM broker is used

Note that in order to access the S3 bucket, the Composite Handler assumes that valid AWS credentials are available in the environment - this can be in the former of environment variables, or in the form of ambient credentials.

Django Q Broker

By default Django Q will use the default Django ORM broker.

The SQS broker can be configured by specifying the SQS_BROKER_QUEUE_NAME environment variable. Default SQS broker behaviour is to create this queue if it is not found.

As with S3, above, Composite Handler assumes that valid AWS credentials are available in the environment.

Building

The project ships with a Dockerfile:

docker build -t dlcs/composite-handler:local .

This will produce a single image that can be used to execute any of the supported Django commands, including running the API and the engine:

docker run dlcs/composite-handler:local python manage.py migrate # Apply any pending DB schema changes
docker run dlcs/composite-handler:local python manage.py createcachetable # Create the cache table (if it doesn't exist)
docker run --env-file .env -it --rm dlcs/composite-handler:local /srv/dlcs/entrypoint-api.sh # Run the API
docker run --env-file .env -it --rm dlcs/composite-handler:local /srv/dlcs/entrypoint-worker.sh # Run the engine
docker run dlcs/composite-handler:local python manage.py qmonitor # Monitor the workers

	finally:
	if folder_path:
	async_task(
	"app.engine.tasks.cleanup_scratch",
	folder_path,
	task_name="Scavenger: [{0}]".format(args["id"]),
	)

dlcs / composite-handler Goto Github PK

composite-handler's Introduction

DLCS - Composite Handler

About

Getting Started

Entrypoints

Configuration

Django Q Broker

Building

composite-handler's People

Contributors

Watchers

composite-handler's Issues

Detecting failures

Tasks should synchronously scavenge scratch disk

Use alternative Broker

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent