Giter Club home page Giter Club logo

pdf_paragraphs_extraction's Introduction

PDF Paragraphs Extraction

A Docker-powered service for extracting paragraphs from PDFs


This service provides one endpoint to get paragraphs from PDFs. The paragraphs contain the page number, the position in the page, the size, and the text. Furthermore, there is an option to get an asynchronous flow using message queues on redis.

Quick Start

Start the service:

make start

Get the paragraphs from a PDF:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5051

To stop the server:

make stop

Contents

Dependencies

Requirements

  • 2Gb RAM memory
  • Single core

Docker containers

A redis server is needed to use the service asynchronously. For that matter, it can be used the command make start:testing that has a built-in redis server.

Containers with make start

Alt logo

Containers with make start:testing

Alt logo

How to use it asynchronously

  1. Send PDF to extract

    curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5051/async_extraction/[tenant_name]

Alt logo

  1. Add extraction task

To add an extraction task, a message should be sent to a queue.

Python code:

queue = RedisSMQ(host=[redis host], port=[redis port], qname='segmentation_tasks', quiet=True)
message_json = '{"tenant": "tenant_name", "task": "segmentation", "params": {"filename": "pdf_file_name.pdf"}}'
message = queue.sendMessage(message_json).exceptions(False).execute()

Alt logo

  1. Get paragraphs

When the segmentation task is done, a message is placed in the results queue:

queue = RedisSMQ(host=[redis host], port=[redis port], qname='segmentation_results', quiet=True)
results_message = queue.receiveMessage().exceptions(False).execute()

# The message.message contains the following information:
# {"tenant": "tenant_name", 
# "task": "pdf_name.pdf", 
# "success": true, 
# "error_message": "", 
# "data_url": "http://localhost:5051/get_paragraphs/[tenant_name]/[pdf_name]"
# "file_url": "http://localhost:5051/get_xml/[tenant_name]/[pdf_name]"
# }


curl -X GET http://localhost:5051/get_paragraphs/[tenant_name]/[pdf_name]
curl -X GET http://localhost:5051/get_xml/[tenant_name]/[pdf_name]

or in python

requests.get(results_message.data_url)
requests.get(results_message.file_url)

Alt logo

HTTP server

Alt logo

The container HTTP server is coded using Python 3.9 and uses the FastApi web framework.

If the service is running, the end point definitions can be founded in the following url:

http://localhost:5051/docs

The end points code can be founded inside the file app.py.

The errors are reported to the file docker_volume/service.log, if the configuration is not changed (see Get service logs)

Queue processor

Alt logo

The container Queue processor is coded using Python 3.9, and it is on charge of the communication with redis.

The code can be founded in the file QueueProcessor.py and it uses the library RedisSMQ to interact with the redis queues.

Service configuration

Some parameters could be configured using environment variables. If a configuration is not provided, the defaults values are used.

Default parameters:

REDIS_HOST=redis_paragraphs
REDIS_PORT=6379
MONGO_HOST=mongo_paragraphs
MONGO_PORT=28017
SERVICE_HOST=http://127.0.0.1
SERVICE_PORT=5051

Set up environment for development

It works with Python 3.9 [install] (https://runnable.com/docker/getting-started/)

make install_venv

Train the paragraph extraction model

NOTE: The model training was only tested using Python 3.11

Get the labeled data

  git clone https://github.com/huridocs/pdf-labeled-data.git

Place the pdf-labeled-data project in the same folder as this repository

.
├── pdf_paragraphs_extraction       
├── pdf-labeled-data                 

Install the virtual environment and initialize it

  make install_venv
  source venv/bin/activate

Create the paragraph extraction model

  python src/create_paragraph_extractor_model.py

The trained model is in the following path

  model/paragraph_extraction_model.model

Execute tests

make test

Troubleshooting

Issue: Error downloading pip wheel

Solution: Change RAM memory used by the docker containers to 3Gb or 4Gb

pdf_paragraphs_extraction's People

Contributors

gabriel-piles avatar dependabot[bot] avatar ali6parmak avatar daneryl avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.