Giter Club home page Giter Club logo

lambda-text-extractor's Introduction

Extracting Text from Binary Document Formats using AWS Lambda

lambda-text-extractor is a Python 3.6 app that works with the AWS Lambda architecture to extract text from common binary document formats.

Features

Some of its key features are:

  • out of the box support for many common binary document formats (see section on Supported Formats),
  • scalable PDF parsing using OCR in parallel using AWS Lambda and asyncio,
  • creation of text searchable PDFs after OCR,
  • serverless architecture makes deployment quick and easy,
  • detailed instruction for preparing libraries and dependencies necessary for processing binary documents, and
  • sensible unicode handling

Supported Formats

lambda-text-extractor supports many common and legacy document formats:

  • Portable Document Format (.pdf),
  • Microsoft Word 2, 6, 7, 97, 2000, 2002 and 2003 (.doc) using Antiword with fallback to Catdoc,
  • Microsoft Word 2007 OpenXML files (.docx) using python-docx,
  • Microsoft PowerPoint 2007 OpenXML files (.pptx) using python-pptx,
  • Microsoft Excel 5.0, 97-2003, and 2007 OpenXML files (.xls, .xlsx) using xlrd,
  • OpenDocument 1.2 (.odm, .odp, .ods, .odt, .oth, .otm, .otp, .ots, .ott) using odfpy,
  • Rich Text Format (.rtf) using UnRTF v0.21.9,
  • XML files and HTML web pages (.html, .htm, .xml) using lxml,
  • CSV files (.csv) using Python csv module,
  • Images (.tiff, .jpg, .jpeg, .png) using Tesseract, and
  • Plain text files (.txt)

Setup

Due to the size of code and dependencies (and AWS Lambda's 50MB package limits), the extraction system is split into two Lambda functions: simple and ocr. ocr supports extracting text from images and "image" PDFs, while simple handles text extraction from the remaining formats. The side benefit of splitting into two functions is that we can configure the memory requirements of the two functions independently.

We use apex for our development toolchain to deploy the AWS Lambda functions; the code for the two Lambda functions are found in the functions directory. To deploy to AWS (Note that the -D argument refers to dry run mode.)

apex -D deploy

You need to ensure your IAM role has lambda:InvokeAsync permissions, and s3:PutObject permissions on the output bucket. Generally, we would advice using a specific bucket with auto-delete lifecycle rules for the temporary storage. You can set the IAM role and other configuration options in project.json.

The speed of parsing depends on CPU and this is controlled by the amount of memory allocated to your Lambda functions. For our needs, we find that 512MB for simple and 1024MB for ocr is a good balance between performance and cost.

Usage

Non OCR Text Extraction

The simple function expects an event with

  • document_uri: A URI containing the document to extract text from, i.e., s3://bucket/key.pdf.
  • temp_uri_prefix (optional): A URI prefix where temporary files can be stored. Defaults to <document_uri>-temp if not set.
  • text_uri (optional): A URI where the extracted text will be stored, i.e., s3://bucket/key.txt. Defaults to <document_uri>.txt if not set.
  • disable_ocr (optional): Whether to disable OCR feature. Defaults to False.

Example

aws lambda invoke --function-name textractor_simple --payload '{"document_uri": "https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf", "temp_uri_prefix": "s3://bucket/", "text_uri": "s3://bucket/tracemonkey.txt"}' -

aws s3 cp s3://bucket/tracemonkey.txt -

It automatically fallbacks to ocr function when:

  • file is a PDF (i.e., ends with .pdf),
  • text content is shorter than 32 characters, and
  • disable_ocr is False.

OCR Text Extraction

The ocr expects the same event as simple with the following additional fields:

  • searchable_pdf_uri: A URI where searchable version of the PDF file is stored. Defaults to <document_uri>.searchable.pdf
  • create_searchable_pdf: Whether to create searchable PDFs. Defaults to True.
  • page: Page number of perform PDF OCR extraction. Defaults to all pages.

Searchable PDF creation may take significantly longer than just text extraction. As there are multiple steps in OCR PDF extraction, there are several additional variables (set through environment variables) to configure its behavior.

  • MERGE_SEARCHABLE_PDF_DURATION: The maximum number of seconds to take for searchable PDF merging. Defaults to 90 seconds.
  • RETURN_RESULTS_DURATION: The number of seconds to reserve at the end for compiling results and returning them. Defaults to 3 seconds.
  • TEXTRACT_OUTPUT_WAIT_BUFFER_TIME: The number of seconds to reserve for the overhead in async wait of each page's OCR Lambda functions to return. Defaults to 5 seconds.

For more details about how PDF OCR extraction work here, see section on PDF OCR Extraction.

Example

aws lambda invoke --function-name textractor_ocr --payload '{"document_uri": "https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf", "temp_uri_prefix": "s3://bucket/", "text_uri": "s3://bucket/tracemonkey.txt", "searchable_pdf_uri": "s3://bucket/tracemonkey.searchable.pdf"}' -

aws s3 cp s3://bucket/tracemonkey-5.txt -

PDF OCR Extraction

Due to the slow nature of OCR on images and AWS Lambda's 300 seconds execution limit, we used a hack (i.e., another lambda invocation) to OCR the pages of a PDF in parallel, while using S3 as our temporary store.

When we determine that a PDF needs to be processed using OCR (i.e., simple text extraction yields < 512 bytes), we automatically invoke ocr and wait for the results asynchronously for each page of the PDF (we use asyncio and aiobotocore to achieve this). The page field in event determines which page we want to OCR for that function call.

Basically, the steps for OCR extraction are as follows:

  1. Determine the number of pages in the PDF using pdfinfo. We find that this subprocess call is faster (and more robust) than using a Python PDF library like PyPDF2.
  2. Invoke ocr on each page of the document by passing in the page field. We store the intermediate output (i.e., extracted text and searchable PDFs for each page) in the temp_uri_prefix folder. We wait for the Lambda function calls in step 2 to complete using await.
  3. We download the intermediate outputs to the Lambda function's local filesystem.
  4. We combine the intermediate text and searchable PDF, ignoring missing pages and files. The missing information will be stored in the metadata of the final text_uri and searchable_pdf_uri as missing_text_pages and missing_searchable_pdf_pages respectively.

For step 2 and 3, it is done concurrently and asynchronously and we set a timeout based on

REMAINING_TIME - MERGE_SEARCHABLE_PDF_DURATION - RETURN_RESULTS_DURATION - TEXTRACT_OUTPUT_WAIT_BUFFER_TIME

where REMAINING_TIME is the amount of time remaining after step 1.

Based on our experience, merging searchable PDFs take quite a while (and depends on the number of pages you have). On average, it can take about 60 seconds for merging 100 pages of searchable PDFs. If this is an issue for you, you might want to modify the code to fix the path of the intermediate outputs and combine it yourself outside the Lambda infrastructure. Currently, we use random UUIDs for the filenames of each intermediate output page. The relevant part of the code is in the _invoke_textract_ocr_tasks method.

For OCR extractions on individual pages, we use Ghostscript to extract the page into an image with basic image processing and then use Tesseract to do text extraction. If create_searchable_pdf is enabled, Tesseract is used to directly create a searchable PDF. After which, we use pdftotext for regular text extraction from the searchable PDF (instead of running Tesseract twice).

If anybody knows of a better pattern for processing PDFs, do feel free to submit a pull request!

Building Binaries

For more information on how we prepped the Lambda execution environment to run all these external software and libraries, see Building Binaries.

lambda-text-extractor's People

Contributors

skylander86 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lambda-text-extractor's Issues

Failing to Extract Text on Lambda

Hi, I've just deployed the new version of your code, but I'm getting errors. In particular, when I try to run the example given on the Readme:

aws lambda invoke --function-name textractor_simple --payload '{"document_uri": "https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf", "temp_uri_prefix": "s3://text-extractor/", "text_uri": "s3://text-extractor/tracemonkey.txt"}' -

I get a:

{
    "StatusCode": 200
}

And no Errors on the lambda, but when I go see the extracted text file, it has 0 bytes, and Cloudwatch says this:

[ERROR] 2017-11-09T20:32:36.918Z 1e9cea26-c58d-11e7-9503-b7e3017ab9c2 Subprocess ['/var/task/bin/pdftotext', '-layout', '-nopgbrk', '-eol', 'unix', '/tmp/tmp8xi8qzza.pdf', '/tmp/tmpim1oc76s.txt'] returned 127:
Traceback (most recent call last):
File "/var/task/utils.py", line 8, in get_subprocess_output
output = subprocess.check_output(cmdline, **kwargs)
File "/var/lang/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/var/lang/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/var/task/bin/pdftotext', '-layout', '-nopgbrk', '-eol', 'unix', '/tmp/tmp8xi8qzza.pdf', '/tmp/tmpim1oc76s.txt']' returned non-zero exit status 127.

And I'm a little puzzled. I believe the pdftotext binary should be in the bin/ directory of the function. Maybe the libraries are having a problem? Is it working for you?

Thanks!

Antiword and UnRTF failing

Hi,

First of all, I'd like to thank you for your awesome repo!

However, I was testing it, and run into some errors. The PDF extractor lambda works good. However, when I tried the office extractor lambda, it failed, both with an RTF and a DOC files.

This are the messages:

For UnRTF:
"reason": "Exception while executing ['/var/task/bin/unrtf', '-P', '/var/task/lib/unrtf', '--text', u'/tmp/intelllex_dZROnq.rtf']: Command '['/var/task/bin/unrtf', '-P', '/var/task/lib/unrtf', '--text', u'/tmp/intelllex_dZROnq.rtf']' returned non-zero exit status 1 (output=No config directories. Searched: /var/task/lib/unrtf\n)"

For Antiword:
"reason": "Exception while executing ['/var/task/bin/antiword', '-t', '-w', '0', '-m', 'UTF-8', u'/tmp/intelllex_pLX1jK.doc']: Command '['/var/task/bin/antiword', '-t', '-w', '0', '-m', 'UTF-8', u'/tmp/intelllex_pLX1jK.doc']' returned non-zero exit status 1 (output=I can't find the name of your HOME directory\nI can't open your mapping file (UTF-8.txt)\nIt is not in '/.antiword' nor in '/usr/share/antiword'.\n\tName: antiword\n\tPurpose: Display MS-Word files\n\tAuthor: (C) 1998-2005 Adri van Os\n\tVersion: 0.37 (21 Oct 2005)\n\tStatus: GNU General Public License\n\tUsage: antiword [switches] wordfile1 [wordfile2 ...]\n\tSwitches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]\n\t\t-f formatted text output\n\t\t-t text output (default)\n\t\t-a <paper size name> Adobe PDF output\n\t\t-p <paper size name> PostScript output\n\t\t paper size like: a4, letter or legal\n\t\t-x <dtd> XML output\n\t\t like: db (DocBook)\n\t\t-m <mapping> character mapping file\n\t\t-w <width> in characters of text output\n\t\t-i <level> image level (PostScript only)\n\t\t-L use landscape mode (PostScript only)\n\t\t-r Show removed text\n\t\t-s Show hidden (by Word) text\n)"

Do you know what the reason might be? I just used apex deploy from a cloned version of your repo, with my IAM role. From what I can see, it kind of looks like is looking for a lib folder where instead seems to be a lib-linux_x86 folder. Although I'm not sure and it might have nothing to do with it.

Please, any pointers would be very welcomed. I can do more testing if you point me in the right direction.

Thanks!

Santiago.

Received error when trying to parse .jpg file.

Following is the error response:

{"errorMessage": "local variable 'textractor_results' referenced before assignment", "errorType": "UnboundLocalError", "stackTrace": [["/var/task/main.py", 128, "handle", "payload['results']['textractor'] = textractor_results"]]}

Source Bucket Lambda Trigger

Currently the way this is setup is through a manual invoke.

What would be the best steps to use a source bucket and a destination bucket?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.