skylander86 / lambda-text-extractor Goto Github PK

View Code? Open in Web Editor NEW

173.0 9.0 40.0 113.45 MB

AWS Lambda functions to extract text from various binary formats.

License: Apache License 2.0

Python 87.47% C 8.67% XSLT 2.47% C++ 0.94% Objective-C 0.46%

text-extraction aws-lambda searchable-pdfs ocr lambda-functions pdf pdf-ocr-extraction tesseract

lambda-text-extractor's Issues

Source Bucket Lambda Trigger

Currently the way this is setup is through a manual invoke.

What would be the best steps to use a source bucket and a destination bucket?

Add to awesome-functions

It's a good candidate for https://github.com/plutov/awesome-functions

Failing to Extract Text on Lambda

Hi, I've just deployed the new version of your code, but I'm getting errors. In particular, when I try to run the example given on the Readme:

aws lambda invoke --function-name textractor_simple --payload '{"document_uri": "https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf", "temp_uri_prefix": "s3://text-extractor/", "text_uri": "s3://text-extractor/tracemonkey.txt"}' -

I get a:

{
    "StatusCode": 200
}

And no Errors on the lambda, but when I go see the extracted text file, it has 0 bytes, and Cloudwatch says this:

[ERROR] 2017-11-09T20:32:36.918Z 1e9cea26-c58d-11e7-9503-b7e3017ab9c2 Subprocess ['/var/task/bin/pdftotext', '-layout', '-nopgbrk', '-eol', 'unix', '/tmp/tmp8xi8qzza.pdf', '/tmp/tmpim1oc76s.txt'] returned 127:
Traceback (most recent call last):
File "/var/task/utils.py", line 8, in get_subprocess_output
output = subprocess.check_output(cmdline, **kwargs)
File "/var/lang/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/var/lang/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/var/task/bin/pdftotext', '-layout', '-nopgbrk', '-eol', 'unix', '/tmp/tmp8xi8qzza.pdf', '/tmp/tmpim1oc76s.txt']' returned non-zero exit status 127.

And I'm a little puzzled. I believe the pdftotext binary should be in the bin/ directory of the function. Maybe the libraries are having a problem? Is it working for you?

Thanks!

Received error when trying to parse .jpg file.

Following is the error response:

{"errorMessage": "local variable 'textractor_results' referenced before assignment", "errorType": "UnboundLocalError", "stackTrace": [["/var/task/main.py", 128, "handle", "payload['results']['textractor'] = textractor_results"]]}

Antiword and UnRTF failing

Hi,

First of all, I'd like to thank you for your awesome repo!

However, I was testing it, and run into some errors. The PDF extractor lambda works good. However, when I tried the office extractor lambda, it failed, both with an RTF and a DOC files.

This are the messages:

For UnRTF:
"reason": "Exception while executing ['/var/task/bin/unrtf', '-P', '/var/task/lib/unrtf', '--text', u'/tmp/intelllex_dZROnq.rtf']: Command '['/var/task/bin/unrtf', '-P', '/var/task/lib/unrtf', '--text', u'/tmp/intelllex_dZROnq.rtf']' returned non-zero exit status 1 (output=No config directories. Searched: /var/task/lib/unrtf\n)"

For Antiword:
"reason": "Exception while executing ['/var/task/bin/antiword', '-t', '-w', '0', '-m', 'UTF-8', u'/tmp/intelllex_pLX1jK.doc']: Command '['/var/task/bin/antiword', '-t', '-w', '0', '-m', 'UTF-8', u'/tmp/intelllex_pLX1jK.doc']' returned non-zero exit status 1 (output=I can't find the name of your HOME directory\nI can't open your mapping file (UTF-8.txt)\nIt is not in '/.antiword' nor in '/usr/share/antiword'.\n\tName: antiword\n\tPurpose: Display MS-Word files\n\tAuthor: (C) 1998-2005 Adri van Os\n\tVersion: 0.37 (21 Oct 2005)\n\tStatus: GNU General Public License\n\tUsage: antiword [switches] wordfile1 [wordfile2 ...]\n\tSwitches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]\n\t\t-f formatted text output\n\t\t-t text output (default)\n\t\t-a <paper size name> Adobe PDF output\n\t\t-p <paper size name> PostScript output\n\t\t paper size like: a4, letter or legal\n\t\t-x <dtd> XML output\n\t\t like: db (DocBook)\n\t\t-m <mapping> character mapping file\n\t\t-w <width> in characters of text output\n\t\t-i <level> image level (PostScript only)\n\t\t-L use landscape mode (PostScript only)\n\t\t-r Show removed text\n\t\t-s Show hidden (by Word) text\n)"

Do you know what the reason might be? I just used apex deploy from a cloned version of your repo, with my IAM role. From what I can see, it kind of looks like is looking for a lib folder where instead seems to be a lib-linux_x86 folder. Although I'm not sure and it might have nothing to do with it.

Please, any pointers would be very welcomed. I can do more testing if you point me in the right direction.

Thanks!

Santiago.

skylander86 / lambda-text-extractor Goto Github PK

lambda-text-extractor's Issues

Source Bucket Lambda Trigger

Add to awesome-functions

Failing to Extract Text on Lambda

Received error when trying to parse .jpg file.

Antiword and UnRTF failing

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent