skylander86 / lambda-text-extractor Goto Github PK
View Code? Open in Web Editor NEWAWS Lambda functions to extract text from various binary formats.
License: Apache License 2.0
AWS Lambda functions to extract text from various binary formats.
License: Apache License 2.0
Currently the way this is setup is through a manual invoke.
What would be the best steps to use a source bucket and a destination bucket?
It's a good candidate for https://github.com/plutov/awesome-functions
Hi, I've just deployed the new version of your code, but I'm getting errors. In particular, when I try to run the example given on the Readme:
aws lambda invoke --function-name textractor_simple --payload '{"document_uri": "https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf", "temp_uri_prefix": "s3://text-extractor/", "text_uri": "s3://text-extractor/tracemonkey.txt"}' -
I get a:
{
"StatusCode": 200
}
And no Errors on the lambda, but when I go see the extracted text file, it has 0 bytes, and Cloudwatch says this:
[ERROR] 2017-11-09T20:32:36.918Z 1e9cea26-c58d-11e7-9503-b7e3017ab9c2 Subprocess ['/var/task/bin/pdftotext', '-layout', '-nopgbrk', '-eol', 'unix', '/tmp/tmp8xi8qzza.pdf', '/tmp/tmpim1oc76s.txt'] returned 127:
Traceback (most recent call last):
File "/var/task/utils.py", line 8, in get_subprocess_output
output = subprocess.check_output(cmdline, **kwargs)
File "/var/lang/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/var/lang/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/var/task/bin/pdftotext', '-layout', '-nopgbrk', '-eol', 'unix', '/tmp/tmp8xi8qzza.pdf', '/tmp/tmpim1oc76s.txt']' returned non-zero exit status 127.
And I'm a little puzzled. I believe the pdftotext
binary should be in the bin/
directory of the function. Maybe the libraries are having a problem? Is it working for you?
Thanks!
Following is the error response:
{"errorMessage": "local variable 'textractor_results' referenced before assignment", "errorType": "UnboundLocalError", "stackTrace": [["/var/task/main.py", 128, "handle", "payload['results']['textractor'] = textractor_results"]]}
Hi,
First of all, I'd like to thank you for your awesome repo!
However, I was testing it, and run into some errors. The PDF extractor lambda works good. However, when I tried the office extractor lambda, it failed, both with an RTF and a DOC files.
This are the messages:
For UnRTF:
"reason": "Exception while executing ['/var/task/bin/unrtf', '-P', '/var/task/lib/unrtf', '--text', u'/tmp/intelllex_dZROnq.rtf']: Command '['/var/task/bin/unrtf', '-P', '/var/task/lib/unrtf', '--text', u'/tmp/intelllex_dZROnq.rtf']' returned non-zero exit status 1 (output=No config directories. Searched: /var/task/lib/unrtf\n)"
For Antiword:
"reason": "Exception while executing ['/var/task/bin/antiword', '-t', '-w', '0', '-m', 'UTF-8', u'/tmp/intelllex_pLX1jK.doc']: Command '['/var/task/bin/antiword', '-t', '-w', '0', '-m', 'UTF-8', u'/tmp/intelllex_pLX1jK.doc']' returned non-zero exit status 1 (output=I can't find the name of your HOME directory\nI can't open your mapping file (UTF-8.txt)\nIt is not in '/.antiword' nor in '/usr/share/antiword'.\n\tName: antiword\n\tPurpose: Display MS-Word files\n\tAuthor: (C) 1998-2005 Adri van Os\n\tVersion: 0.37 (21 Oct 2005)\n\tStatus: GNU General Public License\n\tUsage: antiword [switches] wordfile1 [wordfile2 ...]\n\tSwitches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]\n\t\t-f formatted text output\n\t\t-t text output (default)\n\t\t-a <paper size name> Adobe PDF output\n\t\t-p <paper size name> PostScript output\n\t\t paper size like: a4, letter or legal\n\t\t-x <dtd> XML output\n\t\t like: db (DocBook)\n\t\t-m <mapping> character mapping file\n\t\t-w <width> in characters of text output\n\t\t-i <level> image level (PostScript only)\n\t\t-L use landscape mode (PostScript only)\n\t\t-r Show removed text\n\t\t-s Show hidden (by Word) text\n)"
Do you know what the reason might be? I just used apex deploy from a cloned version of your repo, with my IAM role. From what I can see, it kind of looks like is looking for a lib
folder where instead seems to be a lib-linux_x86
folder. Although I'm not sure and it might have nothing to do with it.
Please, any pointers would be very welcomed. I can do more testing if you point me in the right direction.
Thanks!
Santiago.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.