Giter Club home page Giter Club logo

lambda-text-extractor's Issues

Source Bucket Lambda Trigger

Currently the way this is setup is through a manual invoke.

What would be the best steps to use a source bucket and a destination bucket?

Failing to Extract Text on Lambda

Hi, I've just deployed the new version of your code, but I'm getting errors. In particular, when I try to run the example given on the Readme:

aws lambda invoke --function-name textractor_simple --payload '{"document_uri": "https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf", "temp_uri_prefix": "s3://text-extractor/", "text_uri": "s3://text-extractor/tracemonkey.txt"}' -

I get a:

{
    "StatusCode": 200
}

And no Errors on the lambda, but when I go see the extracted text file, it has 0 bytes, and Cloudwatch says this:

[ERROR] 2017-11-09T20:32:36.918Z 1e9cea26-c58d-11e7-9503-b7e3017ab9c2 Subprocess ['/var/task/bin/pdftotext', '-layout', '-nopgbrk', '-eol', 'unix', '/tmp/tmp8xi8qzza.pdf', '/tmp/tmpim1oc76s.txt'] returned 127:
Traceback (most recent call last):
File "/var/task/utils.py", line 8, in get_subprocess_output
output = subprocess.check_output(cmdline, **kwargs)
File "/var/lang/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/var/lang/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/var/task/bin/pdftotext', '-layout', '-nopgbrk', '-eol', 'unix', '/tmp/tmp8xi8qzza.pdf', '/tmp/tmpim1oc76s.txt']' returned non-zero exit status 127.

And I'm a little puzzled. I believe the pdftotext binary should be in the bin/ directory of the function. Maybe the libraries are having a problem? Is it working for you?

Thanks!

Received error when trying to parse .jpg file.

Following is the error response:

{"errorMessage": "local variable 'textractor_results' referenced before assignment", "errorType": "UnboundLocalError", "stackTrace": [["/var/task/main.py", 128, "handle", "payload['results']['textractor'] = textractor_results"]]}

Antiword and UnRTF failing

Hi,

First of all, I'd like to thank you for your awesome repo!

However, I was testing it, and run into some errors. The PDF extractor lambda works good. However, when I tried the office extractor lambda, it failed, both with an RTF and a DOC files.

This are the messages:

For UnRTF:
"reason": "Exception while executing ['/var/task/bin/unrtf', '-P', '/var/task/lib/unrtf', '--text', u'/tmp/intelllex_dZROnq.rtf']: Command '['/var/task/bin/unrtf', '-P', '/var/task/lib/unrtf', '--text', u'/tmp/intelllex_dZROnq.rtf']' returned non-zero exit status 1 (output=No config directories. Searched: /var/task/lib/unrtf\n)"

For Antiword:
"reason": "Exception while executing ['/var/task/bin/antiword', '-t', '-w', '0', '-m', 'UTF-8', u'/tmp/intelllex_pLX1jK.doc']: Command '['/var/task/bin/antiword', '-t', '-w', '0', '-m', 'UTF-8', u'/tmp/intelllex_pLX1jK.doc']' returned non-zero exit status 1 (output=I can't find the name of your HOME directory\nI can't open your mapping file (UTF-8.txt)\nIt is not in '/.antiword' nor in '/usr/share/antiword'.\n\tName: antiword\n\tPurpose: Display MS-Word files\n\tAuthor: (C) 1998-2005 Adri van Os\n\tVersion: 0.37 (21 Oct 2005)\n\tStatus: GNU General Public License\n\tUsage: antiword [switches] wordfile1 [wordfile2 ...]\n\tSwitches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]\n\t\t-f formatted text output\n\t\t-t text output (default)\n\t\t-a <paper size name> Adobe PDF output\n\t\t-p <paper size name> PostScript output\n\t\t paper size like: a4, letter or legal\n\t\t-x <dtd> XML output\n\t\t like: db (DocBook)\n\t\t-m <mapping> character mapping file\n\t\t-w <width> in characters of text output\n\t\t-i <level> image level (PostScript only)\n\t\t-L use landscape mode (PostScript only)\n\t\t-r Show removed text\n\t\t-s Show hidden (by Word) text\n)"

Do you know what the reason might be? I just used apex deploy from a cloned version of your repo, with my IAM role. From what I can see, it kind of looks like is looking for a lib folder where instead seems to be a lib-linux_x86 folder. Although I'm not sure and it might have nothing to do with it.

Please, any pointers would be very welcomed. I can do more testing if you point me in the right direction.

Thanks!

Santiago.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.