nifi-extracttext-processor

Apache NiFi Custom Processor Extracting Text From Files with Apache Tika

See my article and example here:

https://community.hortonworks.com/articles/163776/parsing-any-document-with-apache-nifi-15-with-apac.html

Try this setup https://community.hortonworks.com/storage/attachments/56409-tika.xml

https://community.hortonworks.com/articles/81694/extracttext-nifi-custom-processor-powered-by-apach.html

For the latest version see here:

https://community.hortonworks.com/articles/177370/extracting-html-from-pdf-excel-and-word-documents.html

nifi-extracttext-processor's People

Contributors

Stargazers

Watchers

nifi-extracttext-processor's Issues

ZeroByteFileException from Tika?

@tspannhw When using this code, I am able to get the unit tests to work just fine and return data after the enqueue/run methods are called. Seems to be working just fine. But once I deploy to Nifi, I keep getting this Tika ZeroByteFileException message "InputStream must have > 0 bytes." This is after sending in the same pdf file used for the unit tests. I can't seem to find any information about this...

I have confirmed from a post by Brian Bende that the nar packages up all required libraries, and I have even unzipped the nar to verify that the Tika libraries were included. Nifi starts up fine, so I really don't think it's a missing library issue. The processor is accessible in the Nifi UI and can be configured. It just doesn't seem to get the input properly.

Was there any additional installation tasks for your processor other than dropping the nar in the /nifi/lib/ dir? I think Tika does allow custom configurations through xml files- did you have to specify a custom config at all? I can't seem to make any sense of this exception and figure it must be an install issue. Any thoughts?

I'm using Nifi 1.5.0, Tika 1.17, JDK 8. I also have pdfbox 2.0.8 there.

*Note- I also have a simple pdfbox based custom processor hooked up in parallel in the Nifi flow. This processor gets the pdf input file, reads it just fine, and parses the output. So I suppose that eliminates any potential issue with Nifi not "delivering" the input file as a Java IO InputStream properly.

Payload lost on failure

I have a PDF document that has the prevent extraction flag checked.

The processor (very reasonably) fails, but a zero byte flow file is returned to failure.

I would expect the original flow file to be routed to failure.

tspannhw / nifi-extracttext-processor Goto Github PK

nifi-extracttext-processor's Introduction

nifi-extracttext-processor

nifi-extracttext-processor's People

Contributors

Stargazers

Watchers

Forkers

nifi-extracttext-processor's Issues

ZeroByteFileException from Tika?

Payload lost on failure

Option to extract HTML

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent