Apache NiFi Custom Processor Extracting Text From Files with Apache Tika
See my article and example here:
Try this setup https://community.hortonworks.com/storage/attachments/56409-tika.xml
For the latest version see here:
Apache NiFi Custom Processor Extracting Text From Files with Apache Tika
License: Apache License 2.0
Apache NiFi Custom Processor Extracting Text From Files with Apache Tika
See my article and example here:
Try this setup https://community.hortonworks.com/storage/attachments/56409-tika.xml
For the latest version see here:
@tspannhw When using this code, I am able to get the unit tests to work just fine and return data after the enqueue/run methods are called. Seems to be working just fine. But once I deploy to Nifi, I keep getting this Tika ZeroByteFileException message "InputStream must have > 0 bytes." This is after sending in the same pdf file used for the unit tests. I can't seem to find any information about this...
I have confirmed from a post by Brian Bende that the nar packages up all required libraries, and I have even unzipped the nar to verify that the Tika libraries were included. Nifi starts up fine, so I really don't think it's a missing library issue. The processor is accessible in the Nifi UI and can be configured. It just doesn't seem to get the input properly.
Was there any additional installation tasks for your processor other than dropping the nar in the /nifi/lib/ dir? I think Tika does allow custom configurations through xml files- did you have to specify a custom config at all? I can't seem to make any sense of this exception and figure it must be an install issue. Any thoughts?
I'm using Nifi 1.5.0, Tika 1.17, JDK 8. I also have pdfbox 2.0.8 there.
*Note- I also have a simple pdfbox based custom processor hooked up in parallel in the Nifi flow. This processor gets the pdf input file, reads it just fine, and parses the output. So I suppose that eliminates any potential issue with Nifi not "delivering" the input file as a Java IO InputStream properly.
I have a PDF document that has the prevent extraction flag checked.
The processor (very reasonably) fails, but a zero byte flow file is returned to failure.
I would expect the original flow file to be routed to failure.
Hi,
Great job on this processor!
As tika allows HTML extraction of multiple document formats, do you think it is feasible to have an option to output HTML in this processor ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.