Giter Club home page Giter Club logo

tika-python's Introduction

Build Status Coverage Status

tika-python

A Python port of the Apache Tika library that makes Tika available using the Tika REST Server.

This makes Apache Tika available as a Python library, installable via Setuptools, Pip and Easy Install.

To use this library, you need to have Java 7+ installed on your system as tika-python starts up the Tika REST server in the background.

Inspired by Aptivate Tika.

Installation (with pip)

  1. pip install tika

Installation (without pip)

  1. python setup.py build
  2. python setup.py install

Airgap Environment Setup

To get this working in a disconnected environment, download a tika server file (both tika-server.jar and tika-server.jar.md5, which can be found here) and set the TIKA_SERVER_JAR environment variable to TIKA_SERVER_JAR="file:////tika-server-standard.jar" which successfully tells python-tika to "download" this file and move it to /tmp/tika-server-standard.jar and run as background process.

This is the only way to run python-tika without internet access. Without this set, the default is to check the tika version and pull latest every time from Apache.

Environment Variables

These are read once, when tika/tika.py is initially loaded and used throughout after that.

  1. TIKA_VERSION - set to the version string, e.g., 1.12 or default to current Tika version.
  2. TIKA_SERVER_JAR - set to the full URL to the remote Tika server jar to download and cache.
  3. TIKA_SERVER_ENDPOINT - set to the host (local or remote) for the running Tika server jar.
  4. TIKA_CLIENT_ONLY - if set to True, then TIKA_SERVER_JAR is ignored, and relies on the value for TIKA_SERVER_ENDPOINT and treats Tika like a REST client.
  5. TIKA_TRANSLATOR - set to the fully qualified class name (defaults to Lingo24) for the Tika translator implementation.
  6. TIKA_SERVER_CLASSPATH - set to a string (delimited by ':' for each additional path) to prepend to the Tika server jar path.
  7. TIKA_LOG_PATH - set to a directory with write permissions and the tika.log and tika-server.log files will be placed in this directory.
  8. TIKA_PATH - set to a directory with write permissions and the tika_server.jar file will be placed in this directory.
  9. TIKA_JAVA - set the Java runtime name, e.g., java or java9
  10. TIKA_STARTUP_SLEEP - number of seconds (float) to wait per check if Tika server is launched at runtime
  11. TIKA_STARTUP_MAX_RETRY - number of checks (int) to attempt for Tika server startup if launched at runtime
  12. TIKA_JAVA_ARGS - set java runtime arguments, e.g, -Xmx4g
  13. TIKA_LOG_FILE - set the filename for the log file. default: tika.log. if it is an empty string (''), no log file is created.

Testing it out

Parser Interface (backwards compat prior to REST)

#!/usr/bin/env python
import tika
tika.initVM()
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"])
print(parsed["content"])

Parser Interface

The parser interface extracts text and metadata using the /rmeta interface. This is one of the better ways to get the internal XHTML content extracted.

Note: Alert Icon The parser interface needs the following environment variable set on the console for printing of the extracted content. export PYTHONIOENCODING=utf8

#!/usr/bin/env python
import tika
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"])
print(parsed["content"])

Optionally, you can pass Tika server URL along with the call what's useful for multi-instance execution or when Tika is dockerzed/linked.

parsed = parser.from_file('/path/to/file', 'http://tika:9998/tika')
string_parsed = parser.from_buffer('Good evening, Dave', 'http://tika:9998/tika')

You can also pass a binary stream

with open(file, 'rb') as file_obj:
    response = tika.parser.from_file(file_obj)

Gzip compression

Since Tika 1.24.1 gzip compression of input and output streams is allowed.

Input compression can be achieved with gzip or zlib:

    import zlib 

    with open(file, 'rb') as file_obj:
        return tika.parser.from_buffer(zlib.compress(file_obj.read()))

...

    import gzip
 
    with open(file, 'rb') as file_obj:
        return tika.parser.from_buffer(gzip.compress(file_obj.read()))

And output with the header:

    with open(file, 'rb') as file_obj:
        return tika.parser.from_file(file_obj, headers={'Accept-Encoding': 'gzip, deflate'})

Specify Output Format To XHTML

The parser interface is optionally able to output the content as XHTML rather than plain text.

Note: Alert Icon The parser interface needs the following environment variable set on the console for printing of the extracted content. export PYTHONIOENCODING=utf8

#!/usr/bin/env python
import tika
from tika import parser
parsed = parser.from_file('/path/to/file', xmlContent=True)
print(parsed["metadata"])
print(parsed["content"])

# Note: This is also available when parsing from the buffer.

Unpack Interface

The unpack interface handles both metadata and text extraction in a single call and internally returns back a tarball of metadata and text entries that is internally unpacked, reducing the wire load for extraction.

#!/usr/bin/env python
import tika
from tika import unpack
parsed = unpack.from_file('/path/to/file')

Detect Interface

The detect interface provides a IANA MIME type classification for the provided file.

#!/usr/bin/env python
import tika
from tika import detector
print(detector.from_file('/path/to/file'))

Config Interface

The config interface allows you to inspect the Tika Server environment's configuration including what parsers, mime types, and detectors the server has been configured with.

#!/usr/bin/env python
import tika
from tika import config
print(config.getParsers())
print(config.getMimeTypes())
print(config.getDetectors())

Language Detection Interface

The language detection interface provides a 2 character language code texted based on the text in provided file.

#!/usr/bin/env python
from tika import language
print(language.from_file('/path/to/file'))

Translate Interface

The translate interface translates the text automatically extracted by Tika from the source language to the destination language.

#!/usr/bin/env python
from tika import translate
print(translate.from_file('/path/to/spanish', 'es', 'en'))

Using a Buffer

Note you can also use a Parser and Detector .from_buffer(string|BufferedIOBase) method to dynamically parser a string or bytes buffer in Python and/or detect its MIME type. This is useful if you've already loaded the content into memory.

string_parsed = parser.from_buffer('Good evening, Dave')
byte_data: bytes = b'B\xc3\xa4ume'
parsed = parser.from_buffer(io.BytesIO(byte_data))

Using Client Only Mode

You can set Tika to use Client only mode by setting

import tika
tika.TikaClientOnly = True

Then you can run any of the methods and it will fully omit the check to see if the service on localhost is running and omit printing the check messages.

Changing the Tika Classpath

You can update the classpath that Tika server uses by setting the classpath as a set of ':' delimited strings. For example if you want to get Tika-Python working with GeoTopicParsing, you can do this, replace paths below with your own paths, as identified here and make sure that you have done this:

kill Tika server (if already running):

ps aux | grep java | grep Tika
kill -9 PID
import tika.tika
import os
from tika import parser
home = os.getenv('HOME')
tika.tika.TikaServerClasspath = home + '/git/geotopicparser-utils/mime:'+home+'/git/geotopicparser-utils/models/polar'
parsed = parser.from_file(home + '/git/geotopicparser-utils/geotopics/polar.geot')
print parsed["metadata"]

Customizing the Tika Server Request

You may customize the outgoing HTTP request to Tika server by setting requestOptions on the .from_file and .from_buffer methods (Parser, Unpack , Detect, Config, Language, Translate). It should be a dictionary of arguments that will be passed to the request method. The request method documentation specifies valid arguments. This will override any defaults except for url and params /data.

from tika import parser
parsed = parser.from_file('/path/to/file', requestOptions={'timeout': 120})

New Command Line Client Tool

When you install Tika-Python you also get a new command line client tool, tika-python installed in your /path/to/python/bin directory.

The options and help for the command line tool can be seen by typing tika-python without any arguments. This will also download a copy of the tika-server jar and start it if you haven't done so already.

tika.py [-v] [-o <outputDir>] [--server <TikaServerEndpoint>] [--install <UrlToTikaServerJar>] [--port <portNumber>] <command> <option> <urlOrPathToFile>

tika.py parse all test.pdf test2.pdf                   (write output JSON metadata files for test1.pdf_meta.json and test2.pdf_meta.json)
tika.py detect type test.pdf                           (returns mime-type as text/plain)
tika.py language file french.txt                       (returns language e.g., fr as text/plain)
tika.py translate fr:en french.txt                     (translates the file french.txt from french to english)
tika.py config mime-types                              (see what mime-types the Tika Server can handle)

A simple python and command-line client for Tika using the standalone Tika server (JAR file).
All commands return results in JSON format by default (except text in text/plain).

To parse docs, use:
tika.py parse <meta | text | all> <path>

To check the configuration of the Tika server, use:
tika.py config <mime-types | detectors | parsers>

Commands:
  parse  = parse the input file and write a JSON doc file.ext_meta.json containing the extracted metadata, text, or both
  detect type = parse the stream and 'detect' the MIME/media type, return in text/plain
  language file = parse the file stream and identify the language of the text, return its 2 character code in text/plain
  translate src:dest = parse and extract text and then translate the text from source language to destination language
  config = return a JSON doc describing the configuration of the Tika server (i.e. mime-types it
             can handle, or installed detectors or parsers)

Arguments:
  urlOrPathToFile = file to be parsed, if URL it will first be retrieved and then passed to Tika

Switches:
  --verbose, -v                  = verbose mode
  --encode, -e           = encode response in UTF-8
  --csv, -c    = report detect output in comma-delimited format
  --server <TikaServerEndpoint>  = use a remote Tika Server at this endpoint, otherwise use local server
  --install <UrlToTikaServerJar> = download and exec Tika Server (JAR file), starting server on default port 9998

Example usage as python client:
-- from tika import runCommand, parse1
-- jsonOutput = runCommand('parse', 'all', filename)
 or
-- jsonOutput = parse1('all', filename)

Questions, comments?

Send them to Chris A. Mattmann.

Contributors

  • Chris A. Mattmann, JPL
  • Brian D. Wilson, JPL
  • Dongni Zhao, USC
  • Kenneth Durri, University of Maryland
  • Tyler Palsulich, New York University & Google
  • Joe Germuska, Northwestern University
  • Vlad Shvedov, Profinda.com
  • Diogo Vieira, Globo.com
  • Aron Ahmadia, Continuum Analytics
  • Karanjeet Singh, USC
  • Renat Nasyrov, Yandex
  • James Brooking, Blackbeard
  • Yash Tanna, USC
  • Igor Tokarev, Freelance
  • Imraan Parker, Freelance
  • Annie K. Didier, JPL
  • Juan Elosua, TEGRA Cybersecurity Center
  • Carina de Oliveira Antunes, CERN
  • Ana Mensikova, JPL

Thanks

Thanks to the DARPA MEMEX program for funding most of the original portions of this work.

License

Apache License, version 2

tika-python's People

Contributors

acsc-cyberlab avatar ahmadia avatar bitsgalore avatar carantunes avatar chrismattmann avatar cymox1 avatar dongnizh avatar ekeydar avatar frennkie avatar gabriel-v avatar harsham05 avatar igormp avatar imraanparker avatar jacknashg avatar jjelosua avatar karanjeets avatar kdurril avatar lvieirajr avatar matthewdavislee avatar mjbommar avatar pehat avatar prough21 avatar sheldonreiff avatar smadha avatar strayer avatar thammegowda avatar tigorc avatar tooa avatar yarongon avatar yashtanna93 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tika-python's Issues

Connection refused with Docker python:2.7 image

Installed tika-python into python:2.7 base image, but getting connection refused.

Are there any other dependencies I need to know about?

Output:

>>> from tika import parser
>>> parser.from_buffer('init')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/tika/parser.py", line 29, in from_buffer
    {'Accept': 'application/json'}, False)
  File "/usr/local/lib/python2.7/site-packages/tika/tika.py", line 245, in callServer
    resp = verbFn(serviceUrl, data=data, headers=headers)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 122, in put
    return request('put', url, data=data, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 415, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', error(111, 'Connection refused'))
>>> parser.from_buffer('init')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/tika/parser.py", line 29, in from_buffer
    {'Accept': 'application/json'}, False)
  File "/usr/local/lib/python2.7/site-packages/tika/tika.py", line 245, in callServer
    resp = verbFn(serviceUrl, data=data, headers=headers)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 122, in put
    return request('put', url, data=data, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 415, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', error(111, 'Connection refused'))

tika-python stuck in converting some doc

when i use tika-python in my project, sometimes it just stuck when i converting some doc without any error info. Would you help me plz?

i use tika-python in mac pro. however, converting is ok by using tika-app. Let me know if you need any extra information.

py3?

Hi,
was just playing around a bit to see if I could get this working.

Anyway, below is what happend when I tried with py3 and win8.1 64.
The stop point puzzled me, so I thought I just would let you know.
I guess that you only do py2, so this is just if you ever think in terms of py3...

// ahed

(pjava) C:\VIRENVS\pjava\Scripts>pip install tika
Collecting tika
Downloading tika-1.10.tar.gz
Requirement already satisfied (use --upgrade to upgrade): setuptools in c:\viren
vs\pjava\lib\site-packages (from tika)
Collecting requests (from tika)
Using cached requests-2.7.0-py2.py3-none-any.whl
Building wheels for collected packages: tika
Running setup.py bdist_wheel for tika
Stored in directory: C:\Users...\AppData\Local\pip\Cache\wheels\f2\68\5
b\3fa52886820037cfb90525f5c44d0788308face211328020d8
Successfully built tika
Installing collected packages: requests, tika
Successfully installed requests-2.7.0 tika-1.10

(pjava) C:\VIRENVS\pjava\Scripts>python
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import tika
tika.initVM()
from tika import parser
Traceback (most recent call last):
File "", line 1, in
File "C:\VIRENVS\pjava\lib\site-packages\tika\parser.py", line 19, in

from tika import parse1, callServer, ServerEndpoint

ImportError: cannot import name 'parse1'

from tika import parse1
Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name 'parse1'
dir(tika)
['builtins', 'cached', 'doc', 'file', 'loader', 'name',
'package', 'path', 'spec', 'initVM']

Running server API on localhost

If we run the tika rest server on localhost, a call such as the following will not try to use the tika rest server, but will instead attempt to download the tika jar file from http://search.maven.org ...:

parser.from_buffer(fileString, 'http://localhost:9998/tika')

The following works as expected (i.e. accesses the tika rest server on localhost):

parser.from_buffer(fileString, 'http://127.0.0.1:9998/tika')

I recommend that you either update the documentation to make this difference clear, or update the code to attempt to call the rest server on localhost before falling back to downloading the tika jar file.

Thanks!

CLI plaintext output misbehaviour when using installer

I installed tika-python using pip, and then ran it in detect mode on all files in a directory using the following command:

tika-python/tika $ tika-python detect type /home/johan/epub-CB/20150505/* >~/stdout.txt 2> ~/stderr.txt

Result stdout.txt:

('server endpoint:', 'http://localhost:9998')
('server endpoint:', 'http://localhost:9998')
('server endpoint:', 'http://localhost:9998')
('server endpoint:', 'http://localhost:9998')
... etc ..

Result stderr.txt:

[(200, u'application/epub+zip'), (200, u'application/epub+zip'), (200, u'application/epub+zip'), (200, u'application/epub+zip'), ...

So instead of printing the mimetype strings as plain text, it actually prints a dump of the internal Python objects (moreover they're printed to stderr, where I would expect stdout)! I then cloned the repo and ran tika.py directly without using the installer (same command line options):

python tika.py detect type /home/johan/epub-CB/20150505/* > ~/stdout.txt 2> ~/stderr.txt

Result stdout.txt:

application/epub+zip
application/epub+zip
application/epub+zip
application/epub+zip
application/epub+zip
application/epub+zip
application/epub+zip
 ...etc.

Result stderr.txt: empty file.

Which is the expected behaviour. So it seems something gets broken in the install process.

Support for HTML output

I know apache tika supports HTML output format.But am getting only text format from tika-python.
Could you please provide support for HTML output format.

Thanks
Mahesh

New commits are breaking Python 3 compatibility

I've put some effort to make tika-python compatible with Python 3, but in new commits I see something like this:
print resp
So if you want to keep tika-python compatible with Python 3, all the contributors should come to the agreement about coding style. Let alone that modules should not print anything - the user of the library may not be willing to have this stuff in one's program output. Debugging info should be properly logged (with the help of logging module, for example). Please, @chrismattmann, @yashtanna93, @karanjeets, @kdurril, @dongnizh, think about it.
It's also would be great to use automated testing with Tox and Travis CI. The former allows you to set up multiple Python environments for testing while the second allows you to see which of the new commits/pull requests are violating existing tests. Let's build a better software!

parser.from_buffer breaks on unicode

Platform: Ubuntu 14.04
Python version: 2.7.x

The following breaks:

#!/usr/bin/env python2.7                                                                                                                                                                      
import tika
from tika import parser
import requests

r = requests.get("http://www.hyperiongray.com/")
#stringed = r.text.encode('ascii','ignore')                                                                                                                                                   
string_parsed = parser.from_buffer(r.text)

with the exception:

punk@punk-controller:~/memex-dev/the-headless-horseman$ python tika-test.py 
Traceback (most recent call last):
  File "tika-test.py", line 10, in <module>
    string_parsed = parser.from_buffer(r.text)
  File "/usr/local/lib/python2.7/dist-packages/tika/parser.py", line 29, in from_buffer
    {'Accept': 'application/json'}, False)
  File "/usr/local/lib/python2.7/dist-packages/tika/tika.py", line 245, in callServer
    resp = verbFn(serviceUrl, data=data, headers=headers)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 99, in put
    return request('put', url, data=data, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 455, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 558, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 330, in send
    timeout=timeout
  File "/usr/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 542, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 367, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python2.7/httplib.py", line 973, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.7/httplib.py", line 1007, in _send_request
    self.endheaders(body)
  File "/usr/lib/python2.7/httplib.py", line 969, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 833, in _send_output
    self.send(message_body)
  File "/usr/lib/python2.7/httplib.py", line 805, in send
    self.sock.sendall(data)
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5132-5134: ordinal not in range(128)

A workaround is something like the following:

#!/usr/bin/env python2.7                                                                                                                                                                      
import tika
from tika import parser
import requests

r = requests.get("http://www.hyperiongray.com/")
mangled_str = r.text.encode('ascii','ignore')
string_parsed = parser.from_buffer(mangled_str)

which works as expected. However, this won't be acceptable for any non-html content and most html content.

Cannot parse docx document containing more than one element

Hi, first of all thanks for the great tool.
When I parse a docx document that contains several components, for example document itself and an image, for subsequent ones the parser tries to append values to already existing keys in metadata dict. But in some cases the key contains string/unicode and append operation results in AttributeError.

Example:

In [1]: from tika import parser

In [2]: parsed = parser.from_file('https://dl.dropboxusercontent.com/u/3408905/197.docx')
tika.py: Retrieving https://dl.dropboxusercontent.com/u/3408905/197.docx to /var/folders/nb/y4j12k7d6tq3_vkx118_602w0000gn/T/197.docx.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-0599c0dd20fe> in <module>()
----> 1 parsed = parser.from_file('https://dl.dropboxusercontent.com/u/3408905/197.docx')

/private/tmp/tikagit/tika-python/tika/parser.py in from_file(filename, serverEndpoint)
     23 def from_file(filename, serverEndpoint=ServerEndpoint):
     24     jsonOutput = parse1('all', filename, serverEndpoint)
---> 25     return _parse(jsonOutput)
     26
     27 def from_buffer(string, serverEndpoint=ServerEndpoint):

/private/tmp/tikagit/tika-python/tika/parser.py in _parse(jsonOutput)
     51             if n != "X-TIKA:content":
     52                 if n in parsed["metadata"]:
---> 53                     parsed["metadata"][n].append(js[n])
     54                 else:
     55                     parsed["metadata"][n] = js[n]

AttributeError: 'unicode' object has no attribute 'append'

I think if a same key is encountered second time, the existing value should be converted to a list (if it's not already), probably something like this should fix the issue:

diff --git a/tika/parser.py b/tika/parser.py
index c6ce684..9053bca 100644
--- a/tika/parser.py
+++ b/tika/parser.py
@@ -50,6 +50,8 @@ def _parse(jsonOutput):
         for n in js:
             if n != "X-TIKA:content":
                 if n in parsed["metadata"]:
+                    if not isinstance(parsed["metadata"][n], list):
+                        parsed["metadata"][n] = [parsed["metadata"][n]]
                     parsed["metadata"][n].append(js[n])
                 else:
                     parsed["metadata"][n] = js[n]

Block's port of web-service when using tika (detector)

After running an app (flask) that's communicating via some port, if I close down that app and relaunch it I will consistently get a "port in use" error when starting flask up from the second time on-wards. After spending some time in htop killing potential java processes (lsof -i reported that only firefox/mongod/some-java-thing was using port sockets). Killing specifically the tika-server.jar process allowed me to run flask again on the original port. Any ideas?

ValueError when parsing unsupported type

When I try to parse a file that is not supported by tika, for example binary, the parser tries to perform json.loads on returned content even if it's empty.

Example:

In [1]: from tika import parser
In [2]: parsed = parser.from_file('https://dl.dropboxusercontent.com/u/3408905/binary')
tika.py: Retrieving https://dl.dropboxusercontent.com/u/3408905/binary to /var/folders/nb/y4j12k7d6tq3_vkx118_602w0000gn/T/binary.
tika.py: Warn: Tika server returned status: 415
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-f52d6c546d59> in <module>()
----> 1 parsed = parser.from_file('https://dl.dropboxusercontent.com/u/3408905/binary')

/private/tmp/tikagit/tika-python/tika/parser.py in from_file(filename, serverEndpoint)
     23 def from_file(filename, serverEndpoint=ServerEndpoint):
     24     jsonOutput = parse1('all', filename, serverEndpoint)
---> 25     return _parse(jsonOutput)
     26
     27 def from_buffer(string, serverEndpoint=ServerEndpoint):

/private/tmp/tikagit/tika-python/tika/parser.py in _parse(jsonOutput)
     34     if not jsonOutput:
     35         return parsed
---> 36     realJson = json.loads(jsonOutput[1])
     37
     38     content = ""

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    336             parse_int is None and parse_float is None and
    337             parse_constant is None and object_pairs_hook is None and not kw):
--> 338         return _default_decoder.decode(s)
    339     if cls is None:
    340         cls = JSONDecoder

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
    363
    364         """
--> 365         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    366         end = _w(s, end).end()
    367         if end != len(s):

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
    381             obj, end = self.scan_once(s, idx)
    382         except StopIteration:
--> 383             raise ValueError("No JSON object could be decoded")
    384         return obj, end

ValueError: No JSON object could be decoded

The issue seems to be that parser._parse is always provided with a tuple containing response code and parsed content, so even if the content is empty, following condition will not be met and it will always try to decode input.

def _parse(jsonOutput):
    parsed={}
    if not jsonOutput:   <----- always a tuple (int, str)
        return parsed
    realJson = json.loads(jsonOutput[1])

Possible solution is to test against the content itself, not entire tuple. Or to pass the contents only as it seems response code is not used in this function anyway.

Tika-Python errors out when trying to parse a very large file.

Reported by Christopher Stout [email protected]:


Hey Chris,

Could you please update the following code in your parser wrapper?:

content = tika.BodyContentHandler()

to

content = tika.BodyContentHandler(-1)

pass in the argument of -1 to handle unlimited size because as of now, it errors out trying to parse a very large file returning the error message:

tika.JavaError: org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters

Great work.

extract formatted data from doc

I need some new feature. Extract formatted data( xml, html) from doc. Besides, I need the img embedding in doc.
I noticed that in tika¹s official app(tika-app-1.5.jar), it is realised that extract plain text besides with formatting data and xml data. However, the code is obscure and it seems to use some class which is not including in tika.( like SAXTransformerFactory ).
image

Python 2.7.9 on windows fails to download Tika server

Running on latest Python 2.7.9 on windows Vista and I get this:

>>> parser.from_buffer(str)
tika.py: Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.9/tika-server-1.9.jar to c:\users\chrisa~1\appdata\local\temp\tika-server.jar.

Traceback (most recent call last):
  File "<pyshell#11>", line 1, in <module>
    parser.from_buffer(str)
  File "C:\Python27\lib\site-packages\tika\parser.py", line 29, in from_buffer
    {'Accept': 'application/json'}, False)
  File "C:\Python27\lib\site-packages\tika\tika.py", line 239, in callServer
    serverEndpoint = checkTikaServer(serverHost, port, tikaServerJar)
  File "C:\Python27\lib\site-packages\tika\tika.py", line 266, in checkTikaServer
    tikaServerJar = getRemoteJar(tikaServerJar, jarPath)
  File "C:\Python27\lib\site-packages\tika\tika.py", line 300, in getRemoteJar
    urlretrieve(urlOrPath, destPath)
  File "C:\Python27\lib\urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "C:\Python27\lib\urllib.py", line 245, in retrieve
    fp = self.open(url, data)
  File "C:\Python27\lib\urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 364, in open_http
    return self.http_error(url, fp, errcode, errmsg, headers)
  File "C:\Python27\lib\urllib.py", line 377, in http_error
    result = method(url, fp, errcode, errmsg, headers)
  File "C:\Python27\lib\urllib.py", line 641, in http_error_302
    data)
  File "C:\Python27\lib\urllib.py", line 667, in redirect_internal
    return self.open(newurl)
  File "C:\Python27\lib\urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 443, in open_https
    h.endheaders(data)
  File "C:\Python27\lib\httplib.py", line 1049, in endheaders
    self._send_output(message_body)
  File "C:\Python27\lib\httplib.py", line 893, in _send_output
    self.send(msg)
  File "C:\Python27\lib\httplib.py", line 855, in send
    self.connect()
  File "C:\Python27\lib\httplib.py", line 1274, in connect
    server_hostname=server_hostname)
  File "C:\Python27\lib\ssl.py", line 352, in wrap_socket
    _context=self)
  File "C:\Python27\lib\ssl.py", line 579, in __init__
    self.do_handshake()
  File "C:\Python27\lib\ssl.py", line 808, in do_handshake
    self._sslobj.do_handshake()
IOError: [Errno socket error] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)
>>> 

I've tried monkey patching as suggested in:

http://stackoverflow.com/questions/27835619/ssl-certificate-verify-failed-error

Still get same error.

And also read:

http://bugs.python.org/issue23052

Tika parser can't handle non-ASCII filenames

Just copying my log here:

2016-02-04 21:41:27,286 root         INFO     Parsing asset /home/renat/data/pdf/Светозвук_в_природе_и_световая_симфония_Скрябина.pdf...
server endpoint: http://10.250.176.36:9998
2016-02-04 21:41:27,294 requests.packages.urllib3.connectionpool INFO     Starting new HTTP connection (1): 10.250.176.36
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.4/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/renat/search/text_extractor/text_extractor.py", line 74, in <module>
    text_extractor.parse_text()
  File "/home/renat/search/text_extractor/text_extractor.py", line 35, in parse_text
    parsed_data = tika_parser.from_file(parse_task.filename, self.tika_address)
  File "/home/renat/searchenv/lib/python3.4/site-packages/tika-1.12-py3.4.egg/tika/parser.py", line 25, in from_file
  File "/home/renat/searchenv/lib/python3.4/site-packages/tika-1.12-py3.4.egg/tika/tika.py", line 168, in parse1
  File "/home/renat/searchenv/lib/python3.4/site-packages/tika-1.12-py3.4.egg/tika/tika.py", line 275, in callServer
  File "/home/renat/searchenv/lib/python3.4/site-packages/requests/api.py", line 120, in put
    return request('put', url, data=data, **kwargs)
  File "/home/renat/searchenv/lib/python3.4/site-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/renat/searchenv/lib/python3.4/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/renat/searchenv/lib/python3.4/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/home/renat/searchenv/lib/python3.4/site-packages/requests/adapters.py", line 376, in send
    timeout=timeout
  File "/home/renat/searchenv/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 559, in urlopen
    body=body, headers=headers)
  File "/home/renat/searchenv/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 353, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.4/http/client.py", line 1137, in request
    self._send_request(method, url, body, headers)
  File "/usr/local/lib/python3.4/http/client.py", line 1177, in _send_request
    self.putheader(hdr, value)
  File "/usr/local/lib/python3.4/http/client.py", line 1109, in putheader
    values[i] = one_value.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 21-29: ordinal not in range(256)

I think the reason is that urlopen tries to open an url before performing quote for local files. Is there proper way to fix it?

command 'g++' failed at the last two steps installing jcc

Hi Professor,

I encountered the mentioned linking error at the last two steps installing jcc, see below

Yis-MacBook-Pro:JCC-2.19 Yehudi$ sudo ../../bin/python2.7 setup.py build
found JAVAHOME = /Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home
Loading source files for package org.apache.jcc...
Constructing Javadoc information...
Standard Doclet version 1.7.0_45
Building tree for all the packages and classes...
Generating javadoc/org/apache/jcc/PythonException.html...
Generating javadoc/org/apache/jcc/PythonVM.html...
Generating javadoc/org/apache/jcc/package-frame.html...
Generating javadoc/org/apache/jcc/package-summary.html...
Generating javadoc/org/apache/jcc/package-tree.html...
Generating javadoc/constant-values.html...
Generating javadoc/serialized-form.html...
Building index for all the packages and classes...
Generating javadoc/overview-tree.html...
Generating javadoc/index-all.html...
Generating javadoc/deprecated-list.html...
Building index for all classes...
Generating javadoc/allclasses-frame.html...
Generating javadoc/allclasses-noframe.html...
Generating javadoc/index.html...
Generating javadoc/help-doc.html...
running build
running build_py
writing /Users/Yehudi/MyUSC/CSCI-572-Information-Retrieval/python-tika/src/JCC-2.19/jcc/config.py
copying jcc/config.py -> build/lib.macosx-10.9.5-x86_64-2.7/jcc
copying jcc/classes/org/apache/jcc/PythonVM.class -> build/lib.macosx-10.9.5-x86_64-2.7/jcc/classes/org/apache/jcc
copying jcc/classes/org/apache/jcc/PythonException.class -> build/lib.macosx-10.9.5-x86_64-2.7/jcc/classes/org/apache/jcc
running build_ext
building 'jcc' extension
gcc -fno-strict-aliasing -arch x86_64 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -dynamiclib -D_jcc_lib -DJCC_VER="2.19" -I/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/include -I/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/include/darwin -I_jcc -Ijcc/sources -I/Users/Yehudi/MyUSC/CSCI-572-Information-Retrieval/buildout/buildout.python/parts/opt/include/python2.7 -c jcc/sources/jcc.cpp -o build/temp.macosx-10.9.5-x86_64-2.7/jcc/sources/jcc.o -DPYTHON -fno-strict-aliasing -Wno-write-strings
clang: warning: argument unused during compilation: '-dynamiclib'
gcc -fno-strict-aliasing -arch x86_64 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -dynamiclib -D_jcc_lib -DJCC_VER="2.19" -I/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/include -I/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/include/darwin -I_jcc -Ijcc/sources -I/Users/Yehudi/MyUSC/CSCI-572-Information-Retrieval/buildout/buildout.python/parts/opt/include/python2.7 -c jcc/sources/JCCEnv.cpp -o build/temp.macosx-10.9.5-x86_64-2.7/jcc/sources/JCCEnv.o -DPYTHON -fno-strict-aliasing -Wno-write-strings
clang: warning: argument unused during compilation: '-dynamiclib'
g++ -Wl,-x -dynamiclib -undefined dynamic_lookup build/temp.macosx-10.9.5-x86_64-2.7/jcc/sources/jcc.o build/temp.macosx-10.9.5-x86_64-2.7/jcc/sources/JCCEnv.o -o build/lib.macosx-10.9.5-x86_64-2.7/libjcc.dylib -L/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib -ljava -L/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/server -ljvm -Wl,-rpath -Wl,/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib -Wl,-rpath -Wl,/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/server -Wl,-S
ld: internal error: atom not found in symbolIndex(__ZN7JNIEnv_13CallIntMethodEP8_jobjectP10_jmethodIDz) for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'g++' failed with exit status 1

I followed the article (http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201403.mbox/%[email protected]%3E) and installed the custom gcc (brew install gcc) of version 4.9.1, and set the env CC=/usr/local/Cellar/gcc/4.9.1 (where gcc brew). I also modified the setup.py because the jdk version was hard coded with '1.7.0_25' but mine is 1.7.0_45. However I'm still blocked here, any suggestions? Thanks !

Failed to establish a new connection: [Errno 61] Connection refused

Tried using Tika-Python earlier today. Got the following error. Seems to be exceeding some maximum connections.

It also seems to be a problem with Tika Server.

Traceback (most recent call last):
File "yao_file_detector.py", line 40, in
main()
File "yao_file_detector.py", line 36, in main
detect_files('/Users/Frank/working-directory/fulldump/file-type.txt')
File "yao_file_detector.py", line 23, in detect_files
file_type = detector.from_file(''.join([base_directory, val]))
File "/Library/Python/2.7/site-packages/tika/detector.py", line 22, in from_file
jsonOutput = detectType1('type', filename)
File "/Library/Python/2.7/site-packages/tika/tika.py", line 223, in detectType1
verbose, tikaServerJar)
File "/Library/Python/2.7/site-packages/tika/tika.py", line 256, in callServer
resp = verbFn(serviceUrl, encodedData, headers=headers)
File "/Library/Python/2.7/site-packages/requests/api.py", line 120, in put
return request('put', url, data=data, *_kwargs)
File "/Library/Python/2.7/site-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, *_kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, *_send_kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 576, in send
r = adapter.send(request, *_kwargs)
File "/Library/Python/2.7/site-packages/requests/adapters.py", line 437, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=9998): Max retries exceeded with url: /detect/stream (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x118ec4590>: Failed to establish a new connection: [Errno 61] Connection refused',))

double list embedded

Hi, @chrismattmann look at this:

7edcc94a-9bfc-452d-8c53-bb68582761b6

After parsing by parser in tika-python, this one shows:

fb7a1240-90b3-490a-934d-3b5aad896284

with double list symbol on it.
Can we just keep one list symbol and release the other one in order to keep it the same with other metadata values?Or you want to use it for some other reasons?

Support to use the library i client-only mode

I think it would be useful to use the library in client-only mode. In this way the library calls a remote server and if the server is not reachable raises an exception rather than download tika and run it.
Thanks,
P.

call to requests.put is blocking on Windows when using from_file

{parser|detector|language|translate}.from_file all use requests.put on an open file handle to stream the file to the Tika JAX-RS server. On Windows this call freezes and doesn't return, giving eventually (correctly) a BadStatusLine return from the underlying Python httplib. Need to find out why it's blocking on requests.put on windows with streaming file handle.

problems in setting up new environment and installing python

Hi Professor,
I'm following steps for installing JCC on MAC from here (https://github.com/chrismattmann/tika-python )

  1. As mentioned in this step( "Create a the file local.cfg with the following contents, then edit /some/directory at the bottom to be the directory you want to house your new python installation. " ) edited this "/some/directory/" to a local directory
    2.Ran these commands
    "env MACOSX_DEPLOYMENT_TARGET=10.9 bin/buildout -c local.cfg
    env MACOSX_DEPLOYMENT_TARGET=10.9 bin/buildout -c local.cfg install install-links "
    but I don't see python installation directory ("/some/directory" ) getting populated with binaryfiles

so couldn't execute last steps as there are no binary files
[ ../../bin/python2.7 setup.py build
../../bin/python2.7 setup.py install ]

Am I missing something ?
Thanks for your help in advance.

Tests don't run with setup.py test

Tests seem to only run with:

python2.7 tika/tests/tests_params.py

When trying to run through setuptools (note first run python2.7 setup.py develop) I get:

test_content (tika.tests.tests_params.RemoteTest) ... ERROR
test_meta (tika.tests.tests_params.RemoteTest) ... ERROR
test_true (tika.tests.tests_params.RemoteTest) ... ERROR

======================================================================
ERROR: test_content (tika.tests.tests_params.RemoteTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mattmann/git/tika-python/tika/tests/tests_params.py", line 48, in setUp
    self.param1 = tika.parser.from_file(self.param1)
  File "/usr/local/xdata-employment/python2.7/lib/python2.7/site-packages/tika/parser.py", line 24, in from_file
    jsonOutput = parse1('all', filename, serverEndpoint)
  File "/usr/local/xdata-employment/python2.7/lib/python2.7/site-packages/tika/tika.py", line 140, in parse1
    path, type = getRemoteFile(urlOrPath, '/tmp')
  File "/usr/local/xdata-employment/python2.7/lib/python2.7/site-packages/tika/tika.py", line 280, in getRemoteFile
    urlp = urlparse(urlOrPath)
  File "/Users/mattmann/git/buildout.python/parts/opt/lib/python2.7/urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "/Users/mattmann/git/buildout.python/parts/opt/lib/python2.7/urlparse.py", line 182, in urlsplit
    i = url.find(':')
AttributeError: 'NoneType' object has no attribute 'find'

======================================================================
ERROR: test_meta (tika.tests.tests_params.RemoteTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mattmann/git/tika-python/tika/tests/tests_params.py", line 48, in setUp
    self.param1 = tika.parser.from_file(self.param1)
  File "/usr/local/xdata-employment/python2.7/lib/python2.7/site-packages/tika/parser.py", line 24, in from_file
    jsonOutput = parse1('all', filename, serverEndpoint)
  File "/usr/local/xdata-employment/python2.7/lib/python2.7/site-packages/tika/tika.py", line 140, in parse1
    path, type = getRemoteFile(urlOrPath, '/tmp')
  File "/usr/local/xdata-employment/python2.7/lib/python2.7/site-packages/tika/tika.py", line 280, in getRemoteFile
    urlp = urlparse(urlOrPath)
  File "/Users/mattmann/git/buildout.python/parts/opt/lib/python2.7/urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "/Users/mattmann/git/buildout.python/parts/opt/lib/python2.7/urlparse.py", line 182, in urlsplit
    i = url.find(':')
AttributeError: 'NoneType' object has no attribute 'find'

======================================================================
ERROR: test_true (tika.tests.tests_params.RemoteTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mattmann/git/tika-python/tika/tests/tests_params.py", line 48, in setUp
    self.param1 = tika.parser.from_file(self.param1)
  File "/usr/local/xdata-employment/python2.7/lib/python2.7/site-packages/tika/parser.py", line 24, in from_file
    jsonOutput = parse1('all', filename, serverEndpoint)
  File "/usr/local/xdata-employment/python2.7/lib/python2.7/site-packages/tika/tika.py", line 140, in parse1
    path, type = getRemoteFile(urlOrPath, '/tmp')
  File "/usr/local/xdata-employment/python2.7/lib/python2.7/site-packages/tika/tika.py", line 280, in getRemoteFile
    urlp = urlparse(urlOrPath)
  File "/Users/mattmann/git/buildout.python/parts/opt/lib/python2.7/urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "/Users/mattmann/git/buildout.python/parts/opt/lib/python2.7/urlparse.py", line 182, in urlsplit
    i = url.find(':')
AttributeError: 'NoneType' object has no attribute 'find'

----------------------------------------------------------------------
Ran 7 tests in 6.722s

FAILED (errors=3)
[chipotle:~/git/tika-python] mattmann% 

Any ideas @kdurril and/or @nutjob4life?

ubuntu running error

f5d23fb2-d765-4cbd-9593-2f78fc6b3e2d
@chrismattmann I tried to run it on ubuntu, but it seems some syntax error on bad fd number.
It probably happens because of ">&" symbol.
Try to figure out.

Tika-Python doesn't work on windows

I received the following email from Github user:

I tried the Tika Lib for a project on Windows and saw that it doesn't permit to recognize a windows path.
Perhaps, you only work on unices ;-)
in

def getRemoteFile(urlOrPath, destPath):
    """Fetch URL to local path or just return absolute path."""
    #import pdb; pdb.set_trace()
    urlp = urlparse(urlOrPath)
    if urlp.scheme == '':
        return (os.path.abspath(urlOrPath), 'local')
    else:
        filename = urlOrPath.rsplit('/',1)[1]
        destPath = destPath + '/' +filename
        echo2('Retrieving %s to %s.' % (urlOrPath, destPath))
        urlretrieve(urlOrPath, destPath)
        return (destPath, 'remote')

the urlp.scheme return the drive ( c or d, etc ...)

>> o=urlparse('c:\temp\toto.pdf')
>> print o
ParseResult(scheme='c', netloc='', path='\temp\toto.pdf', params='', query='', fragment='')

i corrected my installation of tika by changing into :

if urlp.scheme not in ('http','https'):
    return(urlOrPath,'local')
and the 2nd line because 
>> os.path.abspath('c:\temp\toto.pdf')
'C:\\Python27\\\temp\toto.pdf' wich is wrong

best regards,
stéphane
[email protected]

Support Tika's `fileUrl` option

tika-server supports the fileUrl option that lets you extract data from a file available via an URL. This is very useful if the Tika server runs on host A, the file is located on host B and the program using Tika is running on host C. Without fileUrl, one first has to transfer the file from host B to host C and then transfer it to host A. With fileUrl one can tell Tika to directly transfer the file from host B to host A.

tika-python should support this feature, for example like this:

parsed = tika.parser.from_url('http://example.com/my_document')

Tika Python not working for Java 1.8

As a part of my work I am supposed to call tika parser to get the contents of file in XHTML format and then feed it to the Grobid Quantities tool to get measurements details for each file. I found that Grobid quantities worked on Java 1.8 so I updated my java version to 1.8. At this stage the tika-python failed to work. Can some one please check if Tika-python works with java 1.8 because as soon as I reverted back to 1.7 it started working again.

Upgrade to Tika 1.9

Tika 1.9 is now released, so upgrade to use it. This will natively support the translate interface and language interface since those required a 1.9 server.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.