Giter Club home page Giter Club logo

tika-app-python's Introduction

PyPI version Build Status Coverage Status BCH compliance

tika-app-python

Overview

tika-app-python is a wrapper for Apache Tika App. With this library you can analyze:

  • file on disk
  • payload in base64
  • file object (like standard input)

To use file object function you should use Apache Tika version >= 1.17.

Apache 2 Open Source License

tika-app-python can be downloaded, used, and modified free of charge. It is available under the Apache 2 license.

Authors

Main Author

Fedele Mantuano (Twitter: @fedelemantuano)

Installation

Clone repository

git clone https://github.com/fedelemantuano/tika-app-python.git

and install tika-app-python with setup.py:

cd tika-app-python

python setup.py install

or use pip:

pip install tika-app

Usage in a project

Import TikaApp class:

from tikapp import TikaApp

tika_client = TikaApp(file_jar="/opt/tika/tika-app-1.18.jar")

For get content type:

tika_client.detect_content_type("your_file")

For detect language:

tika_client.detect_language("your_file")

For detect all metadata and content:

tika_client.extract_all_content("your_file")

For detect only content:

tika_client.extract_only_content("your_file")

For detect only metadata:

tika_client.extract_only_metadata("your_file")

You can analyze payload in base64 with the same methods, but passing payload argument:

tika_client.detect_content_type(payload="base64_payload")
tika_client.detect_language(payload="base64_payload")
tika_client.extract_all_content(payload="base64_payload")
tika_client.extract_only_content(payload="base64_payload")
tika_client.extract_only_metadata(payload="base64_payload")

or you can analyze file object (like standard input) with the same methods, but passing objectInput argument:

tika_client.detect_language(objectInput="objectInput")
tika_client.extract_all_content(objectInput="objectInput")
tika_client.extract_only_content(objectInput="objectInput")
tika_client.extract_only_metadata(objectInput="objectInput")

Usage from command-line

If you installed tika-app-python with pip or setup.py you can use it with command-line. To use tika-app-python you should submit the Apache Tika app JAR. You can:

  • set the enviroment value TIKA_APP_JAR
  • use --jar switch

The last one overwrite all the others.

These are all swithes:

usage: tikapp [-h] (-f FILE | -p PAYLOAD | -k) [-j JAR] [-d] [-t] [-l]
                   [-m] [-a] [-v]

Wrapper for Apache Tika App.

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  File to submit (default: None)
  -p PAYLOAD, --payload PAYLOAD
                        Base64 payload to submit (default: None)
  -k, --stdin           Enable parsing from stdin (default: False)
  -j JAR, --jar JAR     Apache Tika app JAR (default: None)
  -d, --detect          Detect document type (default: False)
  -t, --text            Output plain text content (default: False)
  -l, --language        Output only language (default: False)
  -m, --metadata        Output only metadata (default: False)
  -a, --all             Output metadata and content from all embedded files
                        (default: False)
  -v, --version         show program's version number and exit

Example from file on disk:

$ tikapp -f example_file -a

Example from standard input

$ tikapp -a -k < example_file

Performance tests

These are the results of performance tests in tests folder:

(Python 2)
tika_content_type()             0.704840 sec
tika_detect_language()          1.592066 sec
magic_content_type()            0.000215 sec
tika_extract_all_content()      0.816366 sec
tika_extract_only_content()     0.788667 sec

(Python 3)
tika_content_type()             0.698357 sec
tika_detect_language()          1.593452 sec
magic_content_type()            0.000226 sec
tika_extract_all_content()      0.785915 sec
tika_extract_only_content()     0.766517 sec

tika-app-python's People

Contributors

antoinefraboul avatar fedelemantuano avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tika-app-python's Issues

Metadata function

Would a function, e.g. tika_extract_only_metadata, be accepted? I.e. calling Tika with the -m switch.

OCR?

Tika as a standalone will OCR an image. Is there a way to access that functionality through your code? Thanks very much.

Change output format

Can we change the output of "extract_only_content('File_Name')"?

The code that i Used is working fine....
from tikapp import TikaApp client = TikaApp('File_Path of Tika-app.jar file') content = client.extract_only_content("File_Name") print(content)

The content contains normal string text. Can we get the content output as a html file string (with tags and all) ???

Messed order of parsed texts

Hi Guys,

The bulk paragraphs are wrongly placed at the bottom of the parsed texts. Is there any configuration to correct the order? Found sth here, but not sure if this can be implemented here?

Thanks so much. Luke

Input Pdf file left and output text right shown in screenshot below:

68594595-cb8be980-048f-11ea-8a27-d2fba6fc2471

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.