Giter Club home page Giter Club logo

simple-tika-server's Introduction

Simple Tika Server
https://github.com/gselva
==========================

Tika as a HTTP service to parse and get metadata and the textual content of documents (as json) such as below.

Sample Invocation:
------------------
curl -T tika.pdf http://localhost:8080/tika/fulldata

Sample Output:
-------------
{
    "metadata": { "Author":"Apache",
            "Content-Length":243166,
            "Content-Type":"application/pdf",
            "Creation-Date":"2010-05-11T13:39:32Z",
            "Keywords":"",
            "Last-Modified":"2010-05-11T13:39:32Z",
            "created":"Wed May 11 14:39:32 BST 2010",
            "creator":"Google Chrome",
            "producer":"Mac OS X 10.6.7 Quartz PDFContext",
            "resourceName":"tika.pdf",
            "subject":"",
            "title":"Apache Tika - Apache Tika",
            "xmpTPg:NPages":4 },

    "text":"Apache Tika\nApache Tika - a content analysis toolkit"
}

Based on https://github.com/maxcom/tikaserver-ex   (thanks Max!)

More info
---------
Tika as server	    : https://issues.apache.org/jira/browse/TIKA-593
Json output		: https://issues.apache.org/jira/browse/TIKA-213

Build and run
-------------
Install Maven if you haven't, then, under project root folder do:
mvn jetty:run

Deploy
------
mvn war:war

Then drop the war into the webapps folder of a servlet container like Jetty or Apache Tomcat. 
If using Simple Tika Server for parsing local files with HTTP GET, you'll need to setup JNDI entries, see below.


Usage
-----
HTTP PUT a document to parse with Tika

  PUT document to get metadata
	curl -T pom.xml http://localhost:8080/tika/metadata
		returns metadata as JSON
  			
  PUT document to get text
	curl -T pom.xml http://localhost:8080/tika/text
		returns textual content, if any, as json
  
  PUT document to get metadata and text
	curl -T pom.xml http://localhost:8080/tika/fulldata
		returns metadata and textual content as json


HTTP GET to parse a locally available document with Tika

  GET Metadata for files local to server ('metadata' option)
	  http://localhost:8080/tika/metadata/localfiles/tika.pdf
		returns metadata as JSON
		'localfiles' is a JNDI value that returns a URL (see jetty-env.xml for example) 
		tika.pdf should be made available there before calling
		
  	http://localhost:8080/tika/metadata/wikipedia/wiki/Douglas_Adams  
		returns metadata as JSON
		'wikipedia' is a JNDI value that returns a URL (see jetty-env.xml for example) 
		"wiki/Douglas_Adams" is the resource you want to parse 
		
  	http://localhost:8080/tika/metadata/wikipedia/w/index.php?title=Talk:Douglas_Adams&action=history  		
		document part can have query params as well 
		"w/index.php?title=Talk:Douglas_Adams&action=history" is treated as the source document here

  
  GET Text for files local to server ('text' option)
  	http://localhost:8080/tika/text/localfiles/tika.pdf
		returns textual content, if any, as json
		'localfiles' is a JNDI value that returns a URL (see jetty-env.xml for example) 
		tika.pdf should be made available there before calling
  			
  GET Metadata and Text for files local to server ('fulldata' option)
  	http://localhost:8080/tika/fulldata/localfiles/tika.pdf
		returns metadata and textual content as json
		'localfiles' is a JNDI value that returns a URL (see jetty-env.xml for example) 
		tika.pdf should be made available there before calling


HTTP Codes returned
-------------------
200 - Ok
404 - Document not found
415 - Unknown file type
422 - Unparsable document of known type (password protected documents and unsupported versions like Biff5 Excel)
500 - Internal error

simple-tika-server's People

Watchers

James Cloos avatar dani avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.