Giter Club home page Giter Club logo

chrismattmann / imagecat Goto Github PK

View Code? Open in Web Editor NEW
94.0 18.0 40.0 179.46 MB

ImageCat is an Apache OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (images,but could be extended to other files) in place, and to extract metadata and OCR information from those files/images using Tika and Tesseract OCR.

Shell 1.12% Python 0.06% HTML 19.78% JavaScript 10.01% CSS 1.26% XSLT 1.08% Java 66.36% Batchfile 0.33% Mathematica 0.01% Dockerfile 0.01%
memex oodt solr oodt-radix tika apache

imagecat's Introduction

ImageCatalog

This is an OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (images,but could be extended to other files) in place, and to extract metadata and OCR information from those files/images using Tika and Tesseract OCR.

See the wiki for more information on installing and running ImageCat:

You can clone the wiki by running
git clone https://github.com/chrismattmann/imagecat.wiki.git

Questions, comments?

Send them to Chris A. Mattmann.

License

Apache License, version 2

imagecat's People

Contributors

arathisunder avatar atzimdars avatar chrismattmann avatar coldsp33d avatar danlamanna avatar harsham05 avatar quasiben avatar szlwzl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

imagecat's Issues

Update Documentation step 9

In step 9 the command is ambiguous if you're installing as you are already in the deploy directory. Update the instructions to be relative but still state that one should be in the deploy directory.

Install Instructions

(memex)ubuntu@ip-10-47-164-122:/mnt/deploy/tomcat7/bin$ cd $OODT_HOME/resmgr/bin/ && ./start_memex_stubs
-bash: ./start_memex_stubs: No such file or directory

change to start-memex-stubs

But then another issue:

(memex)ubuntu@ip-10-47-164-122:/mnt/deploy/resmgr/bin$ ./start-memex-stubs
./start-memex-stubs: ./batch_stub: /bin/tcsh: bad interpreter: No such file or directory

Step 11 Specify Jar Locations

In step 11 we should specify where to get each of the jar files from. I propose that we grab them from their component directory (i.e. cas-filemgr-VERSION.jar from filemgr/lib/cas-filemgr-VERSION.jar)

Missing XMLs

These were missing after unpackaging

root@7dc72cda3f49:/deploy/pge/extractors# tree
.
|-- filename
|   `-- chunkfile_extractor.xml
`-- metout
    `-- chunkfile_metout.xml

deploy/bin/env.sh sets spurious environment variable on line 48

Professor,

In the process of seting up imagecat, I found a bug. This occurs once you're run ./install.sh, and are in the process of running ./start.sh inside deploy.

On line 47 of /bin/env.sh, there is

export OODT_HOME=--OODT_HOME--

Which interferes with the startup process and should be removed. We're already setting OODT_HOME in bin/imagecatenv.sh, so this line can either be removed, or edited to take the value from imagecatenv.sh.

Similarly, another file contains the same substring tomcat7/conf/Catalina/localhost/solr.xml.

For reference, I found these files by running the command

egrep -ril '\-\-OODT_HOME\-\-' .  

Inside the deploy folder.

Null pointer Exception

I am receiving the below error in oodt.out after running $OODT_HOME/bin/chunker. Tomcat and Solr both appear to be starting correctly and the path to images is correct in the roxy file. I have also set permissions liberally for the image files but am unable to get any results. Images are all jpegs with no spaces or special chars.

Sep 16, 2015 1:25:25 PM org.apache.oodt.cas.workflow.engine.IterativeWorkflowProcessorThread checkTaskRequiredMetadata
INFO: Task: [Chunker] has no required metadata fields
Exception in thread "Thread-4" java.lang.NullPointerException: value cannot be null
    at org.apache.lucene.document.Field.<init>(Field.java:240)
    at org.apache.lucene.document.Field.<init>(Field.java:216)
    at org.apache.oodt.cas.workflow.instrepo.LuceneWorkflowInstanceRepository.addInstanceMetadataToDoc(LuceneWorkflowInstanceRepository.java:576)
    at org.apache.oodt.cas.workflow.instrepo.LuceneWorkflowInstanceRepository.toDoc(LuceneWorkflowInstanceRepository.java:543)
    at org.apache.oodt.cas.workflow.instrepo.LuceneWorkflowInstanceRepository.addWorkflowInstanceToCatalog(LuceneWorkflowInstanceRepository.java:459)
    at org.apache.oodt.cas.workflow.instrepo.LuceneWorkflowInstanceRepository.updateWorkflowInstance(LuceneWorkflowInstanceRepository.java:200)
    at org.apache.oodt.cas.workflow.engine.IterativeWorkflowProcessorThread.persistWorkflowInstance(IterativeWorkflowProcessorThread.java:563)
    at org.apache.oodt.cas.workflow.engine.IterativeWorkflowProcessorThread.run(IterativeWorkflowProcessorThread.java:258)
    at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Thread.java:745)

Some Images fail

Images are partially processed but not parsed correctly:

INFO: on.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.jpeg.JpegParser@5c0bae4a
OUTPUT:         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
OUTPUT:         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
OUTPUT:         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
OUTPUT:         at org.apache.solr.core.RequestHandler
Apr 15, 2015 9:18:29 PM org.apache.oodt.commons.io.LoggerOutputStream flush

Warn users this happens or fix?

ImageCat Streaming

For integrating ImageCat into the Memex crawler pipeline, we would like to run the integration code in Storm for real-time processing. Is it possible to provide a streaming API for ImageCat?

Thanks,
Steve

XMLRPC Prereq Documentation

The prereq lists xmlrpc as a prerequisite. I believe this should be xmlrpclib. This comes standard in python version 2.7.

Step 12 with OODT 0.9 can be removed

Removing all of the slf jars made the opsui not function either this step 12 of the installation needs to be removed or the slf jar files need to be put into the tomcat/common/lib so they are available across all webapps.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.