Giter Club home page Giter Club logo

ehri-indexer's Introduction

Indexer

This is a convenience tool for indexing the EHRI database. The basic idea is to read some JSON from a web service (EHRI REST), convert it to another format (Solr Doc), and POST it to another web service (Solr). The traditional way to do this would be something like:

curl <WS-URL> | convert-json | curl -X POST "Content-type: application/json" <SOLR-UPDATE-URL> --data @-

Here, we just bundle the downloading and uploading bits as well, with some shortcut syntax. There are ways to accomplish the shell pipeline approach using certain options detailed below.

Notes: To build a jar, use mvn clean compile assembly:single. The compile phase must be present. See: http://stackoverflow.com/a/574650/285374. There should then be a Jar called ehri-indexer-1 .0-SNAPSHOT-jar-with-dependencies.jar inside the target folder, which you can execute with the usual java -jar <file.jar> [OPTIONS] ... [ARGS].

Current options:

usage: indexer  [OPTIONS] <spec> ... <specN>
 -c,--clear-id <arg>     Clear an individual id. Can be used multiple
                         times.
 -C,--clear-type <arg>   Clear an item type. Can be used multiple times.
 -D,--clear-all          Clear entire index first (use with caution.)
 -f,--file <arg>         Read input from a file instead of the REST
                         service. Use '-' for stdin.
 -h,--help               Print this message.
 -i,--index              Index the data. This is NOT the default for safety
                         reasons.
 -n,--noconvert          Don't convert data to index format.
 -P,--pretty             Pretty print out JSON given by --print.
 -p,--print              Print converted JSON to stdout. Also implied by
                         --noindex.
 -r,--rest <arg>         Base URL for EHRI REST service.
 -s,--solr <arg>         Base URL for Solr service (minus the action
                         segment).
 -v,--verbose            Print index stats.
 -V,--veryverbose        Print individual item ids

Each <spec> should consist of:
* an item type (all items of that type)
* an item id prefixed with '@' (individual items)
* a type|id (bar separated - all children of an item)
The default URIs for Solr and the REST service are:
* http://localhost:7474/ehri
* http://localhost:8983/solr/portal

Examples:

Index documentary unit and repository types from default service endpoints:

java -jar indexer.jar --index documentaryUnit repository

Index individual item us-005578:

java -jar indexer.jar --index @us-005578

Pretty print (to stdout) the converted JSON output for all documentary units, but don't index:

java -jar indexer.jar --pretty documentaryUnit

Pretty print (to stdout) the raw REST service output:

java -jar indexer.jar --pretty --noconvert documentaryUnit

Clear the entire index:

java -jar indexer.jar --clear-all

Index data read from a JSON file instead of the REST service, outputting some stats:

java -jar indexer.jar --index -f data.json -v

Same as above, but piping the data through stdin (use '-' as the file name):

cat data.json | java -jar indexer.jar --index -f - -v

Read data from stdin, convert it, and pipe it to a Curl upload for indexing:

cat orig.json | java -jar indexer.jar -f - | curl -X POST -H "Content-type: application/json"
"http://localhost:8983/solr/ehri/update?commit=true" --data @-

TODO:

  • Add proper logging
  • Add proper error handling
  • Ensure all resources are properly cleaned up
  • Add more tests!

ehri-indexer's People

Contributors

mikesname avatar

Watchers

James Cloos avatar Junte Zhang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.