Giter Club home page Giter Club logo

primo-endpoint's Introduction

Primo Endpoint

Build Status

Configurable metadata aggregator and crosswalk for NYU Libraries collections designed to populate Primo. Can run as a web server and dynamically update document cache.

Production

> docker build -t primo-endpoint .
> docker run -p 80 primo-endpoint

Logs to stdout by default. Startup can be optimized by persisting the /cache volume.

Development

Installation

> curl -sSL https://get.haskellstack.org/ | sh
> stack install

Usage:

Usage: primo-endpoint [OPTION...]
  -c FILE   --config=FILE        Load configuration from FILE [config.yml]
  -a FILE   --auth=FILE          Load auth rules from FILE [auth.yml]
  -C DIR    --cache=DIR          Use DIR for cache files [$XDR_CACHE_DIR/primo-endpoint]
  -f        --force              Force an initial update of all collections
  -o[DEST]  --output[=DEST]      Write JSON output to file [-]
  -w[PORT]  --web-server[=PORT]  Run a web server on PORT [80] to serve the result
  -l        --log-access         Log access to stdout
  -v        --verbose            Log collection refreshes to stdout

Config

The configuration is read from a YAML (or JSON) file with the following structure:

  • interval: number of seconds for which to cache collections before reloading (by default)
  • fda: FDA-specific configuration options:
    • collections: maximum number of collections to load from index to use in translating hdls to ids
  • generators: a set of named generator "macro" functions that can be used as generator keys, substituting passed object arguments for input fields
  • templates: a set of named field generator templates, each of which contains a set of field generators
  • collections: a set of named collections, each with the following fields:
    • source: a source type (see below), which may also take additional arguments on the collection object
    • template: optional string or array of 0 or more templates (referencing names in the templates object), which are all unioned together
    • fields: additional local "custom" generator fields for this collection

See config.yml for an example.

Sources

Each collection can have one of the following source values to specify the endpoint to pull from:

  • FDA: https://archive.nyu.edu/rest/collections/$id requires id (internal) or hdl (suffix)
  • DLTS: http://discovery.dlib.nyu.edu:8080/solr3_discovery/$core/select requires core (core (none), viewer, or nyupress) and code (collection code)
  • DLib: http://dlib.nyu.edu/$path requires path
  • SDR: https://geo.nyu.edu/catalog (filtered on dct_provenance_s=NYU)
  • SpecialCollections: https://specialcollections.library.nyu.edu/search/catalog.json requires filters object mapping field to value
  • ISAW: http://isaw.nyu.edu/publications/awol-index/awol-index-json.zip (filtered on is_part_of=null)
  • JSON: raw JSON file with array of documents in native key-value format; requires file or url; mainly for testing purposes

Fields

Field definitions are made up of the following:

  • Object with one or more key-value pairs, applied in the following order (highest to lowest precedence):
    • Single fields, which are processed independently and then combined (as if in an array):
      • field: name of source field to copy
      • string: string literal to create single value
      • paste: list of definitions, or string with $field or ${field} placeholders to substitute ($$ for a literal $); the resulting strings are pasted together (no delimiter) as a cross-product (so the number of resulting values is the product of the number of values from each element)
      • handle: definition. Convert a string of the form "http://hdl.handle.net/XXX/YYY.ZZZ" to "hdl-handle-net-XXX-YYY-ZZZ". Any non-matching input is discarded.
      • value: any definition (for convenient nesting)
      • generator name: key-definition arguments as object. Substitutes a generator "macro" from the generator section, assigning the given keys to their corresponding values as input fields to the macro. The generator can also see any other input fields as well.
    • Post-processors that first process the rest of the definition, and then apply a transformation on the result:
      • date: string strptime format. Tries to parse each value in the result with the given format and produces a timestamp in standard format (relevant prefix of "%Y-%m-%dT%H:%M:%S%QZ") as output. Any inputs that cannot be parsed are discarded.
      • match: match input against regular expressions
      • A string regular expression, which filters input values against the regular expression, passing only those which match
      • An object "lookup table" mapping regular expressions to substitutions: each input value is matched against each regular expression, and the right-hand value substituted for each matching value. Within the substitution, the following additional field values are available:
        • ``` (backtick): the input string before the (first) match
        • \' (apostrophe): the input string after the (first) match
        • &: the matching segment of the input string
        • 0: same as &
      • 1...N: the string matched by each parenthesized group in the regular expression
      • limit: integer. Take only the first n values from the input, discarding the rest.
      • default: definition. If there are no produced input values, provide the definition instead.
      • join: string literal delimiter. Paste all the inputs together, separated by the given delimiter. Always produces exactly one output.
  • Array: all produced values are merged, producing the sum of all the input values.
  • String literal containing only ., _, and alphanumerics: passed to field
  • Any other string literal: passed to paste
  • Null: same as empty array (produces 0 values)

There are two special input fields added to every source document:

  • _key: The collection key
  • _name: The collection name field

Reference data:

Required fields for primo

  • "id": for FDA, "fda:hdl-handle-net-2451-XXXX"
  • "desc_metadata__addinfolink_tesim"
  • "desc_metadata__addinfotext_tesim"
  • "desc_metadata__available_tesim"
  • "desc_metadata__citation_tesim"
  • "desc_metadata__creator_tesim"
  • "desc_metadata__data_provider_tesim"
  • "desc_metadata__date_tesim"
  • "desc_metadata__description_tesim"
  • "desc_metadata__edition_tesim"
  • "desc_metadata__format_tesim"
  • "desc_metadata__isbn_tesim"
  • "desc_metadata__language_tesim"
  • "desc_metadata__location_tesim"
  • "desc_metadata__publisher_tesim"
  • "desc_metadata__relation_tesim"
  • "desc_metadata__repo_tesim"
  • "desc_metadata__resource_set_tesim"
  • "desc_metadata__restrictions_tesim"
  • "desc_metadata__rights_tesim"
  • "desc_metadata__series_tesim"
  • "desc_metadata__subject_tesim"
  • "desc_metadata__subject_spatial_tesim"
  • "desc_metadata__subject_temporal_tesim"
  • "desc_metadata__title_tesim"
  • "desc_metadata__type_tesim"
  • "desc_metadata__version_tesim"
  • "collection_ssm"

primo-endpoint's People

Contributors

dylex avatar ekate avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

kevchu3

primo-endpoint's Issues

Retrieve FDA records in consistent order

For all the other sources we use a ?sort=id parameter or something like it to get the records in order. This is important for pagination, but also helpful in testing when comparing the output. Is there something we can pass to the DSpace REST API to sort the items within each collection? Should we just sort them ourselves after the fact? Or just let them be in whatever order?

problem with "paste" in config

For a new collection I need to paste several fields and strings to create a url as "available" field in primo.
available:
paste:
- "http://dlib.nyu.edu/findingaids/html/fales/mss_496/dsc"
- ref_ssi
- ".html"
The url is not constructed, e.g. the value of "available" field is empty. If I use ".html#" all works as expected. I think html is interpreted as variable name. Is it possible ? Could you please take a look. I add the branch called mss_496 which has a new config

Proper language translation for MODS XML

Currently the translation layer puts everything in iso639-2 long names, but MODS wants 3-char codes. Maybe the primo output layer should do this? Or XML output layer should un-do it? Or there should be field generator config for it?

headers added for authentication behave weirdly

Documents from FDA private collection are not returned although I've added authentication header to the request through auth module.
When I print the header in the FDA apache log I get:
"GET /rest/collections/ HTTP/1.1" 200 441 ", " "application/json" common
e.g. token is repeated twice in the header hence authentication is not working.
Looks like there are no errors in reading the values and forming the header. The request headers list looks normal so I can't figure out where this second value comes from
When manually sending the same request by curl the log entry has only one token
GET /rest/collections/ HTTP/1.1" 200 9883 "-" "curl/7.29.0" "" combined
I can't figure out where this second value comes from.
@dylex if you have time to look at it, I can provide more details

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.