Giter Club home page Giter Club logo

pure.py's Introduction

pure.py

A set of Python tools to make help transform and data for Pure.

Disclaimer: This code is provided "as-is" for educational purposes under the MIT license and should in no way be considered production-ready code. The author is not obligated to maintain, fix or otherwise provide technical support to end-users. Please refer to the license for more details.

Tools:

  • patents.py Python script to harvest Inteum patents and convert them to XML according to the publications.xsd used in the Pure bulk XML import wizard.
  • pureapi.py A simple wrapper for the Pure API. Will default to the latest Pure API version if nothing else is specified.

Harvesting patents from Inteum:

  1. Determine your Inteum RSS 2.0 URL.
  2. Create a Pure API key (required for resolving persons).
  3. Run pip install -r requirements.txt to add packages.
  4. Initialize sites.json with your institution's values (see below for details).
  5. Run python patents.py (Output will be saved to current directory).
  6. <YOUR_PURE_NAME>\_out.xml can be imported into Pure.

sites.json

Each site is a Site object that contains a Pure API key, name, Pure URL, root org ID and source URL (to pull XML data from).

The script supports harvesting multiple sites. Simply add these to the sites.json file to instantiate additional Site objects.

[
    {
        "py/object": "puresite.Site",
        "api_key": "<YOUR_API_KEY",
        "name": "<YOUR_PURE_NAME>",
        "pure_url": "<YOUR_PURE_URL>",
        "root_org": "<YOUR_ROOT_ORG_ID>",
        "url": "<YOUR_DATA_PURL>"
    }
]

Example:

[
    {
        "py/object": "puresite.Site",
        "api_key": "xxxxxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
        "name": "my_name",
        "pure_url": "https://mypure.elsevierpure.com",
        "root_org": "my_root_org_source_id",
        "url": "http://my_inteum_site.technologypublisher.com/RssDataFeed.aspx?UpdateOnOrAfter=1/1/2010"
    }
]

Transforming Inteum RSS 2.0 to Publications

See rss-to-pubs.xsl for a stylesheet that converts the XML to be ingested by Pure. Note that parameters are passed from Python, hence the transformation will not be complete when run outside of patents.py.

  • For reference, the schema files publications.xsd and commons.xsd files can be downloaded from Pure as an Administrator.
  • Default visibility of converted patents is set to Restricted.

Resolving Persons

The RSS feed does not contain unique identifiers for inventors. Pure will only match inventors to internal persons if there is an exact match on the name, which cannot always be guaranteed. To get around this, the script will attempt to map inventors to persons in Pure.

See puresite.py for details on how to match internal Pure persons with a Lucene query against the Pure API. Successful matches are mapped to the externalId (Pure source ID) of a person with a closely matching name.

Note: If multiple persons match on the same name, only the last match will be saved. Additional functionality can be added for more sophisticated matching logic.

Each match will be added to a dictionary and saved to a JSON file "<YOUR_PURE_NAME>.json" for faster re-run, e.g.

{"John;Doe": "7213cfd7abac1973c9d018a3fb1022f3"}

Matches are used by the XSLT to enrich persons with an internal ID using an extension function python:lookup_person.

Any matches will be saved to a file called <YOUR_PURE_NAME>.json. To force re-matching, simply delete the file and re-run the script.

Contributors

  • Lars K Oestergaard

pure.py's People

Contributors

larsko avatar ostergaardl-els avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.