Giter Club home page Giter Club logo

adsaffilpipeline's Introduction

ADSAffil

Affiliation augmentation for augment_pipeline. The pipeline relies on a set of predefined affiliation strings and affiliation data to assign canonical affiliations to affiliation metadata; it uses the results of both human- curated affiliation data and those derived from machine learning.

Queues

There are two queues:

  • augment-affiliation: take a record's affiliation and (try to) augment it

  • output-record: send the results of augmentation to master-pipeline via msg

Interactive operation

Maintenance: Creating pickle files

The pickle file contains two dicts: affil_dict, the dictionary of curated affiliation strings and their IDs; and canon_dict, the dictionary of canonical IDs, abbreviations, and parent-child relationships.

To generate the pickle file from the command line, run the pipeline with:

run.py -mp

assuming the file paths and names in config.py are the locations for all of these files.


Maintenance: Send a test record for debugging purposes

To send a single record through the augmentation pipeline to master, run the pipeline with:

run.py -d

This assumes that config.PICKLE_FILE already exists. If the pipeline is successful, the record in master_pipeline.records for 2002ApJ...576..963T should have been augmented with the affiliation A00928 (Yale University Astronomy Department) for all three authors.


Production: Augment a list of records in a JSON file (e.g. exported by Solr)

To augment a set of records in file FILEPATH, run the pipeline with:

run.py -f FILEPATH

Maintainer

Matthew Templeton, ADS

adsaffilpipeline's People

Contributors

marblestation avatar nemanjamart avatar seasidesparrow avatar spacemansteve avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

adsaffilpipeline's Issues

Strings with HTML entities are being split improperly during the matching process

Individual author-affiliation strings can have HTML entities (e.g. & amp ;). These can be fixed by calling self._norm_affil() after self.instring has been defined (see https://github.com/seasidesparrow/ADSAffilPipeline/blob/0253522469e4c04edf18b38c165a5b287999eebd/ADSAffil/app.py#L112) but this normalization process currently happens after the affiliation is split. Instead of calling this on individual "v", call it for "s" in this code block.

Problem with unclosed XML tags while building a new pickle file (-mp option)

If there are any examples of unclosed tags (especially an opening tag without a corresponding ), the -mp process will fail in https://github.com/seasidesparrow/ADSAffilPipeline/blob/faa9b2170ecd503414ce2d4fc89ae6e001d4b17d/ADSAffil/utils.py#L120

This is happening because the bs4 process that searches for HTML markup will concatenate lines together until it finds a corresponding closing tag. Thus, trying to split a "line" with a tab results in >> 2 columns.

At a minimum, there needs to be a try-except, but it would be best to either remove or modify the current dependency on bs4. If modify, try something other than .find_all('p') would be best.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.