Giter Club home page Giter Club logo

adscitationcapture's Introduction

ADSAbs 2.0

This is the Flask application for the new ADS website.

After making sure that you have python development files (package python-dev on debian-based systems) installed, the simple installation is:

$ virtualenv some-python
$ source some-python/bin/activate
$ pip install -U pip
$ pip install -U distribute
$ CFLAGS= pip install -r requirements.txt

You will need a running mongodb instance. Assuming, you are just testing things, you can do:

$ cat <<EOF> ./mongo_auth.js
use admin
db.addUser('foo','bar')
db.auth('foo','bar')
use adsabs
db.addUser('adsabs','adsabs')
use adsdata
db.addUser('adsdata','adsdata')
use adsgut
db.addUser('adsgut','adsgut')
EOF

$ mongo < ./mongo_auth.js

Then, edit config/local_config.py and add:

class LocalConfig(object):
    MONGOALCHEMY_USER = 'adsabs'
    MONGOALCHEMY_PASSWORD = 'adsabs'
    ADSDATA_MONGO_USER = 'adsdata'
    ADSDATA_MONGO_PASSWORD = 'adsdata'
    MONGODB_SETTINGS= {'HOST': 'mongodb://adsgut:adsgut@localhost/adsgut', 'DB': 'adsgut'}
    THUMBNAIL_MONGO_PASSWORD = None

For more details, see http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/BeerInstallation or look into the Jenkins task, where we test the setup: http://adswhy:9090/view/BEER/job/BEER-05-live-service/configure

Jenkins

You can have Jenkins automatically test your repository/branch:

  1. got to adswhy:9090 (login)
  2. click on 'create a new job'
  3. select 'copy from' = BEER-02-adsabs
  4. change some values:
    • the git url (and optionally the name of the branch you want to test)
    • port of a mongo db (MongoDB is created for each test, so you just want to avoid using a port that some other tests use)
    • email (to notify you of build problems)

adscitationcapture's People

Contributors

dependabot[bot] avatar ehenneken avatar marblestation avatar nemanjamart avatar tjacovich avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

adscitationcapture's Issues

Software Source bibcode/publication date mismatch

Software source/concept doi records do not have metadata independent of the most recent version, even though they have a unique doi. As the metadata updates, We keep the bibcode year in these records frozen. This ultimately results in the bibcode year and publication date being out of sync, sometimes by several years. To address this:

  1. Modify maintenance_metadata to retain the publication year for concept dois
  2. Write a new maintenance task that extracts the publication year from the citation_target_version table and updates the current metadata accordingly.

Normalize author names

We need to implement a function to guess first and last names with different possible formats (e.g., "Firstname Lastname", "Lastname, Firstname") and multiple last names or middle names. Then this should be applied to address this:

'author_norm': authors, # TODO: This should be a list of normalized authors

This feature could be implemented in pyingest.

Expand unittests to cover delta_computation and reader_import

Unit tests do not currently cover the file import portion of CitationCapture. It would be good to add some tests for both DeltaComputation and ReaderImport when it is merged into the mainline branch so that way we can confirm the behavior of the two imports are consistent.

DataCite parser fails with affiliations that are None

{"asctime": "2019-01-07T22:01:45.260Z", 
"levelname": "ERROR", 
"processName": "ForkPoolWorker-3", 
"message": "Failed parsing", 
"exc_info": "Traceback (most recent call last):\n
  File \"/app/ADSCitationCapture/doi.py\", line 208, in _parse_metadata_zenodo_doi\n
    parsed_metadata = dc.parse(raw_metadata)\n  File \"/usr/local/lib/python2.7/dist-packages/pyingest/parsers/datacite.py\", line 168, in parse\n
    aaffils.append(aff.strip())\n
AttributeError: 'NoneType' object has no attribute 'strip'", 
"timestamp": "2019-01-07T22:01:45.260Z", 
"hostname": "adsvm05"}

Deal with duplicated citations where not all are resolved

ADS Classic can output a list of citation coming from the same citing bibcode but different source. Sometimes, for one source the resolver did not work so it provides and empty cited bibcode. We must ensure that when we discard duplicates, we keep those that are resolved.

Validate/unique list of citations

In order to remove duplicates and invalid citing bibcodes from a list of citations, two options are available:

  1. Use metadata from ADS classic at the time of citation ingest / processing
  2. Use metadata from our API when software records are serialized and sent to the pipeline

If we go with option 1, we would use the mapping between all known bibcodes and canonical bibcodes, which is found in file /proj/ads/abstracts/config/bibcodes.list.all2can
The first column contains all the known bibcodes in ADS, and the second column contains their canonical version. Given a set of citing bibcodes, one would look each one up in this list. If an entry exist, then replace the bibcode with its canonical version. If an entry does not exist, the bibcode is invalid and should be ignored. At the end of the process, the resulting list of bibcodes is uniqued.

If we go with option 2, we would use an API call to bigquery and submit our list of citing bibcodes at once. The search engine will resolve each bibcode in its canonical form and return a list of unique records. The bibcodes of these records are the ones to use as our citation list.

Since bibcodes can be remapped at any point, one would periodically need to perform this operation and update the relevant records on a regular basis (at least weekly). ADS classic does this via updates to its nonbib pipeline, so a similar model could be used by the citation capture pipeline.

Date field should contain pubdate

Pubdate/year both are filled using the same source:

pubdate = parsed_metadata.get('pubdate', get_date().strftime("%Y-%m-%d"))

For instance, 2016zndo.....55143A in solr has year":"2016" and "pubdate":"2016-06-21". But date is filled using the current timestamp:

'date': (citation_change.timestamp.ToDatetime()+datetime.timedelta(minutes=30)).strftime('%Y-%m-%dT%H:%M:%S.%fZ'), # TODO: Why this date has to be 30 minutes in advance? This is based on ADSImportPipeline SolrAdapter

This makes that a query like year:2016 NOT pubdate:[2016-01 TO 2016-12] returns results, because solr is using date and not pubdate fields when the user queries using pubdate:.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.