greenelab / crossref Goto Github PK

View Code? Open in Web Editor NEW

58.0 7.0 11.0 56 KB

Download metadata for all DOIs using the Crossref API

Home Page: https://doi.org/b48h

License: Creative Commons Zero v1.0 Universal

Jupyter Notebook 57.56% Python 42.44%

crossref doi metadata mongodb python publishing dataset

crossref's Introduction

Store and process the Crossref Database

This repository downloads Crossref metadata using the Crossref API. The items retrieved are stored in MongoDB to preserve their raw structure. This design allows for flexible downstream analyses.

MongoDB

MongoDB is run via Docker. It's available on the host machine at http://localhost:27017/.

docker run \
  --name=mongo-crossref \
  --publish=27017:27017 \
  --volume=`pwd`/mongo.db:/data/db \
  --rm \
  mongo:3.4.2

Execution

works

With mongo running, execute with the following commands:

# Download all works
# To start fresh, use `--cursor=*`
# If querying fails midway, you can extract the cursor of the
# last successful query from the tail of query-works.log.
# Then rerun download.py, passing the intermediate cursor
# to --cursor instead of *.
python download.py \
  --component=works \
  --batch-size=550 \
  --log=logs/query-works.log \
  --cursor=*

# Export mongodb works collection to JSON
mongoexport \
  --db=crossref \
  --collection=works \
  | xz > data/mongo-export/crossref-works.json.xz

See data/mongo-export for more information on crossref-works.json.xz. Note that creating this file from the Crossref API takes several weeks. Users are encouraged to use the cached version available on figshare (see also Other resources below).

1.works-to-dataframe.ipynb is a Jupyter notebook that extracts tabular datasets of works (TSVs), which are tracked using Git LFS:

doi.tsv.xz: a table where each row is a work, with columns for the DOI, type, and issued date.
doi-to-issn.tsv.xz: a table where each row is a work (DOI) to journal (ISSN) mapping.

types

With mongo running, execute with the following command:

python download.py \
  --component=types \
  --log=logs/query-types.log

Environment

This repository uses conda to manage its environment as specified in environment.yml. Install the environment with:

conda env create --file=environment.yml

Then use source activate crossref and source deactivate to activate or deactivate the environment. On windows, use activate crossref and deactivate instead.

Other resources

Ideally, Crossref would provide a complete database dump, rather than requiring users to go through the inefficient process of API querying all works: see CrossRef/rest-api-doc#271. Until then, users should checkout the Crossref data currently hosted by this repository, whose query date is 2017-03-21, and its corresponding figshare.

For users who need more recent data, Bryan Newbold used this codebase to create a MongoDB dump dated January 2018 (query date of approximately 2018-01-10), which he uploaded to the Internet Archive. His output file crossref-works.2018-01-21.json.xz contains 93,585,242 DOIs and consumes 28.9 GB compared to 87,542,370 DOIs and 7.0 GB for the crossref-works.json.xz dated 2017-03-21. This increased size is presumably due to the addition of I4OC references to Crossref work records.

Bryan Newbold has also created a September 2018 release, which is uploaded to the Internet Archive. This repository is currently seeking contributions to update the convenient TSV outputs based on the more recent database dumps.

Daniel Ecer also downloaded the Crossref work metadata in January 2018, using the codebase at elifesciences/datacapsule-crossref. His database dump is available on figshare. While the multi-part format of this dump is likely less convenient than the dumps produced by this repository, Daniel Ecer's analysis also exports a DOI-to-DOI table of citations/references available here. This citation catalog contains 314,785,303 citations (summarized here) and is thus more comprehensive than the catalog available from greenelab/opencitations.

Acknowledgements

This work is funded in part by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4552 to @cgreene.

crossref's People

Contributors

Stargazers

Watchers

Forkers

de-code bnewbold elifesciences-publications scitedotai inambioinfo something2019 adixxov meatware 6nokturnal6 lukematic yin0713

crossref's Issues

Export MongoDB database from Docker

The goal is to export the MongoDB containing cached Crossref items. We could then archive this file, say on figshare, so others don't have to submit insane amounts of API queries.

Storing at zenodo?

I am not sure about the legal status of the downloaded data,
but if it can be worked out properly, then maybe Zenodo.org could be a good place to store the data.
First of all, it becomes citeable easily, secondly, the records refreshed with new version (but the old one remains there too), finally, they have a good and fast infrastructure.
If needed, the data could be split into smaller parts.
(There is a 50GB limit / upload, but no limit on number of uploads.)

Unicode issue - Non english characters as question mark

I am able to download, decompress, load and run successfully. But there is one big issue - non english characters in crossrefworks.json are replaced by question mark '?'.

Below is example of one such record.

{ "_id" : { "$oid" : "58d96fec0c62134f84023a29" }, "indexed" : { "date-parts" : [ [ 2016, 10, 25 ] ], "date-time" : "2016-10-25T11:26:19Z", "timestamp" : { "$numberLong" : "1477394779953" } }, "reference-count" : 97, "publisher" : "SAGE Publications", "issue" : "5", "content-domain" : { "domain" : [], "crossmark-restriction" : false }, "short-container-title" : [ "European Journal of Cardiovascular Prevention & Rehabilitation" ], "cited-count" : 0, "published-print" : { "date-parts" : [ [ 2006, 10 ] ] }, "DOI" : "10.1097/01.hjr.0000224482.95597.7a", "type" : "journal-article", "created" : { "date-parts" : [ [ 2006, 9, 22 ] ], "date-time" : "2006-09-22T08:19:38Z", "timestamp" : { "$numberLong" : "1158913178000" } }, "page" : "687-694", "source" : "CrossRef", "title" : [ "ESC Study Group of Sports Cardiology Position Paper on adverse cardiovascular effects of doping in athletes" ], "prefix" : "http://id.crossref.org/prefix/10.1177", "volume" : "13", "author" : [ { "given" : "Asterios", "family" : "Deligiannis", "affiliation" : [] }, { "given" : "Hans", "family" : "Bj??rnstad", "affiliation" : [] }, { "given" : "Francois", "family" : "Carre", "affiliation" : [] }, { "given" : "Hein", "family" : "Heidb??chel", "affiliation" : [] }, { "given" : "Evangelia", "family" : "Kouidi", "affiliation" : [] }, { "given" : "Nicole M.", "family" : "Panhuyzen-Goedkoop", "affiliation" : [] }, { "given" : "Fabio", "family" : "Pigozzi", "affiliation" : [] }, { "given" : "Wilhelm", "family" : "Sch??nzer", "affiliation" : [] }, { "given" : "Luc", "family" : "Vanhees", "affiliation" : [] } ], "member" : "http://id.crossref.org/member/179", "container-title" : [ "European Journal of Cardiovascular Prevention & Rehabilitation" ], "original-title" : [], "deposited" : { "date-parts" : [ [ 2011, 7, 28 ] ], "date-time" : "2011-07-28T15:46:48Z", "timestamp" : { "$numberLong" : "1311868008000" } }, "score" : 1, "subtitle" : [ "" ], "short-title" : [], "issued" : { "date-parts" : [ [ 2006, 10 ] ] }, "URL" : "http://dx.doi.org/10.1097/01.hjr.0000224482.95597.7a", "ISSN" : [ "1741-8267" ], "citing-count" : 97, "subject" : [ "Medicine(all)" ] }

What percent of scholarly publication DOIs are registered with Crossref?

Let's use this issue to jot down notes related to what percent of scholarly publication DOIs are registered with Crossref.

As a refresher, there are many DOI Registration Agencies (RA). For example, EIDR is a DOI RA for entertainment, so you can actually get DOI metadata for some porns! There is even discussion regarding DOIs for construction products. Shoutout to @jenniferlin15 who helped me understand these intricacies of the DOI system.

For our analyses, we're most interested in DOIs for scholarly content. There are other RAs than Crossref that engage with scholarly content. Some examples are:

mEDRA which "provides DOI registration services to publishers, academic institutions, research centres and intermediaries in Italy, in the EU market and internationally."
DataCite which "provides persistent identifiers (DOIs) for research data."

We're mostly interested in cataloging all DOIs for scholarly publications in relation to our Sci-Hub coverage project.

2019-09 IA Bulk Dump Update

I ran another bulk dump using the scripts in this repository. The scrape started on 2019-09-09 and ended around 2019-10-05, yielding 107,151,607 DOIs. The xz compressed file is 46 GBytes.

Available at: https://archive.org/details/crossref_doi_dump_201909

SHA-256:

338e01b613f34624a3408f781fecf746e7fc1ce6e4636186e57775d18d2a6ebc  crossref-works.2019-09-09.json.xz

Updating the README with a link might make sense, as this repository is the top hit for "Crossref Metadata Bulk Dump" (at least with my search filters).

I would encourage any future folks using these dumps to switch over to Daniel Ecer's dumps posted to figshare more frequently:
https://figshare.com/articles/Crossref_Works_Dump_-_August_2019/9751865

Question on keyword coverage

Many thanks for this work - we used it for a time before crossref started offering their own data dumps.

We did notice a small discrepancy in the keywords between this dump and the crossref dump. The dump here appears to have much better keyword coverage for papers (over 90%) while the current crossref dump has less than 5% coverage.

Where could I look to understand the methodology you used to gather keywords for this dump? We may need to use that methodology to augment what the crossref dump offers.

Thanks

Future updates of bulk Crossref metadata corpus

In April 2017 @dhimmel uploaded a bulk snapshot of Crossref metadata to figshare (where it was assigned DOI 10.6084/m9.figshare.4816720.v1).

While this metadata can be scraped from the Crossref API by anybody (eg, using the scripts in this repository), I found it really helpful to grab in bulk form.

I'm curious whether this dump could be updated on an annual or quarterly basis. I don't have a particular need for the the data to be versioned (eg, assigned sequential .v2, .v3 DOIs at figshare), but that would probably help with discovery for other folks and generally be a best practice.

If nobody has time to do such an update I will probably run the scripts from this repository and push to archive.org at: https://archive.org/details/ia_biblio_metadata.

Schema of the datasets

Hello and thanks for the data, much appreciated!

I am currently trying to assign a unique author ID to each author of each paper (from arXiv) given the paper's DOI. The task is trivial given the dataset, but I am just not sure which one to download since I have no clue about what information each dataset contains.

I think it would be great if you can share the schema of the datasets presented. For instance if it's a JSON object per newline, it would be great to share the first line at least, or even better, document it properly.

Since I am trying to solve the specific problem I mentioned in the first paragraph, I would also appreciate if you can direct me to the right dataset (if exist), whilst working on this issue.

Much thanks again,
Bora

Invalid character error when using mongoimport

Hello, I was looking for ways to get a complete dump of the metadata in CrossRef and came across this useful project.

I've tried importing your dump file as well as the one archived by Bryan Newbold, but I've come across the following error:

After running this command to import the file (using Windows 10 Pro):

mongoimport --db crossref-mongo --collection works --file "C:\path\to\file\crossrefworks.json.xz"

I get the error:

Failed: error processing document #1: invalid character 'ý' looking for beginning of value
imported 0 documents

I get the same error when I try importing Bryan Newbold's dump file.

It's the first time I'm using MongoDB so I don't know if I'm doing something wrong. I realise this is not strictly an error with the code in this project, but I thought you might be able to know what's going on with these dump files. Any help would be much appreciated.