kermitt2 / biblio-glutton Goto Github PK

View Code? Open in Web Editor NEW

112.0 7.0 14.0 7.44 MB

A high performance bibliographic information service

Java 94.55% JavaScript 4.36% Python 1.08%

openaccess metadata-api bibliographical-references doi pubmed disambiguation hal reference-matching

biblio-glutton's Introduction

biblio-glutton

A framework dedicated to scientific bibliographic information. It includes:

a bibliographical reference matching service: from an input such as a raw bibliographical reference and/or a combination of key metadata, the service will return the disambiguated bibliographical object with in particular its DOI and a set of metadata aggregated from Crossref and other sources,
a fast metadata look-up service: from a "strong" identifier such as DOI, PMID, etc. the service will return a set of metadata aggregated from Crossref and other sources,
various mapping between DOI, PMID, PMC, ISTEX ID and ark, integrated in the bibliographical service,
Open Access resolver: Integration of Open Access links via the Unpaywall dataset from Impactstory,
Gap and daily update for Crossref resources (via the Crossref REST API), so that your glutton data service stays always in sync with Crossref,
MeSH classes mapping for PubMed articles.

biblio-glutton should be very handy if you need to run and scale a local full "Crossref" database and API, to aggregate Crossref, Pubmed and other common bibliographical records and to match a large amount of bibliographical records or raw bibliographical reference strings.

The framework is designed both for speed (with several thousands request per second for look-up) and matching accuracy. It can be scaled horizontally as needed and can provide high availability.

Benchmarking against the Crossref REST API is presented below.

In the Glutton family, the following complementary tools are available for taking advantage of Open Access resources:

biblio-glutton-extension: A browser extension (Firefox & Chrome) for providing bibliographical services, like identifying dynamically Open Access resources on web pages and providing contextual citation services.
biblio-glutton-harvester: A robust, fault tolerant, Python utility for harvesting efficiently (multi-threaded) a large Open Access collection of PDF (Unpaywall, PubMed Central), with the possibility to upload content on Amazon S3,

Current stable version of biblio-glutton is 0.3. Working version is 0.4-SNAPSHOT.

Documentation

The full documentation is available here, including an evaluation of the bibliographical reference matching and some expected runtime information.

How to cite

If you want to cite this work, please refer to the present GitHub project, together with the Software Heritage project-level permanent identifier and do please indicate any author name. For example, with BibTeX:

@misc{biblio-glutton,
    title = {biblio-glutton},
    url = {https://github.com/kermitt2/biblio-glutton},
    publisher = {GitHub},
    year = {2018--2024},
    archivePrefix = {swh},
    eprint = {1:dir:a5a4585625424d7c7428654dbe863837aeda8fa7}
}

Main authors and contact

Patrice Lopez (@kermitt2, [email protected])
Luca Foppiano (@lfoppiano)

License

Distributed under Apache 2.0 license.

If you contribute to this project, you agree to share your contribution following this license.

biblio-glutton's People

Contributors

Stargazers

Watchers

Forkers

scitedotai bfirsh bnewbold grace-reed hkocoglu karatekaneen aazhar nguyendoanquyet mkardas anowak steppo83 marktab tskippervold ianna-ding

biblio-glutton's Issues

Better selection step

Currently (it's an early version!), we simply use the first best result of the search step as camdidate for a matching. Obviously, we should consider the top n best candidate, and select the best one based on the parsed metadata (these parsed metadata can come as additional arguments of the query, see #12 or after a parsing of the reference string with GROBID citation model).

The matching service would then look much more like a traditional record matching service with a blocking step (search-based) and a fine-grained matching step where metadata are considered to select the best candidate in the block.

Centrality of DOI/Crossref

In my spare time I am starting to add support for fatcat (https://fatcat.wiki) metadata to biblio-glutton. To start this would probably be a branch, then see if it make sense to upstream.

As i'm poking through the code I have some conceptual questions about how Crossref metadata and DOIs are currently used. My goal is to be able to match against fatcat releases which do not have DOIs (but always have a fatcat internal identifer, and may have other external identifiers like arxiv ID or PMID).

is it currently possible to do bibliographic lookups for works that don't have a DOI? Eg, could pubmed metadata (title, authors, etc) for works that have a PMID but no DOI be included in the elasticsearch index? I think not, but want to confirm
are arxiv ids supported anywhere? cc: @bfirsh who has great matching and I assumed had lookups by arxiv id
for the code path where a search match against elasticsearch has been made, are all the "enrichments" of other metadata done on a DOI basis? Or are istex/pii/pmid lookups chained together to find additional identifiers and do key/value lookups with those non-DOI identifiers? I think only DOIs are used, but want to confirm
are lookups by PMID/ISTEX/PII only performed for API lookups that supply that identifier directly? I think this is the case but want to confirm.

I'd be happy to submit a README update clarifying some of these once I understand it myself. Or maybe a new file as README is getting long!

Extension: Extend Biblio-glutton to DBLP

Hello @here,

Thanks a lot for the great tool.

Some months ago, we (with @ste210) started investigating the idea of extending Crossref with the DBLP dataset as part of the PatCit project.

After some time exchanging, here are the main findings (see full discussion thread here):

the DBLP dataset (w/o theses) contains 4,777,622 docs
3,900,859 of these docs have a DOI (81.7%)
3,520,018 of these DOIs are also in the CrossRef Database (90%)

It leaves a good number of relevant publications (based on conference rank) which are not covered by CrossRef but which have high quality bibliographical references from DBLP (see breakdown here)

At this point, my idea was to:

take the subset of documents which are in DBLP but not in CrossRef
map the DBLP xml objects to the crossref jsonl format - for the restriction of attributes used by biblio-glutton in the matching process
append the DBLP data (properly formated) to the Crossref database
there we go

I know that biblio-glutton was thought to be DOI-centric. That being said, the DOI is mainly used to harvest extra data from PubMed, Unpaywall, etc right? So, for the bibliographical references in the DBLP which have no DOI, we could replace the DOI value by the DBLP unique identifier. This is not very pretty but it could do the work right?

I might miss the complexity due to the internal functioning of biblio-glutton, so, let me know if you think that this is unrealistic ;)

If it sounds reasonable, I'll be happy to share the code/feedback on the hack here and on PatCit.

Thanks in advance,

Cyril

Use Openalex over Crossref?

We've started looking into Openalex which is an aggregate of multiple data sources (including Crossref) and I was wondering what it would take for switching to that dataset instead of the one from Crossref?

Regarding the indexing it should be relatively simple to just rewrite the logic to reconstruct the same format as the one being used today.

How about the internal db which (I assume) contains all the data that's being returned? Would that work to keep the same key but change the content completely or would you have to change something else for that as well?

Edit:

Forgot to add link: https://openalex.org/about

Add an asbtract and MeSH classes look-up service

Some abstract are present in crossref metadata, but of course many are available via MEDLINE data with the nice MeSH classes. The sub-package pubmed-glutton parse all MEDLINE data and map PMID with this complete metadata. So, as everything is already available, it would be nice to offer, as complementary service in the API, the look-up of abstracts (from CrossRef and PubMed) and MeSH classes.

Mixed query: case DOI + first author last name

The following query is rejected:

http://localhost:8080/service/lookup?doi=10.3389%2Ffnhum.2011.00050&firstAuthor=Keller

We have a DOI and a first author last name.

{
  "code": "400",
  "message": "The supplied parameters were not sufficient to select the query"
}

Of course if we send a DOI only it works:

http://localhost:8080/service/lookup?doi=10.3389%2Ffnhum.2011.00050

So this is not consistent (more arguments is always better):

we should accept the query with these parameters
we perform a normal DOI lookup with the provided DOI
by default "postValidate" is true, so we check if the provided author name soft matches the author name of the record, if it matches, return the result, otherwise the not found due to post-validation message / in the above example, post-validation fails, because the author for this DOI is not "Keller"
if "postValidate" is false, we ignore first author last name, and simply return the result

Add REST service for mixed matching approach

From GROBID we usually have both the raw full reference string and the parsed extracted fields (authors and title).

As addition to the present stuff, it would be nice to have as input to the matching service:

arguments "raw full reference string" plus author list (all authors separated by something, not just the first author), plus the title
this would allow to add a post validation (optional) with all the authors+title
this would allow to support the following approach: match first with first author last name + title (3-4 times faster) and only if it fails or author+title metadata are not available, try a matching with the raw full reference string (more expensive)
this could be exploited for more precise matching after a search-based blocking (sometimes two version of the same paper -one from a conference and one more complete in a journal issue- have the same title, but one more author, so the full list of authors is useful to select the best candidate)
finally we could integrate to glutton for raw reference string only-query, as an option, the parsing of the reference with GROBID, and add the post-validation in all the case. So even with raw reference, we would have integrate reliable post-validation/selection.

Post-validation is necessary to avoid false positive due to the search-based step (which is only the block step normally).

Input data preprocessing to remove noise

I just found the following problem, although since the data is extracted from a PDF I'm not sure it's the right place where to fix the issue.

The following DOI: 10.1063/1.1905789͔ comes out with a nasty 9͔ ...

Crossref find the record: https://search.crossref.org/?from_ui=&q=10.1063%2F1.1905789%CD%94
The data is extracted from the publisher version of the manuscript: https://aip.scitation.org/doi/pdf/10.1063/1.1905789

Although I think this is not glutton lookup's responsibility, I think having a small pre-processing that removes crap could be nice anyway .

Update: I've checked and since we lookup by DOI directly from LMDB it's a rather strict matching (we lowercase already)

Missing OA URL in aggregated metadata

This is so obvious that I didn't check earlier... the OA URL is missing in the "aggregated" record:

curl "http://localhost:8080/service/lookup?doi=10.1038/nature12373"

{
	"reference-count": 30,
	"publisher": "Springer Nature",
	"issue": "7460",
	"license": [{
		"URL": "http://www.springer.com/tdm",
		"start": {
			"date-parts": [
				[2013, 8, 1]
			],
			"date-time": "2013-08-01T00:00:00Z",
			"timestamp": {
				"$numberLong": "1375315200000"
			}
		},
		"delay-in-days": 0,
		"content-version": "unspecified"
	}],
	"content-domain": {
		"domain": [],
		"crossmark-restriction": false
	},
	"short-container-title": ["Nature"],
	"published-print": {
		"date-parts": [
			[2013, 8]
		]
	},
	"DOI": "10.1038/nature12373",
	"type": "journal-article",
	"created": {
		"date-parts": [
			[2013, 7, 30]
		],
		"date-time": "2013-07-30T12:59:50Z",
		"timestamp": {
			"$numberLong": "1375189190000"
		}
	},
	"page": "54-58",
	"source": "Crossref",
	"is-referenced-by-count": 514,
	"title": ["Nanometre-scale thermometry in a living cell"],
	"prefix": "10.1038",
	"volume": "500",
	"author": [{
		"given": "G.",
		"family": "Kucsko",
		"sequence": "first",
		"affiliation": []
	}, {
		"given": "P. C.",
		"family": "Maurer",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "N. Y.",
		"family": "Yao",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "M.",
		"family": "Kubo",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "H. J.",
		"family": "Noh",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "P. K.",
		"family": "Lo",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "H.",
		"family": "Park",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "M. D.",
		"family": "Lukin",
		"sequence": "additional",
		"affiliation": []
	}],
	"member": "297",
	"published-online": {
		"date-parts": [
			[2013, 8, 1]
		]
	},
	"container-title": ["Nature"],
	"language": "en",
	"link": [{
		"URL": "http://www.nature.com/articles/nature12373.pdf",
		"content-type": "application/pdf",
		"content-version": "vor",
		"intended-application": "text-mining"
	}, {
		"URL": "http://www.nature.com/articles/nature12373",
		"content-type": "text/html",
		"content-version": "vor",
		"intended-application": "text-mining"
	}, {
		"URL": "http://www.nature.com/articles/nature12373.pdf",
		"content-type": "application/pdf",
		"content-version": "vor",
		"intended-application": "similarity-checking"
	}],
	"deposited": {
		"date-parts": [
			[2017, 12, 29]
		],
		"date-time": "2017-12-29T09:39:40Z",
		"timestamp": {
			"$numberLong": "1514540380000"
		}
	},
	"score": 1,
	"issued": {
		"date-parts": [
			[2013, 8]
		]
	},
	"references-count": 30,
	"journal-issue": {
		"published-print": {
			"date-parts": [
				[2013, 8]
			]
		},
		"issue": "7460"
	},
	"alternative-id": ["BFnature12373"],
	"URL": "http://dx.doi.org/10.1038/nature12373",
	"relation": {
		"cites": []
	},
	"ISSN": ["0028-0836", "1476-4687"],
	"issn-type": [{
		"value": "0028-0836",
		"type": "print"
	}, {
		"value": "1476-4687",
		"type": "electronic"
	}],
	"pmid": "23903748",
	"pmcid": "PMC4221854"
}

We see that we don't see the OA url:

curl "http://localhost:8080/service/oa?doi=10.1038/nature12373"
https://dash.harvard.edu/bitstream/handle/1/12285462/Nanometer-Scale%20Thermometry.pdf?sequence=1

Feature request/Question: Keep data up to date, incremental appending

First of all, love the project and really appreciate the work that you are doing!

We have an issue where we want to keep the data as fresh as possible and update it often. By the standard way that are described in the documentation with the bulk datasets this can be done by appending to the huge files and then re-running the indexing. So my question is if there is any way currently to simply add new items (crossref, pubmed etc) or if you have any pointers of where to start if I was to add that API for you?

Best regards, Robin

Result from best OA URL rather in json

Currently results for /service/oa? is just the string of the OA URL.

Finally I think it would be better to have a json answer, for example:

curl "http://localhost:8080/service/oa?doi=10.1107/S0907444911008754"

{
   "oaLink": "http://journals.iucr.org/d/issues/2011/05/00/bw5391/bw5391.pdf"
}

It will be easier to integrate the result in the web extension, and it will make possible to easily add optionally more information in the response (for instance if someone want also the URL of the ISTEX full text, or if we want to integrate custom repo).

Experiment with alternative compression

We use snappy right now for LMDB stored records. There are other compression methods that might be relevant to small objects to get higher compression ratio and faster decompression (in principle we are less interested in compression speed).

See for instance https://morotti.github.io/lzbench-web as a benchmark study. zstd (which has a training mode) and lz4 might be more interesting than snappy for our usage.

Crossref Index: [parse_exception] request body is required

Hello,

Thanks for the great work !

I am trying to build the ES index on a AWS EC2 instance.

After ingesting ~29 millions records, the programme raised a [parse_exception] request body is required.

Loaded 29485000 records in 8941.782 s (12.003649109329235 record/s)
Loaded 29486000 records in 8941.819 s (12.090436464756378 record/s)
Loaded 29487000 records in 8942.022 s (12.126944858781727 record/s)
Loaded 29488000 records in 8942.14 s (12.175967076185024 record/s)
Bulk is rejected... let's medidate 10 seconds about the illusion of time and consciousness
Waiting for 10 seconds
Bulk is rejected... let's medidate 10 seconds about the illusion of time and consciousness
Waiting for 10 seconds
bulk is finally ingested...
Loaded 29489000 records in 8987.688 s (21.52018593440647 record/s)
bulk is finally ingested...
Loaded 29490000 records in 8988.32 s (21.2630236019562 record/s)
Bulk is rejected... let's medidate 10 seconds about the illusion of time and consciousness
Waiting for 10 seconds
Bulk is rejected... let's medidate 10 seconds about the illusion of time and consciousness
Waiting for 10 seconds
[parse_exception] request body is required

/home/ubuntu/biblio-glutton/matching/main.js:357
                                            throw err;
                                            ^
Error: [parse_exception] request body is required
    at respond (/home/ubuntu/biblio-glutton/matching/node_modules/elasticsearch/src/lib/transport.js:308:15)
    at checkRespForFailure (/home/ubuntu/biblio-glutton/matching/node_modules/elasticsearch/src/lib/transport.js:267:7)
    at HttpConnector.<anonymous> (/home/ubuntu/biblio-glutton/matching/node_modules/elasticsearch/src/lib/connectors/http.js:166:7)
    at Unzip.wrapper (/home/ubuntu/biblio-glutton/matching/node_modules/lodash/lodash.js:4929:19)
    at emitNone (events.js:111:20)
    at Unzip.emit (events.js:208:7)
    at endReadableNT (_stream_readable.js:1064:12)
    at _combinedTickCallback (internal/process/next_tick.js:138:11)
    at process._tickCallback (internal/process/next_tick.js:180:9)

After that, I see that the index has been partially built.

Do you have any idea of how I can fix the issue ? If yes, I would also like to know if it is possible not to restart from scratch (e.g, start indexing directly the remaining records) ?

Reproduce issue

$ cd matching/
$ npm install # host='localhost:9200' in my_connection.json
$ node main -dump ~/data/2017-03-21crossref-works.json.xz index

System

AWS EC2 t2.medium
Elastic search latest
java:
- openjdk version "1.8.0_222"
- OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~18.04.1-b10)
- OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
elastic search:
- "number" : "7.3.1",
- "build_flavor" : "default",
- "build_type" : "deb",
- "build_hash" : "4749ba6",
- "lucene_version" : "8.1.0",
- "minimum_wire_compatibility_version" : "6.8.0",
- "minimum_index_compatibility_version" : "6.0.0-beta1"

Thanks !

using datafeed to update unpaywall doi OA mapping

after downloading the latest snapshot OA links are obsolete, so it would be nice to be able to use the changefiles from datafeed (unpaywall product for accessing changefiles)

http://unpaywall.org/products/data-feed/

http://unpaywall.org/products/data-feed/changefiles

Move doc to readthedocs

Because it's more readable !

Revisited result format for aggregated sources

As we are moving to more heterogeneous sources, crossref is one bibliographical record among others. To keep everything well separated and avoid destructive merging, the headache of unified representations and the mixture of automated, rule-based and original mapping/merging, we can define the following result format for an aggregated record:

{
  "doi": "10.1028/ijijij".
  "pmid": 52627,
  "pmcid": PMC7828282,
  "crossref": {},
  "pubmed": {},
  "hal": {},
  "dblp": {},
  "unpaywall": {}
}

We would have all strong identifiers are all present in the root of the JSON response. Then each full record from the original source is added, converted into Crossref format (which is like the unixref format).

API would be extended to select sub-set of source-specific records (e.g. source=['crossref','hal']), with default covering all available sources for the bibliographical object.

Finally in case of a matching response, where a disambiguation decision is taken, we can add a matching score at the root of the response.

URI encoded parameters are not working

Trying to integrate glutton in grobid (as a plugable CrossRef API implementation), the Apache HttpClient encodes request parameters, but then the server is not able to manage them:

Example:

failing:

INFO  [2018-12-13 01:21:47,923] org.grobid.core.utilities.glutton.GluttonClient:  (,doi=10.1093/aob/mci237): New request in the pool
INFO  [2018-12-13 01:21:47,924] org.grobid.core.utilities.crossref.CrossrefRequestTask: timedSemaphore acquire... current total: 0, still available: 100
INFO  [2018-12-13 01:21:47,924] org.grobid.core.utilities.glutton.GluttonClient:  (,doi=10.1093/aob/mci237): .. executing
http://localhost:8080/service/lookup?doi=10.1093%2Faob%2Fmci237
WARN  [2018-12-13 01:21:48,097] org.grobid.core.utilities.Consolidation: CrossRef returns error (-1) : org.apache.http.client.ClientProtocolException thrown during request execution :  (,doi=10.1093/aob/mci237)
No message found in json result.

127.0.0.1 - - [13/Dec/2018:01:22:00 +0000] "GET /service/lookup?doi=10.1093%2Faob%2Fmci237 HTTP/1.1" 200 2067 "-" "Apache-HttpClient/4.5.5 (Java/1.8.0_111)" 2

working:

curl http://localhost:8080/service/lookup?doi=10.1093/aob/mci237
127.0.0.1 - - [13/Dec/2018:01:11:38 +0000] "GET /service/lookup?doi=10.1093/aob/mci237 HTTP/1.1" 200 2067 "-" "curl/7.47.0" 4

{
	"reference-count": 33,
	"publisher": "Oxford University Press (OUP)",
	"issue": "5",
	"content-domain": {
		"domain": [],
		"crossmark-restriction": false
	},
	"published-print": {
		"date-parts": [
			[2005, 10, 1]
		]
	},
	"DOI": "10.1093/aob/mci237",
	"type": "journal-article",
	"created": {
		"date-parts": [
			[2005, 8, 16]
		],
		"date-time": "2005-08-16T00:13:55Z",
		"timestamp": {
			"$numberLong": "1124151235000"
		}
	},
	"page": "853-861",
	"source": "Crossref",
	"is-referenced-by-count": 46,
	"title": ["Phylogeographical Variation of Chloroplast DNA in Cork Oak (Quercus suber)"],
	"prefix": "10.1093",
	"volume": "96",
	"author": [{
		"given": "ROSELYNE",
		"family": "LUMARET",
		"sequence": "first",
		"affiliation": []
	}, {
		"given": "MATHIEU",
		"family": "TRYPHON-DIONNET",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "HENRI",
		"family": "MICHAUD",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "AURÉLIE",
		"family": "SANUY",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "EMILIE",
		"family": "IPOTESI",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "CÉLINE",
		"family": "BORN",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "CÉLINE",
		"family": "MIR",
		"sequence": "additional",
		"affiliation": []
	}],
	"member": "286",
	"published-online": {
		"date-parts": [
			[2005, 8, 15]
		]
	},
	"container-title": ["Annals of Botany"],
	"language": "en",
	"link": [{
		"URL": "http://academic.oup.com/aob/article-pdf/96/5/853/435617/mci237.pdf",
		"content-type": "unspecified",
		"content-version": "vor",
		"intended-application": "similarity-checking"
	}],
	"deposited": {
		"date-parts": [
			[2017, 10, 11]
		],
		"date-time": "2017-10-11T06:18:36Z",
		"timestamp": {
			"$numberLong": "1507702716000"
		}
	},
	"score": 1,
	"issued": {
		"date-parts": [
			[2005, 8, 15]
		]
	},
	"references-count": 33,
	"journal-issue": {
		"published-online": {
			"date-parts": [
				[2005, 8, 15]
			]
		},
		"published-print": {
			"date-parts": [
				[2005, 10, 1]
			]
		},
		"issue": "5"
	},
	"URL": "http://dx.doi.org/10.1093/aob/mci237",
	"relation": {
		"cites": []
	},
	"ISSN": ["1095-8290", "0305-7364"],
	"issn-type": [{
		"value": "0305-7364",
		"type": "print"
	}, {
		"value": "1095-8290",
		"type": "electronic"
	}],
	"istexId": "17B513E0AC506300197444BB1105AF946504F162",
	"ark": "ark:/67375/HXZ-PBH2VH9H-P",
	"pmid": "16103038",
	"pmid": "16103038",
	"pmcid": "PMC4247051"
}

Maximum number of requests and request/second

After several weeks of investigation, I finally found a rather clean solution for implementing a mechanism for limiting the number of requests, using directly dropwizard:

QoSFilter: Example here allows setting the maximum parallel connections
DoSFilter: allow to set request/seconds and other mechanisms to protect the service.

Both return 503 by default, so no need to implement such a mechanism by hand.

health check errors after a while

Hello

We try to use the healthcheck endpoint, but after a dozen of requests , this error appears :

{
    "HealthCheck": {
        "error": {
            "message": "Platform constant error code: ENOMEM Cannot allocate memory (12)",
            "stack": [
                "org.lmdbjava.ResultCodeMapper.checkRc(ResultCodeMapper.java:114)",
                "org.lmdbjava.Env$Builder.open(Env.java:458)",
                "org.lmdbjava.Env$Builder.open(Env.java:474)",
                "com.scienceminer.lookup.storage.StorageEnvFactory.getEnv(StorageEnvFactory.java:36)",
                "com.scienceminer.lookup.storage.lookup.OALookup.<init>(OALookup.java:43)",
                "com.scienceminer.lookup.web.healthcheck.LookupHealthCheck.check(LookupHealthCheck.java:38)",
                "com.codahale.metrics.health.HealthCheck.execute(HealthCheck.java:320)",
                "com.codahale.metrics.health.HealthCheckRegistry.runHealthChecks(HealthCheckRegistry.java:185)",
                "com.codahale.metrics.servlets.HealthCheckServlet.runHealthChecks(HealthCheckServlet.java:149)",
                "com.codahale.metrics.servlets.HealthCheckServlet.doGet(HealthCheckServlet.java:121)",
                "javax.servlet.http.HttpServlet.service(HttpServlet.java:687)",
                "javax.servlet.http.HttpServlet.service(HttpServlet.java:790)",
                "com.codahale.metrics.servlets.AdminServlet.service(AdminServlet.java:108)",
                "javax.servlet.http.HttpServlet.service(HttpServlet.java:790)",
                "io.dropwizard.jetty.NonblockingServletHolder.handle(NonblockingServletHolder.java:49)",
                "org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)",
                "io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:45)",
                "io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:39)",
                "org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)",
                "org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)",
                "org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)",
                "org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)",
                "org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)",
                "org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)",
                "org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)",
                "org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)",
                "org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)",
                "io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)",
                "org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)",
                "io.dropwizard.jetty.BiDiGzipHandler.handle(BiDiGzipHandler.java:67)",
                "org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:56)",
                "org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)",
                "org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)",
                "org.eclipse.jetty.server.Server.handle(Server.java:502)",
                "org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364)",
                "org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)",
                "org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)",
                "org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)",
                "org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)",
                "org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)",
                "org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)",
                "org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)",
                "org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)",
                "org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)",
                "org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)",
                "org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)",
                "java.lang.Thread.run(Thread.java:748)"
            ]
        },
        "healthy": false,
        "message": "Platform constant error code: ENOMEM Cannot allocate memory (12)"
    },
    "deadlocks": {
        "healthy": true
    }
}

even though this error appears, the API still work .

Cheers

Invalid response JSON because PMID is repeated 2 times

For example:

curl http://localhost:8080/service/lookup?doi=10.1093/aob/mci237

See the problem at the end of the JSON:

{
	"reference-count": 33,
	"publisher": "Oxford University Press (OUP)",
	"issue": "5",
	"content-domain": {
		"domain": [],
		"crossmark-restriction": false
	},
	"published-print": {
		"date-parts": [
			[2005, 10, 1]
		]
	},
	"DOI": "10.1093/aob/mci237",
	"type": "journal-article",
	"created": {
		"date-parts": [
			[2005, 8, 16]
		],
		"date-time": "2005-08-16T00:13:55Z",
		"timestamp": {
			"$numberLong": "1124151235000"
		}
	},
	"page": "853-861",
	"source": "Crossref",
	"is-referenced-by-count": 46,
	"title": ["Phylogeographical Variation of Chloroplast DNA in Cork Oak (Quercus suber)"],
	"prefix": "10.1093",
	"volume": "96",
	"author": [{
		"given": "ROSELYNE",
		"family": "LUMARET",
		"sequence": "first",
		"affiliation": []
	}, {
		"given": "MATHIEU",
		"family": "TRYPHON-DIONNET",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "HENRI",
		"family": "MICHAUD",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "AURÉLIE",
		"family": "SANUY",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "EMILIE",
		"family": "IPOTESI",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "CÉLINE",
		"family": "BORN",
		"sequence": "additional",
		"affiliation": []
	}, {
		"given": "CÉLINE",
		"family": "MIR",
		"sequence": "additional",
		"affiliation": []
	}],
	"member": "286",
	"published-online": {
		"date-parts": [
			[2005, 8, 15]
		]
	},
	"container-title": ["Annals of Botany"],
	"language": "en",
	"link": [{
		"URL": "http://academic.oup.com/aob/article-pdf/96/5/853/435617/mci237.pdf",
		"content-type": "unspecified",
		"content-version": "vor",
		"intended-application": "similarity-checking"
	}],
	"deposited": {
		"date-parts": [
			[2017, 10, 11]
		],
		"date-time": "2017-10-11T06:18:36Z",
		"timestamp": {
			"$numberLong": "1507702716000"
		}
	},
	"score": 1,
	"issued": {
		"date-parts": [
			[2005, 8, 15]
		]
	},
	"references-count": 33,
	"journal-issue": {
		"published-online": {
			"date-parts": [
				[2005, 8, 15]
			]
		},
		"published-print": {
			"date-parts": [
				[2005, 10, 1]
			]
		},
		"issue": "5"
	},
	"URL": "http://dx.doi.org/10.1093/aob/mci237",
	"relation": {
		"cites": []
	},
	"ISSN": ["1095-8290", "0305-7364"],
	"issn-type": [{
		"value": "0305-7364",
		"type": "print"
	}, {
		"value": "1095-8290",
		"type": "electronic"
	}],
	"istexId": "17B513E0AC506300197444BB1105AF946504F162",
	"ark": "ark:/67375/HXZ-PBH2VH9H-P",
	"pmid": "16103038",
	"pmid": "16103038",
	"pmcid": "PMC4247051"
}

PMC ID lookup considers PMC ID as a PMID

curl http://localhost:8080/service/lookup?pmc=1017419
returns:

{..., "pmid":"1017419", "pmcid":"PMC1475233"}

(same as curl http://localhost:8080/service/lookup?pmid=1017419)

log settings not working

Logging appender settings are not working

ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...

When adding log4j-core to the classpath

ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2

when adding a Log4j config, it also breaks at build time...

I remember having a similar problem with grobid + gradle + dropwizard, and it relies on some gradle I think.

LMDB env scaling issue

For simple lookup, we have quite a lot of failures due to:

! org.lmdbjava.Env$ReadersFullException: Environment maxreaders reached (-30790)

Apparently, LMDB cannot support more than 126 simultaneous readers? This looks like a weird hard coded limitation.

To reproduce these errors:

cd biblio-glutton/script

node oa_coverage -pmc ../data/pmc/PMID_PMCID_DOI.csv.gz > out.json

where PMID_PMCID_DOI.csv.gz is the usual PMID/DOI mapping from ftp://ftp.ebi.ac.uk/pub/databases/pmc/DOI/

I observe 1 failure like that for every [1,000-10,000] requests on a normal working station.

Possible fix (apart from slowing down the rate of queries at the client): pooling of several LMDB environments (at least 2) for each database?

Curious IndexOutOfBound for /service/lookup?pmc=PMC1017419

I've just got this exception. I note down to avoid forgetting:

WARN  [2019-05-23 01:29:39,698] com.scienceminer.lookup.web.resource.LookupController: PMC ID did not matched, move to additional metadata
144.213.182.3 - - [23/May/2019:01:29:39 +0000] "GET /service/lookup?pmc=PMC1017419 HTTP/1.1" 404 27 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0" 132
unable to decode:&com.scienceminer.lookup.data.IstexDataüark:/67375/6H6-1RX6NZDF-Sü10.1016/0001-4575(92)900
ERROR [2019-05-23 01:30:08,393] com.scienceminer.lookup.storage.lookup.IstexIdsLookup: Cannot retrieve ISTEX identifiers by doi:  10.1016/0001-4575(92)90047-m
! java.lang.ArrayIndexOutOfBoundsException: 171
! at org.nustaq.serialization.coders.FSTStreamDecoder.readFByte(FSTStreamDecoder.java:296)
! at org.nustaq.serialization.coders.FSTStreamDecoder.readFShort(FSTStreamDecoder.java:366)
! at org.nustaq.serialization.FSTClazzNameRegistry.decodeClass(FSTClazzNameRegistry.java:163)
! at org.nustaq.serialization.coders.FSTStreamDecoder.readClass(FSTStreamDecoder.java:478)
! at org.nustaq.serialization.FSTObjectInput.readClass(FSTObjectInput.java:939)
! at org.nustaq.serialization.FSTObjectInput.readObjectWithHeader(FSTObjectInput.java:347)
! at org.nustaq.serialization.FSTObjectInput.readObjectFields(FSTObjectInput.java:713)
! at org.nustaq.serialization.FSTObjectInput.instantiateAndReadNoSer(FSTObjectInput.java:566)
! at org.nustaq.serialization.FSTObjectInput.readObjectWithHeader(FSTObjectInput.java:374)
! at org.nustaq.serialization.FSTObjectInput.readObjectInternal(FSTObjectInput.java:331)
! at org.nustaq.serialization.FSTObjectInput.readObject(FSTObjectInput.java:311)
! at org.nustaq.serialization.FSTObjectInput.readObject(FSTObjectInput.java:245)
! ... 75 common frames omitted
! Causing: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 171
! at org.nustaq.serialization.FSTObjectInput.readObject(FSTObjectInput.java:247)
! at org.nustaq.serialization.FSTConfiguration.asObject(FSTConfiguration.java:1150)
! at com.scienceminer.lookup.utils.BinarySerialiser.deserialize(BinarySerialiser.java:27)
! at com.scienceminer.lookup.utils.BinarySerialiser.deserialize(BinarySerialiser.java:33)
! at com.scienceminer.lookup.storage.lookup.IstexIdsLookup.retrieveByDoi(IstexIdsLookup.java:174)
! at com.scienceminer.lookup.storage.LookupEngine.injectIdsByDoi(LookupEngine.java:426)
! at com.scienceminer.lookup.storage.LookupEngine.retrieveByDoi(LookupEngine.java:128)
! at com.scienceminer.lookup.storage.LookupEngine.retrieveByPmid(LookupEngine.java:178)
! at com.scienceminer.lookup.web.resource.LookupController.getByQuery(LookupController.java:143)
! at com.scienceminer.lookup.web.resource.LookupController.getByQueryAsync(LookupController.java:99)
! at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
! at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
! at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.lang.reflect.Method.invoke(Method.java:498)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:143)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
! at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
! at io.dropwizard.jetty.NonblockingServletHolder.handle(NonblockingServletHolder.java:49)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:35)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:45)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:39)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
! at com.codahale.metrics.jetty9.InstrumentedHandler.handle(InstrumentedHandler.java:239)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)
! at io.dropwizard.jetty.BiDiGzipHandler.handle(BiDiGzipHandler.java:67)
! at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:56)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
! at org.eclipse.jetty.server.Server.handle(Server.java:502)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
! at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:132)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)
! at java.lang.Thread.run(Thread.java:748)

Managing invalid PMID used in non-pubmed sources

http://localhost:8080/service/lookup?pmid=16262981 results in 404

However it is the PMID indicated by this record, so it exists !
http://localhost:8080/service/lookup?pii=S0266462305050762

Run path for gap/daily sync with Crossref

from @Aazhar
Currently the service has to be launched from the biblio-glutton/lookup repertory to solve the path to the indexing for daily update, otherwise:

ERROR [2022-04-27 03:14:12,161] com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask: IO error when executing external command: [node, main, -dump, /opt/biblio-glutton/crossref_increments/2022-04-26/D1000310.json.gz, extend]
! java.io.IOException: error=2, No such file or directory
! at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
! at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
! at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271)
! at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
! ... 7 common frames omitted
! Causing: java.io.IOException: Cannot run program "node" (in directory "../indexing"): error=2, No such file or directory

We could improve the resolution of the path, probably with an optional install path parameter.

PMC ID lookup with number only

Currently PMC is expected as prefix for the PMC ID, while not in the case of PMID (because the mapping file is like this). It could be nice to have the service working with the PMC ID number only too:

curl http://localhost:8080/service/lookup?pmc=PMC1017419 -> works

curl http://localhost:8080/service/lookup?pmc=1017419 -> does not work

Sanity check for field request

Looking at months of logs, I only found one catched error.

It seems that a complete google scholar query was send as DOI field, resulting in this exception:

ERROR [2021-08-16 20:41:08,464] com.scienceminer.lookup.storage.lookup.MetadataLookup: Cannot retrieve Crossref document by DOI:  https://scholar.google.com/scholar_lookup?title=nepro+study+investigators+analysis+of+docetaxel+therapy+in+elderly+(%e2%89%a570years)+castration+resistant+prostate+cancer+patients+enrolled+in+the+netherlands+prostate+study&author=gerritse,+f.l.&author=meulenbeld,+h.j.&author=roodhart,+j.m.l.&author=van+der+velden,+a.m.t.&author=blaisse,+r.j.b.&author=smilde,+t.j.&author=erjavec,+z.&author=de+wit,+r.&author=los,+m.&publication_year=2013&journal=eur.+j.+cancer&volume=49&pages=3176%e2%80%933183&doi=10.1016/j.ejca.2013.06.008
! java.nio.BufferOverflowException: null
! at java.nio.DirectByteBuffer.put(DirectByteBuffer.java:363)
! at java.nio.ByteBuffer.put(ByteBuffer.java:859)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.retrieveJsonDocument(MetadataLookup.java:110)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.retrieveByMetadata(MetadataLookup.java:132)
! at com.scienceminer.lookup.storage.LookupEngine.retrieveByDoi(LookupEngine.java:128)
! at com.scienceminer.lookup.web.resource.LookupController.getByQuery(LookupController.java:126)
! at com.scienceminer.lookup.web.resource.LookupController.getByQueryAsync(LookupController.java:99)
! at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
! at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.lang.reflect.Method.invoke(Method.java:498)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
...

Just checking the fields before processing them might be useful for avoiding BufferOverflowException, which could be a vulnerability.

Better matching step

Currently the matching is simply view as a post-validation using reduced metadata (basically author first name, in addition possibly to the title). This is a source of matching error when the title or author names are not present in the bibliographical reference.

To improve matching accuracy, we need a more comprehensive matching step, using more available metadata (provided by the client via the query or via GROBID citation parsing otherwise), while still maintaining robustness given that all these metadata are uncertain.

Example of failing queries:

Tallquist, M.D., Weismann, K.E., Hellstro ̈m, M.,and Soriano, P. (2000). Development 127, 5059–5070.
->return 10.1038/sj.onc.1203216 while it has no DOI apparently

support matching a bulk of identifiers

Hello
for some use cases , It would be good for performance if we can send a certain amount of references and get back the results for each item, at least when matching using a list of dois, istexIDs, pmids , either for the lookup endpoint , the OA endpoint or the oa_istex.

Enable caching

Keeping in mind that caching queries/results in LMDB make sense for matching queries only.

Add the ability to update lmdb data from crossref

At present, updating the lmdb database recompiling the entire db from scratch. It would be nice if incremental updates could be made using the crossref metadata plus API and/or OAI-PMH service.

PMC ID look-up

Still some confusing aspects with the PMC ID identifier arguments:

http://cloud.science-miner.com/glutton/service/lookup?pmc=1017419

returns record for { pmid : "1017419" }

For querying a PMC article with a PMC ID, we need the prefix PMC, e.g.
http://cloud.science-miner.com/glutton/service/lookup?pmc=PMC1017419
however, the lookup is failing (404), while it should normally because it's an old article:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1017419/

PII as additional "strong" identifier

Just realize now, the lookup can also be made with the PII (PII/DOI mapping being provided by the public ISTEX metadata). Useful in particular for getting the OA version of Elsevier publications.

GROBID call too async :)

The call to grobid in the case of biblio only query (without author) is async, and the response comes after that the non-validated result is returned... so there is no post-validation in this case.

biblio-glutton/lookup/src/main/java/com/scienceminer/lookup/storage/LookupEngine.java

Line 281 in fb46c95

grobidClient.processCitation(biblio, "0", response -> {

on this line the call to grobid is sent, but the response will come after that we reach this part:

biblio-glutton/lookup/src/main/java/com/scienceminer/lookup/storage/LookupEngine.java

Line 300 in fb46c95

 final String s = injectIdsByDoi(matchingDocument.getJsonObject(), matchingDocument.getDOI()); 

and the non validated DOI match is returned first whatever GROBID will return as author (or non author).

example query: ?biblio=Reporting Hospital Quality Data for Annual Payment Update. Avail- able at: http://www.cms.gov/Medicare/Quality-Initiatives-Patient- Assessment-Instruments/HospitalQualityInits/Downloads/Hospital- RHQDAPU200808. Accessed December 18, 2013

-> match DOI 10.1037/e556322006-027 (wrong one)

-> no author in this DOI, no author found by GROBID, but glutton returns this DOI record as result

-> the call to grobid should be sync, or it is necessary to find a way to wait for its answer to go on with the injectIdsByDoi() or grobid should be integrated as a library.

Docker image fixes

Few things to do still with docker:

fix docker build / gradle build to correctly replace urls of elastic and grobid
push the image on hub.docker

Add the ability to compile lmdb from crossref premium

It would be nice if we could compile the lookup database using crossref metadata plus, which is in a different format the the greenelab metadata dump. Specifically, the data from crossref are split across multiple files, each with 10,002 lines.

Example files from crossref plus attached.

examples.zip

Experiment with vespa

Matching of full raw reference string provides the best accuracy but is also the most expensive, so scaling with this kind of queries supposes to add many elasticsearch nodes.

It could be interesting to experiment with vespa as alternative to elasticsearch to see if it speeds up the search query rate.

Results seem to be vastly improved by first parsing citation with GROBID

I think this is what you're getting at in #13 and #21, but I figure it'd be useful to share my specific experience with using biblio-glutton.

I'm using biblio-glutton to add citation links to https://www.arxiv-vanity.com/. It works really well -- thank you!

The implementation is a bit weird though. I would have thought I would just be able to do /service/lookup?biblio=... and be done with it. That gave me very few positive results though -- I think I tried several papers and only got perhaps 5 positive matches.

I found I got much better results (most citations working, almost all on some papers) by first parsing the citation with GROBID then passing atitle and firstAuthor to biblio-glutton. I am getting almost perfect results -- I haven't seen a false positive yet.

Here's the lookup code I'm using, if you're interested. You can see the high-level logic at the bottom where it first does a grobid call, then a biblio-glutton call.

This simple improvement makes me wonder -- why doesn't biblio-glutton do this internally? Am I doing something stupid?

Error during import of gz files

Hello,
I downloaded via torrent all gz files that are located here https://academictorrents.com/details/4dcfdf804775f2d92b7a030305fa0350ebef6f3e
I tried to import them to biblio-glutton db via docker compose via this command at the end of Docker file:
CMD java -jar lib/lookup-service-0.2-onejar.jar crossref --input /app/data/crossref-data/April2022 /app/config/config.yml

I have errors for all files, same error for all and the error says:
ERROR [2022-09-22 09:42:10,531] com.scienceminer.lookup.reader.CrossrefJsonlReader: Some serious error when deserialize the JSON object:
biblio-glutton-biblio-1 | },
biblio-glutton-biblio-1 | ! com.fasterxml.jackson.core.JsonParseException: Unexpected close marker '}': expected ']' (for root starting at [Source: (String)" },"; line: 1, column: 0])
biblio-glutton-biblio-1 | ! at [Source: (String)" },"; line: 1, column: 10]
biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840)
biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:712)
biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.core.base.ParserBase._reportMismatchedEndMarker(ParserBase.java:1016)
biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._closeScope(ReaderBasedJsonParser.java:2888)
biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:683)
biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4247)
biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2720)
biblio-glutton-biblio-1 | ! at com.scienceminer.lookup.reader.CrossrefJsonlReader.fromJson(CrossrefJsonlReader.java:53)
biblio-glutton-biblio-1 | ! at com.scienceminer.lookup.reader.CrossrefJsonlReader.lambda$load$0(CrossrefJsonlReader.java:34)
biblio-glutton-biblio-1 | ! at java.util.Iterator.forEachRemaining(Iterator.java:116)

Can you help me?
Thanks

Crossref gap update command might not stop

from @Aazhar
It appears that the gap update command does not stop running occasionally and we should explore why it happens. It could be related to temporary non-availability of the Crossref REST API.

crossrefGapUpdate
             count = 1129675
         mean rate = 6.37 events/second
     1-minute rate = 0.00 events/second
     5-minute rate = 0.00 events/second
    15-minute rate = 0.00 events/second

Journal-based metadata, not working as expected

Query based on journal name, volume and first page:

http://localhost:8080/service/lookup?doi=10.1107%2FS0907444911008754&jtitle=Acta%20Crystallographica%20Section%20D%20Biological%20Crystallography&volume=67&page=463&firstName=Gorrec
-> working

http://localhost:8080/service/lookup?jtitle=Acta%20Crystallographica%20Section%20D%20Biological%20Crystallography&volume=67&page=463&firstName=Gorrec
-> not working

doi: 10.1107/S0907444911008754
jtitle: Acta Crystallographica Section D Biological Crystallography
volume: 67
page: 463
firstName: Gorrec

Slower-than-expected LMDB import; tuning?

When experimenting with import of the fatcat release metadata corpus (about 97 million records, similar in size/scope to crossref corpus when abstracts+references removed), I found the Java LMDB importing slower than expected:

    java -jar lookup/build/libs/lookup-service-1.0-SNAPSHOT-onejar.jar fatcat --input /srv/biblio-glutton/datasets/release_export_expanded.json.gz /srv/biblio-glutton/config/biblio-glutton.yaml

    [...]
    -- Meters ----------------------------------------------------------------------
    fatcatLookup
                 count = 1146817
             mean rate = 8529.29 events/second
         1-minute rate = 8165.53 events/second
         5-minute rate = 6900.17 events/second
        15-minute rate = 6358.60 events/second

    [...] RAN OVER NIGHT

    6/26/19 4:32:11 PM =============================================================

    -- Meters ----------------------------------------------------------------------
    fatcatLookup
                 count = 37252787
             mean rate = 1474.81 events/second
         1-minute rate = 1022.73 events/second
         5-minute rate = 1022.72 events/second
        15-minute rate = 1005.36 events/second
    [...]

I cut it off soon after, when the rate drooped further to ~900/second.

This isn't crazy slow (it would finish in another day or two), but, for instance, the node/elasticsearch ingest of the same corpus completed pretty quickly on the same machine:


    [...]
    Loaded 2131000 records in 296.938 s (8547.008547008547 record/s)
    Loaded 2132000 records in 297.213 s (7142.857142857142 record/s)
    Loaded 2133000 records in 297.219 s (3816.793893129771 record/s)
    Loaded 2134000 records in 297.265 s (12987.012987012988 record/s)
    Loaded 2135000 records in 297.364 s (13513.513513513513 record/s)
    [...]
    Loaded 98076000 records in 22536.231 s (9433.962264150943 record/s)
    Loaded 98077000 records in 22536.495 s (9090.90909090909 record/s)

This is a ~30 thread machine with 50 GByte RAM and a consumer-grade Samsung 2 TByte SSD. I don't seem to have any lmdb libraries installed, I guess they are vendored in. In my config I have (truncated to relevant bits):

storage: /srv/biblio-glutton/data/db
batchSize: 10000
maxAcceptedRequests: -1

server:
  type: custom
  applicationConnectors:
  - type: http
    port: 8080
  adminConnectors:
  - type: http
    port: 8081
  registerDefaultExceptionMappers: false
  maxThreads: 2048
  maxQueuedRequests: 2048
  acceptQueueSize: 2048

Wondering what kind of performance others are seeing by the "end" of a full crossref corpus import, and if there is other tuning I should do.

For my particular use-case (fatcat matching) it is tempting to redirect to a HTTP REST API which can handle at least hundreds of requests/sec at a couple ms or so latency; this would keep the returned data "fresh" without needing a pipeline to rebuild the LMDB snapshots periodically or continuously. Probably not worth it for most users and most cases. I do think there is a "universal bias" towards the most recent published works though: most people read and are processing new papers, and new papers tend to cite recent (or forthcoming) papers, so having the matching corpus even a month or two out of date could be sub-optimal. The same "freshness" issue would exist with elasticsearch anyways though.

Docker and Docker-compose files incorrect + java.net.ConnectException

Hello,
Some points to help users that will user biblio-glutton in Windows env with Docker.

You need to clone the repo using this args --config core.autocrlf=input so the command will be git clone https://github.com/kermitt2/biblio-glutton.git --config core.autocrlf=input otherwise you'll have strange errors during docker-compose up phase
Docker file is not right, it needs the last command to be CMD java -jar lib/lookup-service-0.2-onejar.jar server /app/config/glutton.yml
Docker compose file is not right, needs to have volume for config otherwise the jar file says that it cannot find it, so

volumes:
- .\data:/app/data
- .\config:/app/config

Even with all those changes, the service starts but biblio-glutton has java.net.ConnectException of Connection refused.

WARN [2022-09-26 07:48:02,191] org.eclipse.jetty.server.HttpChannel: handleException /service/data java.net.ConnectException: Connection refused
WARN [2022-09-26 07:48:02,192] org.eclipse.jetty.server.HttpChannelState: unhandled due to prior sendError

Do I have to to a pull request for the changes above?
Can you help me about the error of connection refused?

Thanks

Some lookup should not fail if ES is not started

For instance

curl http://localhost:8080/service/lookup?doi=10.1484/J.QUAESTIO.1.103624
{"code":"502","message":"Cannot fetch data from Elasticsearch. "}

elasticsearch is not used for DOI lookup, so the service should not fail. It's good support simple lookup usage too, without the advanced "matching" functions.

When postValidate is true, article is said to be always found

Having no elasticsearch running, so no possible match, the error message specifies that something is found, which can't be the case:

lopez@work:~/grobid$ curl "http://localhost:8080/service/lookup?atitle=Naturalizing+Intentionality+between+Philosophy+and+Brain+Science.+A+Survey+of+Methodological+and+Metaphysical+Issues&firstAuthor=Pecere&postValidate=true"
{"message":"Article found but it didn't passed the post Validation."}

Stalled request for async branch

With the async branch, the following request runs forever :)

curl "http://localhost:8080/service/lookup?biblio=%EF%9D%A2.+%EF%9D%A4%EF%9D%A5%EF%9D%AC%EF%9D%AC%EF%9D%A9%EF%9D%AE%EF%9D%A7+%EF%9D%A5%EF%9D%B4+%EF%9D%A1%EF%9D%AC."

The string is only made of small capitals: query.bibliographic=.   .
Such queries should normally fail quickly, not rely on the (super long currently) timeout.

The query fails immediately with master branch and a sync. call to ES.

Light response with OA and ISTEX ID

For the glutton web extension, it would be good to have a service that provides both the OA PDF access (as the current service/oa?) and the ISTEX ID when available.

Consider faster REST API / microservice framework

Given that fast data provision is one of the focus of this project, we should consider modern high performance Java microservice framework to replace DropWizard Jetty/Jersey:

GreenLightning https://oci-pronghorn.gitbook.io/greenlightning/
Jooby https://jooby.io/
Vert.x https://vertx.io/

For reference https://www.techempower.com/benchmarks/

Unrecognized field "journal_issn_l" in recent dump of Unpaywall

Hello,

first, thanks for the truly awesome work!

Issue

I am building the embedded LMDB database and was trying to add the Unpaywall LookUp.

The program starts but keeps raising exceptions com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "journal_issn_l" (detailed error message below).

! com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "journal_issn_l" (class com.scienceminer.lookup.data.UnpayWallMetadata), not marked as ignorable (19 known properties: "journal_is_in_doaj", "genre", "oaStatus", "journal_issns", "is_oa", "openAccess", "oa_locations", "data_standard", "journal_name", "title", "updated", "publisher", "year", "doi", "journal_is_oa", "best_oa_location", "doi_url", "published_date", "oa_status"])
!  at [Source: (String)"{"doi": "10.1007/bf03160334", "year": 1914, "genre": "journal-article", "is_oa": false, "title": "Barroisia und die Pharetronenfrage", "doi_url": "https://doi.org/10.1007/bf03160334", "updated": "2018-06-17T04:42:28.895386", "oa_status": "closed", "publisher": "Springer Nature", "z_authors": [{"given": "H.", "family": "Rauff"}], "journal_name": "Paläontologische Zeitschrift", "oa_locations": [], "data_standard": 2, "journal_is_oa": false, "journal_issns": "0031-0220", "journal_issn_l": "0031-022"[truncated 90 chars]; line: 1, column: 493] (through reference chain: com.scienceminer.lookup.data.UnpayWallMetadata["journal_issn_l"])
! at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:61)
! at com.fasterxml.jackson.databind.DeserializationContext.handleUnknownProperty(DeserializationContext.java:823)
! at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1153)
! at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1589)
! at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownVanilla(BeanDeserializerBase.java:1567)
! at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:294)
! at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:151)
! at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4013)
! at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3004)
! at com.scienceminer.lookup.reader.UnpayWallReader.fromJson(UnpayWallReader.java:58)
! at com.scienceminer.lookup.reader.UnpayWallReader.lambda$load$1(UnpayWallReader.java:42)
! at java.util.Iterator.forEachRemaining(Iterator.java:116)
! at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
! at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
! at com.scienceminer.lookup.reader.UnpayWallReader.load(UnpayWallReader.java:41)
! at com.scienceminer.lookup.storage.lookup.OALookup.loadFromFile(OALookup.java:110)
! at com.scienceminer.lookup.command.LoadUnpayWallCommand.run(LoadUnpayWallCommand.java:66)
! at com.scienceminer.lookup.command.LoadUnpayWallCommand.run(LoadUnpayWallCommand.java:22)
! at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:87)
! at io.dropwizard.cli.Cli.run(Cli.java:78)
! at io.dropwizard.Application.run(Application.java:93)
! at com.scienceminer.lookup.web.LookupServiceApplication.main(LookupServiceApplication.java:68)
ERROR [2019-08-21 01:31:36,510] com.scienceminer.lookup.reader.UnpayWallReader: The input line cannot be processed
 {"doi": "10.3886/icpsr02766", "year": null, "genre": "dataset", "is_oa": false, "title": "Project on Human Development in Chicago Neighborhoods: Community Survey, 1994-1995", "doi_url": "https://doi.org/10.3886/icpsr02766", "updated": "2018-06-18T23:27:05.481519", "oa_status": "closed", "publisher": "Inter-university Consortium for Political and Social Research (ICPSR)", "z_authors": [{"given": "Felton J.", "family": "Earls"}, {"given": "Jeanne", "family": "Brooks-Gunn"}, {"given": "Stephen W.", "family": "Raudenbush"}, {"given": "Robert J.", "family": "Sampson"}], "journal_name": "ICPSR Data Holdings", "oa_locations": [], "data_standard": 2, "journal_is_oa": false, "journal_issns": null, "journal_issn_l": null, "published_date": null, "best_oa_location": null, "journal_is_in_doaj": false, "has_repository_copy": false}

How to reproduce the behaviour

java -jar build/libs/lookup-service-1.0-SNAPSHOT-onejar.jar unpaywall --input ~/data/unpaywall_snapshot_2019-08-16T155437.jsonl.gz data/config/config.yml

Note: as you can see the Unpaywall dataset that I am using is more recent that the one used in the biblio-glutton demo.

Environment

AWS EC2 t2.micro (Ubuntu 18.04)
biblio-glutton (latest commit)
Unpaywall fresh from 16th August