Comments (10)
I am doing a new iteration on this.
The "super" matching service would work as this:
GET host:port/service/lookup?doi=DOI&atitle=ARTICLE_TITLE&firstAuthor=FIRST_AUTHOR_SURNAME&jtitle=JOURNAL_TITLE&volume=VOLUME&firstPage=FIRST_PAGE&biblio=BIBLIO_STRING[?postValidate=true][?parseReference=true]
The logic would be the following one:
-
try to match with DOI if present, if successful try the post validation (if author at least is present), if validated DOI lookup fails go to 2 otherwise 5
-
try to match with author and article title metadata if present, if successful try the post validation, if validated lookup fails go to 3, otherwise 5
-
try to match with journal title, volume and firs page if present, if successful try the post validation (at least with author is present), if validated lookup fails go to 4, otherwise 5
-
try full string matching, if present, if successful try the post-validation if at least if author is present - if not present call the grobid citation parser to get author at least (article title if possible) and post-validate with that, if this validated match fails too go to 6, otherwise 5
-
return
200
with the matched DOI and extended metadata -
return
404
Validation takes place if postValidate
is true (default). GROBID parsing takes place if parseReference
is true` (default).
Ideally it should work with PMID, PMCID or ISTEX_ID instead of DOI (any "strong" identifier).
from biblio-glutton.
The point number 4 is not clear, what to do if the postvalidation doesn't work. Why should be re-validated with the author/title extracted from the full string? shouldn't we try to extract them with grobid and try to match (2)?
Second, so are you sure you want to changepostValidate
and parseReference
to true
by default?
from biblio-glutton.
about point 4) I think it is clear but I can rephrase with more details:
4. try full string matching, if a full string is present.
4.1 if full string matching is successful
if at least the author is present -> try to post validate the matching result with this author and possible title
if author/title not present -> call the grobid citation parser to get author at least (article title if possible) -> try to post-validate with that
4.2 if the post-validation fails, or if no author/title is available after Grobid parsing, or if full string matching initially failed, too go to 6, otherwise 5
In 2) we use the provided metadata only, Grobid is not called. So it's a different setting, with higher priority as it does not involve costly Grobid parsing.
from biblio-glutton.
Second, so are you sure you want to changepostValidate and parseReference to true by default?
Yes because, without post-validation, the false positive due to full reference string matching will kill the accuracy. Parsing the reference is a way to allow this post-validation even when no metadata are provided, so it is something to exploit as much as possible.
from biblio-glutton.
at least with author is present
what does that means?
if the title is not present then it's ignored and if it's present is considered for the postValidation while if the author is not present the post validation will fail? is it correct?
from biblio-glutton.
at least with author present
So this is only for step 4)
if the title is not provided as metadata, but author is, we can post-validate just with author (my observation is that most of the cases it is enough, but it might require more tests) this is an acceptable trade-off (there is also good chance that the title is anyway not present in the raw reference bibliography). Of course if both title and author are provided, we use both.
If the post-validation with author only passes, we are done, success.
If the post-validation with author only fails, we are also done, 404. There is no additional grobid reference parsing in this case (this is the trade-off).
from biblio-glutton.
another question.... back to point 1. (but valid everytime)
if the author is not present and postvalidate = true its a 404, right?
If this is the case, this might be a problem for DOI lookup cause if postvalidation is not disable will return 404
from biblio-glutton.
mmm if author is not present, we don't post-validate in (1), we go basically then to (3) - and if we reach (4) we parse the raw reference string to try to get an author. There is no "author is not present and postvalidate = true" possible in the whole process. The availability of at least an author name (provided or extracted) if a condition for postvalidating.
from biblio-glutton.
So done with PR #18
from biblio-glutton.
Finally the mixed matching approach has been removed in version 0.2
because the full matching (which is more accurate) has been made much faster (almostas fast as the previous mixed matching), removing the interest of the mixed matching.
from biblio-glutton.
Related Issues (20)
- Maximum number of requests and request/second
- Sanity check for field request HOT 1
- Experiment with alternative compression
- support matching a bulk of identifiers
- health check errors after a while HOT 1
- Move doc to readthedocs
- Revisited result format for aggregated sources
- Run path for gap/daily sync with Crossref HOT 1
- Crossref gap update command might not stop
- Consider faster REST API / microservice framework
- Use Openalex over Crossref? HOT 1
- Error during import of gz files HOT 8
- Docker and Docker-compose files incorrect + java.net.ConnectException HOT 3
- Error during startup of jar service HOT 9
- Error during import in elasticsearch HOT 1
- LMDB - "Transaction must abort, has a child, or is invalid"
- Docker problem: Error: Unable to access jarfile lib/lookup-service-0.2-SNAPSHOT-onejar.jar HOT 1
- Slow importing of Crossref full metadata dump in LMDB HOT 2
- What's the best way to understand the logic used for consolidation?
- Leaking threads (and low performance) in docker image HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from biblio-glutton.