globalnamesarchitecture / gni Goto Github PK
View Code? Open in Web Editor NEWThis project forked from dimus/gni
Global Names Index
Home Page: http://wiki.github.com/GlobalNamesArchitecture/gni
This project forked from dimus/gni
Global Names Index
Home Page: http://wiki.github.com/GlobalNamesArchitecture/gni
Hi Dima,
It occurs to me that if you could determine why the fuzzy matching algorithm goes wrong in this one particular example, it might suggest ways to improve accuracy overall.
The single letter OCR error Euglcna gracilis (for Euglena gracilis) is instead consistently replaced by Egilina gracilis, which requires substitution of four letters. Both names are in GNI. Other OCR errors for this name are also resolved in the same way.
BHL pages with OCR errors in the name Euglena gracilis that are misinterpreted as Egilina gracilis:
http://www.biodiversitylibrary.org/page/929138
http://www.biodiversitylibrary.org/page/1364518
http://www.biodiversitylibrary.org/page/1364524
http://www.biodiversitylibrary.org/page/1364525
http://www.biodiversitylibrary.org/page/1364531
http://www.biodiversitylibrary.org/page/1364532
http://www.biodiversitylibrary.org/page/1488979
http://www.biodiversitylibrary.org/page/1503443
http://www.biodiversitylibrary.org/page/1503451
http://www.biodiversitylibrary.org/page/1547005
http://www.biodiversitylibrary.org/page/1547013
http://www.biodiversitylibrary.org/page/1547015
http://www.biodiversitylibrary.org/page/1539409
http://www.biodiversitylibrary.org/page/1488189
As I pointed out previously, a single letter difference in the Latin name can be the difference between a butterfly and a snail, which makes fuzzy matching inappropriate. There seem to be a variety of algorithms specifically for detecting and correcting OCR errors, however, which are qualitatively different from the sort of phonetic misspelling and qwerty keyboard typographical errors that fuzzy matching is intended to overcome. OCR specific algorithms "know" about the nature of OCR errors, which result from similarly shaped characters being confounded or not recognized. In the example case, "c" is more likely to be an OCR error for "e" than for "i". Some OCR errors are obvious (mid-string non-alpha characters and case changes) while others are statistically much more likely than others. A probability table could suggest likely and eliminate unlikely substations. There seems to be a considerable literature on the subject, for example http://www.cs.cmu.edu/~rcarlson/docs/RyanCarlson_nlp.pdf. Use of GNI as a dictionary should be done with caution. It is rather "dirty," both significantly incomplete and full of errors. It must be relied on for recognition of genus group names, but a species group name string not being in GNI, particularly in combination with a particular genus, does not mean that it is invalid and must be replaced.
Regards,
Pat
Given url similar to:
.../data_sources/123/name_strings
user should be able to get paginated response with all the name strings in the data source and information associated with them
Taxa at an infraspecific rank can get high matches in GNR, although the result is just at the species rank or is incorrect:
For example, when supplying 'Cirsium creticum s. triumfetti' this result makes sense:
But these high scores are unexpected:
Cirsium creticum (Lam.) d'Urv. [ exact canonical match, Score: 0.988 ]
Catalogue of Life
Cirsium creticum (Lam.) d’Urv. [ exact canonical match, Score: 0.988 ]
GBIF Backbone Taxonomy
Cirsium creticum (Lam.) d’Urv. subsp. creticum [ exact canonical match, Score: 0.988 ]
GBIF Backbone Taxonomy
(note 'creticum' vs 'triumfetti' at infraspecific level).
Currently resolver does not give information when a particular source was updated. There are two ways of adding this information -- when someone queries data source itself, or as a part of each record. I think the best way would be to return it once for each data source mentioned in the output.
data sources
data_source
id 1
title NCBI
last_updated 2012-12-1
e.g. https://www.wikidata.org/wiki/Q36611. Taxa can be spotted in the wikidata JSON download dump by looking for items with property P31 ("instance of") set to Q16521 ("taxon")
Curious if resolve_once
behavior is correct e.g., this call http://resolver.globalnames.org/name_resolvers.json?names=Plantago+major&resolve_once=true
returns more than one match (all from different sources I think, but more than 1 match)
The API docs suggest setting resolve_once=TRUE
should just get first match.
When a list of names of length greater than 1000 is passed via POST request, jobs never seem to finish. Is this just me? Perhaps something going on with the queueing system in the backend?
Right now, GNR produces two outputs of matched names: (1) the entire taxon name incl. author and (2) the canonical form. For example, for taxon 'Stipa pennnata var. anomala, you get: (1) 'Stipa pennata var. anomala (P.A.Smirn.) Tzvelev' and (2) 'Stipa pennata anomala'.
I would like to request a third output, that includes taxonomic abbreviations (e.g. 'var', 'subsp') but not author names. Using the example above, this would be (3) 'Stipa pennata var. anomala'.
Some biological databases use the full taxon name (incl. abbreviations) but not the author name. Having this feature would help facilitate data merging among databases with the taxon name as a key variable.
User sends a file with names, user gets back a link to zipped file with all resolved names.
Hard limits are from 1 o 5 data sources
if less or more -- return failure code with the message which explains the problem
Every name_resolver header response has
as discussed with @dimus -
In addition to taxon hierarchies, suggest to include available common names for resolved taxa. This would help me immensely in making the search features in http://globalbioticinteractions.org friendlier for humans.
Currently the resolver returns something like:
...
data_source_id: 4,
data_source_title: "NCBI",
gni_uuid: "16f235a0-e4a3-529c-9b83-bd15fe722110",
name_string: "Homo sapiens",
canonical_form: "Homo sapiens",
classification_path: "|Eukaryota|Opisthokonta|Metazoa|Eumetazoa|Bilateria|Coelomata|Deuterostomia|Chordata|Craniata|Vertebrata|Gnathostomata|Teleostomi|Euteleostomi|Sarcopterygii|Tetrapoda|Amniota|Mammalia|Theria|Eutheria|Euarchontoglires|Primates|Haplorrhini|Simiiformes|Catarrhini|Hominoidea|Hominidae|Homininae|Homo|Homo sapiens",
classification_path_ranks: "|superkingdom||kingdom|||||phylum|subphylum||superclass||||||class|||superorder|order|suborder|infraorder|parvorder|superfamily|family|subfamily|genus|species",
classification_path_ids: "131567|2759|33154|33208|6072|33213|33316|33511|7711|89593|7742|7776|117570|117571|8287|32523|32524|40674|32525|9347|314146|9443|376913|314293|9526|314295|9604|207598|9605|9606",
...
suggested result (including common names) something like:
...
data_source_id: 4,
data_source_title: "NCBI",
gni_uuid: "16f235a0-e4a3-529c-9b83-bd15fe722110",
name_string: "Homo sapiens",
canonical_form: "Homo sapiens",
common_names: "human @en|Mensch @de|mens @nl",
classification_path: "|Eukaryota|Opisthokonta|Metazoa|Eumetazoa|Bilateria|Coelomata|Deuterostomia|Chordata|Craniata|Vertebrata|Gnathostomata|Teleostomi|Euteleostomi|Sarcopterygii|Tetrapoda|Amniota|Mammalia|Theria|Eutheria|Euarchontoglires|Primates|Haplorrhini|Simiiformes|Catarrhini|Hominoidea|Hominidae|Homininae|Homo|Homo sapiens",
classification_path_ranks: "|superkingdom||kingdom|||||phylum|subphylum||superclass||||||class|||superorder|order|suborder|infraorder|parvorder|superfamily|family|subfamily|genus|species",
classification_path_ids: "131567|2759|33154|33208|6072|33213|33316|33511|7711|89593|7742|7776|117570|117571|8287|32523|32524|40674|32525|9347|314146|9443|376913|314293|9526|314295|9604|207598|9605|9606",
taxon_id: "9606",
...
Remove old statuses from the code too
when using resolver.globalnames.org with "Homo sapiens", exact canonical matches include both NCBI:9606 (Homo sapiens) and NCBI:741158 (Homo sapiens ssp. Denisova). Is this expected?
Generate a list of all data sources in GNI, and provide them with the ID value and associated metadata to @deepreef for registration as Identifier Domains in BioGUID.
name resolution is reasonably private process, that I can only access my information by uuid token, and never by id of the resource to reduce risk of anauthorized access
I noticed that fuzzy matching in GNR did not work in the case of "Buphtalmum salicifolium". The typo is in the genus name (correct spelling: Buphthalmum salicifolium; Asteraceae).
Here an example with the R taxize package and the name (incl typo in genus name). I just get the genus name from IRMN (but not the correct spelling):
gnr_resolve("Buphtalmum salicifolium")
submitted_name matched_name data_source_title score
1 Buphtalmum salicifolium Buphtalmum Linnaeus, 1753 Interim Register of Marine and Nonmarine
Genera 0.75
When suppliyng the correctly spelt name, I get a match - so the name is included in the DBs:
gnr_resolve("Buphthalmum salicifolium") %>% head(2)
submitted_name matched_name data_source_title score
1 Buphthalmum salicifolium Buphthalmum salicifolium NCBI 0.988
2 Buphthalmum salicifolium Buphthalmum salicifolium Freebase 0.988
Other taxon names with a typo in the genus name work just fine ("Stippa" should be "Stipa"):
gnr_resolve("Stippa pennata") %>% head(2)
submitted_name matched_name data_source_title score
1 Stippa pennata Stipa pennata L. Catalogue of Life 0.75
2 Stippa pennata Stipa pennata L. ITIS 0.75
I know this is just a single case but it really puzzled me.
Thanks for your effort.
Right now we search multiple data sources for every step. For example if search 5 data sources, and a name had exact string match in only one of them -- it gets out of the search pattern and not checked for canonical of fuzzy search, even if it can be found in the rest 4 data sources. We have to figure out if it is what we want or we want a different approach.
Least surprise behavior.
I would imagine that if I submitted names and serveral data sources, I would expect to get back a sum of search for all of these data sources, as if I would get by resolving a name against each of them separately. However it significantly increases the load on the system. With limited resources it might make sense to decide on more economical approach.
Possible solutions:
An independent option for all 3 would be to use whlole GNI data to find names which were not found in selected databases.
When executing using name "Anura" in http://resolver.globalnames.org , a result is returned that includes unexpected WoRMS classification ids like gn:I7WB_zNSUlqQP-4KWe8VCA
. Expected WoRMS ids would include 448306 (see http://www.marinespecies.org/aphia.php?p=taxdetails&id=448306).
{
"data_source_id": 9,
"data_source_title": "WoRMS",
"gni_uuid": "58305eaf-9617-5566-9bed-45388e44643b",
"name_string": "Anura",
"canonical_form": "Anura",
"classification_path": "Animalia|Chordata|Amphibia|Anura",
"classification_path_ranks": "kingdom|phylum|class|order",
"classification_path_ids": "gn:3uL60ncMV0CLLRfxOOBRdQ|gn:0nPgiA8cWdWSAg1iZqIs6Q|gn:w4svaopeWPq3Ub7_uQ1eaw|gn:I7WB_zNSUlqQP-4KWe8VCA",
"taxon_id": "gn:I7WB_zNSUlqQP-4KWe8VCA",
"match_type": 1,
"prescore": "1|0|0",
"score": 0.75
}
I just broke returning results by changing what name resolver instance is returning. New results have to be sent only with following fields added to the result header:
1 context for all data sources
2 all names first in the following format:
[ {id=>1, :name => 'Betula', :results => .....}, {...}....]
results should have the following fields:
I came across this issue by accident today. If the user enters a list of species names and one of the rows in between is empty, Global Names Resolver will ignore any input after that.
Example input to reproduce this problem:
Danio rerio
Laticauda semifasciata
Hey y'all,
I was hoping to use globalnames.org to do fast name lookups for our project Global Biotic Interactions . . . and it seems that the http://resolver.globalnames.org is no longer responding. An integration test that that worked earlier, is now failing.
Would it be possible to run the resolver locally?
thx,
-jorrit
Hi!
I've seen you got almost all the taxonomy our users report species (NCBI, WORMS, ITIS,...)
I was wondering if there is a way of , programatically send an id (like WORMS:123) and get the info. Also we would need to ask for parents ant navigate up the tree.
Nice resource!
Would it be possible to search and return alternative names?
e.g. a search in NCBI via GNR for 'Alopex lagopus' returns nothing even though NCBI records it as a synonym for 'Vulpes lagopus'.
Many thanks,
Dom
following a discussion with @dimus , the following use case would allow users to help classify / annotate bad names and suggest alternative valid names:
@dimus let me know if I captured the use case properly. I do realize that this use case if very specific to GloBI . . .
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.