Giter Club home page Giter Club logo

gni's People

Contributors

ashipunova avatar dimus avatar dshorthouse avatar pdevries avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

colineol poldham

gni's Issues

Problems with fuzzy matching

Hi Dima,

It occurs to me that if you could determine why the fuzzy matching algorithm goes wrong in this one particular example, it might suggest ways to improve accuracy overall.

The single letter OCR error Euglcna gracilis (for Euglena gracilis) is instead consistently replaced by Egilina gracilis, which requires substitution of four letters. Both names are in GNI. Other OCR errors for this name are also resolved in the same way.

BHL pages with OCR errors in the name Euglena gracilis that are misinterpreted as Egilina gracilis:
http://www.biodiversitylibrary.org/page/929138
http://www.biodiversitylibrary.org/page/1364518
http://www.biodiversitylibrary.org/page/1364524
http://www.biodiversitylibrary.org/page/1364525
http://www.biodiversitylibrary.org/page/1364531
http://www.biodiversitylibrary.org/page/1364532
http://www.biodiversitylibrary.org/page/1488979
http://www.biodiversitylibrary.org/page/1503443
http://www.biodiversitylibrary.org/page/1503451
http://www.biodiversitylibrary.org/page/1547005
http://www.biodiversitylibrary.org/page/1547013
http://www.biodiversitylibrary.org/page/1547015
http://www.biodiversitylibrary.org/page/1539409
http://www.biodiversitylibrary.org/page/1488189

As I pointed out previously, a single letter difference in the Latin name can be the difference between a butterfly and a snail, which makes fuzzy matching inappropriate. There seem to be a variety of algorithms specifically for detecting and correcting OCR errors, however, which are qualitatively different from the sort of phonetic misspelling and qwerty keyboard typographical errors that fuzzy matching is intended to overcome. OCR specific algorithms "know" about the nature of OCR errors, which result from similarly shaped characters being confounded or not recognized. In the example case, "c" is more likely to be an OCR error for "e" than for "i". Some OCR errors are obvious (mid-string non-alpha characters and case changes) while others are statistically much more likely than others. A probability table could suggest likely and eliminate unlikely substations. There seems to be a considerable literature on the subject, for example http://www.cs.cmu.edu/~rcarlson/docs/RyanCarlson_nlp.pdf. Use of GNI as a dictionary should be done with caution. It is rather "dirty," both significantly incomplete and full of errors. It must be relied on for recognition of genus group names, but a species group name string not being in GNI, particularly in combination with a particular genus, does not mean that it is invalid and must be replaced.

Regards,
Pat

GNR: Unexpected high matching scores despite incomplete match at infraspecific rank

Taxa at an infraspecific rank can get high matches in GNR, although the result is just at the species rank or is incorrect:

For example, when supplying 'Cirsium creticum s. triumfetti' this result makes sense:

  1. Cirsium creticum d'Urv. subsp. triumfetti (Lacaita) Werner [ exact canonical match, Score: 0.988 ]
    GBIF Backbone Taxonomy

But these high scores are unexpected:

  1. Cirsium creticum (Lam.) d'Urv. [ exact canonical match, Score: 0.988 ]
    Catalogue of Life

  2. Cirsium creticum (Lam.) d’Urv. [ exact canonical match, Score: 0.988 ]
    GBIF Backbone Taxonomy

  3. Cirsium creticum (Lam.) d’Urv. subsp. creticum [ exact canonical match, Score: 0.988 ]
    GBIF Backbone Taxonomy
    (note 'creticum' vs 'triumfetti' at infraspecific level).

As a User I should be aware when data source records were updated last

Currently resolver does not give information when a particular source was updated. There are two ways of adding this information -- when someone queries data source itself, or as a part of each record. I think the best way would be to return it once for each data source mentioned in the output.

data sources
    data_source
        id 1
        title NCBI
        last_updated 2012-12-1

resolve_once behavior

Curious if resolve_once behavior is correct e.g., this call http://resolver.globalnames.org/name_resolvers.json?names=Plantago+major&resolve_once=true returns more than one match (all from different sources I think, but more than 1 match)

The API docs suggest setting resolve_once=TRUE should just get first match.

Jobs seem to never finish?

When a list of names of length greater than 1000 is passed via POST request, jobs never seem to finish. Is this just me? Perhaps something going on with the queueing system in the backend?

@dimus @dshorthouse

Alternative output of result name in GNR [Request]

Right now, GNR produces two outputs of matched names: (1) the entire taxon name incl. author and (2) the canonical form. For example, for taxon 'Stipa pennnata var. anomala, you get: (1) 'Stipa pennata var. anomala (P.A.Smirn.) Tzvelev' and (2) 'Stipa pennata anomala'.

I would like to request a third output, that includes taxonomic abbreviations (e.g. 'var', 'subsp') but not author names. Using the example above, this would be (3) 'Stipa pennata var. anomala'.

Some biological databases use the full taxon name (incl. abbreviations) but not the author name. Having this feature would help facilitate data merging among databases with the taxon name as a key variable.

add common names to resolver result

as discussed with @dimus -

In addition to taxon hierarchies, suggest to include available common names for resolved taxa. This would help me immensely in making the search features in http://globalbioticinteractions.org friendlier for humans.

Currently the resolver returns something like:

...
data_source_id: 4,
data_source_title: "NCBI",
gni_uuid: "16f235a0-e4a3-529c-9b83-bd15fe722110",
name_string: "Homo sapiens",
canonical_form: "Homo sapiens",
classification_path: "|Eukaryota|Opisthokonta|Metazoa|Eumetazoa|Bilateria|Coelomata|Deuterostomia|Chordata|Craniata|Vertebrata|Gnathostomata|Teleostomi|Euteleostomi|Sarcopterygii|Tetrapoda|Amniota|Mammalia|Theria|Eutheria|Euarchontoglires|Primates|Haplorrhini|Simiiformes|Catarrhini|Hominoidea|Hominidae|Homininae|Homo|Homo sapiens",
classification_path_ranks: "|superkingdom||kingdom|||||phylum|subphylum||superclass||||||class|||superorder|order|suborder|infraorder|parvorder|superfamily|family|subfamily|genus|species",
classification_path_ids: "131567|2759|33154|33208|6072|33213|33316|33511|7711|89593|7742|7776|117570|117571|8287|32523|32524|40674|32525|9347|314146|9443|376913|314293|9526|314295|9604|207598|9605|9606",
...

suggested result (including common names) something like:

...
data_source_id: 4,
data_source_title: "NCBI",
gni_uuid: "16f235a0-e4a3-529c-9b83-bd15fe722110",
name_string: "Homo sapiens",
canonical_form: "Homo sapiens",
common_names: "human @en|Mensch @de|mens @nl",
classification_path: "|Eukaryota|Opisthokonta|Metazoa|Eumetazoa|Bilateria|Coelomata|Deuterostomia|Chordata|Craniata|Vertebrata|Gnathostomata|Teleostomi|Euteleostomi|Sarcopterygii|Tetrapoda|Amniota|Mammalia|Theria|Eutheria|Euarchontoglires|Primates|Haplorrhini|Simiiformes|Catarrhini|Hominoidea|Hominidae|Homininae|Homo|Homo sapiens",
classification_path_ranks: "|superkingdom||kingdom|||||phylum|subphylum||superclass||||||class|||superorder|order|suborder|infraorder|parvorder|superfamily|family|subfamily|genus|species",
classification_path_ids: "131567|2759|33154|33208|6072|33213|33316|33511|7711|89593|7742|7776|117570|117571|8287|32523|32524|40674|32525|9347|314146|9443|376913|314293|9526|314295|9604|207598|9605|9606",
taxon_id: "9606",
...

Fuzzy matching on genus name

I noticed that fuzzy matching in GNR did not work in the case of "Buphtalmum salicifolium". The typo is in the genus name (correct spelling: Buphthalmum salicifolium; Asteraceae).

Here an example with the R taxize package and the name (incl typo in genus name). I just get the genus name from IRMN (but not the correct spelling):

gnr_resolve("Buphtalmum salicifolium")
       submitted_name              matched_name                               data_source_title score
1 Buphtalmum salicifolium Buphtalmum Linnaeus, 1753 Interim Register of Marine and Nonmarine     
Genera  0.75

When suppliyng the correctly spelt name, I get a match - so the name is included in the DBs:

gnr_resolve("Buphthalmum salicifolium") %>% head(2)
        submitted_name             matched_name data_source_title score
1 Buphthalmum salicifolium Buphthalmum salicifolium              NCBI 0.988
2 Buphthalmum salicifolium Buphthalmum salicifolium          Freebase 0.988

Other taxon names with a typo in the genus name work just fine ("Stippa" should be "Stipa"):

gnr_resolve("Stippa pennata") %>% head(2)
  submitted_name     matched_name data_source_title score
1 Stippa pennata Stipa pennata L. Catalogue of Life  0.75
2 Stippa pennata Stipa pennata L.              ITIS  0.75

I know this is just a single case but it really puzzled me.
Thanks for your effort.

Figure out how to do multiple datasource name resolution

Right now we search multiple data sources for every step. For example if search 5 data sources, and a name had exact string match in only one of them -- it gets out of the search pattern and not checked for canonical of fuzzy search, even if it can be found in the rest 4 data sources. We have to figure out if it is what we want or we want a different approach.

Least surprise behavior.

I would imagine that if I submitted names and serveral data sources, I would expect to get back a sum of search for all of these data sources, as if I would get by resolving a name against each of them separately. However it significantly increases the load on the system. With limited resources it might make sense to decide on more economical approach.

Possible solutions:

  1. Bite the bullet and do search thorougly for each and every data source
  2. Do what we do now
  3. Allow only search of 1 dat source at a time

An independent option for all 3 would be to use whlole GNI data to find names which were not found in selected databases.

search for Anura in WoRMS returns non-WoRMS taxon ids

When executing using name "Anura" in http://resolver.globalnames.org , a result is returned that includes unexpected WoRMS classification ids like gn:I7WB_zNSUlqQP-4KWe8VCA . Expected WoRMS ids would include 448306 (see http://www.marinespecies.org/aphia.php?p=taxdetails&id=448306).

{
    "data_source_id": 9,
    "data_source_title": "WoRMS",
    "gni_uuid": "58305eaf-9617-5566-9bed-45388e44643b",
    "name_string": "Anura",
    "canonical_form": "Anura",
    "classification_path": "Animalia|Chordata|Amphibia|Anura",
    "classification_path_ranks": "kingdom|phylum|class|order",
    "classification_path_ids": "gn:3uL60ncMV0CLLRfxOOBRdQ|gn:0nPgiA8cWdWSAg1iZqIs6Q|gn:w4svaopeWPq3Ub7_uQ1eaw|gn:I7WB_zNSUlqQP-4KWe8VCA",
    "taxon_id": "gn:I7WB_zNSUlqQP-4KWe8VCA",
    "match_type": 1,
    "prescore": "1|0|0",
    "score": 0.75
}

As a user of API I receive results of of the name resolution in corresponding format

I just broke returning results by changing what name resolver instance is returning. New results have to be sent only with following fields added to the result header:

1 context for all data sources
2 all names first in the following format:
[ {id=>1, :name => 'Betula', :results => .....}, {...}....]

results should have the following fields:

  1. data_source_id
  2. gni uuid
  3. name_string
  4. canonical_form
  5. classification path (if exists)
  6. classification path ids (if exist)
  7. ds taxon id
  8. ds. guid (if exists)
  9. urls (if exist)
  10. current_name taxon id (if exist)
  11. current name string if (exist)
  12. match type id
  13. prescore in pipe delimited format
  14. score

All input is ignored after an empty line

I came across this issue by accident today. If the user enters a list of species names and one of the rows in between is empty, Global Names Resolver will ignore any input after that.

Example input to reproduce this problem:

Danio rerio

Laticauda semifasciata

not an issue, more a question: Search by Identifiers and get parent/s

Hi!

I've seen you got almost all the taxonomy our users report species (NCBI, WORMS, ITIS,...)

I was wondering if there is a way of , programatically send an id (like WORMS:123) and get the info. Also we would need to ask for parents ant navigate up the tree.

Nice resource!

Search and return recorded synonyms

Would it be possible to search and return alternative names?
e.g. a search in NCBI via GNR for 'Alopex lagopus' returns nothing even though NCBI records it as a synonym for 'Vulpes lagopus'.

Many thanks,
Dom

introduce a method to allow human to annotate (and potentially correct) bad names

following a discussion with @dimus , the following use case would allow users to help classify / annotate bad names and suggest alternative valid names:

  1. A checklist contains the species name Airius felis
  2. a program X send the name to the resolver api at resolver.globalnames.org
  3. resolver replies that the name Airius felis is bad and includes a feedback url
  4. program X sends the feedback url to a user Y
  5. User Y clicks on the feedback url
  6. On the name feedback webpages, user Y confirms that the name is bad, indicates that the name contains a typo (e.g. Arius felis), is outdated, and that the valid name is Ariopsis felis .
  7. next time program X sends the name Airius felis resolver replies that the name is bad, and include a suggestion to use the name Ariopsis felis suggested by user Y.

@dimus let me know if I captured the use case properly. I do realize that this use case if very specific to GloBI . . .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.