Giter Club home page Giter Club logo

gni's Introduction

Global Names Index v.2

Indexes occurances of biological scientific names in the world, normalizes and reconsiles lexical and taxonomic variants of the name strings.

Citing for Global Names Index v1

DOI

Citing for Global Names Index v2 (Global Names Resolver)

DOI

Dependencies

  • Install Ruby (see version in .ruby-version)
  • Install bundle gem
  • Install Java JDK 6-8
  • Install Redis

Run

bundle install

Testing

rake db:drop:all
rake db:create:all
rake db:migrate
rake db:migrate RAILS_ENV=test
rake db:seed
rake db:seed RAILS_ENV=test
rake solr:start ( or rake solr:run in a separate terminal window)
rake solr:build
rspec

Also see .travis.yml org as an example

Working with seed data

Use rake "db:seed RAILS_ENV=your_env" to populate tables in development, test and production environments. Different environments differ in how and which tables are populated. To add more data for testing/development purposes use

rake db:addnames

the command reads data from spec/files/addnames.csv to add more records to the system.

Resolver worker

RAILS_ENV=production RAKE_ENV=production QUEUE=name_resolver rake resque:work

Assets precompiling

bundle exec rake assets:precompile

Rebuilding canonical names index for production

from the machine with solr go to gni dir and run

rake db:solr:build RAILS_ENV=production

Copyright

Authors: Dmitry Mozzherin, David Shorthouse

Copyright (c) 2012-2013 Marine Biological Laboratory. See LICENSE.txt for further details.

gni's People

Contributors

ashipunova avatar dimus avatar dshorthouse avatar pdevries avatar pleary avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

colineol poldham

gni's Issues

Jobs seem to never finish?

When a list of names of length greater than 1000 is passed via POST request, jobs never seem to finish. Is this just me? Perhaps something going on with the queueing system in the backend?

@dimus @dshorthouse

not an issue, more a question: Search by Identifiers and get parent/s

Hi!

I've seen you got almost all the taxonomy our users report species (NCBI, WORMS, ITIS,...)

I was wondering if there is a way of , programatically send an id (like WORMS:123) and get the info. Also we would need to ask for parents ant navigate up the tree.

Nice resource!

introduce a method to allow human to annotate (and potentially correct) bad names

following a discussion with @dimus , the following use case would allow users to help classify / annotate bad names and suggest alternative valid names:

  1. A checklist contains the species name Airius felis
  2. a program X send the name to the resolver api at resolver.globalnames.org
  3. resolver replies that the name Airius felis is bad and includes a feedback url
  4. program X sends the feedback url to a user Y
  5. User Y clicks on the feedback url
  6. On the name feedback webpages, user Y confirms that the name is bad, indicates that the name contains a typo (e.g. Arius felis), is outdated, and that the valid name is Ariopsis felis .
  7. next time program X sends the name Airius felis resolver replies that the name is bad, and include a suggestion to use the name Ariopsis felis suggested by user Y.

@dimus let me know if I captured the use case properly. I do realize that this use case if very specific to GloBI . . .

All input is ignored after an empty line

I came across this issue by accident today. If the user enters a list of species names and one of the rows in between is empty, Global Names Resolver will ignore any input after that.

Example input to reproduce this problem:

Danio rerio

Laticauda semifasciata

Fuzzy matching on genus name

I noticed that fuzzy matching in GNR did not work in the case of "Buphtalmum salicifolium". The typo is in the genus name (correct spelling: Buphthalmum salicifolium; Asteraceae).

Here an example with the R taxize package and the name (incl typo in genus name). I just get the genus name from IRMN (but not the correct spelling):

gnr_resolve("Buphtalmum salicifolium")
       submitted_name              matched_name                               data_source_title score
1 Buphtalmum salicifolium Buphtalmum Linnaeus, 1753 Interim Register of Marine and Nonmarine     
Genera  0.75

When suppliyng the correctly spelt name, I get a match - so the name is included in the DBs:

gnr_resolve("Buphthalmum salicifolium") %>% head(2)
        submitted_name             matched_name data_source_title score
1 Buphthalmum salicifolium Buphthalmum salicifolium              NCBI 0.988
2 Buphthalmum salicifolium Buphthalmum salicifolium          Freebase 0.988

Other taxon names with a typo in the genus name work just fine ("Stippa" should be "Stipa"):

gnr_resolve("Stippa pennata") %>% head(2)
  submitted_name     matched_name data_source_title score
1 Stippa pennata Stipa pennata L. Catalogue of Life  0.75
2 Stippa pennata Stipa pennata L.              ITIS  0.75

I know this is just a single case but it really puzzled me.
Thanks for your effort.

Figure out how to do multiple datasource name resolution

Right now we search multiple data sources for every step. For example if search 5 data sources, and a name had exact string match in only one of them -- it gets out of the search pattern and not checked for canonical of fuzzy search, even if it can be found in the rest 4 data sources. We have to figure out if it is what we want or we want a different approach.

Least surprise behavior.

I would imagine that if I submitted names and serveral data sources, I would expect to get back a sum of search for all of these data sources, as if I would get by resolving a name against each of them separately. However it significantly increases the load on the system. With limited resources it might make sense to decide on more economical approach.

Possible solutions:

  1. Bite the bullet and do search thorougly for each and every data source
  2. Do what we do now
  3. Allow only search of 1 dat source at a time

An independent option for all 3 would be to use whlole GNI data to find names which were not found in selected databases.

As a User I should be aware when data source records were updated last

Currently resolver does not give information when a particular source was updated. There are two ways of adding this information -- when someone queries data source itself, or as a part of each record. I think the best way would be to return it once for each data source mentioned in the output.

data sources
    data_source
        id 1
        title NCBI
        last_updated 2012-12-1

Problems with fuzzy matching

Hi Dima,

It occurs to me that if you could determine why the fuzzy matching algorithm goes wrong in this one particular example, it might suggest ways to improve accuracy overall.

The single letter OCR error Euglcna gracilis (for Euglena gracilis) is instead consistently replaced by Egilina gracilis, which requires substitution of four letters. Both names are in GNI. Other OCR errors for this name are also resolved in the same way.

BHL pages with OCR errors in the name Euglena gracilis that are misinterpreted as Egilina gracilis:
http://www.biodiversitylibrary.org/page/929138
http://www.biodiversitylibrary.org/page/1364518
http://www.biodiversitylibrary.org/page/1364524
http://www.biodiversitylibrary.org/page/1364525
http://www.biodiversitylibrary.org/page/1364531
http://www.biodiversitylibrary.org/page/1364532
http://www.biodiversitylibrary.org/page/1488979
http://www.biodiversitylibrary.org/page/1503443
http://www.biodiversitylibrary.org/page/1503451
http://www.biodiversitylibrary.org/page/1547005
http://www.biodiversitylibrary.org/page/1547013
http://www.biodiversitylibrary.org/page/1547015
http://www.biodiversitylibrary.org/page/1539409
http://www.biodiversitylibrary.org/page/1488189

As I pointed out previously, a single letter difference in the Latin name can be the difference between a butterfly and a snail, which makes fuzzy matching inappropriate. There seem to be a variety of algorithms specifically for detecting and correcting OCR errors, however, which are qualitatively different from the sort of phonetic misspelling and qwerty keyboard typographical errors that fuzzy matching is intended to overcome. OCR specific algorithms "know" about the nature of OCR errors, which result from similarly shaped characters being confounded or not recognized. In the example case, "c" is more likely to be an OCR error for "e" than for "i". Some OCR errors are obvious (mid-string non-alpha characters and case changes) while others are statistically much more likely than others. A probability table could suggest likely and eliminate unlikely substations. There seems to be a considerable literature on the subject, for example http://www.cs.cmu.edu/~rcarlson/docs/RyanCarlson_nlp.pdf. Use of GNI as a dictionary should be done with caution. It is rather "dirty," both significantly incomplete and full of errors. It must be relied on for recognition of genus group names, but a species group name string not being in GNI, particularly in combination with a particular genus, does not mean that it is invalid and must be replaced.

Regards,
Pat

search for Anura in WoRMS returns non-WoRMS taxon ids

When executing using name "Anura" in http://resolver.globalnames.org , a result is returned that includes unexpected WoRMS classification ids like gn:I7WB_zNSUlqQP-4KWe8VCA . Expected WoRMS ids would include 448306 (see http://www.marinespecies.org/aphia.php?p=taxdetails&id=448306).

{
    "data_source_id": 9,
    "data_source_title": "WoRMS",
    "gni_uuid": "58305eaf-9617-5566-9bed-45388e44643b",
    "name_string": "Anura",
    "canonical_form": "Anura",
    "classification_path": "Animalia|Chordata|Amphibia|Anura",
    "classification_path_ranks": "kingdom|phylum|class|order",
    "classification_path_ids": "gn:3uL60ncMV0CLLRfxOOBRdQ|gn:0nPgiA8cWdWSAg1iZqIs6Q|gn:w4svaopeWPq3Ub7_uQ1eaw|gn:I7WB_zNSUlqQP-4KWe8VCA",
    "taxon_id": "gn:I7WB_zNSUlqQP-4KWe8VCA",
    "match_type": 1,
    "prescore": "1|0|0",
    "score": 0.75
}

Search and return recorded synonyms

Would it be possible to search and return alternative names?
e.g. a search in NCBI via GNR for 'Alopex lagopus' returns nothing even though NCBI records it as a synonym for 'Vulpes lagopus'.

Many thanks,
Dom

add common names to resolver result

as discussed with @dimus -

In addition to taxon hierarchies, suggest to include available common names for resolved taxa. This would help me immensely in making the search features in http://globalbioticinteractions.org friendlier for humans.

Currently the resolver returns something like:

...
data_source_id: 4,
data_source_title: "NCBI",
gni_uuid: "16f235a0-e4a3-529c-9b83-bd15fe722110",
name_string: "Homo sapiens",
canonical_form: "Homo sapiens",
classification_path: "|Eukaryota|Opisthokonta|Metazoa|Eumetazoa|Bilateria|Coelomata|Deuterostomia|Chordata|Craniata|Vertebrata|Gnathostomata|Teleostomi|Euteleostomi|Sarcopterygii|Tetrapoda|Amniota|Mammalia|Theria|Eutheria|Euarchontoglires|Primates|Haplorrhini|Simiiformes|Catarrhini|Hominoidea|Hominidae|Homininae|Homo|Homo sapiens",
classification_path_ranks: "|superkingdom||kingdom|||||phylum|subphylum||superclass||||||class|||superorder|order|suborder|infraorder|parvorder|superfamily|family|subfamily|genus|species",
classification_path_ids: "131567|2759|33154|33208|6072|33213|33316|33511|7711|89593|7742|7776|117570|117571|8287|32523|32524|40674|32525|9347|314146|9443|376913|314293|9526|314295|9604|207598|9605|9606",
...

suggested result (including common names) something like:

...
data_source_id: 4,
data_source_title: "NCBI",
gni_uuid: "16f235a0-e4a3-529c-9b83-bd15fe722110",
name_string: "Homo sapiens",
canonical_form: "Homo sapiens",
common_names: "human @en|Mensch @de|mens @nl",
classification_path: "|Eukaryota|Opisthokonta|Metazoa|Eumetazoa|Bilateria|Coelomata|Deuterostomia|Chordata|Craniata|Vertebrata|Gnathostomata|Teleostomi|Euteleostomi|Sarcopterygii|Tetrapoda|Amniota|Mammalia|Theria|Eutheria|Euarchontoglires|Primates|Haplorrhini|Simiiformes|Catarrhini|Hominoidea|Hominidae|Homininae|Homo|Homo sapiens",
classification_path_ranks: "|superkingdom||kingdom|||||phylum|subphylum||superclass||||||class|||superorder|order|suborder|infraorder|parvorder|superfamily|family|subfamily|genus|species",
classification_path_ids: "131567|2759|33154|33208|6072|33213|33316|33511|7711|89593|7742|7776|117570|117571|8287|32523|32524|40674|32525|9347|314146|9443|376913|314293|9526|314295|9604|207598|9605|9606",
taxon_id: "9606",
...

resolve_once behavior

Curious if resolve_once behavior is correct e.g., this call http://resolver.globalnames.org/name_resolvers.json?names=Plantago+major&resolve_once=true returns more than one match (all from different sources I think, but more than 1 match)

The API docs suggest setting resolve_once=TRUE should just get first match.

GNR: Unexpected high matching scores despite incomplete match at infraspecific rank

Taxa at an infraspecific rank can get high matches in GNR, although the result is just at the species rank or is incorrect:

For example, when supplying 'Cirsium creticum s. triumfetti' this result makes sense:

  1. Cirsium creticum d'Urv. subsp. triumfetti (Lacaita) Werner [ exact canonical match, Score: 0.988 ]
    GBIF Backbone Taxonomy

But these high scores are unexpected:

  1. Cirsium creticum (Lam.) d'Urv. [ exact canonical match, Score: 0.988 ]
    Catalogue of Life

  2. Cirsium creticum (Lam.) d’Urv. [ exact canonical match, Score: 0.988 ]
    GBIF Backbone Taxonomy

  3. Cirsium creticum (Lam.) d’Urv. subsp. creticum [ exact canonical match, Score: 0.988 ]
    GBIF Backbone Taxonomy
    (note 'creticum' vs 'triumfetti' at infraspecific level).

Alternative output of result name in GNR [Request]

Right now, GNR produces two outputs of matched names: (1) the entire taxon name incl. author and (2) the canonical form. For example, for taxon 'Stipa pennnata var. anomala, you get: (1) 'Stipa pennata var. anomala (P.A.Smirn.) Tzvelev' and (2) 'Stipa pennata anomala'.

I would like to request a third output, that includes taxonomic abbreviations (e.g. 'var', 'subsp') but not author names. Using the example above, this would be (3) 'Stipa pennata var. anomala'.

Some biological databases use the full taxon name (incl. abbreviations) but not the author name. Having this feature would help facilitate data merging among databases with the taxon name as a key variable.

As a user of API I receive results of of the name resolution in corresponding format

I just broke returning results by changing what name resolver instance is returning. New results have to be sent only with following fields added to the result header:

1 context for all data sources
2 all names first in the following format:
[ {id=>1, :name => 'Betula', :results => .....}, {...}....]

results should have the following fields:

  1. data_source_id
  2. gni uuid
  3. name_string
  4. canonical_form
  5. classification path (if exists)
  6. classification path ids (if exist)
  7. ds taxon id
  8. ds. guid (if exists)
  9. urls (if exist)
  10. current_name taxon id (if exist)
  11. current name string if (exist)
  12. match type id
  13. prescore in pipe delimited format
  14. score

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.