globalnamesarchitecture / gni Goto Github PK

This project forked from dimus/gni

Global Names Index

Home Page: http://wiki.github.com/GlobalNamesArchitecture/gni

Ruby 42.06% JavaScript 5.30% CoffeeScript 0.27% CSS 1.77% HTML 3.19% XSLT 13.79% Shell 0.03% Java 26.82% Gherkin 0.29% SCSS 0.99% Haml 5.49%

gni's Introduction

Global Names Index v.2

Indexes occurances of biological scientific names in the world, normalizes and reconsiles lexical and taxonomic variants of the name strings.

Citing for Global Names Index v1

Citing for Global Names Index v2 (Global Names Resolver)

Dependencies

Install Ruby (see version in .ruby-version)
Install bundle gem
Install Java JDK 6-8
Install Redis

Run

bundle install

Testing

rake db:drop:all
rake db:create:all
rake db:migrate
rake db:migrate RAILS_ENV=test
rake db:seed
rake db:seed RAILS_ENV=test
rake solr:start ( or rake solr:run in a separate terminal window)
rake solr:build
rspec

Also see .travis.yml org as an example

Working with seed data

Use rake "db:seed RAILS_ENV=your_env" to populate tables in development, test and production environments. Different environments differ in how and which tables are populated. To add more data for testing/development purposes use

rake db:addnames

the command reads data from spec/files/addnames.csv to add more records to the system.

Resolver worker

RAILS_ENV=production RAKE_ENV=production QUEUE=name_resolver rake resque:work

Assets precompiling

bundle exec rake assets:precompile

Rebuilding canonical names index for production

from the machine with solr go to gni dir and run

rake db:solr:build RAILS_ENV=production

Copyright

Authors: Dmitry Mozzherin, David Shorthouse

gni's People

Contributors

Stargazers

Watchers

Forkers

colineol poldham

gni's Issues

Jobs seem to never finish?

When a list of names of length greater than 1000 is passed via POST request, jobs never seem to finish. Is this just me? Perhaps something going on with the queueing system in the backend?

@dimus @dshorthouse

Provide GNI data sources for BioGUID Identifier Domains

Generate a list of all data sources in GNI, and provide them with the ID value and associated metadata to @deepreef for registration as Identifier Domains in BioGUID.

As a User I should be certain that status message indicates current state of name_resolving API request every time I check my resource

not an issue, more a question: Search by Identifiers and get parent/s

Hi!

I've seen you got almost all the taxonomy our users report species (NCBI, WORMS, ITIS,...)

I was wondering if there is a way of , programatically send an id (like WORMS:123) and get the info. Also we would need to ask for parents ant navigate up the tree.

Nice resource!

As a user I want to get information about names I have in 'bad' format (not latin1 or utf-8)

Bug: GBIF, uBio crash the resolution of some lists

Add human readable match status

introduce a method to allow human to annotate (and potentially correct) bad names

following a discussion with @dimus , the following use case would allow users to help classify / annotate bad names and suggest alternative valid names:

A checklist contains the species name Airius felis
a program X send the name to the resolver api at resolver.globalnames.org
resolver replies that the name Airius felis is bad and includes a feedback url
program X sends the feedback url to a user Y
User Y clicks on the feedback url
On the name feedback webpages, user Y confirms that the name is bad, indicates that the name contains a typo (e.g. Arius felis), is outdated, and that the valid name is Ariopsis felis .
next time program X sends the name Airius felis resolver replies that the name is bad, and include a suggestion to use the name Ariopsis felis suggested by user Y.

@dimus let me know if I captured the use case properly. I do realize that this use case if very specific to GloBI . . .

All input is ignored after an empty line

I came across this issue by accident today. If the user enters a list of species names and one of the rows in between is empty, Global Names Resolver will ignore any input after that.

Example input to reproduce this problem:

Danio rerio

Laticauda semifasciata

Bug: API name_resolver GET request with a uninomial name returns 500

http://resolver.globalnames.org down?

Hey y'all,

I was hoping to use globalnames.org to do fast name lookups for our project Global Biotic Interactions . . . and it seems that the http://resolver.globalnames.org is no longer responding. An integration test that that worked earlier, is now failing.

Would it be possible to run the resolver locally?

thx,
-jorrit

Import iNaturalist data

get inat from http://www.inaturalist.org/observations/gbif-observations-dwca.zip

As a user I should be able to opt name_resolver API for a unique list of names to deduplicate the list I sent

As a user I always access name_resolver resource by uuid, and never by id

name resolution is reasonably private process, that I can only access my information by uuid token, and never by id of the resource to reduce risk of anauthorized access

Bug: GBIF, uBio crash the resolution of some lists

As a user to get my name resolution working I must submit from 1 to 5 data_sources

Hard limits are from 1 o 5 data sources

if less or more -- return failure code with the message which explains the problem

As a user I should get expected result from name resolver even if an exception happened during the process

Fuzzy matching on genus name

I noticed that fuzzy matching in GNR did not work in the case of "Buphtalmum salicifolium". The typo is in the genus name (correct spelling: Buphthalmum salicifolium; Asteraceae).

Here an example with the R taxize package and the name (incl typo in genus name). I just get the genus name from IRMN (but not the correct spelling):

gnr_resolve("Buphtalmum salicifolium")
       submitted_name              matched_name                               data_source_title score
1 Buphtalmum salicifolium Buphtalmum Linnaeus, 1753 Interim Register of Marine and Nonmarine     
Genera  0.75

When suppliyng the correctly spelt name, I get a match - so the name is included in the DBs:

gnr_resolve("Buphthalmum salicifolium") %>% head(2)
        submitted_name             matched_name data_source_title score
1 Buphthalmum salicifolium Buphthalmum salicifolium              NCBI 0.988
2 Buphthalmum salicifolium Buphthalmum salicifolium          Freebase 0.988

Other taxon names with a typo in the genus name work just fine ("Stippa" should be "Stipa"):

gnr_resolve("Stippa pennata") %>% head(2)
  submitted_name     matched_name data_source_title score
1 Stippa pennata Stipa pennata L. Catalogue of Life  0.75
2 Stippa pennata Stipa pennata L.              ITIS  0.75

I know this is just a single case but it really puzzled me.
Thanks for your effort.

As a user I should be able to upload a file with names to the Name Resolver user interface

As a User I should access all name strings belonging to a data source through API

Given url similar to:

.../data_sources/123/name_strings

user should be able to get paginated response with all the name strings in the data source and information associated with them

Add date when matched result was imported to resolver

Figure out how to do multiple datasource name resolution

Right now we search multiple data sources for every step. For example if search 5 data sources, and a name had exact string match in only one of them -- it gets out of the search pattern and not checked for canonical of fuzzy search, even if it can be found in the rest 4 data sources. We have to figure out if it is what we want or we want a different approach.

Least surprise behavior.

I would imagine that if I submitted names and serveral data sources, I would expect to get back a sum of search for all of these data sources, as if I would get by resolving a name against each of them separately. However it significantly increases the load on the system. With limited resources it might make sense to decide on more economical approach.

Possible solutions:

Bite the bullet and do search thorougly for each and every data source
Do what we do now
Allow only search of 1 dat source at a time

An independent option for all 3 would be to use whlole GNI data to find names which were not found in selected databases.

As a user I should be able to submit a file with names and get back a zipped version of request, no matter how many names were in the file.

User sends a file with names, user gets back a link to zipped file with all resolved names.

Name resolver should return :results value back instead of values of everything in the instance

As a user I always get immediate response for name resolution using Get API

As a user I want to see status message with percentage of work done when my request is processed by name_resolver API

Bug: GBIF, uBio crash the resolution of some lists

As a User I should be aware when data source records were updated last

Currently resolver does not give information when a particular source was updated. There are two ways of adding this information -- when someone queries data source itself, or as a part of each record. I think the best way would be to return it once for each data source mentioned in the output.

data sources
    data_source
        id 1
        title NCBI
        last_updated 2012-12-1

Problems with fuzzy matching

Hi Dima,

It occurs to me that if you could determine why the fuzzy matching algorithm goes wrong in this one particular example, it might suggest ways to improve accuracy overall.

The single letter OCR error Euglcna gracilis (for Euglena gracilis) is instead consistently replaced by Egilina gracilis, which requires substitution of four letters. Both names are in GNI. Other OCR errors for this name are also resolved in the same way.

As I pointed out previously, a single letter difference in the Latin name can be the difference between a butterfly and a snail, which makes fuzzy matching inappropriate. There seem to be a variety of algorithms specifically for detecting and correcting OCR errors, however, which are qualitatively different from the sort of phonetic misspelling and qwerty keyboard typographical errors that fuzzy matching is intended to overcome. OCR specific algorithms "know" about the nature of OCR errors, which result from similarly shaped characters being confounded or not recognized. In the example case, "c" is more likely to be an OCR error for "e" than for "i". Some OCR errors are obvious (mid-string non-alpha characters and case changes) while others are statistically much more likely than others. A probability table could suggest likely and eliminate unlikely substations. There seems to be a considerable literature on the subject, for example http://www.cs.cmu.edu/~rcarlson/docs/RyanCarlson_nlp.pdf. Use of GNI as a dictionary should be done with caution. It is rather "dirty," both significantly incomplete and full of errors. It must be relied on for recognition of genus group names, but a species group name string not being in GNI, particularly in combination with a particular genus, does not mean that it is invalid and must be replaced.

Regards,
Pat

'Chalicodoma (Steomegachile) chelostomoides' crashes resolver

search for Anura in WoRMS returns non-WoRMS taxon ids

When executing using name "Anura" in http://resolver.globalnames.org , a result is returned that includes unexpected WoRMS classification ids like gn:I7WB_zNSUlqQP-4KWe8VCA . Expected WoRMS ids would include 448306 (see http://www.marinespecies.org/aphia.php?p=taxdetails&id=448306).

{
    "data_source_id": 9,
    "data_source_title": "WoRMS",
    "gni_uuid": "58305eaf-9617-5566-9bed-45388e44643b",
    "name_string": "Anura",
    "canonical_form": "Anura",
    "classification_path": "Animalia|Chordata|Amphibia|Anura",
    "classification_path_ranks": "kingdom|phylum|class|order",
    "classification_path_ids": "gn:3uL60ncMV0CLLRfxOOBRdQ|gn:0nPgiA8cWdWSAg1iZqIs6Q|gn:w4svaopeWPq3Ub7_uQ1eaw|gn:I7WB_zNSUlqQP-4KWe8VCA",
    "taxon_id": "gn:I7WB_zNSUlqQP-4KWe8VCA",
    "match_type": 1,
    "prescore": "1|0|0",
    "score": 0.75
}

Search and return recorded synonyms

Would it be possible to search and return alternative names?
e.g. a search in NCBI via GNR for 'Alopex lagopus' returns nothing even though NCBI records it as a synonym for 'Vulpes lagopus'.

Many thanks,
Dom

import ITIS

As a user I expect every existing name_resolver resource to have the same header signature

Every name_resolver header response has

uuid token
url (with corresponding format)
progress status code (1, 2, 3 for working, success, failure)
status message
sumbitted options for the search

add common names to resolver result

as discussed with @dimus -

In addition to taxon hierarchies, suggest to include available common names for resolved taxa. This would help me immensely in making the search features in http://globalbioticinteractions.org friendlier for humans.

Currently the resolver returns something like:

...
data_source_id: 4,
data_source_title: "NCBI",
gni_uuid: "16f235a0-e4a3-529c-9b83-bd15fe722110",
name_string: "Homo sapiens",
canonical_form: "Homo sapiens",
classification_path: "|Eukaryota|Opisthokonta|Metazoa|Eumetazoa|Bilateria|Coelomata|Deuterostomia|Chordata|Craniata|Vertebrata|Gnathostomata|Teleostomi|Euteleostomi|Sarcopterygii|Tetrapoda|Amniota|Mammalia|Theria|Eutheria|Euarchontoglires|Primates|Haplorrhini|Simiiformes|Catarrhini|Hominoidea|Hominidae|Homininae|Homo|Homo sapiens",
classification_path_ranks: "|superkingdom||kingdom|||||phylum|subphylum||superclass||||||class|||superorder|order|suborder|infraorder|parvorder|superfamily|family|subfamily|genus|species",
classification_path_ids: "131567|2759|33154|33208|6072|33213|33316|33511|7711|89593|7742|7776|117570|117571|8287|32523|32524|40674|32525|9347|314146|9443|376913|314293|9526|314295|9604|207598|9605|9606",
...

suggested result (including common names) something like:

...
data_source_id: 4,
data_source_title: "NCBI",
gni_uuid: "16f235a0-e4a3-529c-9b83-bd15fe722110",
name_string: "Homo sapiens",
canonical_form: "Homo sapiens",
common_names: "human @en|Mensch @de|mens @nl",
classification_path: "|Eukaryota|Opisthokonta|Metazoa|Eumetazoa|Bilateria|Coelomata|Deuterostomia|Chordata|Craniata|Vertebrata|Gnathostomata|Teleostomi|Euteleostomi|Sarcopterygii|Tetrapoda|Amniota|Mammalia|Theria|Eutheria|Euarchontoglires|Primates|Haplorrhini|Simiiformes|Catarrhini|Hominoidea|Hominidae|Homininae|Homo|Homo sapiens",
classification_path_ranks: "|superkingdom||kingdom|||||phylum|subphylum||superclass||||||class|||superorder|order|suborder|infraorder|parvorder|superfamily|family|subfamily|genus|species",
classification_path_ids: "131567|2759|33154|33208|6072|33213|33316|33511|7711|89593|7742|7776|117570|117571|8287|32523|32524|40674|32525|9347|314146|9443|376913|314293|9526|314295|9604|207598|9605|9606",
taxon_id: "9606",
...

resolve_once behavior

Curious if resolve_once behavior is correct e.g., this call http://resolver.globalnames.org/name_resolvers.json?names=Plantago+major&resolve_once=true returns more than one match (all from different sources I think, but more than 1 match)

The API docs suggest setting resolve_once=TRUE should just get first match.

Bug: xml output of name resolver is not valid

exact canonical match of resolver includes NCBI subspecies/strains

when using resolver.globalnames.org with "Homo sapiens", exact canonical matches include both NCBI:9606 (Homo sapiens) and NCBI:741158 (Homo sapiens ssp. Denisova). Is this expected?

New progress statuses

Remove old statuses from the code too

As a user I can use api to see data_sources and their ids

As a user I should be able to see if a name is known by GNI using name reconciling API

GNR: Unexpected high matching scores despite incomplete match at infraspecific rank

Taxa at an infraspecific rank can get high matches in GNR, although the result is just at the species rank or is incorrect:

For example, when supplying 'Cirsium creticum s. triumfetti' this result makes sense:

Cirsium creticum d'Urv. subsp. triumfetti (Lacaita) Werner [ exact canonical match, Score: 0.988 ]
GBIF Backbone Taxonomy

But these high scores are unexpected:

Cirsium creticum (Lam.) d'Urv. [ exact canonical match, Score: 0.988 ]
Catalogue of Life
Cirsium creticum (Lam.) d’Urv. [ exact canonical match, Score: 0.988 ]
GBIF Backbone Taxonomy
Cirsium creticum (Lam.) d’Urv. subsp. creticum [ exact canonical match, Score: 0.988 ]
GBIF Backbone Taxonomy
(note 'creticum' vs 'triumfetti' at infraspecific level).

Alternative output of result name in GNR [Request]

Right now, GNR produces two outputs of matched names: (1) the entire taxon name incl. author and (2) the canonical form. For example, for taxon 'Stipa pennnata var. anomala, you get: (1) 'Stipa pennata var. anomala (P.A.Smirn.) Tzvelev' and (2) 'Stipa pennata anomala'.

I would like to request a third output, that includes taxonomic abbreviations (e.g. 'var', 'subsp') but not author names. Using the example above, this would be (3) 'Stipa pennata var. anomala'.

Some biological databases use the full taxon name (incl. abbreviations) but not the author name. Having this feature would help facilitate data merging among databases with the taxon name as a key variable.

As a user I should not be distracted by empty id field if I don't supply ids to name_resolver API

As a user of API I receive results of of the name resolution in corresponding format

I just broke returning results by changing what name resolver instance is returning. New results have to be sent only with following fields added to the result header:

1 context for all data sources
2 all names first in the following format:
[ {id=>1, :name => 'Betula', :results => .....}, {...}....]

results should have the following fields:

data_source_id
gni uuid
name_string
canonical_form
classification path (if exists)
classification path ids (if exist)
ds taxon id
ds. guid (if exists)
urls (if exist)
current_name taxon id (if exist)
current name string if (exist)
match type id
prescore in pipe delimited format
score

globalnamesarchitecture / gni Goto Github PK

gni's Introduction

Global Names Index v.2

Citing for Global Names Index v1

Citing for Global Names Index v2 (Global Names Resolver)

Dependencies

Testing

Working with seed data

Resolver worker

Assets precompiling

Rebuilding canonical names index for production

Copyright

gni's People

Contributors

Stargazers

Watchers

Forkers

gni's Issues

Recommend Projects

Recommend Topics

Recommend Org