poisotlab / ncbitaxonomy.jl Goto Github PK

View Code? Open in Web Editor NEW

6.0 5.0 2.0 2.89 MB

Wrapper around the NCBI taxonomy files

Home Page: https://poisotlab.github.io/NCBITaxonomy.jl/stable/

License: MIT License

Julia 100.00%

ncbi-taxonomy ncbi biodiversity biodiversity-data taxonomy verena ivado

ncbitaxonomy.jl's Introduction

NCBI taxonomy

This package provides an interface to the NCBI Taxonomy. When installed, it will download the latest version of the taxonomy files from the NCBI ftp service. To update the version of the taxonomy you use, you need to build the package again.

Please note that the taxonomy dump is a big download. If the NCBITAXONOMY_PATH is not set, it will be stored in the package folder under the homedir() directory, which is a bad idea. We strongly recommend editing your configuration file, or exporting the NCBITAXONOMY_PATH environment variable.

This package is developed as part of the activities of the Viral Emergence Research Initiative (VERENA) consortium, with financial support from the Institut de Valorisation des Données (IVADO) at Université de Montréal.

ncbitaxonomy.jl's People

Contributors

Stargazers

Watchers

Forkers

cmartin king8w

ncbitaxonomy.jl's Issues

Issue with lowercase search conflicting with vernaculars

taxon("gorilla") - add a keyword to prefer scientific names

Use Scratch.jl

Is your feature request related to a problem? Please describe.
Currently, the build step is storing data in the package repo - https://github.com/JuliaPackaging/Scratch.jl might be a preferable alternative.

Additional context
This will limit compatibility to 1.5, which is not really an issue.

Standard namefinders

What to do?

Make namefinders for the main divisions.

Why?

This is going to be what users do anyways.

**## Any ideas how?

Uh, yeah.

rank function for a taxonomic ID

Is your feature request related to a problem? Please describe.
There is no way to know the rank of a taxon, which is problematic when building namefinders.

Describe the solution you'd like

using NCBITaxonomy

rank(ncbi"Vulpes vulpes")

Describe alternatives you've considered
None - the only alternative is to do a lookup in the nodes_table manually, which is what the function would do.

Allow arbitrary distance function

Is your feature request related to a problem? Please describe.
It would be good to allow all distance functions from StringDistances

Describe the solution you'd like
Additional keyword argument to taxid (and namefinder)

Ideally, this can take the form of code you would like to write:

using NCBITaxonomy
taxid("Box turus"; fuzzy=true, d=StringDistances.JaroWinkler)

Additional context
Work ongoing on main.

Create a namefinder from a list of IDs

Is your feature request related to a problem? Please describe.
It would be saving a lot of time to build a custom namefinder, when we know what we are likely to find.

Describe the solution you'd like

using NCBITaxonomy

# ... ids are a list of taxa or IDs

namefinder(ids)("Speciosus sp.")

Additional context
All of the internals are in place, all that is needed is to write a wrapper.

Citing NCBITaxonomy.jl

Hey NCBITaxonomy.jl maintainers 👋 (@tpoisot mostly, I suppose? :))
With a group of people we're currently working on reviewing tools for taxonomic name harmonization for ecologists.
With mostly focused on R packages as R is the most used programming language by ecologists.

However we would like to have a section about tools that exist in other languages. As such we'd like to mention NCBITaxonomy.jl, but I'm not familiar with citation practice in Julia modules
How should I cite the module? Do you know any other modules that may be relevant for taxonomic name harmonization?

Thanks :)

The unique name title is not populated when reading to the names.arrow file

The _materialize_data seems to be missing a case, as the unique names do not
end up in the names.arrow file. This is going to be a problem rapidly.

Function to disambiguate names

Add a way to get an array of NCBITaxon with identical names in response to multiple match exception.

pathogen.jl extended to also return species ranks where different than matched names

A lot of virus names are not the same as their matched names! For example:

> classification(get_uid("adeno-associated virus 3b"))
==  1 queries  ===============

Retrieving data for taxon 'adeno-associated virus 3b'

√  Found:  adeno-associated+virus+3b
==  Results  =================

* Total: 1 
* Found: 1 
* Not Found: 0
$`68742`
                                   name         rank      id
1                               Viruses superkingdom   10239
2                          Monodnaviria        clade 2731342
3                          Shotokuvirae      kingdom 2732092
4                         Cossaviricota       phylum 2732415
5                       Quintoviricetes        class 2732422
6                          Piccovirales        order 2732534
7                          Parvoviridae       family   10780
8                          Parvovirinae    subfamily   40119
9                     Dependoparvovirus        genus   10803
10 Adeno-associated dependoparvovirus A      species 1511891
11           Adeno-associated virus - 3      no rank   46350
12            Adeno-associated virus 3B      no rank   68742

Technically it's VIRION workflow and not NCBITaxonomy.jl, but, when you have a second - could you also expand pathogen.jl to return species separate from matched name?

Use a central location for the taxonomy and default to PKG if not set

Need to change the build and load steps.

Add a freshness parameter

What to do?
When the package is loaded, check the date of the taxonomy

Why?
No need to download the taxonomy every few days for minor changes

Any ideas how?
Do something on load and rebuild if needed

Taxonomic tree

Hi!

First, thank you very much for this package! It has been handy for me.

I think that it would be great to have an integration with Phylo.jl to easily get the common taxonomic tree for a set of taxa as in the NCBI site. That could be really useful for visualization and exploration.

Ideally, this can be a single function, similar to lineage and most probably depending on that, that also accepts stop_at, but that takes a list/set of NCBITaxons. For example:

using NCBITaxonomy

common_tree([ncbi"Bos taurus", ncbi"Mus musculus", ncbi"Homo sapiens"], stop_at=ncbi"Mammalia")

Best regards,

In-part names only return the first element

Describe the bug

julia> ncbi"Reptilia"
Testudines (ncbi:8459)

Expected behavior

An array of names? A warning? Both?

> 50% of queries are returning NA; and, separately, exact matches aren't returning where appropriate

Hi! This is gonna be a long one.

Here are three viruses that all have exact matches in the NCBI taxonomy:
Adeno-associated virus - 3
Adeno-associated virus 3B
Adenovirus predict_adv-20

They're an interesting case study for what's going horribly wrong here. In theory, they should all be retrieved as exact matches. Two are, in fact, the same "species". For example, the same NCBI API call through taxize:

> classification(get_uid("Adeno-associated virus - 3"), db = "ncbi")
==  1 queries  ===============

Retrieving data for taxon 'Adeno-associated virus - 3'

√  Found:  Adeno-associated+virus+-+3
==  Results  =================

* Total: 1 
* Found: 1 
* Not Found: 0
$`46350`
                                   name         rank      id
1                               Viruses superkingdom   10239
2                          Monodnaviria        clade 2731342
3                          Shotokuvirae      kingdom 2732092
4                         Cossaviricota       phylum 2732415
5                       Quintoviricetes        class 2732422
6                          Piccovirales        order 2732534
7                          Parvoviridae       family   10780
8                          Parvovirinae    subfamily   40119
9                     Dependoparvovirus        genus   10803
10 Adeno-associated dependoparvovirus A      species 1511891
11           Adeno-associated virus - 3      no rank   46350

attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"
> classification(get_uid("Adeno-associated virus 3B"), db = "ncbi")
==  1 queries  ===============

Retrieving data for taxon 'Adeno-associated virus 3B'

√  Found:  Adeno-associated+virus+3B
==  Results  =================

* Total: 1 
* Found: 1 
* Not Found: 0
$`68742`
                                   name         rank      id
1                               Viruses superkingdom   10239
2                          Monodnaviria        clade 2731342
3                          Shotokuvirae      kingdom 2732092
4                         Cossaviricota       phylum 2732415
5                       Quintoviricetes        class 2732422
6                          Piccovirales        order 2732534
7                          Parvoviridae       family   10780
8                          Parvovirinae    subfamily   40119
9                     Dependoparvovirus        genus   10803
10 Adeno-associated dependoparvovirus A      species 1511891
11           Adeno-associated virus - 3      no rank   46350
12            Adeno-associated virus 3B      no rank   68742

attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"

> classification(get_uid("Adenovirus predict_adv-20"), db = "ncbi")
==  1 queries  ===============

Retrieving data for taxon 'Adenovirus predict_adv-20'

√  Found:  Adenovirus+predict_adv-20
==  Results  =================

* Total: 1 
* Found: 1 
* Not Found: 0
$`2710954`
                       name         rank      id
1                   Viruses superkingdom   10239
2              Varidnaviria        clade 2732004
3              Bamfordvirae      kingdom 2732005
4         Preplasmiviricota       phylum 2732008
5          Tectiliviricetes        class 2732529
6               Rowavirales        order 2732559
7              Adenoviridae       family   10508
8 unclassified Adenoviridae      no rank  189831
9 Adenovirus PREDICT_AdV-20      species 2710954

attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"

Everything I'm going to describe is being run through an R script called jncbi() which is included below for convenience:

jncbi <- function(spnames, type = 'host') {
  raw <- data.frame(Name = spnames)
  write_csv(raw, '~/Github/virion/Code_Dev/TaxonomyTempIn.csv', eol = "\n")
  
  if(type == 'host') {system("julia C:/Users/cjcar/Documents/Github/virion/Code_Dev/host.jl")}
  if(type == 'virus') {system("julia C:/Users/cjcar/Documents/Github/virion/Code_Dev/virus.jl")}
  if(type == 'pathogen') {system("julia C:/Users/cjcar/Documents/Github/virion/Code_Dev/pathogen.jl")}
  
  clean <- read_csv("~/Github/virion/Code_Dev/TaxonomyTempOut.csv")
  file.remove('~/Github/virion/Code_Dev/TaxonomyTempIn.csv')
  file.remove('~/Github/virion/Code_Dev/TaxonomyTempOut.csv')
  
  clean$Name <- stringr::str_to_sentence(clean$Name)
  clean$match <- stringr::str_to_sentence(clean$match)
  return(clean)
}

Doesn't really change anything about the attributes. Just outsources a file to clean and brings it back in.

Here are some contrasting results of virus.jl on different kinds of input.

A BIG LIST

When I pass 8,632 viruses through jncbi, 4,968 come back NA (no match) and 273 come back fuzzy matches (3,419 exact matches). (A file to reproduce this is attached. I'm only including these stats because I think they're probably relevant to our understanding of how big this bug is.) The results are concerning:

Name matched match taxid
adeno-associated virus - 3 TRUE Adeno-associated virus - 3 46350
adeno-associated virus 3B NA NA NA
adenovirus PREDICT_AdV-20 NA NA NA

JUST THOSE THREE VALUES

> jncbi(c("Adeno-associated virus - 3","Adeno-associated virus 3B","Adenovirus PREDICT_AdV-20"), type = 'virus')
Progress: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| Time: 0:00:01

-- Column specification ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
cols(
  Name = col_character(),
  matched = col_logical(),
  match = col_character(),
  taxid = col_double()
)

# A tibble: 3 x 4
  Name                       matched match                      taxid
  <chr>                      <lgl>   <chr>                      <dbl>
1 Adeno-associated virus - 3 TRUE    Adeno-associated virus - 3 46350
2 Adeno-associated virus 3b  NA      NA                            NA
3 Adenovirus predict_adv-20  NA      NA                            NA

2B. THOSE THREE VALUES (LOWERCASE)

> jncbi(str_to_lower(c("Adeno-associated virus - 3","Adeno-associated virus 3B","Adenovirus PREDICT_AdV-20")), type = 'virus')
Progress: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| Time: 0:00:02

-- Column specification ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
cols(
  Name = col_character(),
  matched = col_logical(),
  match = col_character(),
  taxid = col_double()
)

# A tibble: 3 x 4
  Name                       matched match                        taxid
  <chr>                      <lgl>   <chr>                        <dbl>
1 Adeno-associated virus - 3 TRUE    Adeno-associated virus - 3   46350
2 Adeno-associated virus 3b  FALSE   Adeno-associated virus 3a  1406223
3 Adenovirus predict_adv-20  FALSE   Adenovirus predict_adv-20  2710954

==========================
I haven't included it but, if you str_to_lower the virus names before they're passed for the entire list, it also significantly reduces the number of no-match's, and also extends the runtime from 5 mins to about 30 mins, confirming this is, in fact, part (but not all) of the issue

So there are two separate problems that need to be debugged.

Capitalization appears to be making everything wonky. I don't want to do R-end solves to this, given that there's capitalization changes in pathogen.jl - I think you can probably solve this by revisiting that script. (When you do, please do not turn it back into a generic script for both hosts and viruses.)
These should have exact (match=TRUE) matches in the NCBI taxonomy. Both instead get called to fuzzy matching. The first fuzzy match is wrong, while the second fuzzy match is actually the correct exact match, and the strings returned are identical (no differences, as far as I can tell in spacing).

RegEx search

In some cases, it might be a good idea to allow regex search - this is pretty easy to do

Documentation update for the next release

Check docstrings
Examples

[out there] processing taxon identifiers via NLP

A "moonshot" idea I had for this library would be implementing rudimentary natural-language-processing (NLP) methods for processing taxon identifiers.

As an example, if the input contains ["A. p. aciculatus", "ponderosa pine", "Agelaius phoeniceus", "A. phoeniceus californicus", "red winged blackbird", "Agelaius xanthomus", "Pinus ponderosa", "P. ponderosa"] we would want a cleaning function to return ids in NCBI associated with the coarsest resolution id, e.g. ["Agelaius", "Pinus ponderosa"]

Clearly a false-postive here could be analysis-breaking so reporting some degree of confidence in
each resolved species label would also be necessary.

Just something to ruminate on

Return the nearest names

Is your feature request related to a problem? Please describe.
It would be cool to get the closest names rather than a single match.

Describe the solution you'd like

using NCBITaxonomy

similar_names("Box taurus"; ...)

Use Preferences.jl to store a single taxo db

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

New divisions

Phages are PHG, env are ENV

poisotlab / ncbitaxonomy.jl Goto Github PK

ncbitaxonomy.jl's Introduction

NCBI taxonomy

ncbitaxonomy.jl's People

Contributors

Stargazers

Watchers

Forkers

ncbitaxonomy.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org