Giter Club home page Giter Club logo

gnverifier's Introduction

Global Names Verifier

DOI

Try GNverifier online.

GNverifier with OpenRefine

GNverifier API

Feedback

Takes a scientific name or a list of scientific names and verifies them against a variety of biodiversity Data Sources. Includes an advanced search feature.

Citing

If you want to cite GNverifier, use DOI generated by Zenodo:

Features

  • Small and fast app to verify scientific names against many biodiversity databases. The app is a client to a verifier API.
  • It provides 6 different match levels:
    • Exact: complete match with a canonical form or a full name-string from a data source.
    • Fuzzy: if exact match did not happen, it tries to match name-strings assuming spelling errors.
    • Partial: strips middle or last epithets from bi- or multi-nomial names and tries to match what is left.
    • PartialFuzzy: the same as Partial but assuming spelling mistakes.
    • Virus: verification of virus names.
    • FacetedSearch: marks advanced-search queries.
  • Taxonomic resolution. If a database contains taxonomic information, it returns the currently accepted name for the provided name-string.
  • Best match is returned according to the match score. Data sources with some manual curation have priority over auto-curated and uncurated datasets. For example Catalogue of Life or WoRMS are considered curated, GBIF auto-curated, uBio not curated.
  • Fine-tuning the match score by matching authors, years, ranks etc.
  • It is possible to map any name-strings checklist to any of registered Data Sources.
  • If a Data Source provides a classification for a name, it will be returned to the output.
  • The app works for checking just one name-string, or multiple ones written in a file.
  • Advanced search uses simple but powerful query language to find abbreviated names, search by author, year etc.
  • Supports feeding data via pipes of an operating system. This feature allows to chain the program together with other tools.
  • GNverifier includes a web-based graphical user interface identical to its "official" web-service.

Installation

Using Homebrew on Mac OS X, Linux, and Linux on Windows (WSL2)

Homebrew is a popular package manager for Open Source software originally developed for Mac OS X. Now it is also available on Linux, and can easily be used on Windows 10 or 11, if Windows Subsystem for Linux (WSL) is installed.

To use GNverifier with Homebrew:

  1. Install Homebrew

  2. Open terminal and run the following commands:

brew tap gnames/gn
brew install gnverifier

MS Windows

Download the latest release from GitHub, unzip.

One possible way would be to create a default folder for executables and place GNverifier there.

Use Windows+R keys combination and type "cmd". In the appeared terminal window type:

mkdir C:\Users\your_username\bin
copy path_to\gnverifier.exe C:\Users\your_username\bin

Add C:\Users\your_username\bin directory to your PATH user and/or system environment variable.

Another, simpler way, would be to use cd C:\Users\your_username\bin command in cmd terminal window. The GNverifier program then will be automatically found by Windows operating system when you run its commands from that directory.

You can also read a more detailed guide for Windows users in a PDF document.

Linux and Mac (without Homebrew)

If Homebrew is not installed, download the latest release from GitHub, untar, and install binary somewhere in your path.

tar xvf gnverifier-linux-0.1.0.tar.xz
# or tar xvf gnverifier-mac-0.1.0.tar.gz
sudo mv gnverifier /usr/local/bin

Compile from source

Install Go according to installation instructions

go get github.com/gnames/gnverifier/gnverifier

Usage

GNverifier takes one name-string or a text file with one name-string per line as an argument, sends a query with these data to a remote GNames server to match the name-strings against many biodiversity databases and returns results to STDOUT either in JSON, CSV or TSV format.

The app can alto take a query string like g:M. sp:galloprovincialis au:Olivier to perform advanced searching, if the full scientific name is undetermined.

As a web service

gnverifier -p 8080

After running this command, you should be able to access web-based user interface via a browser at http://localhost:8080

As a RESTful API

Refer to the RESTful API docs to learn how to use the same functionality via scripts.

One name-string

gnverifier "Monohamus galloprovincialis"

Many name-strings in a file

gnverifier /path/to/names.txt

The app assumes that a file contains a simple list of names, one per line.

It is also possible to feed data via STDIN:

cat /path/to/names.txt | gnverifier

Advanced search

Advanced search allows to use a simple but powerful query language to find names by abbreviated genus, a year or a range of years. See detailed description in Advanced Search Query Language section.

gnverifier "g:B. sp:bubo au:Linn. y:1700-"

Options and flags

According to POSIX standard flags and options can be given either before or after name-string or file name.

help

gnverifier -h
# or
gnverifier --help
# or
gnverifier

version

gnverifier -V
# or
gnverifier --version

port

Starts GNverifier as a web service using entered port

gnverifier -p 8080

This command will run user-interface accessible by a browser at http://localhost:8080

all_matches

To see all matches instead of the best one use --all_matches flag.

WARNING: for some names the result will be excessively large.

gnverifier -s '1,12' -M file.txt

This flag is ignored by advanced search.

capitalize

If your names are co not have uninomials or genera capitalized according to rules on nomenclature, you can still verify them using this option. If capitalize flag is set, the first character of every name-string will be capitalized (when appropriate). This flag is ignores by advanced search.

gnverifier -c "bubo bubo"
# or
gnverifier --capitalize "bubo bubo"

species group

If species_group flag is on, a search of Aus bus would also search for Aus bus bus and vice versa. This flag expands search to a species group of a name if applicable. It means it involves into search botanical autonyms and coordinated names in zoology.

gnverifier -g "Bubo bubo"
gnverifier  --species_group "Bubo bubo"

fuzzy-match of uninomial names

When fuzzy_uninomial flag is on, uninomials are allowed to go through fuzzy matching, if needed. Normally this flag is off because fuzzy-matched uninomials create a significant amount of false positives.

gnverifier -z "Pomatmus"
gnverifier --fuzzy_uninomial "Pomatmus"

format

Allows to pick a format for output. Supported formats are

  • compact: one-liner JSON.
  • pretty: prettified JSON with new lines and tabs for easier reading.
  • tsv: returns tab-separated values representation.
  • csv: (DEFAULT) returns comma-separated values representation.
# short form for compact JSON format
gnverifier -f compact file.txt
# or long form for "pretty" JSON format
gnverifier --format="pretty" file.csv
# tsv format
gnverifier -f tsv file.csv

Note that a separate JSON "document" is returned for each separate record, instead of returning one big JSON document for all records. For large lists it significantly speeds up parsing of the JSON on the user side.

jobs

If the list of names if very large, it is possible to tell GNverifier to run requests in parallel. In this example GNverifier will run 8 processes simultaneously. The order of returned names will be somewhat randomized.

gnverifier -j 8 file.txt
# or
gnverifier --jobs=8 file.tsv

Sometimes it is important to return names in exactly same order. For such cases set jobs flag to 1.

gnverifier -j 1 file.txt

This option is ignored by advanced search.

quiet

Removes log messages from the output. Note that results of verification go to STDOUT, while log messages go to STDERR. So instead of using -q flag STDERR can be redirected to /dev/null:

gnverifier "Puma concolor" -q >verif-results.csv

#or

gnverifier "Puma concolor 2>/dev/null >verif-results.csv

sources

By default GNverifier returns only one "best" result of a match. If a user has a particular interest in a data set, s/he can set it with this option, and all matches that exist for this source will be returned as well. You need to provide a data source id for a dataset. Ids can be found at the following URL. Some of them are provided in the GNverifier help output as well.

Data from such sources will be returned in preferred_results section of JSON output, or with CSV/TSV rows that start with "PreferredMatch" string.

gnverifier file.csv -s "1,11,172"
# or
gnverifier file.tsv --sources="12"
# or
cat file.txt | gnverifier -s '1,12'

If all matched sources need to be returned, set the flag to "0".

WARNING: the result might be excessively large.

gnverifier "Bubo bubo" -s 0
# potentially even more results get returned by adding --all_matches flag
gnverifier "Bubo bubo" -s 0 -M

The sources option would overwrite ds: settings in case of advanced search.

web-logs

Requires --port. Enables output of logs for web-services.

gnverifier -p 8777 --web-logs

nsqd-tcp

Requires --port. Allows redirecting web-service log output to NSQ messaging server's TCP-based endpoint. It is handy for aggregations of logs from GNverifier web-services running inside of Docker containers or in Kubernetes pods.

gnverifier -p 8777 --nsqd-tcp=localhost:4150
# with logs printed out
gnverifier -p 8777 --nsqd-tcp=localhost:4150 --with-logs

Configuration file

If you find yourself using the same flags over and over again, it makes sense to edit configuration file instead. It is located at $HOME/.config/gnverifier.yaml. After that you do not need to use command line options and flags. Configuration file is self-documented, the default gnverifier.yaml is located on GitHub

gnverifier file.txt

In case if GNverifier runs as a web-based user interface, it is also possible to use environment variables for configuration.

Env. Var. Configuration
GNV_FORMAT Format
GNV_DATA_SOURCES DataSources
GNV_WITH_ALL_MATCHES WithAllMatches
GNV_WITH_CAPITALIZATION WithCapitalization
GNV_VERIFIER_URL VerifierURL
GNV_JOBS Jobs
GNV_WEB_LOGS_NSQD_TCP WebLogsNsqdTCP
GNV_WITH_WEB_LOGS WithWebLogs

Advanced Search Query Language

Example: g:M. sp:gallop. au:Oliv. y:1750-1799 or n:M. gallop. Oliv. 1750-1799

Query language allows searching for scientific names using name components like genus name, specific epithet, infraspecific epithet, author, year. It includes following operators:

g: : Genus name, can be abbreviated (for example g:Bubo, g:B.).

sp: : specific epithet, can be abbreviated (for example sp:galloprovincialis, sp:gallop.).

isp: : Infraspecific epithet, can be abbreviated (for example isp:auspicalis, isp:ausp.).

asp: : Either specific, or infraspecific epithet (for example asp:bubo).

au: : One of the authors of a name, can be abbreviated (for example au:Linn., au:Linnaeus).

y: : Year. Can be one year, or a year range (for example y:1888, y:1800-1802, y:1756-, y:-1880)

ds: : Limit result to one or more data-sources. Note that command line sources option, if given, will overwrite this setting (ds:1,2,172).

tx: : Parent taxon. Limit results to names that contain a particular higher taxon in their classification. If ds: is given, uses the classification of the first data-source in the setting. If ds: is not given, uses managerial classification of the Catalogue of Life (tx:Hemiptera, tx:Animalia, tx:Magnoliopsida).

all: : If true, GNverifier will show all results, not only the best ones. The setting can be true or false (all:t, all:f). This setting will also become true if sources command line option is set to 0.

n: : A "name" setting. It allows to combine several query components together for convenience. Note that it is not a 'real' scientific name, but a shortcut to enter several settings at once loosely following rules of nomenclature (n:B. bubo Linn. 1758). For example, in contrast with GNparser results, it is possible to have abbreviated specific epithets or range in years: n:Mono. gall. Oliv. 1750-1800.

Often there are errors in species epithets gender. Because of that search will try to detect names in any gender that correspond to the epithet.

The search requires to have either sp:, isp: or asp: setting, or provide their analogs in n: setting.

Examples of searches

gnverifier "n:Pom. saltator tx:Animalia y:1750-"

gnverifier "g:Plantago asp:major au:Linn."

gnverifier "g:Cara. isp:daurica ds:1,12"

Copyright

Authors: Dmitry Mozzherin

Copyright © 2020-2023 Dmitry Mozzherin. See LICENSE for further details.

gnverifier's People

Contributors

dimus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

otoliths

gnverifier's Issues

Help with error

@dimus I tried following the instructions in use-gnverify-windows.pdf and when I got to this point:

You will see the PowerShell terminal window. Type the following commands:
cd ~\bin
gnverify "Plantago major"
After a short delay, you should see a result printed on the screen as a comma-separated value output.

I received this message:

image

Any insight into what I am doing wrong?

Citation for Global Names Verifier

Hello @dimus 👋
We've already interacted through the Global Names Gitter :)
I couldn't find a proper way to credit Global Names Verifier in an academic paper.
Is there a specific citation to use? If there is, then it could be nice to include it in the tool as well as on the corresponding website.

Many thanks again for building such a powerful tool!

Wrong OTL ids returned

Hi,

I am actually comparing the Open Tree of Life IDs retrieved via GNVerify and via rotl (official Open Tree of Life API).
They give almost identical results (which is good!) but sadly in some cases, they differ.

Here is an example:

echo "Petroselinum crispum" | gnverify -s 179 -f pretty

giving

"recordId": "959097"

where when doing it via rotl (in R):

library(rotl)

name <- "Petroselinum crispum"

tnrs_match_names(
  names = name,
  do_approximate_matching = FALSE,
  include_suppressed = FALSE
)

search_string unique_name approximate_match ott_id is_synonym flags 1 petroselinum crispum Petroselinum crispum FALSE 2485 FALSE number_matches 1 1

Which indeed verifies:

https://tree.opentreeoflife.org/taxonomy/browse?name=2485

vs

https://tree.opentreeoflife.org/taxonomy/browse?name=959097

Thank you again for your wonderful work, hope those issues help!

[data sources]

Hi, me again

Don't know if I am posting this at the right place but had recently such kind of issues:

echo "Fusarium verticillioides" | gnverify -s 11,179,5 -f pretty

As it is a Fungi sp. I am used to trusting more Index Fungorum (5) but its entries seem a bit outdated (2012-02-09) when compared to others updated in 2020.
Other DB so return the correct current name where Index Fungorum returns an old one.
This leads so to :
http://www.indexfungorum.org/names/NamesRecord.asp?RecordID=417365, which is an old entry instead of http://www.speciesfungorum.org/Names/SynSpecies.asp?RecordID=314213

You told me you would be releasing a lot of new things mid January, so don't know if relevant...in case just ignore! :)

Again, thank you for your impressive work and the responsiveness you put in it! 👍🏼

Given source is not included in the results

If I run the following with v0.3.3, I expect to see results from World Register of Marine Species, but the output only contains Catalogue of Life. Did I miss something in the flags? Thanks.

gnverifier -s '9' -f pretty "Monohamus galloprovincialis"

Output:

{
  "inputId": "addc8d6e-f47f-5291-b970-ac6c96eab940",
  "input": "Monohamus galloprovincialis",
  "matchType": "Exact",
  "bestResult": {
    "dataSourceId": 1,
    "dataSourceTitleShort": "Catalogue of Life",
    "curation": "Curated",
    "recordId": "447M5",
    "entryDate": "2021-06-21",
    "matchedName": "Monohamus galloprovincialis Rüschkamp, 1928",
    "matchedCardinality": 2,
    "matchedCanonicalSimple": "Monohamus galloprovincialis",
    "matchedCanonicalFull": "Monohamus galloprovincialis",
    "currentRecordId": "445TC",
    "currentName": "Monochamus galloprovincialis (Olivier, 1795)",
    "currentCardinality": 2,
    "currentCanonicalSimple": "Monochamus galloprovincialis",
    "currentCanonicalFull": "Monochamus galloprovincialis",
    "isSynonym": true,
    "classificationPath": "Biota|Animalia|Arthropoda|Insecta|Coleoptera|Chrysomeloidea|Cerambycidae|Monochamus|Monochamus galloprovincialis",
    "classificationRanks": "unranked|kingdom|phylum|class|order|superfamily|family|genus|species",
    "classificationIds": "5T6MX|N|RT|H6|C2L|CHR|7VY|63BHG|445TC",
    "editDistance": 0,
    "stemEditDistance": 0,
    "matchType": "Exact"
  },
  "dataSourcesNum": 2,
  "curation": "Curated"
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.