Giter Club home page Giter Club logo

reptiles's Introduction

reptiles

Compare which reptile species are available in NCBI's Taxonomy Browser to those in the Reptile Database (RDB). Initial results are published in Nature Communications, and final results are published in Zootaxa.

Purpose

The Reptile Database (RDB) primarily classifies species based on morphology, whereas NCBI primarily classifies species based on genetic sequencing. We sought to examine the similarities and differences between these two databases. In the case of discrepancies, we classify them in order to help with resolving and preventing these discrepancies in the future.

Usage

To run the script, you need Python 2.7, the re module (installed by default), and the xlsxwriter module (usually not installed by default).

Follow these steps:

  1. Download the RDB data reptile_database_names.txt from this repository.

  2. Download NCBI's taxonomy data. This can be done via their FTP server. Download taxdmp.zip, unzip it, and keep names.dmp and nodes.dmp.

  3. Run NCBI_reptiles.py in the same folder as the above files.

Input details

There are 3 input files. One is from RDB and the other two are from NCBI.

  • reptile_database_names.txt is a tab-delimited file consisting of two columns. The left column contains reptile synonyms, and the right column contains the associated current reptile names. The header row should be any_name\tcurrent_name\n. The file from July 2018 is included in this repository.
    • "current reptile name" in this context refers to names that RDB considers to the primary, scientific name.
    • "reptile synonym" in this context refers to names that RDB considers to be a secondary name. Each synonym is associated with a current reptile name.
  • names.dmp maps NCBI taxonomy IDs to their scientific names
  • nodes.dmp maps NCBI taxonomy IDs to their parent taxonomy IDs and their rank (e.g. species, phylum)

Output details

Console output

The console will output counts of:

  • reptiles in RDB but not NCBI
  • reptiles in both RDB and NCBI
  • reptiles in NCBI but not RDB

Within species that are in NCBI but not RDB, it additionally prints counts of:

  • species labeled "aff.", "cf.", or "sp."
  • hybrid species
  • species that contain numerical digits
  • species that are classified as synonyms in RDB

File output

Two new files are created.

  • NCBI_reptile_list.txt: a list of NCBI reptile species names and their taxonomy ids.
  • reptile_comparison.xlsx: an Excel workbook containing separate worksheets with lists of reptiles. There are three worksheets:
    • RDB_only is a list of reptile names in RDB but not NCBI.
    • common is a list of reptile names in both RDB and NCBI. It includes each reptile's NCBI taxonomy ID.
    • NCBI_only is a list of reptile names in NCBI but not RDB. It includes each reptile's NCBI taxonomy ID. It additionally adds a third column for "kin", i.e. if the reptile name is one of the special categories from the console output (see above). Finally, if the NCBI species matches only a synonym in RDB, the fourth column matches the current name associated with the synonym in RDB. An example of reptile_comparison.txt is included in this repository.

Technical notes

  • The script only looks for NCBI reptile species, not subspecies. When comparing NCBI species with RDB, the script only chooses RDB species that are from the current name column, and that are binomials (two words). Otherwise, the RDB species is classified as a synonym.
  • None of these categories overlap: "aff.", "cf.", "sp.", hybrid, numbered, synonym. If there is a reptile that is in two of the previous categories, it enters only the highest category according to the following rank: synonym > "aff." = "cf." = "sp." = hybrid > numbered.

Questions?

For any questions, please email Akhil.

Acknowledgements

  • Peter Uetz, for the idea and with fixing errors.
  • Detlef Leipe, for help with sorting through NCBI's database and with fixing errors.

reptiles's People

Contributors

a-garg avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.