Giter Club home page Giter Club logo

genbankqc's Introduction

https://api.travis-ci.org/andrewsanchez/GenBankQC.svg?branch=master

GenBank Quality Control

Complete documentation lives at genbankqc.readthedocs.io. It is a work in progress.

GenBankQC is an effort to address the quality control problem for public databases such as the National Center for Biotechnology Information's GenBank. The goal is to offer a simple, efficient, and automated solution for assessing the quality of your genomes.

Note

Please note that GenbankQC is currently in alpha. As a proof of concept for a specific use case, it currently has limitations that users should be aware of. If there is interest, we will address the issues to make it more convenient to use. Please see caveats for more details.

Features

  • Labelling/annotation-independent quality control based on:
    • Simple metrics
    • Genome distance estimation using MASH
  • Flag potential outliers to exclude them from polluting your pipelines

The genbankqc work-flow consists of the following steps:

  1. Generate statistics for each genome based on the following metrics:
    • Number of unknown bases
    • Number of contigs
    • Assembly size
    • Average MASH distance compared to other genomes
  2. Flag potential outliers based on these statistics:
    • Flag genomes containing more than a certain number of unknown bases.
    • Flag genomes outside of a range based on the median absolute deviation.
      • Applies to number of contigs and assembly size
    • Flag genomes whose MASH distance is greater than the upper end of the median absolute deviation.
  3. Visualize the results with a color coded tree

Usage

genbankqc /path/to/genomes
open /path/to/genomes/Escherichia_coli/qc/200_3.0_3.0_3.0/tree.svg

Installation

If you don't yet have a functional conda environment, please download and install Miniconda.

conda create -n genbankqc -c etetoolkit -c biocore pip ete3 scikit-bio

source activate genbankqc

pip install genbankqc

Caveats

There are some arbitrary, hard-coded limitations regarding file names. This is because the project originally began as a part of the NCBI Tool Kit (NCBITK) which we use for downloading genomes from NCBI. NCBITK generates a specific directory structure and file naming scheme which GenbankQC currently expects.

If you'd like to use GenBankQC without using NCBITK, all that is required is that your file names match the python regular expression re.compile('.*(GCA_\d+\.\d.*)(.fasta)'). You can quickly test this by following my example at pythex.org.

https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square

genbankqc's People

Contributors

andrewsanchez avatar pyup-bot avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.