Giter Club home page Giter Club logo

g-d_algorithm's Introduction

1. Introduction

What's the gamma-delta workflow?

  • is an automated pipeline for analyses of DNA samples that provides a quantitative estimate of the species that are part of such samples.
  • Given a DNA sample and a set of reference genomes, corresponding to the possible species included in the sample, the algorithm generates an output file that includes all the species identified in the sample and their relative abundance.
  • Input sample might consist on a set of single- or paired-end reads in FASTA or FASTAQ format. Workflow output is a simple text file in csv format.
  • The core of the workflow is the gamma-delta algorithm that classifies reads of the sample by using a series of thresholds (gamma and delta parameters) to ensure the accuracy of the quantitative estimate. Details about this process are described at the paper mentioned at the Citation section.

The gamma-delta algorithm aims to identify reads that provide taxonomical information at species level. In particular, it is designed to retain only those reads that help for identifying species. Briefly, what gamma-delta algorithm does is for each mapping of a read r against a reference, it obtains a mapping ratio A, which is calculated by dividing the number of matching nucleotides from the query read r to the target sequences among the total number of nucleotides involved in the alignment. Then, a read r will be assigned to species i when the mapping ratio A against species i is higher than gamma and the alternative species’ mapping ratios are bellow delta (Garrido-Sanz et al. 2019, MBMG). This algorithm has been written in python2.7 language and runs on command line under Linux.

2. Setup

2.1. Tools

The following tools need to be installed in the system to run the pipeline: Trimmomatic, BWA aligner and SAMtools. The gamma-delta algorithm requires Python 2.7 as well as the csv, argparse, os, operator, decimal and datetime libraries.

2.2. Paths

For the correct execution of the pipeline, different paths have to be at the “gamma-delta_workflow.sh” script:

TRIMMOMATIC_PATH=/path/to/Trimmomatic
BWA_PATH=/path/to/BWA
ST_PATH=/path/to/SAMtools
gd_PATH=/path/to/gamma-delta_algorithm_script
REF_PATH=/path/to/references

2.3. BWA indexes

The gamma-delta workflow uses BWA as a read mapper. This requires the existence of the indices of each of the references against which the sample is compared to. Index generation only needs to be performed once and, therefore, the workflow script can be modified to avoid index recomputation when new samples are analyzed and reference indexes already exit.

3. Command-line and options

cd /path/to/script

For single-end reads:

./gamma-delta-workflow.sh reads.fastq

For paired-end reads:

./gamma-delta-workflow.sh forward_reads_R1.fastq reverse_reads_R2.fastq

4. Output format

Column header: Query name of the sample
Not-mapping-reads: Number of reads that did not map to any reference
A1-below-gamma: Number of reads that were removed by gamma threshold
A2-above-delta: Number of reads that were removed by delta threshold
List of recovered species: Name of the reference as the name of the SAM file (Number of reads | Relative proportion of reads)

Example:

Sample 1 Sample 2
Not-mapping-reads (50) Not-mapping-reads (60)
A1-below-gamma (30) A1-below-gamma (38)
A2-above-delta (20) A2-above-delta (2)
Reference 1 (90 | 0.9) Reference 4 (55 | 0.55)
Reference 2 (5 | 0.05) Reference 2 (30 | 0.30)
Reference 3 (4 | 0.04) Reference 1 (10 | 0.10)
Reference 4 (1 | 0.01) 0

5. Authors

6. Reporting bugs

All reports and feedbacks are highly appreciate. Please report any suggestion on github or by email to [email protected].

7. Disclaimer

The authors provided the information and software in good faith. Under no circumstance shall authors and the Universitat Autònoma de Barcelona have any liability for any loss or damage of any kind incurred as a result of the use of the information and software provided. The use of this tool is solely at your own risk.

8. Citation

Garrido-Sanz L, Senar MÀ, Piñol J (2020) Estimation of the relative abundance of species in artificial mixtures of insects using low-coverage shotgun metagenomics. Metabarcoding and Metagenomics 4: e48281. https://doi.org/10.3897/mbmg.4.48281

g-d_algorithm's People

Contributors

lidiags avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

andy-b-123

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.