Giter Club home page Giter Club logo

biodive's Introduction

DIVE

The algorithm

DIVE is a purely statistical and completely annotation-free algorithm that proposes a new conceptual approach to discovering k-mer sequences associated with high rates of sequence diversification. DIVE is an efficient algorithm designed to identify sequences that may mechanistically cause sequence diversification (e.g., CRISPR repeat or transposon end) and the variable sequences near them, such as an insertion site. The identified sequences are assigned statistical scores for biologists to prioritize them and blasted against a series of FASTA files if desired. For more details, see [1].

Installation

pip

To install DIVE simply run the following pip command on the terminal:

pip install biodive

To install blast within the same environment use the following command:

conda install -c bioconda blast

github

To install DIVE directly from the repository simply run the following commands:

git clone https://github.com/jordiabante/biodive.git
cd biodive
conda create -n biodive python=3.6.8
conda activate biodive
pip install -e .

To install blast within the same environment use the following command:

conda install -c bioconda blast

Usage

To run a single-sample analysis on a compressed FASTQ/FASTA file use

# import bio module
from biodive import bio

# define input file and output dir
outdir = "/path/to/outdir/"
infile = "/path/to/fastq.gz" # or "/path/to/fasta.gz"

# configure run
config = bio.Config(
    outdir=outdir,              # directory where output files will be stored
    kmer_size=25,               # k-mer size used in the analysis
    annot_fasta=[]              # array containing fasta files to use with blast
)

# run analysis
bio.biodive_single_sample_analysis(infile,config)

If len(annot_fasta)>0, then blast must be available on the path (see installation above).

Output files

Anchor sequences table

A table with suffix _anchors.txt.gz is produced containing information about the interesting anchors detected (keys in old convention). The file contains the following columns:

    sequence id | assembly of {anchor1,anchor2,...} | max_c_up | max_n_up | max_efct_sz_up | max_efct_sz_qval_up | max_kmer_up | max_c_dn | max_n_dn | max_efct_sz_dn | max_efct_sz_qval_dn | max_kmer_dn | A% | C% | G% | T% | {anchor1,anchor2,...}

where up/dn indicate the position of the HVR with respect to the anchor and:

  • max_c_*: number of clusters formed for the maximizing anchor in the set in * direction.
  • max_n_*: corresponding number of target sequences observed.
  • max_efct_sz_*: corresponding effect size.
  • max_efct_sz_qval_*: corresponding adjusted p-value.
  • max_kmer_*: anchor sequence in the * direction.

If the len(annot_fasta)>0, then two extra columns will be added to the previous table containing annotation information for the anchor sequence that maximizes the effect size upstream and downstream, respectively (max_kmer_up and max_kmer_down). For each direction (upstream, downstream) and for each FASTA in annot_fasta, containing the lowest e-value and the corresponding hit in the FASTA (sequence in FASTA resulting in lowest e-value), and the output will be stored in a new table with suffix _anchors_annot.txt.gz (NA will be assigned when e-value>1). For example, if we pass annot_fasta=[fasta1] we will see four extra columns:

    sequence id | ... | {anchor1,anchor2,...} | best_eval_up_fasta1 | best_hit_up_fasta1 | best_eval_dn_fasta1 | best_hit_dn_fasta1 

The intermediate XML files produced by blast are also stored for further analysis.

Re-running annotation

In some cases we might want to update the set of FASTA files we want to blast the results against. Say, for example, that we want to re-run the annotation with FASTA files f1.fasta, f2.fasta, and f3.fasta, with our output SRRXYZ_anchors.txt.gz (note the suffix is _anchors.txt.gz). In that case, we can use the following python code:

from biodive import bio

anchorfile = "/path/to/SRRXYZ_anchors.txt.gz"
annot_fasta = ["/path/to/annotations/f1.fa", "/path/to/annotations/f2.fa", "/path/to/annotations/f3.fa"]
config = bio.Config(annot_fasta=annot_fasta)

bio.biodive_single_sample_analysis_annotation(anchorfile,config)

Anchor sequences FASTA

Three FASTA files are produced:

  1. FASTA file with suffix _assemb_anchors.fasta: assembled anchor sequences.
  2. FASTA file with suffix _max_anchor_up.fasta: maximizing anchor sequence upstream.
  3. FASTA file with suffix _max_anchor_dn.fasta: maximizing anchor sequence downstream.

Note that not all anchor sequences in 2 and 3 are necessarily significant.

Target sequences table

For each anchor in the set {anchor1,anchor2,...}, the target sequences are stored in a file with suffix _targets.txt.gz containing the following columns:

    anchor | upstream/downstream | distance | target | number of instances observed

References

[1] J. Abante, P.L. Wang, J. Salzman. DIVE: a reference-free statistical approach to diversity-generating & mobile genetic element discovery, bioarxiv (2022).

biodive's People

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

biodive's Issues

Original fasta sequence ID information missing in output

Dear developers

I would like to know whether it is possible to have the fasta ID where each of the output anchor sequences are detected somewhere in the output files? For now, my output hits are called seq1/seq2, without information as to which sequences from the input file these anchors are from. I am trying to locate back where these sequences are on my input sequences, without having to run a blast between my output anchor seqs and my input sequences.

Thank you very much

Regards

Marc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.