Giter Club home page Giter Club logo

igda's Introduction

iGDA

Contents

Overview

Cellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, and co-infection of multiple pathogens. Detecting and phasing minor variants, which is to determine whether multiple variants are from the same haplotype, play an instrumental role in deciphering cellular genetic heterogeneity, but are still difficult because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, have provided an unprecedented opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrated that iGDA can accurately reconstruct haplotypes in closely-related strains of the same species (divergence >= 0.011%) from long-read metagenomic data.

System Requirements

Hardware Requirements

For optimal performance, we recommend a computer with the following specs:

RAM: 16+ GB

CPU: 8+ cores

Software Requirements

OS Requirements

The iGDA package is tested on Linux operating systems.

Linux: CentOS Linux release 7.6.1810 (Core)

Installation Guide

Install via Conda (recommended, need Conda)

conda install -c zhixingfeng igda

Or

conda install -c bioconda -c conda-forge -c zhixingfeng igda

if you don't have biconda or conda-forge in your channel list.

Compile from source code (not recommended)

Install dependencies

  1. xgboost 0.90
  2. bioawk 1.0
  3. samtools 1.10

Compile source code (GCC version >= 5, or other C++ compilers supporting c++14 standard)

Download the source code of iGDA from Release, unzip it, enter the directory and type "Make". Add the "bin" directory to your PATH or create a soft link to ./bin/igda in a directory that the system can find.

Download script

Download https://github.com/zhixingfeng/shell/archive/refs/tags/1.0.1.tar.gz, unzip it and add the folder to your PATH.

Usage

Preprocessing:

Convert lower case letters to upper case in reference fasta file:

fasta2upper infasta outfasta

Convert wildcard letters to N in reference fasta file:

fastaclean fastafile outfafile

Realign reads aligned to the negative strand:

(PacBio data) igda_align_pb infile(bam or sam file) reffile outfile nthread

(PacBio data with no QV, like Sequel or Sequel II) igda_align_pb_fa infile(bam or sam file) reffile outfile nthread

(Nanopore data) igda_align_ont infile(bam or sam file) reffile outfile nthread

Sort and convert realign samfile to bamfile:

sam2bam samfile nthread

Detect minor SNVs:

(PacBio data) igda_pipe_detect -m pb bamfile reffile contextmodel outdir

(Nanopore data) igda_pipe_detect -m ont bamfile reffile contextmodel outdir

Please note:

bamfile is the aligned bam file (sorted and indexed).

reffile is the reference fasta file.

contextmodel is the context effect model trained on independent data. They can be download in https://github.com/zhixingfeng/igda_contextmodel. For PacBio, qvx_NCTC_P6_C4 is the models corresponding to different QV thresholds to mask bases with QV < x as N (Use qv0_NCTC_P6_C4 if your reads have no QV). For ONT, If your data is preprocessed by discarding reads with average QV < x and masking bases with QV < y by "N", use the model named "ont_context_effect_read_qv_x_base_qv_y". "qv_0" means no masking. The "sam_maskqv" command released with iGDA can do the low QV base masking (should be available if you install iGDA using conda), but you could also use any in-house or third-party tools. Many QC tools dealing with FASTQ files can do low average QV reads filtering.

Phase minor SNVs:

(PacBio data) igda_pipe_phase -m pb indir(outdir of igda_pipe_detect) reffile outdir

(Nanopore data) igda_pipe_phase -m ont indir(outdir of igda_pipe_detect) reffile outdir

Output format:

For detecting minor SNVs, detected_snv.vcf in outdir is the final result.

For phasing minor SNVs, contigs.sam, contigs.fa, and contigs.ann are the final results.

In the contigs.ann file, each row is a contig.

Column 1 is chromosome name.

Column 2 is the SNVs of the contig. It is encoded, for each integer x, floor(x/4) = 0-based locus, and x modulo 4 = base (0=A, 1=C, 2=G, 3=T)

Column 3 is start locus (0-based)

Column 4 is end locus (0-based)

The other columns are reserved for internal use

Parameter tuning:

It is difficult to find an universally optimal parameter setting. Here are some tips for parameter tuning if the default one does not have the expected performance:

  • By default, igda_pipe_detect discards reads with aligned length < 1000 because it can reduce the impact read-maping ambiguity. Use igda_pipe_detect -l 0 instead if the data have lots of aligned reads shorter than 1000.

  • In igda_pipe_detect, -r and -c are the two major parameters affecting the accuracy. -r is "Minimal depth for each SNV" and -c is "Minimal maximal conditional substitution rate". By default, igda_pipe_detect uses -r 25 -c 0.65, which is a conservative setting aiming to achieve a low false discover rate (FDR). These parameters might have a low sensitivity if the sequencing depth corresponding to a minor SNV is lower or close to 25. It is possible to increase sensitivity by decreasing -r, but FDR might increase.

Demo

Example data and code can be download from the following link, which includes:

  1. A mixture of 186 Bordetella spp. samples (PacBio sequencing)
  2. A mixture of 155 Escherichia coli samples (PacBio sequencing)
  3. A mixture of 65 Klebsiella pneumoniae samples (Oxford Nanopore sequencing)
  4. A mixture of 11 Borrelia burgdorferi strains and 744 other species to mimic a metagenome (PacBio sequencing)

https://www.dropbox.com/sh/umi03t3eendcktf/AABhXy2cNfQAr0wBike87Kc0a?dl=0

The run time of demo 1~3 is expected to be several minutes with 16 cores, but may vary depending on the specific machines and their working conditions. It might take about several hours to finish demo 4 with 32 cores due to its size.

Citation

Feng, Z., Clemente, J.C., Wong, B. et al. Detecting and phasing minor single-nucleotide variants from long-read sequencing data. Nat Commun 12, 3032 (2021).

DOI: https://doi.org/10.1038/s41467-021-23289-4

Maintainer

Zhixing Feng (冯智星)

For any questions, contact [email protected] or [email protected]

igda's People

Contributors

zhixingfeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

igda's Issues

haplotype frequency

Hello
is there an option to store frequency, Avg Coverage, num reads, for each haplotype ... that would be nice to save it in the fasta header for each. thank you and congrats for all the good work!

Context Models

For detecting minor SNVs, what are the differences between the different context models? There seem to be several ONT models, so I wasn't sure which one to select.

ERROR: Unexpected character 'M' found

Hello! Thanks for developing such an amazing tool. However when I apply the IGDA to my mitochondria bam file , it throws an error.
Here are details:
I am running igda_pipe_detect and the parameters are all default.
After it starts the "getbamchrrange" step, there is an error: "ERROR: Unexpected character 'M' found."
I wonder if there is some solution to fix it? like slightly correct my alignment chromosome information 'chrM' or add some code so it can identify the chrM.
Thank you very much!

Final Output - IGV and Reconstructing Final Contigs

Hi Zhixing, I'm sorry for the additional questions, but I have two final ones regarding the output:

  1. How did you group together the outputted contigs to the reference strain contigs in IGV like this:

Screen Shot 2021-12-16 at 1 00 12 PM

  1. In your iGDA output for the metagenome set of Borrelia, you said that you outputted 753 final contigs. If you didn't know what the metagenome set contained, how would you decipher which strains you have based on the 753 final contigs? Did you perhaps run an assembly on the contigs to try to determine the contained strains?

Detecting Minor SNV - Missing $autopar variable

Hi Dr. Feng, when detecting minor SNVs, the program runs but does show a message of "igda_pipe_detect_ont: line 114: [: missing `]'". I looked into line 114 of igda_pipe_detect_ont, and it calls a $auto_par variable that is set to $OPTARG. Do you know what this error message is saying? I thought I inputted all required variables, and the program seems to be running smoothly, but I wanted to make sure. Thank you so much!

Preprocessing not clear enough

Hello,

First we need to "Convert lower case letters to upper case in reference fasta" with fasta2upper infasta outfasta
Second we need to "Convert wildcard letters to N in reference fasta" with fastaclean fastafile outfafile
Third we need to "Realign reads aligned to the negative strand" with igda_align_ont infile(bam or sam file) reffile outfile nthread

Here there is something I don't understand ....
The output of the previous step is a cleaned fastq file but you ask a bam or sam file as input file for the mapping step ....
The infile is fastq not bam or sam right ?
The output of the mapping is sam file right ?

Best

Regarding the maximum length of contigs

Hi @zhixingfeng ,

I am using igda for detecting sublines of one bacteria in a pooled PacBio Sequel II genomic data. The average length of reads is 8kb. I find that igda gives me very few contigs (3-6) for most of datasets. I do not expect 100s of sublines, but I do expect at least 2-3. In my igda results, I get 6 contigs placed very very far apart from each other on a 5Mb genome. From variant analysis, I know that the loci covered by these contigs are either deleted or have a high frequency of mutation (~80%). Do you think that these results could be due to smaller length of reads, thus limiting the maximum achievable length of the contig by igda? Or is it that I am doing something wrong (I followed exactly the commands suggested on the usage page for Sequel II reads)? Any insights will be super useful.

Detect minor SNVs not clear enough

Hello,

Can you please provide more detail on how to chose a contextmodel ?

Also what type of file should we use as contextmodel ?

  • boosting.conf ?
  • boosting.model ?
  • a folder ?
  • What folder ? :
  • train_A
  • train_C
  • train_G
  • train_T
  • or just ont or pacbio ?

Thanks

minimap2 not found

Hello. I'm very excited to try iGDA, but am not able to run the 'idga_align_ont' step. I installed iGDA using a fresh conda environment as described in the GitHub page:
conda install -c bioconda -c conda-forge -c zhixingfeng igda

When I run 'igda_align_ont' I receive the error message "/scicomp/home-pure/ymw8/Software/miniconda3/envs/igda/bin/igda_align_ont: line 24: minimap2: command not found".

I do have several 'minimap2...' executables in the conda environment (such as "minimap2_nanopore"), but none are simply "minimap2".

Please advise. The full error message is below. Thanks -- Adam.

(igda) me> igda_align_ont barcode45.trim.primerclipped.bam iGDA/WuCoV_MN908947_clean.fasta iGDA/barcode45.igda 4
infile=barcode45.trim.primerclipped.bam
reffile=iGDA/WuCoV_MN908947_clean.fasta
outfile=iGDA/barcode45.igda
nthread=4
get forward sequences from sam/bam
run minimap2
~/Software/miniconda3/envs/igda/bin/igda_align_ont: line 24: minimap2: command not found
filter realigned sam
[main_samview] fail to read the header from "iGDA/barcode45.igda.realign.sam".

usability tweaks

Hello. Thank you for developing and sharing iGDA. While testing it out, I noticed a few features that could make it more useful to end-users, in case you decide to do additional development on iGDA.

  1. Report SNVs in a standard human-readable format (e.g. 484K). Right now, they are encoded "for each integer x, floor(x/4) = 0-based locus, and x modulo 4 = base (0=A, 1=C, 2=G, 3=T)"
  2. Report proportions, both for the individual SNV (detected_snv.vcf) and for the contigs if possible.
  3. As I understand the current report format, there is no indication whether or not the 'reference genotype' is present in the sample; genotypes are only reported if they have SNVs.

Thanks
Adam

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.