Giter Club home page Giter Club logo

debreak's Introduction

DeBreak

A SV caller for long-read single-molecular sequencing data.

Author: Maggi Chen

Email: [email protected]

Draft date: May. 4, 2022

Quick Start

conda install -c bioconda debreak

# quick SV calling with sorted bam
debreak --bam merged.sort.bam --outpath debreak_out/

# Accurate SV calling with sorted bam (reference genome needed)
debreak --bam merged.sort.bam --outpath debreak_out/ --rescue_large_ins --rescue_dup --poa --ref hg38.fa 

# SV discovery in cancer/complex genome
debreak --bam merged.sort.bam --outpath debreak_out/ --rescue_large_ins --poa --ref hg38.fa --tumor

Description

DeBreak is a tool for SV discovery with long read data. The input should be a sorted BAM (PacBio CLR, PacBio HiFi, Oxford Nanopore, or mixed platform). The reference genome is also needed when enabling full functions. The output is a standard VCF file containing all SVs that have passed filters. This program was tested on a x86_64 Linux system with a 128GB physical memory.

Depencency

Dependencies for DeBreak:

  • python3
  • pysam (tested with version 0.19.0)
  • minimap2 (tested with version 2.15)
  • wtdbg2 (tested with version 2.5)
  • bsalign (tested with version 1.2.1)

Installation

To simplify the environment setup process, conda is recommended:

conda create --name deb
conda activate deb
conda install -c bioconda debreak

Or use Git to install after installing all the dependencies:

git clone https://github.com/Maggi-Chen/DeBreak.git
export PATH=$PWD/DeBreak/:$PATH

A test simulated dataset (450Kbp genome with 6 SV embedded) is available to verify successful installation:

git clone https://github.com/Maggi-Chen/DeBreak.git
debreak --bam DeBreak/testdata/test_read.bam -o test_out/ --poa --rescue_large_ins \
--rescue_dup --ref DeBreak/testdata/test_ref.fa

The DeBreak SV discovery on test dataset should finish within 1 minute with 4 CPUs and 400MB memory. The output SV callset should be the same with testdata/debreak.vcf.

General usage

debreak [-h] --bam <sort.bam>

SV caller for long-read sequencing data

optional arguments:
  -h, --help                       Show this help message and exit
  -v, --version                    Show program's version number and exit
  --bam BAM                        Input sorted bam. index required
  --samlist SAMLIST                A list of SAM files of same sample
  -o, --outpath OUTPATH            Output directory
  --min_size MIN_SIZE              Minimal size of detected SV
  --max_size MAX_SIZE              Maxminal size of detected SV
  -d, --depth DEPTH                Sequencing depth of this dataset
  -m, --min_support MIN_SUPPORT    Minimal number of supporting reads for one event
  --min_quality MIN_QUALITY        Minimal mapping quality of reads
  --aligner ALIGNER                Aligner used to generate BAM/SAM
  -t, --thread THREAD       Number of threads
  --rescue_dup                     Rescue DUP from INS calls. minimap2,ref required
  --rescue_large_ins               Rescue large INS. wtdbg2,minimap2,ref required
  --poa                            POA for accurate breakpoint. wtdbg2,minimap2,ref required
  --no_genotype                    Disable genotyping
  -r, --ref REF                Reference genome. Should be same with SAM/BAM
  --maxcov MAXCOV                  Maximal coverage for a SV. Suggested maxcov as 2 times mean depth
  --skip_detect                    Skip SV raw signal detection
  --tumor                          Allow clustered SV breakpoints during raw SV signal detection

Use cases

DeBreak requires a input of read alignment results in BAM format. If you start with sequencing reads (Fasta or Fastq format), you may use minimap2 and samtools to map them to a reference genome before you can apply DeBreak:

minimap2 -a reference.fa  movie1.fastq | samtools sort -o movie1.bam
minimap2 -a reference.fa  movie2.fastq | samtools sort -o movie2.bam
...
samtools merge merged.sort.bam  movie1.bam movie2.bam
samtools index merged.sort.bam

DeBreak can be applied with full function (with accurate SV breakpoints):

debreak --bam merged.sort.bam -o debreak_out/ --rescue_large_ins --rescue_dup --poa --ref hg38.fa

Or with only basic functions to quickly call SVs (without accurate SV breakpoints, may miss some large insertions):

debreak --bam merged.sort.bam -o debreak_out/

Options of Debreak

1. --min_support, minimal number of supporting reads

min_support is the most important argument for SV filtering of DeBreak. It should be adjusted according to sequencing depth of input BAM. Without given min_support/depth information, DeBreak estimates the depth of input dataset to assign a resonable min_support.

debreak --bam merged.sort.bam -o debreak_out/ --min_support 5 

Suggested min_support at each depth:

Depth Suggested min_supp
10 3
20 4
30 5
40 6
50 7
60 8
70 9
80 10

If you specify coverage of dataset with --depth, DeBreak will calculate min_support according to the table.

debreak --bam merged.sort.bam -o debreak_out/ --depth 70

2. --poa, partial order alignment for SV breakpoint refinement

This option calls POA module of DeBreak. minimap2, wtdbg2, and reference genome are required. After clustering and filtering SV calls, DeBreak collects all SV-containing reads for each SV candidate and performs POA with wtdbg2 to generate a consensus sequence with higher base accuracy. It then maps consensus sequences to reference genome and infers precise SV breakpoint positions.

debreak --bam merged.sort.bam -o debreak_out/ --poa --ref reference.fa

3. --rescue_large_ins, detect large insertions using local de novo assembly

This option calls rescue_large_insertion module of DeBreak to identify insertions that are longer than the sequencing reads. minimap2, wtdbg2, and reference genome are required. DeBreak scans whole genome for read alignments with clipped end. It identifies candidate insertion breakpoints with enriched clipped alignments, and performs local de novo assembly at each candidate insertion site. minimap2 is used to map all assembled contigs to the reference. DeBreak then detects ultra-large insertions from the contig alignment results.

debreak --bam merged.sort.bam -o debreak_out/ --rescue_large_ins --ref reference.fa

4. --rescue_dup, rescue duplications that are falsely considered as insertions

This option calls rescue_duplication module of DeBreak. minimap2 and reference genome are required. This module maps inserted sequence of each insertion call back to local region near insertion breakpoint. If the inserted sequence can be properly mapped, it suggests that this is a duplication instead of novel insertion.

debreak --bam merged.sort.bam -o debreak_out/ --rescue_dup --ref reference.fa

5. --tumor, tumor mode for cancer genomes

This option sets looser criteria during SV raw signal detection, allowing identification of potentially closer SV breakpoints in complex rearrangements.

debreak --bam tumor.sort.bam -o debreak_out/ --tumor

Output files

The output directory includes:

debreak.vcf                   Standard VCF file of SV calls. The chromosome, coordinates, size, type, number of supporting reads, mapping quality, genotype, and multi-allelic information are recorded for each SV call.
debreak-allsv-merged-final    Tab-delimited SV list, containing the name of reads supporting each SV call.
sv_raw_calls/                 Includes all SV raw signals on each chromosome.
debreak_poa_workspace/        Temporary files during SV breakpoint refinement with POA. For debug purpose.
debreak_ins_workspace/        Temporary files during ultra-large insertion detection. For debug purpose.
map_depth/                    Temporary files during sequencing depth estimation. For debug purpose.

debreak's People

Contributors

maggi-chen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.