CADECT - Concatemer by Amplification DEteCtion Tool

version 1.0.2

Whole Genome Amplification using multiple displacement amplification (MDA) sometimes can introduce potential false concatemer sequences that can affect whole genome assembly assays. Here we propose a Concatemer detection tool for those WGA assays.

Figure. Impact of MDA-Generated Concatemers on the Genome Assembly. (A) Concatemers generated by template switching; (B) Graph representation of the effect of concatemers on genome assembly (bubble fragmentation effect). (Agyabeng-Dadzie et al. 2024)

How it works?

It splits all reads in separate files to perform sliding windows with the user prefered size and the gap between these windows. For ONT amplified reads, we suggest windows >= 500bp with no overlaps (e.g. -w 500). If the read is not able to generate more than one window (< 500bp in size in the 500bp window example) the read is classified as "short-read" and it is stored in the short.fasta/fastq output file. Reads with more than two windows, will be classified as longer sequences and will have their fragment windows aligned (global aligment) with each other and if overlaps are found they are classified as putative concatemers. The longer sequences with no overlaps are classified as non-concatemers. A classification Table will be generated containing the read IDs, Classification, number of windows generated and number of alignments found (note: number of alignments generated are not equivalent to number of repeats/copies). Both fastq and fasta formats are supported. Default global alignment coverage is set to 0.7.

Workflow

Instalation

Requirements:

Python3
BioPython v1.83 (tested)

Easy install unisng conda/mamba

mamba create -n cadect -c bioconda -c conda-forge biopython 
git clone https://github.com/rpbap/CADECT.git
conda activate cadect

Usage

python CADECT_7.py [OPTIONS] -R <Reads.fastq/fasta> -o <output_dir> -w <window size>

Flag description:

Required:
  -R  --reads       fastq (or fasta) file with reads generated by WGA sequencing using ONT (required)
  -o  --output_dir  Output directory name (required)
Options: 
  -w  --window    length of desired window sequences in bp (default = 500)

Output Files

Output File	Description
`classification_table.txt`	File statistics of the CADECT pipeline
`non_concatemers.fastq`	fastq/fasta file containing non-concatemeric reads
`putative_concatemers.fastq`	fastq/fasta file containing putative concatemeric reads
`short.fastq`	fastq/fasta file containing short reads
`progress.log`	Classification progress report

classification_table.txt output from provided example

Read ID	Classification	Num Windows	Num Overlaps
3e8417bd-1c3d-4209-a2bd-b443822a7c27	short	1	0
1f3c3a56-b6a5-49dc-b9c7-2267440e094d	short	1	0
b7ec9679-37df-42b5-8b4e-00b6fa5fe504	non_concatemers	8	0
d159b5a3-ee3b-4cc4-92ad-1422bf7a5a28	putative_concatemers	24	6
159ffb63-2583-4a7d-88a5-639111d4fe99	putative_concatemers	26	27
6d5ce662-395e-4af2-a68c-37015af5913b	putative_concatemers	18	38
c3974c91-cf3d-4a0e-b7bd-0688ec05ea33	non_concatemers	8	0
b8194fa6-aa7b-4017-bd55-5538b8f31039	putative_concatemers	28	84
a6b76c03-832a-47a1-bb80-0a57b862118a	putative_concatemers	19	7

Impotant information

The current version uses Bio.pairwise2 for the global alignment which has been deprecated in Biopython. We are currently working to update the global aligner to something like Bio.Align.PairwiseAligner in a future version. So if the message below appears in your run the pipeline, don't worry, it is still working (just a warning message).

...python3.12/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
  warnings.warn(`

Useful command line to get the global stats from the classification table: cat classification_table.txt| cut -f 2| sort| uniq -c

Compute time test

Total number of reads	Cumulative read length	Processing time	OS tested
1,000 reads	4,099,269 bp	~109 seconds	MacOS Ventura
40,000 reads	47,837,224 bp	~486 seconds	MacOS Ventura
494,419 reads	699,495,625 bp	~4.3 hours (~15,788 seconds)	MacOS Ventura
1,000 reads	6,439,871 bp	~1,106 seconds	Ubuntu 22.04
40,000 reads	261,519,967 bp	~13 hours (~48,614 seconds)	Ubuntu 22.04

Computer specs tested:

OS: Ubuntu 22.04; MacOS Ventura 13.3.1
Memory: 64GiB
Processor: Intel Xeon(R) CPU @ 3.90GHz x 16; Apple M1 Max

We are working to get a multithread function to boost time, in the meanwhile, we are providing a fasta/fastq parser script under extras (split_input.py) to split your input file into subsets to make the user able to submit multiple jobs and boost the run time

Cite us

Agyabeng-Dadzie et al. (2024) "Evaluating the benefits and limits of multiple displacement amplification with whole-genome Oxford Nanopore Sequencing." bioRxiv.

Developers

Rodrigo P. Baptista, PhD link

rpbap / cadect Goto Github PK

cadect's Introduction

CADECT - Concatemer by Amplification DEteCtion Tool

How it works?

Workflow

Instalation

Usage

Output Files

classification_table.txt output from provided example

Impotant information

Compute time test

We are working to get a multithread function to boost time, in the meanwhile, we are providing a fasta/fastq parser script under extras (split_input.py) to split your input file into subsets to make the user able to submit multiple jobs and boost the run time

Cite us

Developers

cadect's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent