Giter Club home page Giter Club logo

cadect's Introduction

CADECT - Concatemer by Amplification DEteCtion Tool

version 1.0.2

Whole Genome Amplification using multiple displacement amplification (MDA) sometimes can introduce potential false concatemer sequences that can affect whole genome assembly assays. Here we propose a Concatemer detection tool for those WGA assays.

image

Figure. Impact of MDA-Generated Concatemers on the Genome Assembly. (A) Concatemers generated by template switching; (B) Graph representation of the effect of concatemers on genome assembly (bubble fragmentation effect). (Agyabeng-Dadzie et al. 2024)

How it works?

It splits all reads in separate files to perform sliding windows with the user prefered size and the gap between these windows. For ONT amplified reads, we suggest windows >= 500bp with no overlaps (e.g. -w 500). If the read is not able to generate more than one window (< 500bp in size in the 500bp window example) the read is classified as "short-read" and it is stored in the short.fasta/fastq output file. Reads with more than two windows, will be classified as longer sequences and will have their fragment windows aligned (global aligment) with each other and if overlaps are found they are classified as putative concatemers. The longer sequences with no overlaps are classified as non-concatemers. A classification Table will be generated containing the read IDs, Classification, number of windows generated and number of alignments found (note: number of alignments generated are not equivalent to number of repeats/copies). Both fastq and fasta formats are supported. Default global alignment coverage is set to 0.7.

Workflow

CADET_0 2 1

Instalation

Requirements:

  • Python3

  • BioPython v1.83 (tested)

Easy install unisng conda/mamba

mamba create -n cadect -c bioconda -c conda-forge biopython 
git clone https://github.com/rpbap/CADECT.git
conda activate cadect

Usage

python CADECT_7.py [OPTIONS] -R <Reads.fastq/fasta> -o <output_dir> -w <window size>

Flag description:

Required:
  -R  --reads       fastq (or fasta) file with reads generated by WGA sequencing using ONT (required)
  -o  --output_dir  Output directory name (required)
Options: 
  -w  --window    length of desired window sequences in bp (default = 500)

Output Files

Output File Description
classification_table.txt File statistics of the CADECT pipeline
non_concatemers.fastq fastq/fasta file containing non-concatemeric reads
putative_concatemers.fastq fastq/fasta file containing putative concatemeric reads
short.fastq fastq/fasta file containing short reads
progress.log Classification progress report

classification_table.txt output from provided example

Read ID Classification Num Windows Num Overlaps
3e8417bd-1c3d-4209-a2bd-b443822a7c27 short 1 0
1f3c3a56-b6a5-49dc-b9c7-2267440e094d short 1 0
b7ec9679-37df-42b5-8b4e-00b6fa5fe504 non_concatemers 8 0
d159b5a3-ee3b-4cc4-92ad-1422bf7a5a28 putative_concatemers 24 6
159ffb63-2583-4a7d-88a5-639111d4fe99 putative_concatemers 26 27
6d5ce662-395e-4af2-a68c-37015af5913b putative_concatemers 18 38
c3974c91-cf3d-4a0e-b7bd-0688ec05ea33 non_concatemers 8 0
b8194fa6-aa7b-4017-bd55-5538b8f31039 putative_concatemers 28 84
a6b76c03-832a-47a1-bb80-0a57b862118a putative_concatemers 19 7

Impotant information

  • The current version uses Bio.pairwise2 for the global alignment which has been deprecated in Biopython. We are currently working to update the global aligner to something like Bio.Align.PairwiseAligner in a future version. So if the message below appears in your run the pipeline, don't worry, it is still working (just a warning message).
...python3.12/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
  warnings.warn(`
  • Useful command line to get the global stats from the classification table: cat classification_table.txt| cut -f 2| sort| uniq -c

Compute time test

Total number of reads Cumulative read length Processing time OS tested
1,000 reads 4,099,269 bp ~109 seconds MacOS Ventura
40,000 reads 47,837,224 bp ~486 seconds MacOS Ventura
494,419 reads 699,495,625 bp ~4.3 hours (~15,788 seconds) MacOS Ventura
1,000 reads 6,439,871 bp ~1,106 seconds Ubuntu 22.04
40,000 reads 261,519,967 bp ~13 hours (~48,614 seconds) Ubuntu 22.04

Computer specs tested:

  • OS: Ubuntu 22.04; MacOS Ventura 13.3.1
  • Memory: 64GiB
  • Processor: Intel Xeon(R) CPU @ 3.90GHz x 16; Apple M1 Max
We are working to get a multithread function to boost time, in the meanwhile, we are providing a fasta/fastq parser script under extras (split_input.py) to split your input file into subsets to make the user able to submit multiple jobs and boost the run time

Cite us

  • Agyabeng-Dadzie et al. (2024) "Evaluating the benefits and limits of multiple displacement amplification with whole-genome Oxford Nanopore Sequencing." bioRxiv.

Developers

  • Rodrigo P. Baptista, PhD link

cadect's People

Contributors

rpbap avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.