The back_to_sequences from pierrepeterlongo

Demultiplexing samples using barcode sequences

How would you suggest using this tool to perform sample demultiplexing using known barcode sequences?

Performance improvement

Hello,

It seems to me that using channel inter-process communications is not optimal for this type of use case.

At first, I wanted to make minimal modifications to propose a pull request, but these modifications turned out to be too important. It was simpler to rewrite everything from scratch.

The result of my work can be found in the natir/sequence_back repository, code isn't perfect, but I wanted to share my work with you so we could discuss it.

The principle remains the same, but with some potentially important modifications:

multi-thread kmer counter initialisation
use a classic not zero copy fast[a|q] parser noodles
support for more compression formats, niffler
hash function initialization with a stable seed and preallocate hash table
not keep reads order
no header modification

I've peform a small benchmark, on my computer (AMD Ryzen 7 5800X 8-Core and NVME disque) not necessarily representative:

> s2b="sequence_back --input-kmers compacted_kmers.fasta --input-sequences reads_1000000.fasta --output-sequences filtered_reads_1000000.fasta -k 31 --output-kmers counted_kmers_1000000.txt
> b2s="back_to_sequences --in-kmers compacted_kmers.fasta --in-sequences reads_1000000.fasta --out-sequences filtered_reads_1000000.fasta -k 31 --out-kmers counted_kmers_1000000.txt

> hyperfine $b2s $s2b -n back_to_sequence -n sequence_back
Benchmark 1: back_to_sequence
  Time (mean ± σ):      6.790 s ±  0.033 s    [User: 5.049 s, System: 6.117 s]
  Range (min … max):    6.706 s …  6.829 s    10 runs

Benchmark 2: sequence_back
  Time (mean ± σ):      1.242 s ±  0.007 s    [User: 6.969 s, System: 0.548 s]
  Range (min … max):    1.232 s …  1.256 s    10 runs

Summary
  sequence_back ran
    5.47 ± 0.04 times faster than back_to_sequence

When I analyze the cpu usage of back_to_sequence it is ~150% while sequence_to_back is ~600%.

Maybe you'll be interested in the ideas I've implemented.

Performance could be further improved:

use a 2bit coding but losing support for non-standard nucleotides https://www.biorxiv.org/content/10.1101/2023.03.09.531845v1.full.pdf
building a mphf from list of kmer

pierrepeterlongo / back_to_sequences Goto Github PK

back_to_sequences's People

Contributors

Stargazers

Watchers

Forkers

back_to_sequences's Issues

Demultiplexing samples using barcode sequences

Performance improvement

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent