back_to_sequences's People
back_to_sequences's Issues
Demultiplexing samples using barcode sequences
How would you suggest using this tool to perform sample demultiplexing using known barcode sequences?
Performance improvement
Hello,
It seems to me that using channel inter-process communications is not optimal for this type of use case.
At first, I wanted to make minimal modifications to propose a pull request, but these modifications turned out to be too important. It was simpler to rewrite everything from scratch.
The result of my work can be found in the natir/sequence_back repository, code isn't perfect, but I wanted to share my work with you so we could discuss it.
The principle remains the same, but with some potentially important modifications:
- multi-thread kmer counter initialisation
- use a classic not zero copy fast[a|q] parser noodles
- support for more compression formats, niffler
- hash function initialization with a stable seed and preallocate hash table
- not keep reads order
- no header modification
I've peform a small benchmark, on my computer (AMD Ryzen 7 5800X 8-Core and NVME disque) not necessarily representative:
> s2b="sequence_back --input-kmers compacted_kmers.fasta --input-sequences reads_1000000.fasta --output-sequences filtered_reads_1000000.fasta -k 31 --output-kmers counted_kmers_1000000.txt
> b2s="back_to_sequences --in-kmers compacted_kmers.fasta --in-sequences reads_1000000.fasta --out-sequences filtered_reads_1000000.fasta -k 31 --out-kmers counted_kmers_1000000.txt
> hyperfine $b2s $s2b -n back_to_sequence -n sequence_back
Benchmark 1: back_to_sequence
Time (mean ± σ): 6.790 s ± 0.033 s [User: 5.049 s, System: 6.117 s]
Range (min … max): 6.706 s … 6.829 s 10 runs
Benchmark 2: sequence_back
Time (mean ± σ): 1.242 s ± 0.007 s [User: 6.969 s, System: 0.548 s]
Range (min … max): 1.232 s … 1.256 s 10 runs
Summary
sequence_back ran
5.47 ± 0.04 times faster than back_to_sequence
When I analyze the cpu usage of back_to_sequence it is ~150% while sequence_to_back is ~600%.
Maybe you'll be interested in the ideas I've implemented.
Performance could be further improved:
- use a 2bit coding but losing support for non-standard nucleotides https://www.biorxiv.org/content/10.1101/2023.03.09.531845v1.full.pdf
- building a mphf from list of kmer
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.