Giter Club home page Giter Club logo

back_to_sequences's People

Contributors

a-ba avatar frankandreace avatar natir avatar pierrepeterlongo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

back_to_sequences's Issues

Performance improvement

Hello,

It seems to me that using channel inter-process communications is not optimal for this type of use case.

At first, I wanted to make minimal modifications to propose a pull request, but these modifications turned out to be too important. It was simpler to rewrite everything from scratch.

The result of my work can be found in the natir/sequence_back repository, code isn't perfect, but I wanted to share my work with you so we could discuss it.

The principle remains the same, but with some potentially important modifications:

  • multi-thread kmer counter initialisation
  • use a classic not zero copy fast[a|q] parser noodles
  • support for more compression formats, niffler
  • hash function initialization with a stable seed and preallocate hash table
  • not keep reads order
  • no header modification

I've peform a small benchmark, on my computer (AMD Ryzen 7 5800X 8-Core and NVME disque) not necessarily representative:

> s2b="sequence_back --input-kmers compacted_kmers.fasta --input-sequences reads_1000000.fasta --output-sequences filtered_reads_1000000.fasta -k 31 --output-kmers counted_kmers_1000000.txt
> b2s="back_to_sequences --in-kmers compacted_kmers.fasta --in-sequences reads_1000000.fasta --out-sequences filtered_reads_1000000.fasta -k 31 --out-kmers counted_kmers_1000000.txt

> hyperfine $b2s $s2b -n back_to_sequence -n sequence_back
Benchmark 1: back_to_sequence
  Time (mean ± σ):      6.790 s ±  0.033 s    [User: 5.049 s, System: 6.117 s]
  Range (min … max):    6.706 s …  6.829 s    10 runs

Benchmark 2: sequence_back
  Time (mean ± σ):      1.242 s ±  0.007 s    [User: 6.969 s, System: 0.548 s]
  Range (min … max):    1.232 s …  1.256 s    10 runs

Summary
  sequence_back ran
    5.47 ± 0.04 times faster than back_to_sequence

When I analyze the cpu usage of back_to_sequence it is ~150% while sequence_to_back is ~600%.

Maybe you'll be interested in the ideas I've implemented.

Performance could be further improved:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.