Giter Club home page Giter Club logo

Comments (8)

tseemann avatar tseemann commented on July 21, 2024

The stk_reg_read() function which allocates the hash for the sequence IDs uses the malloc() and realloc() system calls but does not check to see if they failed (return NULL).

Because you loaded a file bigger than your RAM size, it failed. The simplest solution is to break your list.txt file into smaller pieces and do them one at a time. You can use the Unix split command to do that.

from seqtk.

ndaniel avatar ndaniel commented on July 21, 2024

Unfortunately it is not possible to split the 'list.txt' because SeqTK is run as part of a automatic pipeline (that is https://github.com/ndaniel/fusioncatcher ) on servers which run in parallel other tasks/jobs. Also splitting the 'list.txt' incurs a penalty time. Therefore the free memory available on the server is changing all the time and it makes it impossible to predict it beforehand if SEQTK has enough free memory or not.

The right way to fix this bug is to check the return code of malloc() and realloc() in SeqTK and throw a nice error message with an exit error code.

from seqtk.

tseemann avatar tseemann commented on July 21, 2024

@ndaniel Sure - an error code would be good practice - but fusioncatcher will still fail?

Another option is to use grep -f -F list.txt -A 3 file.fastq

from seqtk.

ndaniel avatar ndaniel commented on July 21, 2024

@ndaniel Sure - an error code would be good practice - but fusioncatcher will still fail?

Usually SeqTK will become a zombie process and it will refuse in many cases to die for hours (and even days) afterwards when it was not able to allocate memory successfully. Therefore the pipeline which is using it will also hang in a zombie mode for hours and days because of it.

Another option is to use grep -f -F list.txt -A 3 file.fastq

At first glance it looks to me that this might not work on all FASTQ files, like for example the FASTQ files which have had their ids compressed using lossy compression (e.g. read id is AAA). I have had already a replacement in place when SeqTK fails, that is extract_short_reads.py, but this is slower then SeqTK. Therefore there is no need for workarounds. It would be nice if the bug would just be fixed!

from seqtk.

tseemann avatar tseemann commented on July 21, 2024

@ndaniel yes it would be good if it failed correctly so you could fallback onto your python script.

but given you know how much RAM you have, can you just fallback when you know the size of the list.txt file is too big?

also, i don't think making the read IDs shorter will help, because the code uses a fixed length hashing string to store the IDs.

from seqtk.

ndaniel avatar ndaniel commented on July 21, 2024

but given you know how much RAM you have, can you just fallback when you know the size of the list.txt file is too big?

No. It is impossible to predict how much available free RAM is at a given time in the future on a server which runs/starts/stops simultaneously several tasks/programs.

also, i don't think making the read IDs shorter will help, because the code uses a fixed length hashing string to store the IDs.

Yes, it does help according to my tests.

from seqtk.

shenwei356 avatar shenwei356 commented on July 21, 2024

seqtk subseq reads-of-size-35-GB.fq list-of-size-20-GB.txt > output.fq

Since the whole IDs list needs to be stored in RAM, a memory efficient data structure like _BloomFilter_ could be used for checking the existence of a ID. And this probabilistic data may has false positives but does not has false negatives, so we can use the subtract (IDs not in sequence file) to ensure the correctness.

This needs more steps:

  1. compute the substract
  2. construct the bloom filter with substract
  3. extract sequences not in the substract

_update_: the step 1 needs read whole IDs too!!! so this may be not a good solution.

from seqtk.

ndaniel avatar ndaniel commented on July 21, 2024

@shenwei356

For now I am using this extract_short_reads.py for cases when the reads ids do not fit in the memory and it works well BUT it is slower than SeqTK.

from seqtk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.