It looks like when there is not enough free memory available for "seqtk subseq", then

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

subseq => core dumped about seqtk HOT 8 OPEN

lh3 commented on July 21, 2024

subseq => core dumped

from seqtk.

Comments (8)

tseemann commented on July 21, 2024

The stk_reg_read() function which allocates the hash for the sequence IDs uses the malloc() and realloc() system calls but does not check to see if they failed (return NULL).

Because you loaded a file bigger than your RAM size, it failed. The simplest solution is to break your list.txt file into smaller pieces and do them one at a time. You can use the Unix split command to do that.

from seqtk.

ndaniel commented on July 21, 2024

Unfortunately it is not possible to split the 'list.txt' because SeqTK is run as part of a automatic pipeline (that is https://github.com/ndaniel/fusioncatcher ) on servers which run in parallel other tasks/jobs. Also splitting the 'list.txt' incurs a penalty time. Therefore the free memory available on the server is changing all the time and it makes it impossible to predict it beforehand if SEQTK has enough free memory or not.

The right way to fix this bug is to check the return code of malloc() and realloc() in SeqTK and throw a nice error message with an exit error code.

from seqtk.

tseemann commented on July 21, 2024

@ndaniel Sure - an error code would be good practice - but fusioncatcher will still fail?

Another option is to use grep -f -F list.txt -A 3 file.fastq

from seqtk.

ndaniel commented on July 21, 2024

@ndaniel Sure - an error code would be good practice - but fusioncatcher will still fail?

Usually SeqTK will become a zombie process and it will refuse in many cases to die for hours (and even days) afterwards when it was not able to allocate memory successfully. Therefore the pipeline which is using it will also hang in a zombie mode for hours and days because of it.

Another option is to use grep -f -F list.txt -A 3 file.fastq

At first glance it looks to me that this might not work on all FASTQ files, like for example the FASTQ files which have had their ids compressed using lossy compression (e.g. read id is AAA). I have had already a replacement in place when SeqTK fails, that is extract_short_reads.py, but this is slower then SeqTK. Therefore there is no need for workarounds. It would be nice if the bug would just be fixed!

from seqtk.

tseemann commented on July 21, 2024

@ndaniel yes it would be good if it failed correctly so you could fallback onto your python script.

but given you know how much RAM you have, can you just fallback when you know the size of the list.txt file is too big?

also, i don't think making the read IDs shorter will help, because the code uses a fixed length hashing string to store the IDs.

from seqtk.

ndaniel commented on July 21, 2024

but given you know how much RAM you have, can you just fallback when you know the size of the list.txt file is too big?

No. It is impossible to predict how much available free RAM is at a given time in the future on a server which runs/starts/stops simultaneously several tasks/programs.

also, i don't think making the read IDs shorter will help, because the code uses a fixed length hashing string to store the IDs.

Yes, it does help according to my tests.

from seqtk.

shenwei356 commented on July 21, 2024

seqtk subseq reads-of-size-35-GB.fq list-of-size-20-GB.txt > output.fq

Since the whole IDs list needs to be stored in RAM, a memory efficient data structure like _BloomFilter_ could be used for checking the existence of a ID. And this probabilistic data may has false positives but does not has false negatives, so we can use the subtract (IDs not in sequence file) to ensure the correctness.

This needs more steps:

compute the substract
construct the bloom filter with substract
extract sequences not in the substract

_update_: the step 1 needs read whole IDs too!!! so this may be not a good solution.

from seqtk.

ndaniel commented on July 21, 2024

@shenwei356

For now I am using this extract_short_reads.py for cases when the reads ids do not fit in the memory and it works well BUT it is slower than SeqTK.

from seqtk.

subseq => core dumped about seqtk HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent