Comments (8)
The stk_reg_read()
function which allocates the hash for the sequence IDs uses the malloc()
and realloc()
system calls but does not check to see if they failed (return NULL
).
Because you loaded a file bigger than your RAM size, it failed. The simplest solution is to break your list.txt
file into smaller pieces and do them one at a time. You can use the Unix split
command to do that.
from seqtk.
Unfortunately it is not possible to split the 'list.txt' because SeqTK is run as part of a automatic pipeline (that is https://github.com/ndaniel/fusioncatcher ) on servers which run in parallel other tasks/jobs. Also splitting the 'list.txt' incurs a penalty time. Therefore the free memory available on the server is changing all the time and it makes it impossible to predict it beforehand if SEQTK has enough free memory or not.
The right way to fix this bug is to check the return code of malloc() and realloc() in SeqTK and throw a nice error message with an exit error code.
from seqtk.
@ndaniel Sure - an error code would be good practice - but fusioncatcher will still fail?
Another option is to use grep -f -F list.txt -A 3 file.fastq
from seqtk.
@ndaniel Sure - an error code would be good practice - but fusioncatcher will still fail?
Usually SeqTK will become a zombie process and it will refuse in many cases to die for hours (and even days) afterwards when it was not able to allocate memory successfully. Therefore the pipeline which is using it will also hang in a zombie mode for hours and days because of it.
Another option is to use grep -f -F list.txt -A 3 file.fastq
At first glance it looks to me that this might not work on all FASTQ files, like for example the FASTQ files which have had their ids compressed using lossy compression (e.g. read id is AAA). I have had already a replacement in place when SeqTK fails, that is extract_short_reads.py, but this is slower then SeqTK. Therefore there is no need for workarounds. It would be nice if the bug would just be fixed!
from seqtk.
@ndaniel yes it would be good if it failed correctly so you could fallback onto your python script.
but given you know how much RAM you have, can you just fallback when you know the size of the list.txt
file is too big?
also, i don't think making the read IDs shorter will help, because the code uses a fixed length hashing string to store the IDs.
from seqtk.
but given you know how much RAM you have, can you just fallback when you know the size of the list.txt file is too big?
No. It is impossible to predict how much available free RAM is at a given time in the future on a server which runs/starts/stops simultaneously several tasks/programs.
also, i don't think making the read IDs shorter will help, because the code uses a fixed length hashing string to store the IDs.
Yes, it does help according to my tests.
from seqtk.
seqtk subseq reads-of-size-35-GB.fq list-of-size-20-GB.txt > output.fq
Since the whole IDs list needs to be stored in RAM, a memory efficient data structure like _BloomFilter_ could be used for checking the existence of a ID. And this probabilistic data may has false positives but does not has false negatives, so we can use the subtract (IDs not in sequence file) to ensure the correctness.
This needs more steps:
- compute the substract
- construct the bloom filter with substract
- extract sequences not in the substract
_update_: the step 1 needs read whole IDs too!!! so this may be not a good solution.
from seqtk.
For now I am using this extract_short_reads.py for cases when the reads ids do not fit in the memory and it works well BUT it is slower than SeqTK.
from seqtk.
Related Issues (20)
- subseq empty output
- seqtk sample: with out without replacement? HOT 1
- `seqtk seq` segfaults on 10G scaffolds HOT 4
- seqtk sample not working as expected HOT 2
- seqtk sample can't properly output fastq.gz HOT 1
- ERROR: the 2nd file has fewer records HOT 1
- The output file size of seqtk subseq is zero HOT 1
- Question: DNA string compressing HOT 1
- seqtk produces different number of reads for paired end files HOT 1
- Problem with seqtk sample HOT 1
- output file contains only one amino acid HOT 1
- seqtk telo -m works partially with 8mer wasp telomere HOT 5
- seqtk comp count CpG
- Seqtk to count sequences same SeqID
- `seqtk hpc input.fq` ignores the quality and converts to .fa
- Is the "sample" feature subsampling without replacement? HOT 1
- DNS Resolution Warning with Singularity Container
- buggy behavior with seqtk subseq command HOT 1
- converting fasta to fastq HOT 2
- Quality scores HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seqtk.