Giter Club home page Giter Club logo

Comments (27)

compbio avatar compbio commented on August 20, 2024

I got same problem. Any solution?

test_1.txt
test_2.txt

from squeakr.

rtjohnso avatar rtjohnso commented on August 20, 2024

Can you send me the fastq file that caused the segfault? You can either attach it to your issue report, or send it via email. If it's large, you might try deleting lines to cut it down to a "minimal working example", i.e. a shorter file that still causes squeakr to crash.

Best,
Rob

from squeakr.

compbio avatar compbio commented on August 20, 2024

I loaded first 12 lines of fastq files

from squeakr.

chelseaju avatar chelseaju commented on August 20, 2024

I am experiencing the same issue. It gives Segmentation fault (core dumped) when processing a fastq file with around 10 million of reads. When I tried to run a smaller file (including the sample file provided - test.fastq), it says

Error opening file for serializing
: No such file or directory 

any idea?

from squeakr.

prashantpandey avatar prashantpandey commented on August 20, 2024

Hi @chelseaju, which command are you running to count k-mers from the sample fastq file (test.fastq)? Because I am not able to reproduce the issue.
I am using this command ./squeakr-count -f -k 28 -s 20 -t 1 -o ./ test.fastq.

Thanks,
Prashant

from squeakr.

prashantpandey avatar prashantpandey commented on August 20, 2024

Hi @nordhuang, I tried running squeakr-count using the fastq file you provided. But I am not getting any segmentation fault. I am using this command ./squeakr-count -f -k 28 -s 20 -t 1 -o ./ tmp.fastq.

Could you please confirm that you are using the same command?

Thanks,
Prashant

from squeakr.

chelseaju avatar chelseaju commented on August 20, 2024

Hi @prashantpandey , thanks for the quick response. I ran the command line you suggested and it resolved the issue of "No such file or directory". Apparently, the error message rises when the output directory does not exist. However, the same command line still produces "Segmentation Fault" when processing a large number of reads (in my case, more than 387412 lines).

from squeakr.

prashantpandey avatar prashantpandey commented on August 20, 2024

Hi @chelseaju , is there a way I can access your fastq file to reproduce the issue?

Thanks,
Prashant

from squeakr.

chelseaju avatar chelseaju commented on August 20, 2024

I am attaching the smaller fastq file (with 387416 lines). I also tried line-by-line debugging. It seemed to me that the error occurs at the qf_serialize() function in threadsafe-gqf/gqf.c (line 2139). Unfortunately, I don't really know how to fix this.

test.fq.gz

from squeakr.

prashantpandey avatar prashantpandey commented on August 20, 2024

Hi @chelseaju, just wanted to check you are seeing the seg fault issue with the file (smaller fastq) you uploaded?
Also, could you specify the exact comment you are using?

Thanks,
Prashant

from squeakr.

chelseaju avatar chelseaju commented on August 20, 2024

@prashantpandey I used the command you suggested
./squeakr-count -f -k 28 -s 20 -t 1 -o . test.fq

from squeakr.

prashantpandey avatar prashantpandey commented on August 20, 2024

Hi @chelseaju,

I tried reproducing your bug but I am actually able to count k-mer in test.fq file that you provided without any bug.

./squeakr-count -f -k 28 -s 20 -t 1 -o . test.fq
Reading from the fastq file and inserting in the QF
Total Time Elapsed: 3.161622seconds
Calc freq distribution: 
Total Time Elapsed: 0.020426seconds
Maximum freq: 129
Num distinct elem: 732988
Total num elems: 4643908

from squeakr.

accopeland avatar accopeland commented on August 20, 2024

Hi,
I'm also seeing segfaults. I'm attaching the smallest file that produces an error on my machine (4316 pairs). Machine is

Linux dint01 2.6.32-696.18.7.el6.nersc.x86_64 #1 SMP
 product: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
       vendor: Intel Corp.
       physical id: 1
       bus info: cpu@0
       size: 2601MHz
       capacity: 2601MHz
       width: 64 bits
       capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp x86-64 constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid cpufreq

Compiled with NH=1 (otherwise see illegal instruction error).

Commands producing segfault (any 13<=k<=29) with uncompressed or gzip fastq. Strangely, k=11 works as do k=29 and various values up to 61, but I did not test exhaustively. Thread count doesn't seem to matter.

squeakr-count -f -k 13 -s 20 -t 22 -o ./ x.fq
Reading from the fastq file and inserting in the QF
Segmentation fault

x.fq.gz

from squeakr.

Christina-hshi avatar Christina-hshi commented on August 20, 2024

Found a bug that causes segmentation fault of the program.
Hi all,
When running the program with parameters given as the example in README file.
./squeakr-count -f -k 28 -s 20 -t 1 -o .
I also got the segmentation fault.
Then I used GDB to see where went wrong inside the code. I found that the codes near line 1360 in "gqf.c" are not safe.
`
1360 uint64_t empty_slot_index = find_first_empty_slot(qf, runend_index+1);

1362 shift_remainders(qf, insert_index, empty_slot_index);
`
In line 1360, it tried to find out the first_empty_slot at or after slot (runend_index+1). However, there could be no empty slot after it, so the returned slot index is larger than the total number of slots, which is not a valid index. As the result, it leads to a segmentation fault when accessing memory not owned by it in line 1362.
So here we need to check whether "empty_slot_index" is out of bound.

What's more important, I think the real reason which causes the problem is the unreasonable parameters.
For example, if we set "-k 28 -s 20", then we know the maximum number of different K-mers is 2^(2*28)=2^56 and the number of slots in RSQF is ~2^20. If we assume all objects stored in RSQF has frequency at least 3 and the hash_bits is 20(should be >=20 based on parameter '-s'), then based on the encoding scheme of the RSQF, it uses at least 3 slots to store each object. So actually, the RSQF can store at most (2^20)/3 objects. It is possible that after hashing, we get 2^20 different objects because 2^56 is larger than 2^20. That means RSQF will run out of empty slots after inserting (2^20)/3 unique objects. As a result, the "segmentation fault" will happen.
Actually in the program, the hash bits is set to be s+8, so there is even higher chance to get "segmentation fault"(run out of owned space), since we may need to store 2^28 unique objects in memory blocks that can only holds at most (2^20)/3 unique objects with frequency >=3.

How to find reasonable parameters?
The key step to find reasonable parameters is to estimate the number of distinct objects that are going to be inserted into RSQF.
In theory, the maximum number of distinct objects is min(4^k, 2^hash_bits) if we don't consider the amount of data we use and its specific properties. If 2^hash_bits <= 4^k, then since the RSQF can only store at most (2^hash_bits)/3 objects under our assumption, so it is very likely to run out of space and lead to potential errors. Thus, the solution seems to set k such that 4^k <= (2^hash_bits)/3. For example, if hash_bits is 20, then the k should be <=4. If we want to use large k, then we need increase hash_bits accordingly. For example, if we want k=28, then the hash_bits should be >=58. If each slot uses only 1 byte, then we need 2^58 bytes, which is memory prohibitive.
Fortunately, the number of distinct objects is much smaller than the maximum number in theory in many real cases. For example, we want to build the K-mer spectrum for human genome using RSQF. Since the human genome is around 3 billions bp long, so there will be at most ~2*3=6 billion(by considering also reverse complement strand) unique K-mers if we assume no or extremely low sequencing errors in our data. So no matter how big the K is, the number of unique K-mers will always be bounded by ~6 billions. Therefore, we can set hash_bits >= 35 such that we will have low chance to run out of slots.

from squeakr.

prashantpandey avatar prashantpandey commented on August 20, 2024

Hi @Christina-hsh Thanks for looking into the segfault. You are right that the segfault happens because the number of slots in the CQF (counting quotient filter) is not enough.
The example command ./squeakr-count -f -k 28 -s 20 -t 1 -o . is to count k-mers in the test.fastq file which contains fewer than 2^20 28-mers.

However, for other fastq files, to decide the correct size of the CQF (or -s argument) we have a script. This script takes as input the fastq file(s) and estimates the log of the number of slots needed for the Squeakr. Please try this script. It is also mentioned in the README.
lognumslots.sh script can be used to estimate the log of number of slots in the CQF argument. The script takes as input the path to the output file of 'ntCard' (https://github.com/bcgsc/ntCard). It then calculates log of the number of slots needed by Squeakr to count k-mers.

Thanks,
Prashant

from squeakr.

prashantpandey avatar prashantpandey commented on August 20, 2024

Hi @accopeland and @chelseaju , Could you please try the latest release on master branch. We have made some changes to the API and added auto-resizing. Please read the new README for the new CLI.

Please let me know if you still see the bug.

Thanks,
Prashant

from squeakr.

Tgrandis avatar Tgrandis commented on August 20, 2024

Hello:
I also get similar "illegal instruction" problems:

squeakr count -k 33 -t 1 -o Xxyl04.squeakr Xxyl04_R1.fastq Xxyl04_R2.fastq
[2019-03-22 14:27:51.191] [squeakr_console] [info] Reading from the fastq file and inserting in the CQF.
Illegal instruction (core dumped)

Initially I tried with multiple threads, but then I need to specify the -s parameter. To do that, I installed the ntcard programme, got a histogram output from there and tried the lognumslots script..

./scripts/lognumslots.sh Xxyl04_ntcard_k33.hist
./scripts/lognumslots.sh: line 9: 1166120114 - - : syntax error: operand expected (error token is "- ")
./scripts/lognumslots.sh: line 10: + 2 * + 3 * : syntax error: operand expected (error token is "* ")
(standard_in) 1: syntax error

So I cannot get that to work either.
I use ubuntu machine (Ubuntu 18.04.2 LTS). I installed squeakr v 1.0 only recently.
All I wanted to do is the find a way to quickly (but multiple times) check the frequency of particular k-mers and to separate my kmers and corresponding sequences into error/low-copy/repeat groups.

Simply setting a value for s did not work either
./squeakr count -e -k 25 -s 20 -t 6 -o Xxyl04.squeakr Xxyl04_seq1.fastq Xxyl04_seq2.fastq
[2019-03-22 14:20:37.588] [squeakr_console] [info] Reading from the fastq file and inserting in the CQF.
Illegal instruction (core dumped)

I tried with a very small dataset (about 1,000 paired reads as fastq) a pair of 3 Gb gzipped datasets and a pair of 8 Gb gzipped datasets (the last one is the one that really needs to be analysed)

from squeakr.

chelseaju avatar chelseaju commented on August 20, 2024

Hi @prashantpandey, I am still struggling with the number of slots in the CQF argument, which generates segfault in one of my files. I also tried to run lognumslots.sh, and got the same error as @Tgrandis observed.

I first ran ntCard, which generated an output of three columns "k", "f", and "n".
k f n
15 1 97756207
15 2 46525201
15 3 22294693
15 4 10887250

It also outputs this information to screen:
k=15 F1 5917728218
k=15 F0 213786893

Looking at the script lognumslots.sh, I could not find any line starting with "F0", "f1", nor "f2", and thus failed to run line 6-8. Given the information from ntCard, what is the formula we can use to estimate the number of slots?

Thanks,
Chelsea

from squeakr.

prashantpandey avatar prashantpandey commented on August 20, 2024

Hi @chelseaju

The output format has changed in the new version of ntCard. I will update the script according to the new format ASAP.

Thanks,
Prashant

from squeakr.

prashantpandey avatar prashantpandey commented on August 20, 2024

Hi @chelseaju ,

We are working to update our lognumslots.sh to work with the new ntCard format. But in the meantime, you can use the last ntCard release v1.0.1 commit fb05b32.

https://github.com/bcgsc/ntCard/releases/tag/1.0.1

This would get you unstuck.

Thanks,
Prashant

from squeakr.

prashantpandey avatar prashantpandey commented on August 20, 2024

Hi @chelseaju ,

I have pushed a fix. But please make sure to not specify the output file option -o in the ntCard command. The script expects the F0 value to be in the output file.
For example, ./ntcard -k <kmer-length> -p <prefix> <file>

Thanks,
Prashant

from squeakr.

chelseaju avatar chelseaju commented on August 20, 2024

Hi @prashantpandey, thanks for fixing this issue. lognumslots.sh works well with the output from ntcard. However, even with the suggested slots for the input parameters, I am still getting the segmentation fault error. The dataset I am testing contain around 87 million reads, with a read length of 180bp. When counting 15mers, it generates the segmentation fault message. However, it seems fine when counting 16mers and 17mers. Any idea about this?

Thanks,
Chelsea

from squeakr.

prashantpandey avatar prashantpandey commented on August 20, 2024

Hi @chelseaju,

Yeah, I guess I understand what's going on.
How many times does it resize before crashing? Also, how many 16-mers/17-mers are there?

Here's what might be going on:
With 15-mers we get 30-bit hashes of k-mers in Squeakr-exact. To insert the hashes in the quotient filter we split the hashes into quotient and remainder bits. By default, the remainder is 8 bits which makes quotient 22 bits. We create the quotient filter with 2^22 slots where each slot is 8-bits.
Every time you resize it borrows a bit from the remainder and increases the quotient in-order to increase the number of slots in the structure.

With a small k-mer size (and smaller hash value) it can't resize enough times in-order to insert all the k-mers.

Thanks,
Prashant

from squeakr.

chelseaju avatar chelseaju commented on August 20, 2024

Hi @prashantpandey,

Thanks for the quick response. For 15mers, if I set the slot = 29 (as recommended), it crashes relatively soon (before the first resizing). If I set the slot = 28, it crashes after the first resizing.

For 16-mers/17-mers, I set the slot to 29 as well, and it resizes once.

In a case like this, do you recommend running the approximate count instead of Squeakr-exact?

Thanks,
Chelsea

from squeakr.

prashantpandey avatar prashantpandey commented on August 20, 2024

Hi @chelseaju ,

For 15-mers, if the slot=29 and it resizes it means that the quotient filter has no space to keep counts of k-mers (because there are no bits left for the remainder).
It also means there are enough k-mers that it would be better to use a counting table instead of the quotient-filter-based hash table which is used in Squeakr-exact.

We are working on adding a workaround in Squeakr to handle this case when k is small and dataset contains almost all 4^K k-mers. However, it might take a few days to get this out.

In the meantime, you can use Squeakr-approximate and try with slot=24 (because in the approx. mode it uses 8-bit remainders, therefore, the total hash size would be 24+8 = 32 bits) and see if it's able to resize and complete.

However, since the total number of hash values (2^32) is much more than the total number of k-mers there would be very few collisions and you would get (almost) exact counts.

Thanks,
Prashant

from squeakr.

kamimrcht avatar kamimrcht commented on August 20, 2024

Hi @prashantpandey
I noticed a segfault using squeakr on large read files with multiple threads.
It stops very early:

[2019-07-11 09:28:25.135] [squeakr_console] [info] Reading from the fastq file and inserting in the CQF.
Segmentation fault (core dumped)

I tried the latest version and the commit of november 18.
I compiled with NH = 1.
I tried with k=21 and 31, s=29 and 33.
Anytime, with more than 10 million reads I end up with the same error.
When I use a sample of less than 10 million reads, squeakr works fine.
This is my command line:
./squeakr/squeakr count -e -k 31 -t 20 -s 29 -o results/tmp.squeakr ERR164480_1.fastq
Then I tried to use a single thread and I could run squeakr on the whole file.

The datasets can be found here:
https://www.ncbi.nlm.nih.gov/sra/ERX140357 (I also tried both fastq)

Thank you!

from squeakr.

prashantpandey avatar prashantpandey commented on August 20, 2024

Hi @kamimrcht , I will try and reproduce this locally and will get back to you.

Thanks,
Prashant

from squeakr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.