mourisl / lighter Goto Github PK
View Code? Open in Web Editor NEWFast and memory-efficient sequencing error corrector
License: GNU General Public License v3.0
Fast and memory-efficient sequencing error corrector
License: GNU General Public License v3.0
Hi,
I used lighter recently and found out that sometimes it produced invalid fastq files like this:
@ST-E00114:418:HHKV7CCXX:8:1109:26484:40020 2:N:0:GCCAAT
TTCAAGAGGTCTTAAGGAGGCTAACTCATATATAGGTTAGGTCTAGATCTAGAGGATGAGAGCATGGGAGACATGGATGTACATAGACTCGATAAGTACTT
+
AA<F<FAFAA<,FFKKK<FFFKKKKKKKKKKKFKKKAFFF79:40020 2:N:0:GCCAAT
As shown, the quality line is truncated and merged with a segment of the header line.
Hello,
I'm not sure that Github issue is the best place for this, but it is the suggested channel for support so will give it a go.
I had some good initial experiences with Lighter, so ran it on a larger number of samples (n = 2000). The hypothesis of the experiment was that Lighter would help to reduce errors that were causing 'mixed positions', where the consensus base at a position had the support of less than 90% of the reads that mapped there.
However, my initial good experience was not continued. The image below is 100 randomly selected samples from our 2000. It shows the number of mixed positions obtained when reads that have just been quality trimmed (uncor_trimmed) and those that have been quality trimmed and Lighter corrected (cor_trimmed) are mapped vs reference.
As you can see, the general trend is for there to be more mixed positions in the alignments that have been Lightered, rather than those that have been just trimmed. This was not expected!
When I looked more closely at the positions that were mixed after Lighter, but not before, I saw something like.
I was initially using an alpha of 0.05 and k = 17, changing this to alpha = 0.1 and k = 25 made no difference to this phenomenon. Do you have any insight into what might be causing this?
OS is Red Hat Enterprise Linux Server release 6.4 (Santiago).
I am trying to use k-mer size greater than the 32 specified by lighter as the maximum and have not been able to do that. Is there a way to increase the maximum k-mer length of lighter from 32 to a higher value?
I am assembling a highly heterozygous, repetitive, 2.5 Gbp plant genome (rooibos). My dataset is around 320 GB (zipped) (Illumina paired-end, and mate-pair data).
Due to time constraints, I would like to know if Light be suitable for my dataset and biological system, before having the tool installed and running my dataset on it?
Thanks for your time.
Kind regards,
Allison
I'm working with some old GA2 reads, get this error, which I suppose has to do with differences in Phred encoding. Would be awesome if Lighter could auto detect or at least allow me to specific Phred offset.
[2014-11-17 23:50:15] Bad quality threshold is "-"
Good day,
I am not certain when one would use the following command line options:
-noQual: ignore the quality score (default: false)
-newQual ascii_quality_score: set the quality for the bases corrected to the specified score (default: not used)
My data is from a 2.5 G plant genome, 195x coverage.
Any other advice/suggestions will be appreciated :)
Thank you for your time.
Kind regards,
Allison
Hi @mourisl,
It was exciting to meet you at CSHL and I appreciate that you took some time to implement the feature we discussed.
I gave it a try this morning, and saving the trusted k-mers file seemed to work fine. At least no error was reported.
$ time lighter \
-r mother-reads-1.fq -r mother-reads-2.fq \
-r father-reads-1.fq -r father-reads-2.fq \
-r proband-reads-1.fq -r proband-reads-2.fq \
-saveTrustedKmers trusted-kmers \
-k 25 25000000 0.233 -t 4
[2017-11-06 10:54:06] =============Start====================
[2017-11-06 10:54:07] Bad quality threshold is "7"
[2017-11-06 10:54:54] Finish sampling kmers
[2017-11-06 10:54:54] Bloom filter A's false positive rate: 0.162178
[2017-11-06 10:54:54] The error rate is high. Lighter adjusts -maxcor to 5 and bad quality threshold to "8".
[2017-11-06 10:56:30] Finish storing trusted kmers
[2017-11-06 10:56:30] The trusted kmers are saved in file trusted-kmers.
real 2m23.730s
user 7m45.252s
sys 0m12.727s
There was a problem when I tried to run the error correction though.
$ time lighter -r mother-reads-1.fq -r mother-reads-2.fq -k 25 25000000 0.233 -t 4 -loadTrustedKmers trusted-kmers -od .
[2017-11-06 11:07:27] =============Start====================
[2017-11-06 11:07:28] Bad quality threshold is "7"
[2017-11-06 11:07:28] Finish loading trusted kmers from file trusted-kmers.
Segmentation fault: 11
real 0m1.548s
user 0m1.385s
sys 0m0.288s
I figured I would try without threading, which did prevent the memory handling error, but did not find any errors to correct.
$ time lighter -r mother-reads-1.fq -r mother-reads-2.fq -k 25 25000000 0.233 -loadTrustedKmers trusted-kmers -od .
[2017-11-06 11:07:52] =============Start====================
[2017-11-06 11:07:53] Bad quality threshold is "7"
[2017-11-06 11:07:53] Finish loading trusted kmers from file trusted-kmers.
[2017-11-06 11:17:43] Finish error correction
Processed 6000000 reads:
0 are error-free
Corrected 0 bases(0.000000 corrections for reads with errors)
Trimmed 0 reads with average trimmed bases 0.000000
Discard 0 reads
real 9m50.509s
user 9m42.973s
sys 0m6.276s
Is it possible I'm invoking the software incorrectly?
I gzipped this data set and uploaded to AWS, in case that helps to reproduce the error. Please let me know when you've downloaded the data files so I can revoke public access.
Hi, I ran the following command on several human samples.
lighter -r proband_1.fastq -r proband_2.fastq -K 25 3100000000 -t 8
Note that I did not invoke the --trim
flag. However, the output contains many many reads that are reported twice in Lighter's output. It appears that the first appearance is the trimmed version and the second is the untrimmed version.
@E00170:22:H00WVALXX:1:1212:18711:26308/1 cor bad_suffix=9 ak
TCAGATGCTCACCTGGGGGTGTGGGTGCTGCTCCAGGTTGTCAGATGCTCACCAGGGGGTGCAGGGGCTGCTCCC
--
@E00170:22:H00WVALXX:1:1212:18711:26308/1 cor bad_suffix=9 ak
TGCTCGCCTTGGGGTGTGGGTGCTCTTCCAGGCTGTCACGTGCGCACCTGGGGGTGCAGGGTGCTGTTGCAGGCTGTCAGATGCTCACCTGGGGGTGTGGGTGCTGCTCCAGGTTGTCAGATGCTCACCAGGGGGTGCAGGGGCTGCTCCC
--
@E00170:22:H00WVALXX:1:2114:24984:53135/1
TCCATTCCAGTTGATTCCATTCCACTCCGTTTCATTCCATT
--
@E00170:22:H00WVALXX:1:2114:24984:53135/1
TTCCACTCCATTCCATATTATTCCATTACACTCCATTCCATTCTATTACTTTCATTTCCATTCAATTCCATTCCAGTTGATTCCATTCCACTCCGTTTCATTCCATTTGAGTCCATTCCATTCCATTCCATTCCATTCCGTTCCATTCCAT
The duplicated reads aren't adjacent in the output, sometimes there is error correction metadata appended to the record and sometimes there is not. Any idea what the problem might be?
This was run using the latest master (a768570) on Linux.
In your experience, what Bloom filter A error rate is achievable - 0.05, more, less? Should I attempt to optimize parameters such that error rate is minimized - if so, what parameters seem to be most influential.
In the conda and docker section, could you please add the advice to brew install lighter
to those using Brew or Linuxbrew?
The number of reads is calculated by dividing the total number of lines by 4, but the program doesn't check if the total number of lines is a multiple of 4:
input file:
@iris:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
+IRIS:7:1:17:394#0/1
aaabaa]baaaaa_aab]D^^
baYDW]abaa
^
@iris:7:1:17:800#0/1
OUTPUT
./lighter -r test.fq -k 17 50 0.1 -t 10
[2014-12-08 11:50:43] =============Start====================
[2014-12-08 11:50:43] Bad quality threshold is "^"
[2014-12-08 11:50:44] Finish sampling kmers
[2014-12-08 11:50:44] Bloom filter A's error rate: 0.000000
[2014-12-08 11:50:44] Finish storing trusted kmers
[2014-12-08 11:50:44] Finish error correction
Processed 2 reads:
0 are error-free
Corrected 0 bases(0.000000 corrections for reads with errors)
Trimmed 0 reads with average trimmed bases 0.000000
Discard 0 reads
root@ip-172-31-42-250:~/Lighter# make
g++ -Wall -O3 -c -o main.o main.cpp
main.cpp: In function ‘int main(int, char**)’:
main.cpp:219:8: warning: variable ‘readId’ set but not used [-Wunused-but-set-variable]
char *readId/**, read, qual/ ;
^
g++ -Wall -O3 -c -o ErrorCorrection.o ErrorCorrection.cpp
ErrorCorrection.cpp: In function ‘int ErrorCorrection(char_, char_, KmerCode&, int, char, Store_, int&, int&)’:
ErrorCorrection.cpp:44:7: warning: variable ‘trusted’ set but not used [-Wunused-but-set-variable]
bool trusted[MAX_READ_LENGTH] ; // Do not correct these makred positions.
^
g++ -Wall -O3 -c -o KmerCode.o KmerCode.cpp
g++ -Wall -O3 -c -o GetKmers.o GetKmers.cpp
GetKmers.cpp: In function ‘void StoreTrustedKmers(char_, char_, int, char, int_, KmerCode&, Store_, Store_)’:
GetKmers.cpp:261:9: warning: variable ‘k’ set but not used [-Wunused-but-set-variable]
int i, k ;
^
g++ -o lighter -Wall -O3 -lpthread -lz ErrorCorrection.o KmerCode.o GetKmers.o main.o
main.o: In function GetGoodQuality(Reads&)': main.cpp:(.text+0x315): undefined reference to
gzgets'
main.cpp:(.text+0x42b): undefined reference to gzgets' main.cpp:(.text+0x45a): undefined reference to
gzgets'
main.cpp:(.text+0x477): undefined reference to gzgets' main.cpp:(.text+0x4a9): undefined reference to
gzgets'
main.o:main.cpp:(.text+0x70d): more undefined references to gzgets' follow main.o: In function
File::~File()':
main.cpp:(.text._ZN4FileD2Ev[ZN4FileD5Ev]+0x25): undefined reference to gzclose' main.o: In function
File::Printf(char const, ...)':
main.cpp:(.text._ZN4File6PrintfEPKcz[_ZN4File6PrintfEPKcz]+0x10a): undefined reference to gzputs' main.o: In function
Reads::~Reads()':
main.cpp:(.text._ZN5ReadsD2Ev[_ZN5ReadsD5Ev]+0x95): undefined reference to gzclose' main.cpp:(.text._ZN5ReadsD2Ev[_ZN5ReadsD5Ev]+0xa8): undefined reference to
gzclose'
main.cpp:(.text._ZN5ReadsD2Ev[ZN5ReadsD5Ev]+0xfa): undefined reference to gzclose' main.cpp:(.text._ZN5ReadsD2Ev[_ZN5ReadsD5Ev]+0x13a): undefined reference to
gzclose'
main.o: In function Reads::Rewind()': main.cpp:(.text._ZN5Reads6RewindEv[_ZN5Reads6RewindEv]+0x59): undefined reference to
gzrewind'
main.cpp:(.text.ZN5Reads6RewindEv[ZN5Reads6RewindEv]+0x6e): undefined reference to gzrewind' main.o: In function
main':
main.cpp:(.text.startup+0x9fe): undefined reference to pthread_create' main.cpp:(.text.startup+0xa14): undefined reference to
pthread_join'
main.cpp:(.text.startup+0xb75): undefined reference to gzopen' main.cpp:(.text.startup+0xc95): undefined reference to
gzopen'
main.cpp:(.text.startup+0xcd5): undefined reference to gzopen' main.cpp:(.text.startup+0xd0a): undefined reference to
gzgets'
main.cpp:(.text.startup+0x10bd): undefined reference to gzgets' main.cpp:(.text.startup+0x10f1): undefined reference to
gzgets'
main.cpp:(.text.startup+0x1396): undefined reference to gzgets' main.cpp:(.text.startup+0x13c9): undefined reference to
gzclose'
main.cpp:(.text.startup+0x13e3): undefined reference to gzgets' main.cpp:(.text.startup+0x13fd): undefined reference to
gzgets'
main.cpp:(.text.startup+0x1418): undefined reference to gzgets' main.cpp:(.text.startup+0x1432): undefined reference to
gzgets'
main.cpp:(.text.startup+0x154d): undefined reference to pthread_create' main.cpp:(.text.startup+0x156b): undefined reference to
pthread_join'
main.cpp:(.text.startup+0x176d): undefined reference to gzgets' main.cpp:(.text.startup+0x1bf1): undefined reference to
gzgets'
main.cpp:(.text.startup+0x1c18): undefined reference to gzgets' main.cpp:(.text.startup+0x1c33): undefined reference to
gzgets'
main.cpp:(.text.startup+0x1c48): undefined reference to gzgets' main.o:main.cpp:(.text.startup+0x1d6d): more undefined references to
gzgets' follow
main.o: In function main': main.cpp:(.text.startup+0x1f36): undefined reference to
pthread_create'
main.cpp:(.text.startup+0x1f51): undefined reference to pthread_join' main.cpp:(.text.startup+0x204d): undefined reference to
gzgets'
main.cpp:(.text.startup+0x2064): undefined reference to gzgets' main.cpp:(.text.startup+0x207f): undefined reference to
gzgets'
main.cpp:(.text.startup+0x2114): undefined reference to gzgets' main.cpp:(.text.startup+0x212f): undefined reference to
gzgets'
main.o:main.cpp:(.text.startup+0x2149): more undefined references to gzgets' follow GetKmers.o: In function
SampleKmers_Thread(void)':
GetKmers.cpp:(.text+0x3cb): undefined reference to pthread_create' GetKmers.cpp:(.text+0x443): undefined reference to
pthread_join'
GetKmers.cpp:(.text+0x4d3): undefined reference to pthread_create' GetKmers.cpp:(.text+0x4e2): undefined reference to
pthread_join'
GetKmers.cpp:(.text+0x523): undefined reference to pthread_join' GetKmers.o: In function
StoreKmers_Thread(void)':
GetKmers.cpp:(.text+0x145d): undefined reference to gzgets' GetKmers.cpp:(.text+0x14f8): undefined reference to
gzgets'
GetKmers.cpp:(.text+0x1513): undefined reference to gzgets' GetKmers.cpp:(.text+0x1529): undefined reference to
gzgets'
GetKmers.cpp:(.text+0x1544): undefined reference to gzgets' GetKmers.o:GetKmers.cpp:(.text._ZN5Reads14NextWithBufferEPcS0_S0_bb[_ZN5Reads14NextWithBufferEPcS0_S0_bb]+0x15d): more undefined references to
gzgets' follow
collect2: error: ld returned 1 exit status
make: ** [lighter] Error 1
Hi,
I tried to correct the publicly available SRR001665 dataset with lighter using the following command:
nice -10 lighter -r ../SRR001665_1.fastq.gz -r ../SRR001665_2.fastq.gz -k 13 4600000 0.04 -t 64 -od k13/ 2>&1 | tee k13/lighter.log
The correction runs through without problems, but the resulting fastq files have 25 respectively 42 unintentionally trimmed sequences in them like this one:
@SRR001665.72513 071112_SLXA-EAS1_s_4:1:6:808:233 length=36 cor bad_prefix=7 ak
GCGTGCCGAAGTTAGTGGGCCTGGAGAATC
+
IIIIIIIIIIIIIIIIII3?I/_%.IIII_IIC4I'
There are still all 36 quality scores, but the last in this case 6 bases of the sequence have been trimmed.
The output is:
[2016-03-22 16:44:22] =============Start====================
[2016-03-22 16:44:24] Bad quality threshold is "&"
[2016-03-22 16:45:33] Finish sampling kmers
[2016-03-22 16:45:33] Bloom filter A's false positive rate: 0.001899
[2016-03-22 16:47:13] Finish storing trusted kmers
[2016-03-22 16:52:13] Finish error correction
Processed 20816448 reads:
18328409 are error-free
Corrected 3617298 bases(1.453875 corrections for reads with errors)
Trimmed 0 reads with average trimmed bases 0.000000
Discard 0 reads
Running a small test data set ona chloroplast genome, approx. 40X coverage. I have 190GB ram so that is not the problem:
lighter -r reads-cp.fastq -t 48 -k 17 300000 0.175
[2023-07-28 11:58:45] =============Start====================
[2023-07-28 11:58:45] Bad quality threshold is "*"
[2023-07-28 11:58:46] Finish sampling kmers
[2023-07-28 11:58:46] Bloom filter A's false positive rate: 0.001099
[2023-07-28 11:58:46] Finish storing trusted kmers
/local/spool/pbs/mom_priv/jobs/91431209.gadi-pbs.SC: line 15: 7429 Segmentation fault lighter -r /g/data/dy44/r12.1_doryanthes/reads-cp.fastq -t 48 -k 17 300000 0.175
I'm trying to run this on my computer with a large FASTQ input file, and am running it as a subprocess in Python:
kmer_size = 31
genome_size = 2000000000
error_rate = 0.1
num_threads = 10
lighter_command = [
lighter_executable_path,
'-r', input_reads_path,
'-k', str(kmer_size),
str(genome_size),str(error_rate), # Additional arguments
'-t', str(num_threads)
]
However, if I set the genome size any larger than the above, it won't work, as I get the following error message:
line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['../../Lighter/lighter', '-r', '100000_NG1D7PJA9F_1.fq', '-k', '31', '3200000000', '0.1', '-t', '10']' died with <Signals.SIGKILL: 9>.
The README says to put in at least the size of the genome of the organism in question, which in this case is the human genome. Am I doing something wrong that's a simple fix? Thank you!
It seems like Lighter just checks if the first character of the file is '@' to consider it valid.
For example, this input file (test.fq) was considered correct:
@iris:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
IRIS:7:1:17:394#0/1
aaabaa]baaaaa_aab]D^^
baYDW]abaa
^
IRIS:7:1:17:800#0/1
GGAAACACTACTTAGGCTTATAAGATCNGGTTGCGG
IRIS:7:1:17:800#0/1
ababbaaabaaaaa]
ba]
aaaaYD_aXT IRIS:7:1:17:1757#0/1 TTTTCTCGACGATTTCCACTCCTGGTCNACGAATCC IRIS:7:1:17:1757#0/1 aaaaaa
aaa`aaaa_^a```]][Z[DY^XYV^_Y
ksdsada
dada
e
e
5
output:
./lighter -r test.fq -k 17 5000000 0.1 -t 10
[2014-12-08 11:35:35] =============Start====================
[2014-12-08 11:35:35] Bad quality threshold is "T"
[2014-12-08 11:35:36] Finish sampling kmers
[2014-12-08 11:35:36] Bloom filter A's error rate: 0.000000
[2014-12-08 11:35:36] Finish storing trusted kmers
[2014-12-08 11:35:36] Finish error correction
Processed 5 reads:
0 are error-free
Corrected 0 bases(0.000000 corrections for reads with errors)
Trimmed 0 reads with average trimmed bases 0.000000
Discard 0 reads
Would it be possible to update the quality scores of corrected bases? Not sure if there is a clever way to do it based on confidence of the trusted kmer but even setting them to a constant user specified value would be nice so that downstream applications don't ignore those bases.
Would be great if Lighter would accept gz compressed reads.
Hi,
I sequenced a genome and I got 16 paired-end libraries. To test the speed of the program, I picked on library ~11 Gb each .gz file, and used the following code:
./lighter -r A-407_1.fq.gz -r A-407_2.fq.gz -K 15 2300000000 -t 48 -od test
I checked the report, and it is:
[2021-02-19 23:57:45] =============Start====================
[2021-02-19 23:57:45] Scanning the input files to infer alpha(sampling rate)
[2021-02-20 00:01:37] Average coverage is 7.923 and alpha is 0.883
[2021-02-20 00:01:41] Bad quality threshold is "5"
[2021-02-20 00:07:35] Finish sampling kmers
[2021-02-20 00:07:35] Bloom filter A's false positive rate: 0.000000
[2021-02-20 00:18:33] Finish storing trusted kmers
[2021-02-20 00:37:52] Finish error correction
Processed 182234180 reads:
0 are error-free
Corrected 0 bases(0.000000 corrections for reads with errors)
Trimmed 0 reads with average trimmed bases 0.000000
Discard 0 reads
It looks very weird because 0 reads are error-free, and also looks like none were corrected?
Any suggestions of this result?
Thanks;
(Enhancement)
I was wondering if Lighter could output a report detailing what it did. In particular
running Lighter with paired end read files ending in the extension .R1.fastq.gz
and .R2.fastq.gz
yields output files with the extensions .R1.fastq.cor.fq.gz
and .R2.fastq.cor.fq.gz
.
could you drop the redundant .fastq
extension when naming the output file?
my running command is
lighter -r "$f" -r "$reverse" -K 19 41020000 -od "$OUT_DIR" -trim -stable -t 8 2>> "$OUT_DIR/lighter.out"
Write messages to stderr instead of stdout. The key difference is that data written to stdout is only sent to a redirected file when some buffer is full, but data written to stderr immediately update the redirected file. When I submit a lighter job to LSF/SGE/etc, I do not see the progress until it finishes. Writing messages to stderr won't have this problem.
Lighter is slow when it is linked against zlib-1.2.3. Here is part of the timing report when lighter is statically linked against zlib-1.2.8 (4 seconds to read through the input file):
[2015-01-10 19:17:32] Scanning the input files to infer alpha(sampling rate)
[2015-01-10 19:17:36] Average coverage is 43.913 and alpha is 0.159
And here is when zlib-1.2.3 is in use (31 seconds):
[2015-01-10 19:17:49] Scanning the input files to infer alpha(sampling rate)
[2015-01-10 19:18:20] Average coverage is 43.913 and alpha is 0.159
The huge performance difference is probably caused by the use of gzgets()
. In zlib-1.2.3, this function is not buffered and performs badly. Zlib-1.2.4 or later uses an internal buffer which dramatically improves the performance. Please see this blog post for details. Note that vcftools also has this issue. It warns the user when an old zlib is on the system. You may do the same. I am also providing a precompiled binary with zlib-1.2.8 statically linked.
BTW, sga @jts is using a 3rd-party C++ wrapper for zlib. I guess it should be performant with zlib-1.2.3. You may also consider to use that. Two additional files only.
Given compressed input, lighter only uses one CPU core during the error correction phase even if a large -t
is specified. I guess this is because writing compressed fastq is much slower than the multi-threaded error correction. It would be good to let user specify the zlib compression level of the output. The default is level 6. Level-1 is times faster without much loss on the compression ratio.
In the long term, an even better solution is to put file I/O on a separate thread, such that lighter could write the compressed output while correcting reads. Quite a few high-performance tools (e.g. kmc and jellyfish, I believe) are using this trick. It is very effective.
Hi Friend,
You have a glitch somwhere in the code, see the error message (regardless the parameters):
ratman@x3650:/usr/local/data/ratman/projects/asm01/data/4u$ lighter -k 25 770000000 0.0875 -t 12 -r pe600_R1.fastq.gz -r pe600_R2.fastq.gz
[2014-06-02 13:14:38] =============Start====================
[2014-06-02 13:16:11] Finish sampling kmers
[2014-06-02 13:16:26] Finish storing trusted kmers
Segmentation fault
Good day,
I am struggling to find sources on whether it is best to perform error correction on raw, or quality trimmed and filtered reads. Can you perhaps give me advice on this? And perhaps direct me to a paper?
Kind regards,
Allison
Today I upgraded from lighter 1.0.7 to 1.1.1 and I first noticed a problem when 1.1.1 was outputting different number of reads in the two output files, and then noticed it was also passing far fewer reads.
This is the command line:
lighter -od . -r R1.fq.gz -r R2.fq.gz -K 32 4000000 -t 72 -maxcor 2
This is the difference in read counts:
Files R1.fq.gz
Reads 3747457 # original reads
Files R2.fq.gz
Reads 3747457
Files 1.0.7-R1.cor.fq.gz
Reads 3747457 # none missing
Files 1.0.7-R2.cor.fq.gz
Reads 3747457
Files 1.1.1-R1.cor.fq.gz
Reads 2511489 # lots missing
Files 1.1.1-R2.cor.fq.gz
Reads 2511506 # has 17 more reads!
Any ideas?
Lighter doesn't appear to like small genomes. Here I'm using ~4m reads and a 10kb genome size, with which Lighter does nothing. If I increase the genome size to a megabase, the report looks more as I'd expect. Is this a bug or a feature? Tested on Ubuntu and Yosemite.
10kb genome size:
bede-rmbp:lighter $ ./lighter -k 11 10000 0.1 -r H140260768-1B.R1.fastq -r H140260768-1B.R2.fastq -t 2
[2014-12-04 12:09:31] =============Start====================
[2014-12-04 12:09:32] Bad quality threshold is #
[2014-12-04 12:09:37] Finish sampling kmers
[2014-12-04 12:09:37] Bloom filter A's error rate: 0.998423
[2014-12-04 12:09:50] Finish storing trusted kmers
[2014-12-04 12:10:06] Finish error correction
Processed 3782706 reads:
0 are error-free
Corrected 0 bases(0.000000 corrections for reads with errors)
Trimmed 0 reads with average trimmed bases 0.000000
Discard 0 reads
1Mb genome size:
bede-rmbp:lighter $ ./lighter -k 11 1000000 0.1 -r H140260768-1B.R1.fastq -r H140260768-1B.R2.fastq -t 2
[2014-12-04 12:34:32] =============Start====================
[2014-12-04 12:34:33] Bad quality threshold is "#"
[2014-12-04 12:34:38] Finish sampling kmers
[2014-12-04 12:34:38] Bloom filter A's error rate: 0.001192
[2014-12-04 12:35:07] Finish storing trusted kmers
[2014-12-04 12:35:26] Finish error correction
Processed 3782706 reads:
3543702 are error-free
Corrected 233998 bases(0.979055 corrections for reads with errors)
Trimmed 0 reads with average trimmed bases 0.000000
Discard 0 reads
lighter -v
Lighter v1.1.0
Hello @mourisl,
I wanted to use Lighter as part of the shovill pipeline (with several threads) but ran into a problem with a "Segmentation fault (core dumped)" error meesage right after the message "Finish storing trusted kmers". I had a look at main.cpp and realized you are handling the processing based on the number of threads differently (1, 2 and more than 2). I tried to pinpoint the problem and finally could resolve it by changing lines 806-809 to read:
int batchSize[3] ;
bool init = true, canJoinOutputThread = false ;
int tag = 2, prevTag ;
int fileInd[3] ;
Notice I enlarged the arrays' size to 3, as I think you will use three batches later on.
I realized by trying Linuxbrew, that the provided binary (Bottle) works without problems. If I build the Formula from source (using the system's GCC 7.3.0) I will get the above error. If I install Linuxbrew's GCC version 5.5.0 and compile the Formula from source it will again work like the Bottle.
I have not provided a pull request as I am unsure of the nature of this problem. Is it possible that GCC 5.5.0 might tolerate the wrong array size (if wrong at all) whereas GCC 7.3.0 is stricter with it?
Any comments on this?
Best regards
Stefan
I installed lighter 1.1.1 on MacOS 10.12.4 via homebrew and it repeatedly stalled after the 'storing trusted kmers' until I killed it after 10 minutes of 0% CPU usage. It started to write the corrected R1 reads before it stalled. I reinstalled from the github repo and it works fine. I'm not sure what the difference is.
[2017-06-30 11:12:18] =============Start====================
[2017-06-30 11:12:18] Scanning the input files to infer alpha(sampling rate)
[2017-06-30 11:12:44] Average coverage is 199.402 and alpha is 0.035
[2017-06-30 11:12:45] Bad quality threshold is "@"
[2017-06-30 11:13:33] Finish sampling kmers
[2017-06-30 11:13:33] Bloom filter A's false positive rate: 0.001737
[2017-06-30 11:17:45] Finish storing trusted kmers
Please add explicit cmdline options for both help and version information.
Good day,
I have 16 .gz files (8 forward, 8 reverse). Can I write the command as follows:
./lighter -r one1.fq.gz -r one2.fq.gz -r two1.fq.gz -r two2.fq.gz -r three1.gz -r three2.gz -k 17 5000000 0.1 -t 32
With all 16 input files in the same command?
Kind regards,
Allison
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.