Giter Club home page Giter Club logo

lighter's People

Contributors

bede avatar benlangmead avatar bgruening avatar mourisl avatar slowkow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lighter's Issues

Major behaviour change from 1.0.7 to 1.1.1

Today I upgraded from lighter 1.0.7 to 1.1.1 and I first noticed a problem when 1.1.1 was outputting different number of reads in the two output files, and then noticed it was also passing far fewer reads.

This is the command line:

lighter -od . -r R1.fq.gz -r R2.fq.gz -K 32 4000000 -t 72 -maxcor 2

This is the difference in read counts:

Files   R1.fq.gz
Reads   3747457  # original reads
Files   R2.fq.gz
Reads   3747457

Files   1.0.7-R1.cor.fq.gz
Reads   3747457  # none missing
Files   1.0.7-R2.cor.fq.gz
Reads   3747457

Files   1.1.1-R1.cor.fq.gz
Reads   2511489  # lots missing
Files   1.1.1-R2.cor.fq.gz
Reads   2511506  # has 17 more reads!

Any ideas?

Issue with Lighter performance

Hello,

I'm not sure that Github issue is the best place for this, but it is the suggested channel for support so will give it a go.

I had some good initial experiences with Lighter, so ran it on a larger number of samples (n = 2000). The hypothesis of the experiment was that Lighter would help to reduce errors that were causing 'mixed positions', where the consensus base at a position had the support of less than 90% of the reads that mapped there.

However, my initial good experience was not continued. The image below is 100 randomly selected samples from our 2000. It shows the number of mixed positions obtained when reads that have just been quality trimmed (uncor_trimmed) and those that have been quality trimmed and Lighter corrected (cor_trimmed) are mapped vs reference.

screen shot 2014-12-23 at 16 38 21

As you can see, the general trend is for there to be more mixed positions in the alignments that have been Lightered, rather than those that have been just trimmed. This was not expected!

When I looked more closely at the positions that were mixed after Lighter, but not before, I saw something like.

Before
before 10 36 01

After
after 10 36 01

I was initially using an alpha of 0.05 and k = 17, changing this to alpha = 0.1 and k = 25 made no difference to this phenomenon. Do you have any insight into what might be causing this?

OS is Red Hat Enterprise Linux Server release 6.4 (Santiago).

Getting `died with <Signals.SIGKILL: 9>` when trying to run Lighter with the human genome size (3.2MB)

I'm trying to run this on my computer with a large FASTQ input file, and am running it as a subprocess in Python:

# Set the desired parameters

kmer_size = 31
genome_size = 2000000000
error_rate = 0.1
num_threads = 10

Construct the Lighter command

lighter_command = [
lighter_executable_path,
'-r', input_reads_path,
'-k', str(kmer_size),
str(genome_size),str(error_rate), # Additional arguments
'-t', str(num_threads)
]

However, if I set the genome size any larger than the above, it won't work, as I get the following error message:

line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['../../Lighter/lighter', '-r', '100000_NG1D7PJA9F_1.fq', '-k', '31', '3200000000', '0.1', '-t', '10']' died with <Signals.SIGKILL: 9>.

The README says to put in at least the size of the genome of the organism in question, which in this case is the human genome. Am I doing something wrong that's a simple fix? Thank you!

undefined reference to `gzgets'

root@ip-172-31-42-250:~/Lighter# make
g++ -Wall -O3 -c -o main.o main.cpp
main.cpp: In function ‘int main(int, char**)’:
main.cpp:219:8: warning: variable ‘readId’ set but not used [-Wunused-but-set-variable]
char *readId/**, read, qual/ ;
^
g++ -Wall -O3 -c -o ErrorCorrection.o ErrorCorrection.cpp
ErrorCorrection.cpp: In function ‘int ErrorCorrection(char_, char_, KmerCode&, int, char, Store_, int&, int&)’:
ErrorCorrection.cpp:44:7: warning: variable ‘trusted’ set but not used [-Wunused-but-set-variable]
bool trusted[MAX_READ_LENGTH] ; // Do not correct these makred positions.
^
g++ -Wall -O3 -c -o KmerCode.o KmerCode.cpp
g++ -Wall -O3 -c -o GetKmers.o GetKmers.cpp
GetKmers.cpp: In function ‘void StoreTrustedKmers(char_, char_, int, char, int_, KmerCode&, Store_, Store_)’:
GetKmers.cpp:261:9: warning: variable ‘k’ set but not used [-Wunused-but-set-variable]
int i, k ;
^
g++ -o lighter -Wall -O3 -lpthread -lz ErrorCorrection.o KmerCode.o GetKmers.o main.o
main.o: In function GetGoodQuality(Reads&)': main.cpp:(.text+0x315): undefined reference togzgets'
main.cpp:(.text+0x42b): undefined reference to gzgets' main.cpp:(.text+0x45a): undefined reference togzgets'
main.cpp:(.text+0x477): undefined reference to gzgets' main.cpp:(.text+0x4a9): undefined reference togzgets'
main.o:main.cpp:(.text+0x70d): more undefined references to gzgets' follow main.o: In functionFile::~File()':
main.cpp:(.text._ZN4FileD2Ev[ZN4FileD5Ev]+0x25): undefined reference to gzclose' main.o: In functionFile::Printf(char const, ...)':
main.cpp:(.text._ZN4File6PrintfEPKcz[_ZN4File6PrintfEPKcz]+0x10a): undefined reference to gzputs' main.o: In functionReads::~Reads()':
main.cpp:(.text._ZN5ReadsD2Ev[_ZN5ReadsD5Ev]+0x95): undefined reference to gzclose' main.cpp:(.text._ZN5ReadsD2Ev[_ZN5ReadsD5Ev]+0xa8): undefined reference togzclose'
main.cpp:(.text._ZN5ReadsD2Ev[ZN5ReadsD5Ev]+0xfa): undefined reference to gzclose' main.cpp:(.text._ZN5ReadsD2Ev[_ZN5ReadsD5Ev]+0x13a): undefined reference togzclose'
main.o: In function Reads::Rewind()': main.cpp:(.text._ZN5Reads6RewindEv[_ZN5Reads6RewindEv]+0x59): undefined reference togzrewind'
main.cpp:(.text.ZN5Reads6RewindEv[ZN5Reads6RewindEv]+0x6e): undefined reference to gzrewind' main.o: In functionmain':
main.cpp:(.text.startup+0x9fe): undefined reference to pthread_create' main.cpp:(.text.startup+0xa14): undefined reference topthread_join'
main.cpp:(.text.startup+0xb75): undefined reference to gzopen' main.cpp:(.text.startup+0xc95): undefined reference togzopen'
main.cpp:(.text.startup+0xcd5): undefined reference to gzopen' main.cpp:(.text.startup+0xd0a): undefined reference togzgets'
main.cpp:(.text.startup+0x10bd): undefined reference to gzgets' main.cpp:(.text.startup+0x10f1): undefined reference togzgets'
main.cpp:(.text.startup+0x1396): undefined reference to gzgets' main.cpp:(.text.startup+0x13c9): undefined reference togzclose'
main.cpp:(.text.startup+0x13e3): undefined reference to gzgets' main.cpp:(.text.startup+0x13fd): undefined reference togzgets'
main.cpp:(.text.startup+0x1418): undefined reference to gzgets' main.cpp:(.text.startup+0x1432): undefined reference togzgets'
main.cpp:(.text.startup+0x154d): undefined reference to pthread_create' main.cpp:(.text.startup+0x156b): undefined reference topthread_join'
main.cpp:(.text.startup+0x176d): undefined reference to gzgets' main.cpp:(.text.startup+0x1bf1): undefined reference togzgets'
main.cpp:(.text.startup+0x1c18): undefined reference to gzgets' main.cpp:(.text.startup+0x1c33): undefined reference togzgets'
main.cpp:(.text.startup+0x1c48): undefined reference to gzgets' main.o:main.cpp:(.text.startup+0x1d6d): more undefined references togzgets' follow
main.o: In function main': main.cpp:(.text.startup+0x1f36): undefined reference topthread_create'
main.cpp:(.text.startup+0x1f51): undefined reference to pthread_join' main.cpp:(.text.startup+0x204d): undefined reference togzgets'
main.cpp:(.text.startup+0x2064): undefined reference to gzgets' main.cpp:(.text.startup+0x207f): undefined reference togzgets'
main.cpp:(.text.startup+0x2114): undefined reference to gzgets' main.cpp:(.text.startup+0x212f): undefined reference togzgets'
main.o:main.cpp:(.text.startup+0x2149): more undefined references to gzgets' follow GetKmers.o: In functionSampleKmers_Thread(void
)':
GetKmers.cpp:(.text+0x3cb): undefined reference to pthread_create' GetKmers.cpp:(.text+0x443): undefined reference topthread_join'
GetKmers.cpp:(.text+0x4d3): undefined reference to pthread_create' GetKmers.cpp:(.text+0x4e2): undefined reference topthread_join'
GetKmers.cpp:(.text+0x523): undefined reference to pthread_join' GetKmers.o: In functionStoreKmers_Thread(void
)':
GetKmers.cpp:(.text+0x145d): undefined reference to gzgets' GetKmers.cpp:(.text+0x14f8): undefined reference togzgets'
GetKmers.cpp:(.text+0x1513): undefined reference to gzgets' GetKmers.cpp:(.text+0x1529): undefined reference togzgets'
GetKmers.cpp:(.text+0x1544): undefined reference to gzgets' GetKmers.o:GetKmers.cpp:(.text._ZN5Reads14NextWithBufferEPcS0_S0_bb[_ZN5Reads14NextWithBufferEPcS0_S0_bb]+0x15d): more undefined references togzgets' follow
collect2: error: ld returned 1 exit status
make: *
* [lighter] Error 1

When to use noQual and newQual ascii_quality_score

Good day,

I am not certain when one would use the following command line options:

-noQual: ignore the quality score (default: false)
-newQual ascii_quality_score: set the quality for the bases corrected to the specified score (default: not used)

My data is from a 2.5 G plant genome, 195x coverage.

Any other advice/suggestions will be appreciated :)

Thank you for your time.

Kind regards,
Allison

Unintentional trimming of sequences

Hi,

I tried to correct the publicly available SRR001665 dataset with lighter using the following command:
nice -10 lighter -r ../SRR001665_1.fastq.gz -r ../SRR001665_2.fastq.gz -k 13 4600000 0.04 -t 64 -od k13/ 2>&1 | tee k13/lighter.log

The correction runs through without problems, but the resulting fastq files have 25 respectively 42 unintentionally trimmed sequences in them like this one:
@SRR001665.72513 071112_SLXA-EAS1_s_4:1:6:808:233 length=36 cor bad_prefix=7 ak
GCGTGCCGAAGTTAGTGGGCCTGGAGAATC
+
IIIIIIIIIIIIIIIIII3?I/_%.IIII_IIC4I'
There are still all 36 quality scores, but the last in this case 6 bases of the sequence have been trimmed.

The output is:
[2016-03-22 16:44:22] =============Start====================
[2016-03-22 16:44:24] Bad quality threshold is "&"
[2016-03-22 16:45:33] Finish sampling kmers
[2016-03-22 16:45:33] Bloom filter A's false positive rate: 0.001899
[2016-03-22 16:47:13] Finish storing trusted kmers
[2016-03-22 16:52:13] Finish error correction
Processed 20816448 reads:
18328409 are error-free
Corrected 3617298 bases(1.453875 corrections for reads with errors)
Trimmed 0 reads with average trimmed bases 0.000000
Discard 0 reads

Issue with loading trusted k-mers

Hi @mourisl,

It was exciting to meet you at CSHL and I appreciate that you took some time to implement the feature we discussed.

I gave it a try this morning, and saving the trusted k-mers file seemed to work fine. At least no error was reported.

$ time lighter \
    -r mother-reads-1.fq -r mother-reads-2.fq \
    -r father-reads-1.fq -r father-reads-2.fq \
    -r proband-reads-1.fq -r proband-reads-2.fq \
    -saveTrustedKmers trusted-kmers \
    -k 25 25000000 0.233 -t 4
[2017-11-06 10:54:06] =============Start====================
[2017-11-06 10:54:07] Bad quality threshold is "7"
[2017-11-06 10:54:54] Finish sampling kmers
[2017-11-06 10:54:54] Bloom filter A's false positive rate: 0.162178
[2017-11-06 10:54:54] The error rate is high. Lighter adjusts -maxcor to 5 and bad quality threshold to "8".
[2017-11-06 10:56:30] Finish storing trusted kmers
[2017-11-06 10:56:30] The trusted kmers are saved in file trusted-kmers.

real	2m23.730s
user	7m45.252s
sys	0m12.727s

There was a problem when I tried to run the error correction though.

$ time lighter -r mother-reads-1.fq -r mother-reads-2.fq  -k 25 25000000 0.233 -t 4 -loadTrustedKmers trusted-kmers -od .
[2017-11-06 11:07:27] =============Start====================
[2017-11-06 11:07:28] Bad quality threshold is "7"
[2017-11-06 11:07:28] Finish loading trusted kmers from file trusted-kmers.
Segmentation fault: 11

real	0m1.548s
user	0m1.385s
sys	0m0.288s

I figured I would try without threading, which did prevent the memory handling error, but did not find any errors to correct.

$ time lighter -r mother-reads-1.fq -r mother-reads-2.fq  -k 25 25000000 0.233 -loadTrustedKmers trusted-kmers -od .
[2017-11-06 11:07:52] =============Start====================
[2017-11-06 11:07:53] Bad quality threshold is "7"
[2017-11-06 11:07:53] Finish loading trusted kmers from file trusted-kmers.
[2017-11-06 11:17:43] Finish error correction
Processed 6000000 reads:
	0 are error-free
	Corrected 0 bases(0.000000 corrections for reads with errors)
	Trimmed 0 reads with average trimmed bases 0.000000
	Discard 0 reads

real	9m50.509s
user	9m42.973s
sys	0m6.276s

Is it possible I'm invoking the software incorrectly?


I gzipped this data set and uploaded to AWS, in case that helps to reproduce the error. Please let me know when you've downloaded the data files so I can revoke public access.

Produce a correction report

(Enhancement)

I was wondering if Lighter could output a report detailing what it did. In particular

  • how many reads were corrected
  • average number of correctionsp
  • counts for each position in reads that were corrected
  • how many discarded
  • how many trimmed
    And this for each -r read file processed.

Test - error report - 0 corrected reads?

Hi,

I sequenced a genome and I got 16 paired-end libraries. To test the speed of the program, I picked on library ~11 Gb each .gz file, and used the following code:
./lighter -r A-407_1.fq.gz -r A-407_2.fq.gz -K 15 2300000000 -t 48 -od test

I checked the report, and it is:
[2021-02-19 23:57:45] =============Start====================
[2021-02-19 23:57:45] Scanning the input files to infer alpha(sampling rate)
[2021-02-20 00:01:37] Average coverage is 7.923 and alpha is 0.883
[2021-02-20 00:01:41] Bad quality threshold is "5"
[2021-02-20 00:07:35] Finish sampling kmers
[2021-02-20 00:07:35] Bloom filter A's false positive rate: 0.000000
[2021-02-20 00:18:33] Finish storing trusted kmers
[2021-02-20 00:37:52] Finish error correction
Processed 182234180 reads:
0 are error-free
Corrected 0 bases(0.000000 corrections for reads with errors)
Trimmed 0 reads with average trimmed bases 0.000000
Discard 0 reads

It looks very weird because 0 reads are error-free, and also looks like none were corrected?

Any suggestions of this result?

Thanks;

Optional level-1 gzip output?

Given compressed input, lighter only uses one CPU core during the error correction phase even if a large -t is specified. I guess this is because writing compressed fastq is much slower than the multi-threaded error correction. It would be good to let user specify the zlib compression level of the output. The default is level 6. Level-1 is times faster without much loss on the compression ratio.

In the long term, an even better solution is to put file I/O on a separate thread, such that lighter could write the compressed output while correcting reads. Quite a few high-performance tools (e.g. kmc and jellyfish, I believe) are using this trick. It is very effective.

Appropriate for 2.5 Gbp Plant Genome?

I am assembling a highly heterozygous, repetitive, 2.5 Gbp plant genome (rooibos). My dataset is around 320 GB (zipped) (Illumina paired-end, and mate-pair data).

Due to time constraints, I would like to know if Light be suitable for my dataset and biological system, before having the tool installed and running my dataset on it?

Thanks for your time.

Kind regards,
Allison

Wrong number of reads detected

The number of reads is calculated by dividing the total number of lines by 4, but the program doesn't check if the total number of lines is a multiple of 4:

input file:
@iris:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
+IRIS:7:1:17:394#0/1
aaabaa]baaaaa_aab]D^^baYDW]abaa^
@iris:7:1:17:800#0/1

OUTPUT
./lighter -r test.fq -k 17 50 0.1 -t 10
[2014-12-08 11:50:43] =============Start====================
[2014-12-08 11:50:43] Bad quality threshold is "^"
[2014-12-08 11:50:44] Finish sampling kmers
[2014-12-08 11:50:44] Bloom filter A's error rate: 0.000000
[2014-12-08 11:50:44] Finish storing trusted kmers
[2014-12-08 11:50:44] Finish error correction
Processed 2 reads:
0 are error-free
Corrected 0 bases(0.000000 corrections for reads with errors)
Trimmed 0 reads with average trimmed bases 0.000000
Discard 0 reads

selecting k-mer size greater than 32

I am trying to use k-mer size greater than the 32 specified by lighter as the maximum and have not been able to do that. Is there a way to increase the maximum k-mer length of lighter from 32 to a higher value?

lighter installed from homebrew fails to complete

I installed lighter 1.1.1 on MacOS 10.12.4 via homebrew and it repeatedly stalled after the 'storing trusted kmers' until I killed it after 10 minutes of 0% CPU usage. It started to write the corrected R1 reads before it stalled. I reinstalled from the github repo and it works fine. I'm not sure what the difference is.

[2017-06-30 11:12:18] =============Start====================
[2017-06-30 11:12:18] Scanning the input files to infer alpha(sampling rate)
[2017-06-30 11:12:44] Average coverage is 199.402 and alpha is 0.035
[2017-06-30 11:12:45] Bad quality threshold is "@"
[2017-06-30 11:13:33] Finish sampling kmers
[2017-06-30 11:13:33] Bloom filter A's false positive rate: 0.001737
[2017-06-30 11:17:45] Finish storing trusted kmers

line 15: 7429 Segmentation fault

Running a small test data set ona chloroplast genome, approx. 40X coverage. I have 190GB ram so that is not the problem:

lighter -r reads-cp.fastq -t 48 -k 17 300000 0.175

[2023-07-28 11:58:45] =============Start====================
[2023-07-28 11:58:45] Bad quality threshold is "*"
[2023-07-28 11:58:46] Finish sampling kmers
[2023-07-28 11:58:46] Bloom filter A's false positive rate: 0.001099
[2023-07-28 11:58:46] Finish storing trusted kmers
/local/spool/pbs/mom_priv/jobs/91431209.gadi-pbs.SC: line 15: 7429 Segmentation fault lighter -r /g/data/dy44/r12.1_doryanthes/reads-cp.fastq -t 48 -k 17 300000 0.175

Segfault - after/during(?) storing trusted kmers

Hi Friend,

You have a glitch somwhere in the code, see the error message (regardless the parameters):

ratman@x3650:/usr/local/data/ratman/projects/asm01/data/4u$ lighter -k 25 770000000 0.0875 -t 12 -r pe600_R1.fastq.gz -r pe600_R2.fastq.gz
[2014-06-02 13:14:38] =============Start====================
[2014-06-02 13:16:11] Finish sampling kmers
[2014-06-02 13:16:26] Finish storing trusted kmers
Segmentation fault

invalid fastq file not being detected

It seems like Lighter just checks if the first character of the file is '@' to consider it valid.
For example, this input file (test.fq) was considered correct:
@iris:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
IRIS:7:1:17:394#0/1
aaabaa]baaaaa_aab]D^^baYDW]abaa^
IRIS:7:1:17:800#0/1
GGAAACACTACTTAGGCTTATAAGATCNGGTTGCGG
IRIS:7:1:17:800#0/1
ababbaaabaaaaa]ba]aaaaYD_aXT IRIS:7:1:17:1757#0/1 TTTTCTCGACGATTTCCACTCCTGGTCNACGAATCC IRIS:7:1:17:1757#0/1 aaaaaaaaa`aaaa_^a```]][Z[DY^XYV^_Y
ksdsada
dada
e
e
5

output:
./lighter -r test.fq -k 17 5000000 0.1 -t 10
[2014-12-08 11:35:35] =============Start====================
[2014-12-08 11:35:35] Bad quality threshold is "T"
[2014-12-08 11:35:36] Finish sampling kmers
[2014-12-08 11:35:36] Bloom filter A's error rate: 0.000000
[2014-12-08 11:35:36] Finish storing trusted kmers
[2014-12-08 11:35:36] Finish error correction
Processed 5 reads:
0 are error-free
Corrected 0 bases(0.000000 corrections for reads with errors)
Trimmed 0 reads with average trimmed bases 0.000000
Discard 0 reads

EC on Raw or Trimmed?

Good day,

I am struggling to find sources on whether it is best to perform error correction on raw, or quality trimmed and filtered reads. Can you perhaps give me advice on this? And perhaps direct me to a paper?

Kind regards,
Allison

A couple of suggestions

  1. Write messages to stderr instead of stdout. The key difference is that data written to stdout is only sent to a redirected file when some buffer is full, but data written to stderr immediately update the redirected file. When I submit a lighter job to LSF/SGE/etc, I do not see the progress until it finishes. Writing messages to stderr won't have this problem.

  2. Lighter is slow when it is linked against zlib-1.2.3. Here is part of the timing report when lighter is statically linked against zlib-1.2.8 (4 seconds to read through the input file):

    [2015-01-10 19:17:32] Scanning the input files to infer alpha(sampling rate)
    [2015-01-10 19:17:36] Average coverage is 43.913 and alpha is 0.159
    

    And here is when zlib-1.2.3 is in use (31 seconds):

    [2015-01-10 19:17:49] Scanning the input files to infer alpha(sampling rate)
    [2015-01-10 19:18:20] Average coverage is 43.913 and alpha is 0.159
    

    The huge performance difference is probably caused by the use of gzgets(). In zlib-1.2.3, this function is not buffered and performs badly. Zlib-1.2.4 or later uses an internal buffer which dramatically improves the performance. Please see this blog post for details. Note that vcftools also has this issue. It warns the user when an old zlib is on the system. You may do the same. I am also providing a precompiled binary with zlib-1.2.8 statically linked.

    BTW, sga @jts is using a 3rd-party C++ wrapper for zlib. I guess it should be performant with zlib-1.2.3. You may also consider to use that. Two additional files only.

Multiple input files

Good day,

I have 16 .gz files (8 forward, 8 reverse). Can I write the command as follows:

./lighter -r one1.fq.gz -r one2.fq.gz -r two1.fq.gz -r two2.fq.gz -r three1.gz -r three2.gz -k 17 5000000 0.1 -t 32

With all 16 input files in the same command?

Kind regards,
Allison

modify extension of output files

running Lighter with paired end read files ending in the extension .R1.fastq.gz and .R2.fastq.gz yields output files with the extensions .R1.fastq.cor.fq.gz and .R2.fastq.cor.fq.gz.

could you drop the redundant .fastq extension when naming the output file?

my running command is
lighter -r "$f" -r "$reverse" -K 19 41020000 -od "$OUT_DIR" -trim -stable -t 8 2>> "$OUT_DIR/lighter.out"

Trimmed and untrimmed read reported

Hi, I ran the following command on several human samples.

lighter -r proband_1.fastq -r proband_2.fastq -K 25 3100000000 -t 8

Note that I did not invoke the --trim flag. However, the output contains many many reads that are reported twice in Lighter's output. It appears that the first appearance is the trimmed version and the second is the untrimmed version.

@E00170:22:H00WVALXX:1:1212:18711:26308/1 cor bad_suffix=9 ak
TCAGATGCTCACCTGGGGGTGTGGGTGCTGCTCCAGGTTGTCAGATGCTCACCAGGGGGTGCAGGGGCTGCTCCC  
--
@E00170:22:H00WVALXX:1:1212:18711:26308/1 cor bad_suffix=9 ak                                                                                               
TGCTCGCCTTGGGGTGTGGGTGCTCTTCCAGGCTGTCACGTGCGCACCTGGGGGTGCAGGGTGCTGTTGCAGGCTGTCAGATGCTCACCTGGGGGTGTGGGTGCTGCTCCAGGTTGTCAGATGCTCACCAGGGGGTGCAGGGGCTGCTCCC
--
@E00170:22:H00WVALXX:1:2114:24984:53135/1
TCCATTCCAGTTGATTCCATTCCACTCCGTTTCATTCCATT
--
@E00170:22:H00WVALXX:1:2114:24984:53135/1
TTCCACTCCATTCCATATTATTCCATTACACTCCATTCCATTCTATTACTTTCATTTCCATTCAATTCCATTCCAGTTGATTCCATTCCACTCCGTTTCATTCCATTTGAGTCCATTCCATTCCATTCCATTCCATTCCGTTCCATTCCAT

The duplicated reads aren't adjacent in the output, sometimes there is error correction metadata appended to the record and sometimes there is not. Any idea what the problem might be?

This was run using the latest master (a768570) on Linux.

Segmentation fault when using -t > 2

Hello @mourisl,

I wanted to use Lighter as part of the shovill pipeline (with several threads) but ran into a problem with a "Segmentation fault (core dumped)" error meesage right after the message "Finish storing trusted kmers". I had a look at main.cpp and realized you are handling the processing based on the number of threads differently (1, 2 and more than 2). I tried to pinpoint the problem and finally could resolve it by changing lines 806-809 to read:

int batchSize[3] ;
bool init = true, canJoinOutputThread = false ;
int tag = 2, prevTag ;
int fileInd[3] ;

Notice I enlarged the arrays' size to 3, as I think you will use three batches later on.

I realized by trying Linuxbrew, that the provided binary (Bottle) works without problems. If I build the Formula from source (using the system's GCC 7.3.0) I will get the above error. If I install Linuxbrew's GCC version 5.5.0 and compile the Formula from source it will again work like the Bottle.

I have not provided a pull request as I am unsure of the nature of this problem. Is it possible that GCC 5.5.0 might tolerate the wrong array size (if wrong at all) whereas GCC 7.3.0 is stricter with it?

Any comments on this?

Best regards
Stefan

Bloom Filter A error rate

In your experience, what Bloom filter A error rate is achievable - 0.05, more, less? Should I attempt to optimize parameters such that error rate is minimized - if so, what parameters seem to be most influential.

auto-detect quality encoding

I'm working with some old GA2 reads, get this error, which I suppose has to do with differences in Phred encoding. Would be awesome if Lighter could auto detect or at least allow me to specific Phred offset.

[2014-11-17 23:50:15] Bad quality threshold is "-"

Update quality scores when doing error correction

Would it be possible to update the quality scores of corrected bases? Not sure if there is a clever way to do it based on confidence of the trusted kmer but even setting them to a constant user specified value would be nice so that downstream applications don't ignore those bases.

Invalid fastq files produced

Hi,
I used lighter recently and found out that sometimes it produced invalid fastq files like this:

@ST-E00114:418:HHKV7CCXX:8:1109:26484:40020 2:N:0:GCCAAT
TTCAAGAGGTCTTAAGGAGGCTAACTCATATATAGGTTAGGTCTAGATCTAGAGGATGAGAGCATGGGAGACATGGATGTACATAGACTCGATAAGTACTT
+
AA<F<FAFAA<,FFKKK<FFFKKKKKKKKKKKFKKKAFFF79:40020 2:N:0:GCCAAT

As shown, the quality line is truncated and merged with a segment of the header line.

Error correction fails with small genome sizes

Lighter doesn't appear to like small genomes. Here I'm using ~4m reads and a 10kb genome size, with which Lighter does nothing. If I increase the genome size to a megabase, the report looks more as I'd expect. Is this a bug or a feature? Tested on Ubuntu and Yosemite.

10kb genome size:

bede-rmbp:lighter $ ./lighter -k 11 10000 0.1 -r H140260768-1B.R1.fastq -r H140260768-1B.R2.fastq -t 2
[2014-12-04 12:09:31] =============Start====================
[2014-12-04 12:09:32] Bad quality threshold is #
[2014-12-04 12:09:37] Finish sampling kmers
[2014-12-04 12:09:37] Bloom filter A's error rate: 0.998423
[2014-12-04 12:09:50] Finish storing trusted kmers
[2014-12-04 12:10:06] Finish error correction
Processed 3782706 reads:
    0 are error-free
    Corrected 0 bases(0.000000 corrections for reads with errors)
    Trimmed 0 reads with average trimmed bases 0.000000
    Discard 0 reads

1Mb genome size:

bede-rmbp:lighter $ ./lighter -k 11 1000000 0.1 -r H140260768-1B.R1.fastq -r H140260768-1B.R2.fastq -t 2
[2014-12-04 12:34:32] =============Start====================
[2014-12-04 12:34:33] Bad quality threshold is "#"
[2014-12-04 12:34:38] Finish sampling kmers
[2014-12-04 12:34:38] Bloom filter A's error rate: 0.001192
[2014-12-04 12:35:07] Finish storing trusted kmers
[2014-12-04 12:35:26] Finish error correction
Processed 3782706 reads:
    3543702 are error-free
    Corrected 233998 bases(0.979055 corrections for reads with errors)
    Trimmed 0 reads with average trimmed bases 0.000000
    Discard 0 reads

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.