lh3 / bfc Goto Github PK

High-performance error correction for Illumina resequencing data

License: MIT License

Makefile 0.55% C 36.51% C++ 5.22% JavaScript 1.41% TeX 56.32%

bfc's Introduction

Introduction

BFC is a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data. It is specifically designed for high-coverage whole-genome human data, though also performs well for small genomes.

The BFC algorithm is a variant of the classical spectrum alignment algorithm introduced by Pevzner et al (2001). It uses an exhaustive search to find a k-mer path through a read that minimizes a heuristic objective function jointly considering penalties on correction, quality and k-mer support. This algorithm was first implemented in my fermi assembler and then refined a few times in fermi, fermi2 and now in BFC. In the k-mer counting phase, BFC uses a blocked bloom filter to filter out most singleton k-mers and keeps the rest in a hash table (Melsted and Pritchard, 2011). The use of bloom filter is how BFC is named, though other correctors such as Lighter and Bless actually rely more on bloom filter than BFC.

Usage

BFC can be invoked as:

bfc -s 3g -t16 reads.fq.gz | gzip -1 > corrected.fq.gz

where option -s specifies the approximate size of the genome. It is possible to use one set of reads to correct another set:

bfc -s 3g -t16 readset1.fq.gz readset2.fq.gz | gzip -1 > corrected_readset2.fq.gz

and to process data from Unix pipes ("<(command)" is bash specific):

bash -c "bfc -s 3g -t16 <(bzip2 -dc reads.fq.bz2) <(bzip2 -dc reads.fq.bz2) | gzip -1 > out.fq.gz"

BFC also offers an option to trim reads containing singleton k-mers (don't switch -s and -k as some options are ordered):

bfc -1 -s 3g -k51 -t16 corrected.fq.gz | gzip -1 > trimmed.fq.gz

This command line keeps k-mer occurring twice or more in a bloom filter (with some false positives) and identifies the longest stretch in a read that has hits in the bloom filter. K-mer trimming is about four times as fast as error correction.

BFC-KMC

An alternative implementation of the algorithm is available at the kmc branch of this repository. It uses KMC2 for k-mer counting and keeps high-occurrence k-mers in a bloom filter. BFC-KMC should be invoked as:

kmc -k55 reads.fq.gz prefix tmpdir
bfc-kmc -t16 prefix reads.fq.gz | gzip -1 > corrected.fq.gz

KMC2 source code and precompiled binaries are available at the KMC website.

bfc's People

Contributors

Stargazers

Watchers

Forkers

bioinformaticsarchive cyang-2014 xtmgah zhmz90 sjackman amystringer abdalbassir yuanyuan517 jz44 ingo-ch wangren09 tianyic xiaoguang-hh

bfc's Issues

Segmentation fault

Any thoughts on why bfc segfaulted here? It had been running for 11.5 hours and was using 952 GB of RAM when it crashed.

❯❯❯ bfc -v
r181
❯❯❯ gunzip -c data.fq.gz | bfc -t16 -s14G -d data.bfc /dev/stdin /dev/null
[M::main] applied `-k 35 -b 37'
[M::bfc_count_cb] read 663715 sequences
[M::bfc_count_cb] read 663537 sequences
[M::bfc_count_cb @248.2*112.3%] processed 663715 sequences; # distinct k-mers: 4945319
[M::bfc_count_cb] read 663527 sequences
…
[M::bfc_count_cb] read 663411 sequences
[M::bfc_count_cb @39980.6*1512.8%] processed 663497 sequences; # distinct k-mers: 12783060251
[M::bfc_count_cb] read 663383 sequences
[M::bfc_count_cb @39987.7*1512.8%] processed 663411 sequences; # distinct k-mers: 12784917041
[M::bfc_count_cb] read 663362 sequences
Segmentation fault (core dumped)

Description of output header

Hi,
I'd like to see how/if error correction worked for my dataset.
Can you explain what e.g.
ec:Z:3
ec:Z:0_0:10_0_0:0_0
ec:Z:0_31:1_0_0:0_0
ec:Z:0_6:6_1_5:4_0
means? What about unaltered header lines?

thanks,
Sven

Memory consuming

Hi Heng,

I am using bfc to do error correction on a 32G server. The reads are from illumina with length 100bp, and coverage is ~10x. I am using command "bfc -s 3g -t 5 -k 19 -b 20 XXX.fq". It seems the program is always killed because the memory consuming is high. Is there a way to run with less memory usage? Thank you.

Best,
Chong

code for arXiv evaluation table

Heng,

Are you willing to share the code you used to evaluate the different error correction algorithms, e.g. Table 1 in your arXiv manuscript?

Matt

How does bfc assign quality scores?

After error correction, I noticed that some quality scores are strange:
?????????????????????????????????????????????????????????????????????????????????????????????+????????+???????????+?????????????

drop reads containing specific set of kmers

Hey @lh3, this looks great.

But we do low-coverage (5x) sequencing of many (non-human) individuals - where removing rare kmers is a bad idea. So our ideal approach is to combine all data into a big dataset (500-1000x coverage total), use that to identify bad kmers, dump those kmers to a file. Then go through each individual low-coverage dataset to eliminate the list of bad kmers. Can you add an option to bfc that can help with this last step? Or is it already hidden somewhere?

Cheers,
Yannick

bfc with MiSeq data?

Is it possible to use bfc with MiSeq data with read lengths on the order of 250 to 300bp?

phred33 to 64 encoding change

Hi,

We have some illumina 1.5/phred64 files that we need to process. We used Trimmomatic's tophred33 option to convert to phred33 encoding (the idea being to do all downstream analysis in phred33). But after then running through bfc, some of the reads were converted back to phred64. Here is an example. The conversion back to phred64 happens in the reads that get "ec:Z:3" added to the info line.

original read

@FCC3PRVACXX:5:1101:5113:1970#AGCGCTAG/1
NACACCGGCACCCTTAAAATTCTACGGTATCGATTTCGGAGACAGCAGACTTTTGGAATTATACGTATTCCGTTTATATGTGTTCTGTAAGCAGTTTTAT
+
BPaceeeegggggiiiihiiiiiiiiiegiiifhiiiiihifghiiiihigggggeeeeedddddbdccedcbcccccedcceccdcddcbcccccdccc

post trimmomatic -tophred33, and leading N also trimmed

@FCC3PRVACXX:5:1101:5113:1970#AGCGCTAG/1
ACACCGGCACCCTTAAAATTCTACGGTATCGATTTCGGAGACAGCAGACTTTTGGAATTATACGTATTCCGTTTATATGTGTTCTGTAAGCAGTTTTAT
+
1BDFFFFHHHHHJJJJIJJJJJJJJJFHJJJGIJJJJJIJGHIJJJJIJHHHHHFFFFFEEEEECEDDFEDCDDDDDFEDDFDDEDEEDCDDDDDEDDD

post bfc

@FCC3PRVACXX:5:1101:5113:1970#AGCGCTAG/1 ec:Z:3
ACACCGGCACCCTTAAAATTCTACGGTATCGATTTCGGAGACAGCAGACTTTTGGAATTATACGTATTCCGTTTATATGTGTTCTGTAAGCAGTTTTAT
+
Paceeeegggggiiiihiiiiiiiiiegiiifhiiiiihifghiiiihigggggeeeeedddddbdccedcbcccccedcceccdcddcbcccccdccc

Segfault

Hi, I got a segfault and got the following from gdb tracer.
#0 bseq_read (fp=0x0, chunk_size=100000000, keep_comment=keep_comment@entry=0, n_=n_@entry=0x7fffb00008c0) at bseq.c:56
#1 0x0000000000404f6f in bfc_ec_cb (shared=0x7fffffffc790, step=, _data=) at correct.c:580
#2 0x0000000000401cfd in ktp_worker (data=0x7fffffffc670) at kthread.c:106
#3 0x00007ffff76ae0db in start_thread () from /lib64/libpthread.so.0
#4 0x00007ffff73deddd in clone () from /lib64/libc.so.6

I have used bfc on other inputs fine, but for whatever reason it thinks fp is 0x0.

Please tag a new stable release

Thanks!

bfc strips pairing info

Heng - working with r181 here.

If a set of reads have paired-end info encoded as per the most recent fastq format:

@HWI-D00310:79:C730GANXX:3:1101:1153:1917 1:Y:0:ATTACTCG
NATTTTGTGGCCACAAAAGAGTATGAACATTTAAAATAGTGAGAGTGGATAACCTTTATAGAGGACCCAATAACACACTGGTCTTCTATGGTCCTCTCCTGGTGTCTGTAATGTGTTGTCTGTGC
+
#=30<E1=1>C1E01@00@C11>E1EGGF111111111?1<:1:E1:011<1<1>11>111=:1100/>0=F>1=1:C111:<:<1111111:000<00>F:;0:0>00;B00000;;C0<0000
@HWI-D00310:79:C730GANXX:3:1101:1153:1917 2:Y:0:ATTACTCG
TCTATTTGGTTCTGGTTGATACCTGTTGGAGTGGTTGAGGTAGTGTTGCATGGTATAAGGGTTAAAGGAATGGTTCCAGGTTTTCAGATTGATGAAGATTTTCATATTGTAGTGCTTTATGCGGC
+
3<3<0111100?11111/1@111=11<10E11101=/=111=>=111011?<1=11>1111110>11111:<11:11111=11010:0=00000=E0000?00000000000:008<00808...

bfc correction (bfc -s 800m -k 55 -t 16 inter.fq) results in stripping off the pairing info, which is problematic.

@HWI-D00310:79:C730GANXX:3:1101:1153:1917   ec:Z:3
NATTTTGTGGCCACAAAAGAGTATGAACATTTAAAATAGTGAGAGTGGATAACCTTTATAGAGGACCCAATAACACACTGGTCTTCTATGGTCCTCTCCTGGTGTCTGTAATGTGTTGTCTGTGC
+
#=30<E1=1>C1E01@00@C11>E1EGGF111111111?1<:1:E1:011<1<1>11>111=:1100/>0=F>1=1:C111:<:<1111111:000<00>F:;0:0>00;B00000;;C0<0000
@HWI-D00310:79:C730GANXX:3:1101:1153:1917   ec:Z:3
TCTATTTGGTTCTGGTTGATACCTGTTGGAGTGGTTGAGGTAGTGTTGCATGGTATAAGGGTTAAAGGAATGGTTCCAGGTTTTCAGATTGATGAAGATTTTCATATTGTAGTGCTTTATGCGGC
+
3<3<0111100?11111/1@111=11<10E11101=/=111=>=111011?<1=11>1111110>11111:<11:11111=11010:0=00000=E0000?00000000000:008<00808...

The workflow I am using is:

interleave-reads.py (from @dib-lab/khmer)
bfc
split-paired-reads.py
...

Obviously, when the pairing info is removed, splitting back into their left and right files does not work.

Provide summary information

It would be nice to have summary information shown at the end of the output, such as:

Total reads: XXX
Correctable reads: XXX (x%)
Reads corrected: XXX (x%)
Total bases corrected: XXX (x%)
Error rate: (x%) -- total bases corrected / total correctable bases

Seg fault/double free or corruption

Hi Heng, I saw the arXiv preprint and and trying out bfc but I'm getting an odd error. The input is adapter trimmed <125 HiSeq reads and for the same file, I will alternately get either a seq fault or ** Error in `bfc': double free or corruption (out).

command:
bfc -s 2.9g -k 55 -t 31 smaller.fq > test_corrected.fq

compiled with gcc version 4.8.2. 10,000 read test fq file is here: https://www.dropbox.com/s/67d24d9or15e5wj/smaller.fq?dl=0

Read counter not incrementing

I'm running bfc on my full set now (1.4 billion reads) and the log is giving me a sequence count that is oscillating. Is this the expected behavior?

bfc -s 2.9g -k 55 -t 31 all_reads.fastq.gz |gzip >all_reads.cor.fastq.gz

[M::bfc_count_cb @4667.7_1946.9%] processed 809616 sequences; # distinct k-mers: 2635769628
[M::bfc_count_cb] read 809443 sequences
[M::bfc_count_cb @4670.6_1947.4%] processed 809443 sequences; # distinct k-mers: 2636069642
[M::bfc_count_cb] read 809500 sequences
[M::bfc_count_cb @4673.9_1947.7%] processed 809500 sequences; # distinct k-mers: 2636370228
[M::bfc_count_cb] read 809470 sequences
[M::bfc_count_cb @4676.6_1948.2%] processed 809470 sequences; # distinct k-mers: 2636673453
[M::bfc_count_cb] read 809400 sequences
[M::bfc_count_cb @4679.5*1948.6%] processed 809400 sequences; # distinct k-mers: 2636970314

Document the meaning of -R (refine error-corrected reads)

Refining of reads is not discussed in the paper or the README (that I can find). It would be helpful to know what it means and when it's necessary.

BFC-ht versus BFC-bf

The arXiv preprint reports that the ht mode of BFC is generally more accurate than the bf mode.. Does the code in the readme e.g., bfc -s 3g -t16 reads.fq.gz | gzip -1 > corrected.fq.gz specify the ht or bf mode?