decodegenetics / bamhash Goto Github PK

License: GNU General Public License v3.0

Makefile 0.04% C++ 95.35% C 4.61%

bamhash's Introduction

BamHash

Hash BAM and FASTQ files to verify data integrity

For each pair of reads in a BAM or FASTQ file we compute a hash value composed of the readname, whether it is first or last in pair, sequence and quality value. All the hash values are summed up so the result is independent of the ordering within the files. The result can be compared to verify that the pair of FASTQ files contain the same read information as the aligned BAM file. The program is written in C++ and uses SeqAnHTS v1.0 for parsing FASTQ, gzip compressed FASTQ and BAM files. SeqAnHTS is a fork of SeqAn library ( Döring etal. , 2008 ) that uses htslib to read SAM/BAM/CRAM files.

Manuscript

Arna Óskarsdóttir, Gísli Másson and Páll Melsted (2015) BamHash: a checksum program for verifying the integrity of sequence data. Bioinformatics, btv539.

A preprint is available on bioRxiv.

Usage

The program has three executables which are used for different filetypes. Running them with --help displays detailed help messages.

Common options

All programs work with sets of reads. The reads are made up of a read name, sequence and quality information. All of these components go into the hash, but the read name or quality information can be ignored if necessary. This would be the case if a pipeline mangled the names, quantizised the quality or after realigning quality scores.

The default mode is to assume paired end reads. If you have single end reads you can supply the --no-paired option.

A debug option -d prints the information and hash value of each read individually, this can be helpful if BamHash is not cooperating with your pipeline.

Both multiline FASTA and FASTQ are supported and gzipped input for FASTA and FASTQ.

BAM

bamhash_checksum_bam [OPTIONS] <in.bam> <in2.bam> ...
bamhash_checksum_bam [OPTIONS] -r <reference-file> <in.cram>

processes a number of BAM files. BAM files are assumed to contain paired end reads. If you run with --no-paired it treats all reads as single end and displays a warning if any read is marked as "second in pair" in the BAM file.

FASTQ

bamhash_checksum_fastq [OPTIONS] <in1.fastq.gz> [in2.fastq.gz ... ]

processes a number of FASTQ files. FASTQ files are assumed to contain paired end reads, such that the first two files contain the first pair of reads, etc. If any of the read names in the two pairs don't match the program exits with failure.

FASTA

bamhash_checksum_fasta [OPTIONS] <in1.fasta> [in2.fasta ... ]

processes a number of FASTA files. All FASTA files are assumed to be single end reads with no quality information. To compare to a BAM file, run bamhash_checksum_bam --no-paired --no-quality

Compiling

External dependencies are on: OpenSSL for the MD5 implementation htslib library (version 1.9)

bamhash's People

Contributors

Stargazers

Watchers

Forkers

dpryan79 drchriscole xtmgah nishill willie14 pemsley raonyguimaraes arnaos

bamhash's Issues

Check option

Like for the std md5sum command it would be really handy to have a 'check' option (-c) in order to check a fastq/bam vs it's bamhash to programmatically verify it's consistency.

Also, it's confusing that fastq read pairs are only counted once whereas they're counted twice in the bam. It would be clearer if the numbers agreed as well as the hashes.
Thanks!

FASTQ read name/description when computing hash

Hi,

I have a question regarding how you use the FASTQ description field to calculate the hash. The question arises because I've been doing some tests and I only manage to get the same md5 using the -R option in both the BAM file and the original FASTQ files.

More details following:

I've trimmed a FASTQ sample for testing purposes, the reads look like this:

~> zcat ../data/NA12878_trimmed_1.fastq.gz | head -n 4
@ERR194147.1 HSQ1004:134:C0D8DACXX:1:1104:3874:86238/1
GGTTCCTACTTCAGGGTCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACG
+
CC@FFFFFHHHHHJJJFHIIJJJJJJIHJIIJJJJJJJJIIGIJJIJJJIJJJIJIJJJJJJJJJJIJHHHHFFFDEEEEEEEEDDDCDDEEDDDDDDDDD

When, after the analysis, I run bamhash_checksum_fastq in the FASTQ files and bamhash_checksum_bam in the resulting BAM file I get different md5's:

~> bamhash_checksum_fastq ../data/NA12878_trimmed_*
a05de49644a0fb5d        10000

~> bamhash_checksum_bam final/NA12878_trimmed/NA12878_trimmed-ready.bam
d4d5ece0f619d83d        20000

If I convert the BAM file back to FASTQ I realised that the FASTQ read description disappears, i.e:

~> samtools fastq final/NA12878_trimmed/NA12878_trimmed-ready.bam | head -n 4
@ERR194147.6389/2
CATCGGATTTTTGTTTTTTTTGTTTTGGGTGGGGGGGGTTGGTGGGGTTGTGTGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGGGGGGGGTGGTTGG
+
11++40+2)2,+)2:3AA8)))00)))((('''''&&&))&&(((&&&&&&(((++(&05;BB7@>B@BDBBDDB>BDD@@3>9<&5-&&&&&&)&&)+(&

Is that description after the readname used to calculate the hash? I'm pretty confident that this is the problem, since if I run BamHash with -R it does return the expected result:

~> bamhash_checksum_fastq -R ../data/NA12878_trimmed_1.fastq.gz ../data/NA12878_trimmed_2.fastq.gz
f4524c00c70e9b83        10000

~> bamhash_checksum_bam -R final/NA12878_trimmed/NA12878_trimmed-ready.bam
f4524c00c70e9b83        20000

Thanks for your help!

Does this tool take in fully-processed bams as input or unprocessed bams?

When running bamhash on a bam file, should the bam be fully processed (i.e. sorted, deduped, and recalibrated)?

I just tested bamhash on a fully-processed bam file and its source paired-end fastq files, however the resulting hashes differ, so I'm wondering if it's because the input bam was fully processed.

UPDATE:

I just re-ran bamhash on a non-fully processed bam file, and got the following results:

bam result:
2490f971d6f15fa2	764438888

source fastq result:
2490f971d6f15fa2	382219444

What do the two columns represent? Is it enough that one of the columns match between the files?

CRAM support?

Hello, are there any plans to support CRAM directly?

Thanks,
Andreas

Read count in paired fastq bamhash computation

The read count of a bamhash computation of paired fastq files is not the same as the read count of a bamhash computation of the bam file the paired fastq files were converted from (as already mentioned in #4 (comment) ... actually it is half the amount of reads in the fastq bamhash computation):

$ bamhash_checksum_bam sample.bam
c0039f91693d4bfd	1749217454

$ bamhash_checksum_fastq sample_R1.fq sample_R2.fq
c0039f91693d4bfd	874608727

My expectation would be that the read count numbers are the same (I would expect the number from the bam bamhash computation). Is this behavior intentional? Otherwise it would be great if this could be fixed!

$ bamhash_checksum_fastq --version
bamhash_checksum_fastq version 1.1

Thanks,
Oliver

Add support for interleaved FASTQ files

Thanks for developing this! This will certainly allow us to confidently delete original data files by first verifying data integrity.

I do have a feature request: it would be useful to have support for interleaved FASTQ files (where read pairs are consecutive in the file; compatible with BWA MEM).

Also, perhaps once this is implemented (if you choose to do so), could you create a new release with all the commits since v1.0? Thanks.

new tagged version

A new version would be great so we can integrate this tool downstream in Galaxy.

Thanks!