cancerit / caveman Goto Github PK

SNV expectation maximisation based mutation calling algorithm aimed at detecting somatic mutations in paired (tumour/normal) cancer samples. Supports both bam and cram format via htslib

Home Page: http://cancerit.github.io/CaVEMan

License: GNU Affero General Public License v3.0

Perl 2.15% Shell 1.01% C 96.09% Makefile 0.72% Roff 0.02%

caveman's Introduction

CaVEMan

A C implementation of the CaVEMan program. Uses an expectation maximisation approach to calling single base substitutions in paired data. Designed for use with a compute farm/cluster; most steps in the program make use of an index parameter. The split step is designed to divide the genome into chunks of adjustable size to optimise for runtime/memory usage requirements. For simple execution of CaVEMan please see cgpCaVEManWrapper

Master	Dev

Installation

See INSTALL.TXT

Prerequisites

As of version 1.6.0 CaVEMan supports cram files (with index).
BWA Mapped, indexed, duplicate marked/removed bam files, for both a normal and tumour sample
Reference.fasta and index
A one based bed style format file of regions to ignore during analysis (see specified format).
zlib >= 1.2.3.5
linasm >= 1.13

Optional inputs (will result in more accurate calls)

Normal and tumour copy number files (see specified format).
A normal contamination of tumour value

Processing flow

CaVEMan is executed in several distinct steps in the order listed below.

Setup

Generates a config file for use with the remaining CaVEMan steps (it'll save you a lot of time typing commandline args). Also generates a file named 'alg_bean' in the run directory.

bin/caveman setup || ./bin/setupCaveman
Usage: caveman setup -t tum.bam -n norm.bam -r reference.fa.fai -g ignore_regions.tab -e tum_cn.bed -j norm_cn.bed [-f path] [-l path] [-a path] [-wzu]

-t  --tumour-bam [file]             Location of tumour bam
-n  --normal-bam [file]             Location of normal bam
-r  --reference-index [file]        Location of reference fasta index
-g  --ignore-regions-file [file]    Location of tsv ignore regions file

Optional
-c  --config-file [file]            File to write caveman run config file [default:'./caveman.cfg.ini']
-f  --results-folder [file]         Folder to write results [default:'./results']
-l  --split-file [file]             File to write list of split sections [default:'./splitList']
-a  --alg-bean-file [file]          Location to write alg-bean [default:'./alg_bean']
-w  --include-smith-waterman        Include SW mapped reads in the analysis
-z  --include-single-end            Use single end reads for this analysis
-u  --include-duplicates            Include reads marked as duplicates in the analysis
-e  --tumour-copy-no-file [file]    Location of tumour copy number bed file (if the extension is not .bed the file will 		be treated as 1 based start). If no copy number file is supplied then the default cn of 2 will be used
-j  --normal-copy-no-file [file]    Location of normal copy number bed file (if the extension is not .bed the file will 		be treated as 1 based start). If no copy number file is supplied then the default cn of 2 will be used
-h	--help                          Display this usage information.

Split

Requires one job per entry in your reference.fasta.fai file. Each job creates a list of segments to be analysed, these are determined by total read count and do not include reads from ignored regions. The size of sections can be tuned using the -m, -c and -e parameters. Once all jobs complete successfully you will need to concatenate all split files into a single file with the name passed to the setup step in -f parameter (or splitList if you used the default).

bin/caveman split || ./bin/splitCaveman
Usage: caveman split -i jobindex -f path [-c int] [-m int] [-e int]

-f --config-file file           Path to the config file produced by setup [default: 'caveman.cfg.ini'].
-i  --index int                 Job index (e.g. from $LSB_JOBINDEX)

Optional
-c  --increment int             Increment to use when deciding split sizes
-m  --max-read-count double     Proportion of read-count to allow as a max in a split section
-e  --read-count int            Guide for maximum read count in a section
-h	help                        Display this usage information.

Mstep

Requires one job per entry in the merged split file. The parameter -i referes to a line in the merged split file. -a can be used to tune the size of section downloaded from bam at a time, this allows tuning of the memory footprint. The mstep will create a file for each job under the results folder (specified as a parameter in the setup step).

bin/caveman mstep || ./bin/mstepCaveman
Usage: caveman mstep -f config.file -i jobindex [-m int] [-a int]

-f --config-file file                  Path to the config file produced by setup [default: 'caveman.cfg.ini'].
-i --index int                         Job index.

Optional
-a --split_size int                    Size of section to retrieve at a time from bam file. Allows memory footprint tuning [default:50000].
-m  --min-base-qual int                Minimum base quality to include in analysis [default:11]
-h	help                                Display this usage information.

Merge

Runs as a single job, merging all the cov_array files generated by the mstep into a file representing the profile of the whole genome. The resulting file is named cov_array and stored in the root run folder. Another file named prob_array is also created.

bin/caveman merge || ./bin/mergeCaveman
Usage: caveman merge -f config_file [-c path] [-p path]

-f --config-file file                Path to the config file produced by setup. [default: 'caveman.cfg.ini']

Optional
-c  --covariate-file filename        Location to write merged covariate array [default: covs_arr]
-p  --probabilities-file filename    Location to write probability array [default: probs_arr]
-h	help                              Display this usage information.

Estep

The final step in calling variants using CaVEMan. As was the case with the mstep, a job per entry in the merged split list is required. Copy number (-e, -j), and normal contamination (-k) are required (See default settings for advice on obtaining results if you don't have these). Each job will create several files named according to the corresponding line in the merged split file. *_muts.vcf and *_snps.vcf are the calls above the given cutoffs for somatic and SNP probabilities alike. The *.no_analysis.bed file lists sections not analysed by caveman, due to being ignored or lacking coverage. Using the debugoption will output another file named *.dbg.vcf containing a line for every position in the genome that was analysed with read counts (as seen by CaVEMan) and top two probabilities calculated.

bin/caveman estep || ./bin/estepCaveman
Usage: caveman estep -i jobindex [-f file] [-m int] [-k float] [-b float] [-p float] [-q float] [-x int] [-y int] [-c 	float] [-d float] [-a int]

-i  --index [int]                                Job index (e.g. from $LSB_JOBINDEX)

Optional
-f  --config-file [file]                         Path to the config file produced by setup. [default:'caveman.cfg.ini']
-m  --min-base-qual [int]                        Minimum base quality for inclusion of a read position [default:11]
-c  --prior-mut-probability [float]              Prior somatic probability [default:0.000006]
-d  --prior-snp-probability [float]              Prior germline mutant probability [default:0.000100]
-k  --normal-contamination [float]               Normal contamination of tumour [default:0.100000]
-b  --reference-bias [float]                     Reference bias [default:0.950000]
-p  --mut-probability-cutoff [float]             Minimum probability call for a somatic mutant position to be output 		[default:0.800000]
-q  --snp-probability-cutoff [float]             Minimum probability call for a germline mutant position to be output 		[default:0.950000]
-x  --min-tum-coverage [int]                     Minimum tumour coverage for analysis of a position [default:1]
-y  --min-norm-coverage [int]                    Minimum normal coverage for analysis of a position [default:1]
-a  --split-size [int]                           Size of section to retrieve at a time from bam file. Allows memory 								 footprint tuning [default:50000].
-s  --debug                                      Adds an extra output to a debug file. Every base analysed has an 			output
-g  --cov-file [file]                            File location of the covariate array. [default:'covs_arr']
-o  --prob-file [file]                           File location of the prob array. [default:'probs_arr']
-v  --species-assembly [string]                  Species assembly (eg 37/GRCh37), required if bam header SQ lines do 								 not contain AS and SP information.
-w  --species [string]                           Species name (eg Human), required if bam header SQ lines do not 								 contain AS and SP information.
-n  --normal-copy-number [int]                   Copy number to use when filling gaps in the normal copy number file 								 [default:2].
-t  --tumour-copy-number [int]                   Copy number to use when filling gaps in the tumour copy number file 								 [default:2].
-l  --normal-protocol [string]                   Normal protocol. Ideally this should match -r but not checked 									 (WGS|WGX|RNA) [default:WGS].
-r  --tumour-protocol [string]                   Tumour protocol. Ideally this should match -l but not checked 									 (WGS|WGX|RNA) [default:WGS].
-P  --normal-platform [string]                   Normal platform. Overrides the values retrieved from bam header.
-T  --tumour-platform [string]                   Tumour platform. Overrides the values retrieved from bam header.
-M  --max-copy-number [int]                      Maximum copy number permitted. If exceeded the copy number for the 								 offending region will be set to this value. [default:10].
-h  --help                                     Display this usage information.

File Formats

Copy Number

Copy number files are taken in a 1-based bed style tab separated format, or BED format if the suffix is .bed. Where the columns are chromosome,start,stop,copynumber(integer). A separate file is required for normal and tumour. Each file should have a copy number assigned for every region requested to be analysed (NB, CaVEMan set CN to 2 in regions where copy number is 0).

Example:
	1	0	20000	2
	2	0	2500	4
	2	2501	500000	6

Ignored regions file

A 1-based bed style tab separated format file of regions to be ignored during analysis. An example might be regions known to have extreme depth of mapped reads through mismapping.

Default Settings

There may be cases where copy number is not available. In order to extract the best results it is advised to use copy number 2 in the normal and 5 in the tumour in combination with a normal contamination of 0.1 . This gives CaVEMan a broad range over which variants will be called compared to a copy number of 2 in the tumour.

LICENCE

Author: Cancer Genome Project [email protected]

This file is part of CaVEMan.

CaVEMan is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

The usage of a range of years within a copyright statement contained within this distribution should be interpreted as being equivalent to a list of years including the first and last year specified and all consecutive years between them. For example, a copyright statement that reads ‘Copyright (c) 2005, 2007- 2009, 2011-2012’ should be interpreted as being identical to a statement that reads ‘Copyright (c) 2005, 2007, 2008, 2009, 2011, 2012’ and a copyright statement that reads ‘Copyright (c) 2005-2012’ should be interpreted as being identical to a statement that reads ‘Copyright (c) 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012’."

caveman's People

Contributors

Stargazers

Watchers

Forkers

alenzhao yaseswini papaemmelab haoziyeung binfyun souramoo martinez-gregorio-hector slw287r gijsschroder liyang-zhang zzygyx9119 zyllifeworld nickwilliamssanger

caveman's Issues

1.13.5 - Segmentation fault (core dumped) - Estep

As per sections 1 and 2 of the following comment on issue #82

I tried out the pre-release 1.13.5 and I got the Segmentation fault (core dumped) Error message that we all fear :-(
I used the exact same inputs and commands I used with version 1.13.2 and 1.13.3.
I tried first using only contigs representing chromosome 21 and 22 as test and got the segmentation fault. Here I attach a zip file with logs and file and commands for that run. Hope it helps.

CaVEMan.run.zip

Reported by @ChristopheLegendre

Installation problem - CaVEMan 1.7.5

Hey Im trying to install Caveman 1.7.5 (need to support some old versions) and I'm getting the following error. Please see at the end my installation procedure.

============== System information ====
+ lsb_release -a
LSB Version:	core-9.20160110ubuntu0.2-amd64:core-9.20160110ubuntu0.2-noarch:security-9.20160110ubuntu0.2-amd64:security-9.20160110ubuntu0.2-noarch
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.4 LTS
Release:	16.04
Codename:	xenial
+ uname -a
Linux 298af5cd93a5 4.9.60-linuxkit-aufs #1 SMP Mon Nov 6 16:00:12 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
+ sw_vers
./setup.sh: line 91: sw_vers: command not found
+ system_profiler
./setup.sh: line 92: system_profiler: command not found
+ grep MemTotal /proc/meminfo
MemTotal:        5068104 kB
+ set +x


+ '[' '!' -e htslib ']'
+ get_distro htslib https://github.com/samtools/htslib/archive/1.2.tar.gz
+ EXT=
+ DECOMP='gunzip -f'
+ echo htslib
htslib
+ [[ https://github.com/samtools/htslib/archive/1.2.tar.gz == *.tar.bz2* ]]
+ [[ https://github.com/samtools/htslib/archive/1.2.tar.gz == *.tar.gz* ]]
+ EXT=tar.gz
+ hash curl
+ curl -sS -o htslib.tar.gz -L https://github.com/samtools/htslib/archive/1.2.tar.gz
+ mkdir -p htslib
++ gunzip -f htslib.tar.gz
+ tar --strip-components 1 -C htslib -xf htslib.tar
+ patch htslib/cram/cram_index.c
patching file htslib/cram/cram_index.c
+ make -C htslib -j4
make: Entering directory '/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib'
gcc -g -Wall -O2 -I. -c -o kfunc.o kfunc.c
gcc -g -Wall -O2 -I. -c -o knetfile.o knetfile.c
gcc -g -Wall -O2 -I. -c -o kstring.o kstring.c
gcc -g -Wall -O2 -I. -c -o bgzf.o bgzf.c
gcc -g -Wall -O2 -I. -c -o faidx.o faidx.c
gcc -g -Wall -O2 -I. -c -o hfile.o hfile.c
gcc -g -Wall -O2 -I. -c -o hfile_net.o hfile_net.c
echo '#define HTS_VERSION "1.2"' > version.h
gcc -g -Wall -O2 -I. -c -o regidx.o regidx.c
gcc -g -Wall -O2 -I. -c -o sam.o sam.c
gcc -g -Wall -O2 -I. -c -o synced_bcf_reader.o synced_bcf_reader.c
gcc -g -Wall -O2 -I. -c -o vcf_sweep.o vcf_sweep.c
gcc -g -Wall -O2 -I. -c -o tbx.o tbx.c
gcc -g -Wall -O2 -I. -c -o vcf.o vcf.c
gcc -g -Wall -O2 -I. -c -o vcfutils.o vcfutils.c
gcc -g -Wall -O2 -I. -c -o cram/cram_codecs.o cram/cram_codecs.c
gcc -g -Wall -O2 -I. -c -o cram/cram_decode.o cram/cram_decode.c
gcc -g -Wall -O2 -I. -c -o cram/cram_encode.o cram/cram_encode.c
gcc -g -Wall -O2 -I. -c -o cram/cram_index.o cram/cram_index.c
gcc -g -Wall -O2 -I. -c -o cram/cram_io.o cram/cram_io.c
gcc -g -Wall -O2 -I. -c -o cram/cram_samtools.o cram/cram_samtools.c
gcc -g -Wall -O2 -I. -c -o cram/cram_stats.o cram/cram_stats.c
gcc -g -Wall -O2 -I. -c -o cram/files.o cram/files.c
gcc -g -Wall -O2 -I. -c -o cram/mFILE.o cram/mFILE.c
gcc -g -Wall -O2 -I. -c -o cram/md5.o cram/md5.c
gcc -g -Wall -O2 -I. -c -o cram/open_trace_file.o cram/open_trace_file.c
gcc -g -Wall -O2 -I. -c -o cram/pooled_alloc.o cram/pooled_alloc.c
gcc -g -Wall -O2 -I. -c -o cram/rANS_static.o cram/rANS_static.c
gcc -g -Wall -O2 -I. -c -o cram/sam_header.o cram/sam_header.c
gcc -g -Wall -O2 -I. -c -o cram/string_alloc.o cram/string_alloc.c
gcc -g -Wall -O2 -I. -c -o cram/thread_pool.o cram/thread_pool.c
gcc -g -Wall -O2 -I. -c -o cram/vlen.o cram/vlen.c
gcc -g -Wall -O2 -I. -c -o cram/zfio.o cram/zfio.c
gcc -g -Wall -O2 -I. -fpic -c -o kfunc.pico kfunc.c
gcc -g -Wall -O2 -I. -fpic -c -o knetfile.pico knetfile.c
gcc -g -Wall -O2 -I. -fpic -c -o kstring.pico kstring.c
gcc -g -Wall -O2 -I. -fpic -c -o bgzf.pico bgzf.c
gcc -g -Wall -O2 -I. -fpic -c -o faidx.pico faidx.c
gcc -g -Wall -O2 -I. -fpic -c -o hfile.pico hfile.c
gcc -g -Wall -O2 -I. -fpic -c -o hfile_net.pico hfile_net.c
gcc -g -Wall -O2 -I. -fpic -c -o hts.pico hts.c
gcc -g -Wall -O2 -I. -fpic -c -o regidx.pico regidx.c
gcc -g -Wall -O2 -I. -fpic -c -o sam.pico sam.c
gcc -g -Wall -O2 -I. -fpic -c -o synced_bcf_reader.pico synced_bcf_reader.c
gcc -g -Wall -O2 -I. -fpic -c -o vcf_sweep.pico vcf_sweep.c
gcc -g -Wall -O2 -I. -fpic -c -o tbx.pico tbx.c
gcc -g -Wall -O2 -I. -fpic -c -o vcf.pico vcf.c
gcc -g -Wall -O2 -I. -fpic -c -o vcfutils.pico vcfutils.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/cram_codecs.pico cram/cram_codecs.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/cram_decode.pico cram/cram_decode.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/cram_encode.pico cram/cram_encode.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/cram_index.pico cram/cram_index.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/cram_io.pico cram/cram_io.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/cram_samtools.pico cram/cram_samtools.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/cram_stats.pico cram/cram_stats.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/files.pico cram/files.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/mFILE.pico cram/mFILE.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/md5.pico cram/md5.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/open_trace_file.pico cram/open_trace_file.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/pooled_alloc.pico cram/pooled_alloc.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/rANS_static.pico cram/rANS_static.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/sam_header.pico cram/sam_header.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/string_alloc.pico cram/string_alloc.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/thread_pool.pico cram/thread_pool.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/vlen.pico cram/vlen.c
gcc -g -Wall -O2 -I. -fpic -c -o cram/zfio.pico cram/zfio.c
gcc -g -Wall -O2 -I. -c -o bgzip.o bgzip.c
gcc -g -Wall -O2 -I. -c -o htsfile.o htsfile.c
gcc -g -Wall -O2 -I. -c -o tabix.o tabix.c
gcc -g -Wall -O2 -I. -c -o test/fieldarith.o test/fieldarith.c
gcc -g -Wall -O2 -I. -c -o test/hfile.o test/hfile.c
gcc -g -Wall -O2 -I. -c -o test/sam.o test/sam.c
gcc -g -Wall -O2 -I. -c -o test/test-regidx.o test/test-regidx.c
gcc -g -Wall -O2 -I. -c -o test/test_view.o test/test_view.c
gcc -g -Wall -O2 -I. -c -o test/test-vcf-api.o test/test-vcf-api.c
gcc -g -Wall -O2 -I. -c -o test/test-vcf-sweep.o test/test-vcf-sweep.c
gcc -g -Wall -O2 -I. -c -o hts.o hts.c
gcc -shared -Wl,-soname,libhts.so.1 -pthread  -o libhts.so kfunc.pico knetfile.pico kstring.pico bgzf.pico faidx.pico hfile.pico hfile_net.pico hts.pico regidx.pico sam.pico synced_bcf_reader.pico vcf_sweep.pico tbx.pico vcf.pico vcfutils.pico cram/cram_codecs.pico cram/cram_decode.pico cram/cram_encode.pico cram/cram_index.pico cram/cram_io.pico cram/cram_samtools.pico cram/cram_stats.pico cram/files.pico cram/mFILE.pico cram/md5.pico cram/open_trace_file.pico cram/pooled_alloc.pico cram/rANS_static.pico cram/sam_header.pico cram/string_alloc.pico cram/thread_pool.pico cram/vlen.pico cram/zfio.pico  -lz -lm
ln -sf libhts.so libhts.so.1
ar -rc libhts.a kfunc.o knetfile.o kstring.o bgzf.o faidx.o hfile.o hfile_net.o hts.o regidx.o sam.o synced_bcf_reader.o vcf_sweep.o tbx.o vcf.o vcfutils.o cram/cram_codecs.o cram/cram_decode.o cram/cram_encode.o cram/cram_index.o cram/cram_io.o cram/cram_samtools.o cram/cram_stats.o cram/files.o cram/mFILE.o cram/md5.o cram/open_trace_file.o cram/pooled_alloc.o cram/rANS_static.o cram/sam_header.o cram/string_alloc.o cram/thread_pool.o cram/vlen.o cram/zfio.o
ranlib libhts.a
gcc -pthread  -o bgzip bgzip.o libhts.a  -lz
gcc -pthread  -o htsfile htsfile.o libhts.a  -lz
gcc -pthread  -o tabix tabix.o libhts.a  -lz
gcc -pthread  -o test/fieldarith test/fieldarith.o libhts.a  -lz
gcc  -o test/hfile test/hfile.o libhts.a  -lz
gcc -pthread  -o test/sam test/sam.o libhts.a  -lz
gcc -pthread  -o test/test-regidx test/test-regidx.o libhts.a  -lz
gcc -pthread  -o test/test_view test/test_view.o libhts.a  -lz
gcc -pthread  -o test/test-vcf-api test/test-vcf-api.o libhts.a  -lz
gcc -pthread  -o test/test-vcf-sweep test/test-vcf-sweep.o libhts.a  -lz
make: Leaving directory '/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib'
+ touch /tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib.success
+ mkdir -p /tmp/downloads/tmp/install_tmp/caveman/c/bin
+ make -j4
rm -f ./src/*.o *~ ./bin/caveman ./bin/generateCavemanUMNormVCF ./bin/* ./tests/tests_log ./tests/bam_access_tests ./tests/config_file_access_tests ./tests/ign_region_access_tests ./tests/alg_bean_tests ./tests/split_access_tests ./tests/cn_access_tests ./tests/algos_tests ./tests/output_tests ./tests/covs_access_tests ./tests/fai_access_tests ./tests/genotype_tests ./src/*.gcda ./src/*.gcov ./src/*.gcno *.gcda *.gcov *.gcno ./tests/*.gcda ./tests/*.gcov ./tests/*.gcno
rm -rf ./bin ./caveman_tmp
mkdir ./bin
mkdir ./caveman_tmp
#Do some magic to ensure we compile CaVEMan with the static libhts.a rather than libhts.so
ln -s /tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/libhts.a ./caveman_tmp/libhts.a
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/file_tests.c  -o src/file_tests.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/List.c  -o src/List.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/List_algos.c  -o src/List_algos.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/bam_access.c  -o src/bam_access.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/config_file_access.c  -o src/config_file_access.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/fai_access.c  -o src/fai_access.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/ignore_reg_access.c  -o src/ignore_reg_access.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/alg_bean.c  -o src/alg_bean.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/split_access.c  -o src/split_access.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/covs_access.c  -o src/covs_access.o
src/covs_access.c: In function ‘covs_access_write_covs_to_file’:
src/covs_access.c:138:20: warning: unused variable ‘s’ [-Wunused-variable]
  int i,j,k,m,n,p,r,s;
                    ^
src/covs_access.c: In function ‘covs_access_read_covs_from_file’:
src/covs_access.c:171:20: warning: unused variable ‘s’ [-Wunused-variable]
  int i,j,k,m,n,p,r,s;
                    ^
src/covs_access.c: In function ‘covs_access_write_probs_to_file’:
src/covs_access.c:427:20: warning: unused variable ‘s’ [-Wunused-variable]
  int i,j,k,m,n,p,r,s;
                    ^
src/covs_access.c: In function ‘covs_access_read_probs_from_file’:
src/covs_access.c:502:20: warning: unused variable ‘s’ [-Wunused-variable]
  int i,j,k,m,n,p,r,s;
                    ^
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/cn_access.c  -o src/cn_access.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/genotype.c  -o src/genotype.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/algos.c  -o src/algos.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/output.c  -o src/output.o
In file included from src/output.h:36:0,
                 from src/output.c:35:
src/algos.h:69:13: warning: inline function ‘algos_run_per_position_estep_maths’ declared but never defined
 inline void algos_run_per_position_estep_maths(estep_position_t *pos);
             ^
src/algos.h:68:12: warning: inline function ‘algos_run_per_read_estep_maths’ declared but never defined
 inline int algos_run_per_read_estep_maths(genotype_store_t *genos,read_pos_t *read, int8_t ref_base_idx, long double base_norm_contam);
            ^
src/algos.h:67:13: warning: inline function ‘finalise_probabilities_and_find_top_prob’ declared but never defined
 inline void finalise_probabilities_and_find_top_prob(estep_position_t *pos,long double norm_factor);
             ^
src/algos.h:66:20: warning: inline function ‘algos_calculate_per_base_normal_contamination’ declared but never defined
 inline long double algos_calculate_per_base_normal_contamination(uint8_t norm_copy_no,uint8_t tum_copy_no);
                    ^
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/setup.c  -o src/setup.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/split.c  -o src/split.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/mstep.c  -o src/mstep.o
In file included from src/mstep.c:47:0:
src/algos.h:69:13: warning: inline function ‘algos_run_per_position_estep_maths’ declared but never defined
 inline void algos_run_per_position_estep_maths(estep_position_t *pos);
             ^
src/algos.h:68:12: warning: inline function ‘algos_run_per_read_estep_maths’ declared but never defined
 inline int algos_run_per_read_estep_maths(genotype_store_t *genos,read_pos_t *read, int8_t ref_base_idx, long double base_norm_contam);
            ^
src/algos.h:67:13: warning: inline function ‘finalise_probabilities_and_find_top_prob’ declared but never defined
 inline void finalise_probabilities_and_find_top_prob(estep_position_t *pos,long double norm_factor);
             ^
src/algos.h:66:20: warning: inline function ‘algos_calculate_per_base_normal_contamination’ declared but never defined
 inline long double algos_calculate_per_base_normal_contamination(uint8_t norm_copy_no,uint8_t tum_copy_no);
                    ^
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/merge.c  -o src/merge.o
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -c src/estep.c  -o src/estep.o
In file included from src/output.h:36:0,
                 from src/estep.c:47:
src/algos.h:69:13: warning: inline function ‘algos_run_per_position_estep_maths’ declared but never defined
 inline void algos_run_per_position_estep_maths(estep_position_t *pos);
             ^
src/algos.h:68:12: warning: inline function ‘algos_run_per_read_estep_maths’ declared but never defined
 inline int algos_run_per_read_estep_maths(genotype_store_t *genos,read_pos_t *read, int8_t ref_base_idx, long double base_norm_contam);
            ^
src/algos.h:67:13: warning: inline function ‘finalise_probabilities_and_find_top_prob’ declared but never defined
 inline void finalise_probabilities_and_find_top_prob(estep_position_t *pos,long double norm_factor);
             ^
src/algos.h:66:20: warning: inline function ‘algos_calculate_per_base_normal_contamination’ declared but never defined
 inline long double algos_calculate_per_base_normal_contamination(uint8_t norm_copy_no,uint8_t tum_copy_no);
                    ^
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -o ./bin/caveman ./src/file_tests.o ./src/List.o ./src/List_algos.o ./src/bam_access.o ./src/config_file_access.o ./src/fai_access.o ./src/ignore_reg_access.o ./src/alg_bean.o ./src/split_access.o ./src/covs_access.o ./src/cn_access.o ./src/genotype.o ./src/algos.o ./src/output.o ./src/setup.o ./src/split.o ./src/mstep.o ./src/merge.o ./src/estep.o -L./caveman_tmp -lhts -lpthread -lz -lm ./src/caveman.c
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic -o ./bin/generateCavemanUMNormVCF ./src/file_tests.o ./src/List.o ./src/List_algos.o ./src/bam_access.o ./src/config_file_access.o ./src/fai_access.o ./src/ignore_reg_access.o ./src/alg_bean.o ./src/split_access.o ./src/covs_access.o ./src/cn_access.o ./src/genotype.o ./src/algos.o ./src/output.o ./src/setup.o ./src/split.o ./src/mstep.o ./src/merge.o ./src/estep.o -L./caveman_tmp -lhts -lpthread -lz -lm ./src/generateCavemanVCFUnmatchedNormalPanel.c
In file included from src/output.h:36:0,
                 from ./src/generateCavemanVCFUnmatchedNormalPanel.c:40:
src/algos.h:69:13: warning: inline function ‘algos_run_per_position_estep_maths’ declared but never defined
 inline void algos_run_per_position_estep_maths(estep_position_t *pos);
             ^
src/algos.h:68:12: warning: inline function ‘algos_run_per_read_estep_maths’ declared but never defined
 inline int algos_run_per_read_estep_maths(genotype_store_t *genos,read_pos_t *read, int8_t ref_base_idx, long double base_norm_contam);
            ^
src/algos.h:67:13: warning: inline function ‘finalise_probabilities_and_find_top_prob’ declared but never defined
 inline void finalise_probabilities_and_find_top_prob(estep_position_t *pos,long double norm_factor);
             ^
src/algos.h:66:20: warning: inline function ‘algos_calculate_per_base_normal_contamination’ declared but never defined
 inline long double algos_calculate_per_base_normal_contamination(uint8_t norm_copy_no,uint8_t tum_copy_no);
                    ^
rsync -uE ./scripts/* ./bin/
chmod u+x ./bin/setupCaveman ./bin/splitCaveman ./bin/mstepCaveman ./bin/mergeCaveman ./bin/estepCaveman ./bin/mergeCavemanResults
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic ./src/file_tests.o ./src/List.o ./src/List_algos.o ./src/bam_access.o ./src/config_file_access.o ./src/fai_access.o ./src/ignore_reg_access.o ./src/alg_bean.o ./src/split_access.o ./src/covs_access.o ./src/cn_access.o ./src/genotype.o ./src/algos.o ./src/output.o ./src/setup.o ./src/split.o ./src/mstep.o ./src/merge.o ./src/estep.o -L./caveman_tmp -lhts -lpthread -lz -lm    tests/bam_access_tests.c   -o tests/bam_access_tests
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic ./src/file_tests.o ./src/List.o ./src/List_algos.o ./src/bam_access.o ./src/config_file_access.o ./src/fai_access.o ./src/ignore_reg_access.o ./src/alg_bean.o ./src/split_access.o ./src/covs_access.o ./src/cn_access.o ./src/genotype.o ./src/algos.o ./src/output.o ./src/setup.o ./src/split.o ./src/mstep.o ./src/merge.o ./src/estep.o -L./caveman_tmp -lhts -lpthread -lz -lm    tests/config_file_access_tests.c   -o tests/config_file_access_tests
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic ./src/file_tests.o ./src/List.o ./src/List_algos.o ./src/bam_access.o ./src/config_file_access.o ./src/fai_access.o ./src/ignore_reg_access.o ./src/alg_bean.o ./src/split_access.o ./src/covs_access.o ./src/cn_access.o ./src/genotype.o ./src/algos.o ./src/output.o ./src/setup.o ./src/split.o ./src/mstep.o ./src/merge.o ./src/estep.o -L./caveman_tmp -lhts -lpthread -lz -lm    tests/ign_region_access_tests.c   -o tests/ign_region_access_tests
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic ./src/file_tests.o ./src/List.o ./src/List_algos.o ./src/bam_access.o ./src/config_file_access.o ./src/fai_access.o ./src/ignore_reg_access.o ./src/alg_bean.o ./src/split_access.o ./src/covs_access.o ./src/cn_access.o ./src/genotype.o ./src/algos.o ./src/output.o ./src/setup.o ./src/split.o ./src/mstep.o ./src/merge.o ./src/estep.o -L./caveman_tmp -lhts -lpthread -lz -lm    tests/alg_bean_tests.c   -o tests/alg_bean_tests
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic ./src/file_tests.o ./src/List.o ./src/List_algos.o ./src/bam_access.o ./src/config_file_access.o ./src/fai_access.o ./src/ignore_reg_access.o ./src/alg_bean.o ./src/split_access.o ./src/covs_access.o ./src/cn_access.o ./src/genotype.o ./src/algos.o ./src/output.o ./src/setup.o ./src/split.o ./src/mstep.o ./src/merge.o ./src/estep.o -L./caveman_tmp -lhts -lpthread -lz -lm    tests/split_access_tests.c   -o tests/split_access_tests
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic ./src/file_tests.o ./src/List.o ./src/List_algos.o ./src/bam_access.o ./src/config_file_access.o ./src/fai_access.o ./src/ignore_reg_access.o ./src/alg_bean.o ./src/split_access.o ./src/covs_access.o ./src/cn_access.o ./src/genotype.o ./src/algos.o ./src/output.o ./src/setup.o ./src/split.o ./src/mstep.o ./src/merge.o ./src/estep.o -L./caveman_tmp -lhts -lpthread -lz -lm    tests/cn_access_tests.c   -o tests/cn_access_tests
gcc -O3 -g -DCAVEMAN_VERSION='"1.7.2"' -DTEST_REF='""""' -Wall -Isrc -I/tmp/downloads/tmp/install_tmp/caveman/install_tmp/htslib/ -rdynamic ./src/file_tests.o ./src/List.o ./src/List_algos.o ./src/bam_access.o ./src/config_file_access.o ./src/fai_access.o ./src/ignore_reg_access.o ./src/alg_bean.o ./src/split_access.o ./src/covs_access.o ./src/cn_access.o ./src/genotype.o ./src/algos.o ./src/output.o ./src/setup.o ./src/split.o ./src/mstep.o ./src/merge.o ./src/estep.o -L./caveman_tmp -lhts -lpthread -lz -lm    tests/algos_tests.c   -o tests/algos_tests
In file included from tests/algos_tests.c:34:0:
src/algos.h:69:13: warning: inline function ‘algos_run_per_position_estep_maths’ declared but never defined
 inline void algos_run_per_position_estep_maths(estep_position_t *pos);
             ^
src/algos.h:68:12: warning: inline function ‘algos_run_per_read_estep_maths’ declared but never defined
 inline int algos_run_per_read_estep_maths(genotype_store_t *genos,read_pos_t *read, int8_t ref_base_idx, long double base_norm_contam);
            ^
src/algos.h:67:13: warning: inline function ‘finalise_probabilities_and_find_top_prob’ declared but never defined
 inline void finalise_probabilities_and_find_top_prob(estep_position_t *pos,long double norm_factor);
             ^
src/algos.h:66:20: warning: inline function ‘algos_calculate_per_base_normal_contamination’ declared but never defined
 inline long double algos_calculate_per_base_normal_contamination(uint8_t norm_copy_no,uint8_t tum_copy_no);
                    ^
/tmp/ccl2GMBI.o: In function `test_algos_calculate_per_base_normal_contamination':
/tmp/downloads/tmp/install_tmp/caveman/tests/algos_tests.c:392: undefined reference to `algos_calculate_per_base_normal_contamination'
/tmp/ccl2GMBI.o: In function `test_finalise':
/tmp/downloads/tmp/install_tmp/caveman/tests/algos_tests.c:498: undefined reference to `finalise_probabilities_and_find_top_prob'
/tmp/downloads/tmp/install_tmp/caveman/tests/algos_tests.c:562: undefined reference to `finalise_probabilities_and_find_top_prob'
collect2: error: ld returned 1 exit status
<builtin>: recipe for target 'tests/algos_tests' failed
make: *** [tests/algos_tests] Error 1

Installation on ubuntu:16.04

export OPT=/opt/bundle
export PERL5LIB=$OPT/lib/perl5:$PERL5LIB
export PATH=$OPT/bin:$PATH
export PATH=/root/perl5/perlbrew/perls/perl-5.18.4/bin:$PATH
export LD_LIBRARY_PATH=$OPT/lib

# install system dependencies
apt-get -yqq update --fix-missing
apt-get -yqq install \
    build-essential \
    libbz2-dev \
    libcurl4-gnutls-dev \
    libgnutls-dev \
    liblzma-dev \
    libncurses5-dev \
    libssl-dev \
    locales \
    nettle-dev \
    wget \
    zlib1g-dev \
    curl \
    libexpat1-dev \
    libgoogle-perftools-dev \
    lsof \
    time \
    psmisc \
    libgd-dev \
    gfortran \
    libreadline-dev \
    build-essential \
    fort77 \
    xorg-dev \
    liblzma-dev \
    libblas-dev \
    gcc-multilib \
    gobjc++ \
    libreadline-dev \
    lsb-core \
    libboost-dev \
    libboost-all-dev \
    libpstreams-dev \
    libgtk2.0-dev \
    libglib2.0-dev \
    r-base-core \
    python-pip

apt-get clean

# Configure default locale, see
# https://github.com/rocker-org/rocker/issues/19
# http://click.pocoo.org/5/python3/
echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen
locale-gen en_US.utf8
/usr/sbin/update-locale LANG=en_US.UTF-8

# install perlbrew and correct perl version
cpan App::perlbrew
perlbrew init
perlbrew install 5.18.4 --thread --notest --noman
perlbrew switch perl-5.18.4

# install variables
DOWNLOAD_TAR=/tmp/downloads/tmp.tar.gz
DOWNLOAD_DIR=/tmp/downloads/tmp

# install PCAP-core
DOWNLOAD_URL=https://github.com/ICGC-TCGA-PanCancer/PCAP-core/archive/v1.14.0.tar.gz
mkdir -p $DOWNLOAD_DIR $OPT/bin $OPT/etc $OPT/lib $OPT/share
curl -sSL -o $DOWNLOAD_TAR --retry 10 $DOWNLOAD_URL
tar -C $DOWNLOAD_DIR --strip-components 1 -zxf $DOWNLOAD_TAR
cd $DOWNLOAD_DIR && ./setup.sh $OPT
cd && rm -rf $DOWNLOAD_DIR $DOWNLOAD_TAR

# install cgpVcf
DOWNLOAD_URL=https://github.com/cancerit/cgpVcf/archive/v1.3.1.tar.gz
mkdir -p $DOWNLOAD_DIR $OPT/bin $OPT/etc $OPT/lib $OPT/share
curl -sSL -o $DOWNLOAD_TAR --retry 10 $DOWNLOAD_URL
tar -C $DOWNLOAD_DIR --strip-components 1 -zxf $DOWNLOAD_TAR
cd $DOWNLOAD_DIR && ./setup.sh $OPT
cd && rm -rf $DOWNLOAD_DIR $DOWNLOAD_TAR

# install cgpCavemanPostProcessing
DOWNLOAD_URL=https://github.com/cancerit/cgpCaVEManPostProcessing/archive/1.5.2.tar.gz
mkdir -p $DOWNLOAD_DIR $OPT/bin $OPT/etc $OPT/lib $OPT/share
curl -sSL -o $DOWNLOAD_TAR --retry 10 $DOWNLOAD_URL
tar -C $DOWNLOAD_DIR --strip-components 1 -zxf $DOWNLOAD_TAR
cd $DOWNLOAD_DIR && ./setup.sh $OPT
cd && rm -rf $DOWNLOAD_DIR $DOWNLOAD_TAR

# install cgpCavemanWrapper
DOWNLOAD_URL=https://github.com/cancerit/cgpCaVEManWrapper/archive/1.7.5.tar.gz
mkdir -p $DOWNLOAD_DIR $OPT/bin $OPT/etc $OPT/lib $OPT/share
curl -sSL -o $DOWNLOAD_TAR --retry 10 $DOWNLOAD_URL
tar -C $DOWNLOAD_DIR --strip-components 1 -zxf $DOWNLOAD_TAR
cd $DOWNLOAD_DIR && ./setup.sh $OPT
cd && rm -rf $DOWNLOAD_DIR $DOWNLOAD_TAR

Error with caveman setup command

During the installation process all tests seem to have passed without applying any manual patch to htslib. But when I use caveman setup or setupCaveman, I get the following error:

caveman: src/alg_bean.c:279: alg_bean_write_list_char: Assertion `((li)->count) > 0' failed.
./setupCaveman: line 35: 30077 Aborted                 (core dumped) ${0%/*}/caveman setup "$@"

I am hoping this is being caused because of not applying the patch to htslib. If yes, shouldn't the tests reflect that instead of passing successfully?

estep help for options -l and -r

Hi folks, just a small issue. I noticed that the help entries for the estep -l and -r options say:
Normal protocol. Ideally this should match -r/-l but not checked (WGS|WGX|RNA) [default:WGS].

However, when trying to use "WGX" you get an error:
Normal protocol 'WGX' is invalid should be one of (WGS|WXS|RNA).

Best,
Simon

generateCavemanUMNormVCF - get sample names from BAM header

Make provision of sample names optional, use the header in first instance.

CAVEMAN: Somatic Calls in SNPs File

We have been using Caveman to call variants in our high depth(200X) WGS bams. We are seeing that some true calls are currently being put into the snps file.

Below is an example where a true call with 16% VAF in Tumor and 0% VAF in Normal is not in the somatic file but the snps file.

I wanted to ask if there is a way we can rescue such calls in a systematic way.
Thank you,
Teja

yellapav$ zless ALL101-T2-1-D1-1_vs_ALL101-N1-1-D1-1.snps.ids.vcf.gz|grep 25380285
12 25380285 70a63090-7430-11e6-a392-90a00151774b
G A . . DP=325;MP=4.8e-05;GP=1.0e+00;TG=GG/AG;TP=4.8e-05;SG=AG/GG;SP=1.0e+00
GT:FAZ:FCZ:FGZ:FTZ:RAZ:RCZ:RGZ:RTZ:PM 0|0:0:0:20:0:0:0:17:0:0.0e+00 0|1:19:1:112:1:26:0:129:0:1.6e-01

mergeCavemanResults - fails on merging of no_analysis bed files

Array references are used in a couple of places without initialisation:

https://github.com/cancerit/CaVEMan/blob/dev/scripts/mergeCavemanResults#L39

https://github.com/cancerit/CaVEMan/blob/dev/scripts/mergeCavemanResults#L46

Crash when encountering a "bad" read

I've been trying to test CaVEMan on some exome sequencing data that I downloaded from the EGA. The data seems to have been pre-processed with GATK (indel realignment/base quality recalibration). When I run CaVEMan (tag 1.6.2), I get the following crash during mstep:

~/Code/CaVEMan/build/bin/caveman mstep -i 1 -f D132.cfg
Looking at section chr1:1-9188602 for mstep
M-stepping section chr1:1-564449 for mstep
fetched a reference seq of length 564449 for this section.
M-stepping section chr1:570372-724136 for mstep
fetched a reference seq of length 153765 for this section.
M-stepping section chr1:727044-825006 for mstep
fetched a reference seq of length 97963 for this section.
M-stepping section chr1:825116-2583334 for mstep
fetched a reference seq of length 1758219 for this section.
[ERROR] (src/algos.c: algos_mstep_read_position:180 errno: No such file or directory) Error calculating map qual index.
*** glibc detected *** /users/bschuster/Code/CaVEMan/build/bin/caveman: double free or corruption (fasttop): 0x0000000002621550 ***
Segmentation fault

I've concluded that this is actually due to a sanity check on some reads in algos_mstep_read_position( but I wonder

what is causing this assertion to fail and
why the app then goes into meltdown

I can't share the full data unfortunately but I'd be happy to test and report back.

emitting calls with impossible genotypes

Example from pancancer data set for donor: 11ab53c8-6366-4bc9-b60d-23e6ed2d1cae

$ head -n 500 11ab53c8-6366-4bc9-b60d-23e6ed2d1cae/caveman/tmpCaveman/a0218c53-0b70-4940-8e14-cdc7dde9e5a7_vs_cdccdb98-4446-41f2-bcf2-e5459a2e477d.muts.ids.vcf | tail -n 1
1   5728410 aff71d72-2469-11e4-8fad-b029640c9a49    T   G   .   .   DP=113;MP=9.7e-01;GP=2.5e-02;TG=TT/GTTT;TP=5.3e-01;SG=TT/CTTT;SP=3.1e-01    GT:FAZ:FCZ:FGZ:FTZ:RAZ:RCZ:RGZ:RTZ:PM   0|0:1:0:0:19:1:0:0:19:0.0e+00   0|1:1:0:0:35:4:0:0:33:0.0e+00

Shows change as T->G when no G in allele counts, also second most likely is shown as T->C, again not possible based on genotypes.

mstep/estep 5.

pileup (caveman specific on) gives:

$ fetchSinglePosPileUp.pl -b 11ab53c8-6366-4bc9-b60d-23e6ed2d1cae/tumour1.bam -c 1:5728410
READ_NAME   CALL    MAP_QUAL    BASE_QUALITY    READ_POS    READ_LENGTH STRAND  XT  CIGAR
HWI-ST805:192:C0MNLACXX:2:1107:3011:9133    T   60  34  4   101 -1  -   101M
HWI-ST580:248:C11VKACXX:4:2305:15586:7661   T   60  28  1   101 -1  -   5S96M
HWI-ST580:220:C0MW6ACXX:2:1209:7836:7880    T   60  35  94  101 1   -   101M
HWI-ST580:219:C0MW5ACXX:5:2109:13870:13807  T   60  35  93  101 1   -   101M
HWI-ST805:251:C11LTACXX:4:1205:9534:18493   T   60  35  13  101 -1  -   101M
HWI-ST580:248:C11VKACXX:4:2208:19801:67900  T   60  35  14  101 -1  -   101M
HWI-ST580:220:C0MW6ACXX:2:1204:4293:63938   T   60  35  15  101 -1  -   101M
HWI-ST805:251:C11LTACXX:4:1103:15451:17878  T   60  35  87  101 1   -   101M
HWI-ST580:219:C0MW5ACXX:5:2301:8086:57987   T   60  35  86  101 1   -   101M
HWI-ST580:219:C0MW5ACXX:5:2110:5453:17975   T   60  35  84  101 1   -   101M
HWI-ST805:251:C11LTACXX:4:2206:2038:85337   T   60  35  18  101 -1  -   101M
HWI-ST580:248:C11VKACXX:3:1107:4139:60661   T   60  29  84  101 1   -   101M
HWI-ST580:219:C0MW5ACXX:5:1316:12431:98910  T   60  35  83  101 1   -   101M
HWI-ST580:220:C0MW6ACXX:2:2205:3740:56605   T   60  35  21  101 -1  -   101M
HWI-ST580:248:C11VKACXX:3:2208:8629:95728   T   60  35  81  101 1   -   101M
HWI-ST580:220:C0MW6ACXX:2:2211:8242:23836   T   60  35  80  101 1   -   101M
HWI-ST580:248:C11VKACXX:4:2216:12502:29324  T   60  35  80  101 1   -   101M
HWI-ST580:219:C0MW5ACXX:5:2106:7346:95661   T   60  35  78  101 1   -   101M
HWI-ST805:192:C0MNLACXX:2:2316:4377:95223   T   60  35  26  101 -1  -   101M
HWI-ST580:248:C11VKACXX:4:1202:9225:60171   T   60  35  75  101 1   -   101M
HWI-ST805:192:C0MNLACXX:2:2304:19413:98996  T   60  35  27  101 -1  -   101M
HWI-ST805:192:C0MNLACXX:2:1111:6545:40916   T   60  33  72  101 1   -   101M
HWI-ST805:251:C11LTACXX:4:1309:6275:65130   T   60  35  32  101 -1  -   101M
HWI-ST580:248:C11VKACXX:3:1301:4764:52191   T   60  35  67  101 1   -   101M
HWI-ST580:248:C11VKACXX:4:2214:7876:11756   T   60  35  64  101 1   -   101M
HWI-ST580:248:C11VKACXX:4:2306:12987:33366  T   60  31  63  101 1   -   101M
HWI-ST580:219:C0MW5ACXX:5:1301:8395:16227   A   6   35  40  101 -1  -   101M
HWI-ST580:248:C11VKACXX:3:2110:18177:53767  T   60  35  41  101 -1  -   101M
HWI-ST805:192:C0MNLACXX:2:1306:20969:30180  T   60  35  42  101 -1  -   101M
HWI-ST580:220:C0MW6ACXX:2:2212:19117:97744  T   60  35  43  101 -1  -   101M
HWI-ST580:248:C11VKACXX:3:1305:15212:31660  T   60  34  45  101 -1  -   101M
HWI-ST580:219:C0MW5ACXX:5:1116:17394:60970  T   60  34  48  101 -1  -   101M
HWI-ST580:219:C0MW5ACXX:5:1116:18074:69479  T   60  35  49  101 -1  -   101M
HWI-ST580:248:C11VKACXX:3:2111:18141:24752  T   60  35  53  101 1   -   101M
HWI-ST805:192:C0MNLACXX:2:1315:12969:2705   T   60  35  52  101 1   -   101M
HWI-ST580:220:C0MW6ACXX:2:2111:19839:31100  T   60  35  51  101 -1  -   101M
HWI-ST580:219:C0MW5ACXX:5:2116:8696:80890   T   60  35  45  101 1   -   101M
HWI-ST580:248:C11VKACXX:3:2316:2210:49505   T   60  35  59  101 -1  -   101M
HWI-ST805:192:C0MNLACXX:2:1214:1956:53010   T   60  31  42  101 1   -   101M
HWI-ST805:251:C11LTACXX:4:2303:12079:39921  T   60  35  64  101 -1  -   101M
HWI-ST580:248:C11VKACXX:3:1101:19990:87622  T   60  35  38  101 1   -   101M
HWI-ST805:192:C0MNLACXX:2:1207:11639:24256  T   60  35  38  101 1   -   101M
HWI-ST580:219:C0MW5ACXX:5:1202:12949:67136  T   60  34  37  101 1   -   101M
HWI-ST805:251:C11LTACXX:4:1111:11412:89595  T   60  32  36  101 1   -   101M
HWI-ST805:192:C0MNLACXX:2:1210:14938:31225  T   60  35  66  101 -1  -   101M
HWI-ST580:248:C11VKACXX:4:2216:16939:32071  A   5   35  35  101 1   -   101M
HWI-ST805:251:C11LTACXX:4:1315:8366:35459   A   0   35  55  101 -1  -   13S88M
HWI-ST580:248:C11VKACXX:3:2106:17002:80274  A   21  35  58  101 -1  -   10S91M
HWI-ST580:248:C11VKACXX:4:1115:14664:71967  A   0   35  62  101 -1  -   6S95M
HWI-ST580:248:C11VKACXX:4:2204:20269:2704   T   60  25  73  101 -1  -   101M
HWI-ST805:251:C11LTACXX:4:2309:1592:69092   T   60  35  26  101 1   -   101M
HWI-ST805:192:C0MNLACXX:2:1213:11630:15850  A   15  35  77  101 -1  -   101M
HWI-ST805:192:C0MNLACXX:2:1104:9631:25928   A   0   34  25  101 1   -   101M
HWI-ST805:192:C0MNLACXX:2:1303:14078:71460  T   57  34  77  101 -1  -   101M
HWI-ST580:220:C0MW6ACXX:2:2209:3319:91398   A   0   34  80  101 -1  -   101M
HWI-ST580:248:C11VKACXX:3:2311:19115:28233  T   50  34  22  101 1   -   101M
HWI-ST580:248:C11VKACXX:4:1108:10939:73827  T   50  35  22  101 1   -   101M
HWI-ST580:220:C0MW6ACXX:2:1209:10248:15016  T   50  35  81  101 -1  -   101M
HWI-ST580:248:C11VKACXX:4:1107:17447:20749  T   60  32  83  101 -1  -   101M
HWI-ST805:192:C0MNLACXX:2:2305:17299:25726  A   9   35  83  101 -1  -   101M
HWI-ST805:251:C11LTACXX:4:2306:12724:38246  T   60  35  16  101 1   -   101M
HWI-ST580:248:C11VKACXX:4:2108:10503:35030  T   60  35  86  101 -1  -   101M
HWI-ST580:220:C0MW6ACXX:2:1315:17819:72930  T   60  35  15  101 1   -   101M
HWI-ST580:248:C11VKACXX:4:2106:7176:23526   T   60  35  14  101 1   -   101M
HWI-ST580:220:C0MW6ACXX:2:1213:6467:91846   T   60  34  13  101 1   -   101M
HWI-ST805:192:C0MNLACXX:2:1106:19196:32669  T   60  35  12  101 1   -   101M
HWI-ST805:251:C11LTACXX:4:1313:1209:88840   T   53  32  91  101 -1  -   101M
HWI-ST580:248:C11VKACXX:3:1216:12063:22854  T   50  35  92  101 -1  -   101M
HWI-ST580:219:C0MW5ACXX:5:2214:5824:81554   T   39  35  96  101 -1  -   101M
HWI-ST580:219:C0MW5ACXX:5:2212:3713:22000   T   48  17  97  101 -1  -   101M
HWI-ST580:248:C11VKACXX:3:2205:6065:53224   T   48  35  5   101 1   -   101M
HWI-ST580:219:C0MW5ACXX:5:2315:7394:74369   T   48  34  4   101 1   -   101M
HWI-ST580:248:C11VKACXX:4:2106:4142:77749   T   48  34  98  101 -1  -   101M
HWI-ST805:251:C11LTACXX:4:1316:2539:24796   T   48  33  100 101 -1  -   101M
HWI-ST805:192:C0MNLACXX:2:1312:10936:56407  T   48  29  100 101 -1  -   101M
HWI-ST805:192:C0MNLACXX:2:1315:9810:3948    T   48  33  2   101 1   -   101M
HWI-ST805:251:C11LTACXX:4:2302:19343:89060  T   48  31  101 101 -1  -   101M
Total coverage: 77
Not properly paired (excluded from output): 1

Setup step symlink path

Setup step follows symlink paths and should get absolute path of the symlink, not the file pointed to by the symlink.

Fast gzip compress estep snps/muts VCFs to reduce I/O & tmp footprint

If the estep writes to gz level 1 for the split VCF's the I/O and tmp storage footprint will be greatly reduced for minimal cpu cost.

Not a full size split example but you get the idea:

$ time (cat snps.vcf > t)

real	0m0.005s
user	0m0.000s
sys	0m0.000s
$ time (gzip -c snps.vcf > t)

real	0m0.007s
user	0m0.000s
sys	0m0.000s
$ ls -l snps.vcf t
-rw-r----- 1 kr2 cancer 31439 Mar 18 10:39 snps.vcf
-rw-r----- 1 kr2 cancer  5865 Mar 18 10:40 t

Split step not creating regular sized splits

A new means of checking split sizes is needed, perhaps a joint iteration between normal and tumour rather than separate counts, just count up to 1million reads and make that the cutoff position?

Something along similar lines as the commandline

caveman estep -i 372 -e 14c65927-b9dd-4ef6-9917-db7bc431ff2f/norm.cn.bed -j 14c65927-b9dd-4ef6-9917-db7bc431ff2f/tum.cn.bed -k 0.100000 -g 14c65927-b9dd-4ef6-9917-db7bc431ff2f/caveman/tmpCaveman/cov_arr -o 14c65927-b9dd-4ef6-9917-db7bc431ff2f/caveman/tmpCaveman/prob_arr -v hs37d5 -w HUMAN -f 14c65927-b9dd-4ef6-9917-db7bc431ff2f/caveman/tmpCaveman/caveman.cfg.ini

Estep section 11:101671969-101750094
fetched a reference seq of length 78126 for this section.
Start to get reads: 11:101671969-101721969
Got reads: 11:101671969-101721969
Start to get reads: 11:101721970-101750094
Got reads: 11:101721970-101750094
Estep section 11:101750095-101828219
fetched a reference seq of length 78125 for this section.
Start to get reads: 11:101750095-101800095
Got reads: 11:101750095-101800095

I added a few prints to stderr so I could correlate the exact time points with the memory spikes.

On the 3rd read set the memory jumped to ~12GB during the read selections step, during the processing of the data it climbs rapidly above 50GB (where I've had to kill it).

On investigation I can see the following read distribution across each request:

tumour

samtools view -c 14c65927-b9dd-4ef6-9917-db7bc431ff2f/tumour1.bam 11:101671969-101721969
88798
samtools view -c 14c65927-b9dd-4ef6-9917-db7bc431ff2f/tumour1.bam 11:101721970-101750094
47921
samtools view -c 14c65927-b9dd-4ef6-9917-db7bc431ff2f/tumour1.bam 11:101750095-101800095
889103

normal

samtools view -c 14c65927-b9dd-4ef6-9917-db7bc431ff2f/normal1.bam 11:101671969-101721969
20170
samtools view -c 14c65927-b9dd-4ef6-9917-db7bc431ff2f/normal1.bam 11:101721970-101750094
11248
samtools view -c 14c65927-b9dd-4ef6-9917-db7bc431ff2f/normal1.bam 11:101750095-101800095
20780

High number of lanes causes error in merge

A large number of lanes means that the covariate arrays are too large to be created using variables on the stack.

Copy number file defaults

The README and the help for caveman setup suggest that when no CNV files are provided, CaVEMan will create some defaults:

-e --tumour-copy-no-file [file] Location of tumour copy number bed file (if the extension is not .bed the file will be treated as 1 based start). If no copy number file is supplied then the default cn of 2 will be used

However, when I tried this now, it simple crashes with

[ERROR] (src/cn_access.c: cn_access_populate_cn:56 errno: No such file or directory) Error trying to open copy number file for reading .
[ERROR] (src/cn_access.c: cn_access_get_copy_number_for_location:132 errno: None) Error populating copy number from file .
[ERROR] (src/algos.c: algos_estep_read_position:659 errno: None) Error fetching normal copy number for chr1:761998
Segmentation fault

This is using 1.6.3. I tried with and without setting -t and -n for step. Do I have to create a dummy copy-number file myself?

error in caveman estep 1.12.0

[ERROR] (src/output.c: output_vcf_header:302 errno: No such file or directory) Error (0) writing contigs.
[ERROR] (src/estep.c: estep_main:517 errno: None) Error writing header to muts file.

mergeCavemanResults

Hello,
I did not find any documentation for this script in the main documentation page, but did run across it when searching the repository. Is this script meant to be run after the estep to combine all the variant calls?

I did try running it as follows:

mergeCavemanResults --output cavemen_muts_all_merge.vcf -s splitList -f results/%/%.muts/vcf
and get the following error

Expected 1726 files but got 0. at ./mergeCavemanResults line 54.

Could you please let me know if this is meant to be run this way? If not, is there a tool you would recommend for combining the final variant calls?

thanks
Arun

any best practices for RNA-seq SNV calling?

Wanting to run some T/N pairs through but retaining normal variants. Normal tissue has been irradiated. Are ASCAT CN values optional?

Thanks

Dave

caveman mstep with error: split_access_get_section_from_indexv Error opening split list file.

Hello,
I'm using 1.11.0, and followed the “Processing flow” in the main page. When running the command after:

bin/caveman setup -t testData/testing_mt.bam -n testData/testing_wt.bam -r testData/ref.fai -g testData/ign.test -f results/
bin/caveman split -i 1 -f caveman.cfg.ini

The command:
bin/caveman mstep -f caveman.cfg.ini -i 1
throw the error meg:

[ERROR] (src/split_access.c: split_access_get_section_from_index:50 errno: No such file or directory) Error opening split list file.
[ERROR] (src/mstep.c: mstep_main:193 errno: None) Error fetching region from split file.
Segmentation fault (core dumped)

It seemed that the code in split_access.c didn't combined the info -i. Is there something wrong with my parameters?

1.13.9 - split error

Encountering an error when I use caveman's split command.
~/CaVEMan-dev/bin/caveman split -f CAVEMAN_analysis/MCF10A_config.txt -i 1 Found chr: 1 of length: 249250621 at index 1 *** Error in/home/gmsywss/CaVEMan-dev/bin/caveman': free(): invalid next size (fast): 0x00000000010ff920 ***
Aborted
`
Tried at various -i integers and still the same error. Issue #80 appears similar and the dev asked for the command to be executed using valgrind and see what the output is.
This is what I get when I execute the split command using valgrind

`
~/valgrind-3.14.0/bin/valgrind ~/CaVEMan-dev/bin/caveman split -i 17 -f CAVEMAN_analysis/MCF10A_config.txt
==39798== Memcheck, a memory error detector
==39798== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==39798== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==39798== Command: /home/gmsywss/CaVEMan-dev/bin/caveman split -i 17 -f CAVEMAN_analysis/MCF10A_config.txt
==39798==
Found chr: 17 of length: 81195210 at index 17
==39798== Invalid write of size 1
==39798== at 0x57C9774: _IO_vfscanf (vfscanf.c:1073)
==39798== by 0x57D901B: __isoc99_vsscanf (isoc99_vsscanf.c:43)
==39798== by 0x57D8FA6: __isoc99_sscanf (isoc99_sscanf.c:31)
==39798== by 0x411FE7: ignore_reg_access_get_ign_reg_count_for_chr (ignore_reg_access.c:55)
==39798== by 0x4211BC: split_main (split.c:211)
==39798== by 0x477002: setup_options (caveman.c:76)
==39798== by 0x5792B44: (below main) (libc-start.c:287)
==39798== Address 0x5b1e5f2 is 0 bytes after a block of size 50 alloc'd
==39798== at 0x4C27E65: malloc (vg_replace_malloc.c:299)
==39798== by 0x411FC9: ignore_reg_access_get_ign_reg_count_for_chr (ignore_reg_access.c:53)
==39798== by 0x4211BC: split_main (split.c:211)
==39798== by 0x477002: setup_options (caveman.c:76)
==39798== by 0x5792B44: (below main) (libc-start.c:287)
==39798==

valgrind: m_mallocfree.c:307 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 128, hi = 4851576727769661767.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata. If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away. Please try that before reporting this as a bug.

host stacktrace:
==39798== at 0x580973AC: show_sched_status_wrk (m_libcassert.c:369)
==39798== by 0x580974B7: report_and_quit (m_libcassert.c:440)
==39798== by 0x58097651: vgPlain_assert_fail (m_libcassert.c:506)
==39798== by 0x580A0EBD: get_bszB_as_is (m_mallocfree.c:305)
==39798== by 0x580A0EBD: get_bszB (m_mallocfree.c:315)
==39798== by 0x580A0EBD: get_pszB (m_mallocfree.c:389)
==39798== by 0x580A0EBD: vgPlain_describe_arena_addr (m_mallocfree.c:1592)
==39798== by 0x5808D143: vgPlain_describe_addr (m_addrinfo.c:187)
==39798== by 0x5808C0B6: vgMemCheck_update_Error_extra (mc_errors.c:1187)
==39798== by 0x58090D22: vgPlain_maybe_record_error (m_errormgr.c:823)
==39798== by 0x5808B5BB: vgMemCheck_record_address_error (mc_errors.c:767)
==39798== by 0x1003969794: ???
==39798== by 0x10032EDF2F: ???
==39798== by 0x100200838F: ???
==39798== by 0x100200838F: ???
==39798== by 0x1C0F: ???
==39798== by 0x10032EDF3F: ???
==39798== by 0x3: ???
==39798== by 0x111CC: ???
==39798== by 0x7: ???

sched status:
running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 39798)
==39798== at 0x57C981A: _IO_vfscanf (vfscanf.c:1158)
==39798== by 0x57D901B: __isoc99_vsscanf (isoc99_vsscanf.c:43)
==39798== by 0x57D8FA6: __isoc99_sscanf (isoc99_sscanf.c:31)
==39798== by 0x411FE7: ignore_reg_access_get_ign_reg_count_for_chr (ignore_reg_access.c:55)
==39798== by 0x4211BC: split_main (split.c:211)
==39798== by 0x477002: setup_options (caveman.c:76)
==39798== by 0x5792B44: (below main) (libc-start.c:287)
client stack range: [0x1FFEFFD000 0x1FFF000FFF] client SP: 0x1FFEFFF880
valgrind stack range: [0x10031EE000 0x10032EDFFF] top usage: 9096 of 1048576

Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using. Thanks.
`
Hope the devs find this useful.

specifying external htslib

The provided Makefile breaks if an external HTSLIB is specified unless it is patched with:

diff Makefile Makefile.org 
21c21
< OPTINC?=-I$(HTSLOC)/include

---
> OPTINC?=-I$(HTSLOC)/
91c91
<       ln -s $(HTSLOC)/lib/libhts.a $(HTSTMP)/libhts.a

---
>       ln -s $(HTSLOC)/libhts.a $(HTSTMP)/libhts.a

By the way, the install instructions provide a patch for htslib 1.2.1 with the implication that 1.3 is fine and the patch is not needed. The above works for HTSLIB installed in /apps/htslib/1.3

Pre-release tag overlapping_reads

Pull all hotfixes into this feature and create a pre-release tag for use in large scale testing via docker.

Note: for some reason this branch isn't feature/overlapping_reads. Needs renaming,

Make doesn't link htslib as static, so libhts.so is required at runtime

Alter makefile to generate a truly htslib independent executable

Ability to auto fill copy number gaps with a default

Add the functionality to fill copy number gaps with the default copy number.

Genome.fa

Test genome.fa file is too big, use a smaller chromosome.

estep doesn't allow user to specify PROTOCOL

The VCF file requires the protocol value to be defined in the 'sample' lines. It should also be possible to set platform (unless this is from the BAM header), source and accession.

no calls gives unusual file sets

If CaVEMan gives no calls during the estep you get a full set of *.muts.vcf files but no *.snps.vcf.

This causes the mergeCavemanResults script to fail.

Not sure how this should be handled ot classified as the data is of no use but a clean run with a warning would be helpful. I encountered this when running an WXS pair from cghub where it appears the tumour and normals don't have any common coverage.

Investigate possible buffering of BAM access

M/Estep do all reads from disk in ~60KB chunks. This is explainable as this is the size of the compressed blocks (65536 b) of a BAM file.

Is it possible to stage the data in a buffered stream, I know 1MB would work better for lustre, not sure about EC2?

caveman estep 1.12.0

This error has been reported before

[ERROR] (src/output.c: output_vcf_header:302 errno: No such file or directory) Error (0) writing contigs.
[ERROR] (src/estep.c: estep_main:517 errno: None) Error writing header to muts file.

The thread was closed suspecting a zlib error (fixed @ 1.2.8)
However we have the same error and this is on Centos 7 with zlib 1.2.7
$ cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core)
and currently supports zlib:-
[root@server213-171-196-65 /]# rpm -qa zlib*
zlib-1.2.7-17.el7.x86_64
zlib-1.2.7-17.el7.i686
zlib-devel-1.2.7-17.el7.x86_64

As far a I understand there isn't a version of zlib 1.2.8 available for Centos 1.2.7.

Any advice will be appreciated.

Error when executing split command

Hi!

I have been trying to run CaVEMan on a few of my WGS samples but each time I run the split command I get this strange error:

caveman split -i 1 -f /home/druce/caveman.cfg.ini
Found chr: 1 of length: 195471971 at index 1
*** Error in caveman': free(): invalid next size (fast): 0x0000000001f0b250 *** *** Error in caveman': malloc(): memory corruption: 0x0000000001f0b290 ***

I have tried looking into this error but cannot figure out a way to solve it. Do you have any advice? I'm using CaVEMan version 1.13.1.

Improve makefile for cram

download htslib, patch and install all in Makefile.

generateCavemanUMNormVCF - Unmatched VCF generation errors

#0 bam_name2id (h=0x0, ref=0x7fffffffdc40 "chr19") at sam.c:189
#1 0x0000000000432a60 in hts_itr_querys (idx=0x756b60, reg=0x6d0150 "chr19:1-2000000", getid=0x433260 <bam_name2id>, hdr=,

itr_query=0x431e80 <hts_itr_query>, readrec=<optimized out>) at hts.c:1532

#2 0x000000000040c914 in bam_access_get_by_position_counts ()
#3 0x0000000000475d53 in gen_panel_generate_pileups_for_segment ()
#4 0x000000000040b4cd in main ()

Caused by lack of reading the header via sam_hdr_read.

Split bug (sparse data edge-case)

To reproduce you need 3 conditions (occurs under 1.11.2, unlikely fixed in 1.12.0):

Chromosome with ignored region
Range at start of chromosome where tumour has no coverage prior to the ignored region
Same range does have coverage in normal

e.g.

Ignored regions:

$ grep '^21' /lustre/scratch112/sanger/cgppipe/canpipe/live/ref/human/GRCh37d5/genome.gap.tab
21	9698543	9700123	4	"..."

Tumour read count 21:1-9698543 (correct flags)

$ samtools view -c /.../Tumour.sample.dupmarked.bam 21:1-9698543
0

Normal read count in 21:1-9698543 (correct flags)

$samtools view -c /nfs/cancer_ref01/nst_links/live/1751/PD29727b/PD29727b.sample.dupmarked.bam 21:1-9698543
4822

This results in a split file for chr 21 of:

$ cat tmpCaveman/splitList.21
21	0	0
21	0	48129895

When this is merged and subsequently the mstep attempts to use the following error is raised:

[ERROR] (src/mstep.c: mstep_main:193 errno: None) Error fetching region from split file.

This is raised on validation of the split record as it expects the stop value to never be 0. Looks to be a straightforward fix.

Add checking to commandline options to ensure types

An example is the estep -p option not checking for a float value.

generateCavemanUMNormVCF - command line help has errors

@drjsanger , looks like some of this is a bit squiffy. In usage program name doesn't match and some of the param short names don't appear in the description area.

$ generateCavemanUMNormVCF -h
Usage: generateCavemanVCFUnmatchedNormalPanel -i jobindex [-f path] [-c int] [-m int] [-e int] 

-i  --index [int]                 Job index (e.g. from $LSB_JOBINDEX)

-b  --bam-files [file]            Path to the bam files to be analysed (comma separated list, in the same order as sample-names).
-o  --out-file [file]             Filename to write results to.
-r  --reference [file]            Path to reference file (genome.fa).
-l  --split-sections [file]       Path to the bed file of split sections to analyse.
-s  --species [string]            Species name (used in VCF output).
-v  --spp-vers [string]           Species version (used in VCF output).
-n  --sample-names [string]       Name of samples being analysed (comma separated list, in the same order as bam-files).
-p  --platform [string]           Platform sample was generated on (Illumina). [default:Illumina]
-t  --protocol [string]           Protocol used to generate sample data (WGS|WXS). [default:WGS]
Optional
-q  --min-base-qual [int]         Minimum base quality [default:0]
-h  help                          Display this usage information.

estep issue with version 1.13.3 . - fai_access Error

@drjsanger
Hi David,

I just installed the version 1.13.3 and rerun it with the same exact inputs as I ran the previous version 1.13.2 which had issue with the estep and zlib buffer.

All the steps up to merge step did not raise any error, but estep raised new errors.
after running the following command,
for lineid in $(seq 1 22) ; do caveman estep -i {lineid} -f caveman.cfg.ini -l WXS -r WXS -w Human -g ./covs_arr -o ./probs_arr ; done,

I unfortunately got the following errors:

********** >1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
[ERROR] (src/fai_access.c: fai_access_get_count_length_all_contigs:84 errno: No such file or directory) Wrong number of entries (1) found in fasta index file line >1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
[ERROR] (src/estep.c: estep_main:519 errno: None) Error establishing contig count and name length.

I also just try caveman estep -i 22 to simplify the command, and got the exact same error.

The list of contigs I gave you earlier ( #77 ) is the right list of contigs seen in the BAM header, but based on that error message I thought the system did not recognize the contig in the reference_genome fasta file because there are annotations added after the contig name in the line starting with ">" character in the reference genome fasta file.

Here below I attached 1) the lines from the fasta file where the contigs are defined and 2) I added the fai file associated as well

As I thought the annotation could be the problem, therefore, I created a reference genome without the annotation in the contig line.
My .fai file was created using samtools faidx ${REF_GENOME_FASTA} command.

Unfortunately, I got the same error for ALL the jobids that ran with the estep module.
********** >1
[ERROR] (src/fai_access.c: fai_access_get_count_length_all_contigs:84 errno: None) Wrong number of entries (1) found in fasta index file line >1
[ERROR] (src/estep.c: estep_main:519 errno: None) Error establishing contig count and name length.
In order to reduce runtime, I subsetted the processing to only two contigs, 21 and 22; I got the exact same error as described right above.

Let me know if you have questions or if you need further information.

hs37d5_contigs.zip

mstep - covs arrays could be optimised to reduce I/O for little CPU cost

At present each mstep element creates a 8.8MB file, but fastest compression reduces this to 753K (on a whole genome thats ~52GB -> 4GB), e.g.

$ gzip -1 -c tmpCaveman/cov_arr > t
$ ls -lh tmpCaveman/cov_arr t
-rw-rw-r-- 1 ubuntu ubuntu 753K Mar 10 18:51 t
-rw-r--r-- 1 ubuntu ubuntu 8.8M Mar 10 07:27 tmpCaveman/cov_arr

Obviously the later steps will need to decompress but the overhead at gzip level 1 is miniscule.

Deleting the individual covs arrays once successfully merged would also be beneficial to overall tmp foot-print.

Incorrect genotype sorting in SNP calls

A follow on from #60 . The sorting logic of genotypes needs correcting.

Error opening file to read cov array

Hi,
I got the error at the merge process described below.

caveman setup -t $tumourBam -n $normalBam -r $REF -g $IGN

for the test run, I use -i 1

caveman split -i 1 -f $INI
mv splitList.1 splitList
caveman mstep -f $INI -i 1

Created results directory './results'.
Found chr: 1 of length: 249250621 at index 1
Found 0 ignored regions for chromosome 1.
Looking at section 1:1-928895 for mstep
M-stepping section 1:1-464448 for mstep
fetched a reference seq of length 464448 for this section.
M-stepping section 1:464449-928895 for mstep
fetched a reference seq of length 464447 for this section.
Created directory ./results/1.

I got the error from the merge process for covs files

caveman merge -f $INI

[ERROR] (src/covs_access.c: covs_access_read_covs_from_file:167 errno: No such file or directory) Error opening file to read cov array: ./results/1/928896_1222529.covs.

When I check the folder, the file "928896_1222529.covs" could not find out in the "results" folder. And also only the file "1_928895.covs" was generated.

-bash-4.1$ ls results/1/
1_928895.covs

I also check the splitList. The splitList have 502 lines in the file.

1	0	928895
1	928895	1222529
1	1222529	1527127
1	1527127	1893065
1	1893065	2189793
1	2189793	2464429
1	2464429	2763184

It seemed that the mstep didn't generate the other covs files. Do you have any idea for solve this error?

Suimye

Estep memory explosion to >50GB

Found that a closely grouped set of split regions were causing a memory issue. As this hadn't been seen in our internal version investigated the copynumber in that region.

Turns out the copynumber was 58 in the tumour. The internal system caps copynumber at 10 to prevent this during the construction of the CN input files. As this completely prevents caveman from running it probably makes more sense to cap it within the algorithm.

I'll add the change in a branch, but feel free to argure against doing it in the code.

Convert estep and mstep to read cram

Update to use the newer htslib access methods

Version checking between steps

Add a check in each step to find the version number i the alg bean and compare it to the current binary version. If they differ, don't permit running.

check BAM headers in setup for readgroups

A user attempted to run caveman using a BAM with no readgroup header lines (or tags on reads). Check that header has some during setup step as doesn't fail until mstep otherwise.

1.13.5 - mstep errors - negative positions

As per section 3 of the following comment on issue #82

As another test, I also tried it using ALL the contigs listed in the BAM (in SQ lines). I noticed that the mstep is actually looking for non-existing contigs, which did not happen in item 3 above when using only chr21 and chr22;

Example of error with mstep, which impact merge and estep:

M-stepping section 4:181586940-185010141 for mstep fetched a reference seq of length 3423202 for this section.
[ERROR] (src/fai_access.c: fai_access_get_ref_seqeuence_for_pos:108 errno: No such file or directory) Error fetching reference sequence for region X:-1365916695-144904260.
[ERROR] (src/fai_access.c: fai_access_get_ref_seqeuence_for_pos:108 errno: No such file or directory) Error fetching reference sequence for region 19:-1591198826-38028818.
[ERROR] (src/fai_access.c: fai_access_get_ref_seqeuence_for_pos:108 errno: No such file or directory) Error fetching reference sequence for region 16:-1844986504-74660215.
[ERROR] (src/fai_access.c: fai_access_get_ref_seqeuence_for_pos:108 errno: No such file or directory) Error fetching reference sequence for region ERCC-00053:-1104487891-1023.
[ERROR] (src/fai_access.c: fai_access_get_ref_seqeuence_for_pos:108 errno: No such file or directory) Error fetching reference sequence for region 16:-1844986504-50701913.
[ERROR] (src/fai_access.c: fai_access_get_ref_seqeuence_for_pos:108 errno: No such file or directory) Error fetching reference sequence for region 20:-1531084355-44680223.
[ERROR] (src/fai_access.c: fai_access_get_ref_seqeuence_for_pos:108 errno: No such file or directory) Error fetching reference sequence for region X:-1365916695-128699773.
Looking at section 1:28206191-29652276 for mstep
M-stepping section 1:28206191-249250621 for mstep

Fail to parse output VCF files due to invalid UTF-8

Dear CaVEMan team,

Thanks for the useful work.

I'm having issues parsing the output files (e.g. ./results/chr1/1_249250621.snps.vcf).
I use the readVcf method of the VariantAnnotation Bioconductor package.
The method fails to parse any of the output VCF file with the following messages:

Error in FUN(X[[i]], ...) : subscript out of bounds
In addition: Warning messages:
1: In strsplit(keyval0[!idx], ",(?=[[:alnum:]]+=)", perl = TRUE) :
  input string 2 is invalid UTF-8
2: In strsplit(keyval0[!idx], ",(?=[[:alnum:]]+=)", perl = TRUE) :
  input string 3 is invalid UTF-8
3: In strsplit(keyval0[!idx], ",(?=[[:alnum:]]+=)", perl = TRUE) :
  input string 4 is invalid UTF-8
4: In strsplit(keyval0[!idx], ",(?=[[:alnum:]]+=)", perl = TRUE) :
  input string 5 is invalid UTF-8
5: In strsplit(keyval0[!idx], ",(?=[[:alnum:]]+=)", perl = TRUE) :
  input string 6 is invalid UTF-8

Upon examination, the files have a header that looks like this:

screenshot
paste from terminal (notice the weird "" and "" symbols near assembly and species):

##fileformat=VCFv4.1
##fileDate=20170619
##reference=/gpfs2/well/ratcliff/pipeline/reference_data/genomes/hg19/hg19.fa
##vcfProcessLog=<InputVCF=<.>,InputVCFSource=<CaVEMan>,InpuVCFVer=<"1.11.2">,InputVCFParam=<NORMAL_CONTAMINATION=0,REF_BIAS=0.8,PRIOR_MUT_RATE=6e-06,PRIOR_SNP_RATE=6e-06,SNP_CUTOFF=0.95,MUT_CUTOFF=0.8>>
##cavemanVersion=1.11.2
##contig=<ID=chr1,length=249250621,assembly=x<E1>X<BF>5,species=x<E1>X<BF>5>
##contig=<ID=chr2,length=243199373,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr3,length=198022430,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr4,length=191154276,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr5,length=180915260,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr6,length=171115067,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr7,length=159138663,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chrX,length=155270560,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr8,length=146364022,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr9,length=141213431,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr10,length=135534747,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr11,length=135006516,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr12,length=133851895,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr13,length=115169878,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr14,length=107349540,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr15,length=102531392,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr16,length=90354753,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr17,length=81195210,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr18,length=78077248,assembly=^X<E2>X<BF>5,species=^X<E2>X<BF>5>
##contig=<ID=chr20,length=63025520,assembly=^X<E2>X<BF>5,species=8<E2>X<BF>5>
##contig=<ID=chrY,length=59373566,assembly=<A8><E2>X<BF>5,species=x<E1>X<BF>5>
##contig=<ID=chr19,length=59128983,assembly<E3>X<BF>5,species=x<E1>X<BF>5>
##contig=<ID=chr22,length=51304566,assembly=,species=>
##contig=<ID=chr21,length=48129895,assembly=,species=>
##contig=<ID=chr6_ssto_hap7,length=4928567,assembly=,species=>
##contig=<ID=chr6_mcf_hap5,length=4833398,assembly=,species=>
##contig=<ID=chr6_cox_hap2,length=4795371,assembly=,species=>
##contig=<ID=chr6_mann_hap4,length=4683263,assembly=,species=>
##contig=<ID=chr6_apd_hap1,length=4622290,assembly=,species=>

Thanks in advance for advising/fixing !

Best wishes,
Kevin

can --ignore-regions-file flag be used to ignore all but the chromosome of interest?

We are trying to speed up caVEMan on some very large bam pairs. Tumour bam is greater than 200GB. Is it possible to use the -g flag to include many regions to ignore so that we can call variants on a single chromosome at a time?

We are actually trying a massively parallel approach on AWS via cgc.sbgenomics.com (NCI"s cancer cloud pilot).

Thanks you