gregoryfaust / samblaster Goto Github PK

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.

License: MIT License

Makefile 0.63% C++ 99.37%

samblaster's Introduction

samblaster

Written by: Greg Faust ([email protected]) Ira Hall Lab, University of Virginia

Please cite: Faust, G.G. and Hall, I.M., “SAMBLASTER: fast duplicate marking and structural variant read extraction,” Bioinformatics Sept. 2014; 30(17): 2503-2505.

Also see: SAMBLASTER_Supplemental.pdf for additonal discussion and statistics about the duplicates marked by samblaster vs. Picard using the NA12878 sample dataset. Click the preceeding link or download the file from this repository.

Current version: 0.1.26

Support for Linux and OSX (Version 10.7 or higher).

Summary

samblaster is a fast and flexible program for marking duplicates in read-id grouped¹ paired-end SAM files. It can also optionally output discordant read pairs and/or split read mappings to separate SAM files, and/or unmapped/clipped reads to a separate FASTQ file. When marking duplicates, samblaster will require approximately 20MB of memory per 1M read pairs.

Installation

samblaster is self contained and therefore has no installation dependencies beyond g++ and make.

samblaster can be downloaded from the releases tab or manually downloaded via git clone. Afterwards, simply use make and copy samblaster to a directory in your path. For example:

git clone git://github.com/GregoryFaust/samblaster.git
cd samblaster
make
cp samblaster /usr/local/bin/.

Usage

See the SAM File Format Specification for details about the SAM alignment format.

By default, samblaster reads SAM input from stdin and writes SAM to stdout. Input SAM files usually contain paired end data (see Duplicate Identification below), must contain a sequence header, and must be read-id grouped¹. By default, the output SAM file will contain all the alignments in the same order as the input, with duplicates marked with SAM FLAG 0x400. The --removeDups option will instead remove duplicate alignments from the output file.

¹A read-id grouped SAM file is one in which all alignments for a read-id (QNAME) are grouped together in adjacent lines. Aligners naturally produce such files. They can also be created by sorting a SAM file by read-id. But as shown below, sorting the input to samblaster by read-id is not required if the alignments are already grouped.

COMMON USAGE SCENARIOS:

To take input alignments directly from bwa mem and output to samtools view to compress SAM to BAM:

bwa mem <idxbase> samp.r1.fq samp.r2.fq | samblaster | samtools view -Sb - > samp.out.bam

When using the bwa mem -M option, also use the samblaster -M option:

bwa mem -M <idxbase> samp.r1.fq samp.r2.fq | samblaster -M | samtools view -Sb - > samp.out.bam

To additionally output discordant read pairs and split read alignments:

bwa mem <idxbase> samp.r1.fq samp.r2.fq | samblaster -e -d samp.disc.sam -s samp.split.sam | samtools view -Sb - > samp.out.bam

To pull split reads and discordants read pairs from a pre-existing BAM file with duplicates already marked:

samtools view -h samp.bam | samblaster -a -e -d samp.disc.sam -s samp.split.sam -o /dev/null

To process a BAM file of singleton long reads to pull split and/or unmapped reads with/without marking duplicates:

samtools view -h samp.bam | samblaster --ignoreUnmated [-e] --maxReadLength 100000 [-s samp.split.sam] [-u samp.umc.fasta] | samtools view -Sb - > samp.out.bam
samtools view -h samp.bam | samblaster --ignoreUnmated -a [-e] [-s samp.split.sam] [-u samp.umc.fasta] -o /dev/null

OPTIONS: Default values enclosed in square brackets []

Input/Output Options:
-i --input           FILE Input sam file [stdin].
-o --output          FILE Output sam file for all input alignments [stdout].
-d --discordantFile  FILE Output discordant read pairs to this file. [no discordant file output]
-s --splitterFile    FILE Output split reads to this file abiding by paramaters below. [no splitter file output]
-u --unmappedFile    FILE Output unmapped/clipped reads as FASTQ to this file abiding by parameters below. [no unmapped file output].
                          Requires soft clipping in input file.  Will output FASTQ if QUAL information available, otherwise FASTA.

Other Options:
-a --acceptDupMarks       Accept duplicate marks already in input file instead of looking for duplicates in the input.
-e --excludeDups          Exclude reads marked as duplicates from discordant, splitter, and/or unmapped file.
-r --removeDups           Remove duplicates reads from all output files. (Implies --excludeDups).
   --addMateTags          Add MC and MQ tags to all output paired-end SAM lines.
   --ignoreUnmated        Suppress abort on unmated alignments. Use only when sure input is read-id grouped,
                          and either paired-end alignments have been filtered or the input file contains singleton reads.
                          --ignoreUnmated is not recommended for general use on paired-end data. It disables checks that detect incorrectly sorted input.
-M                        Compatibility mode (details below); both FLAG 0x100 and 0x800 denote supplementary (chimeric). Similar to bwa mem -M option.
   --maxReadLength    INT Maximum allowed length of the SEQ/QUAL string in the input file. [500]
   --maxSplitCount    INT Maximum number of split alignments for a read to be included in splitter file. [2]
   --maxUnmappedBases INT Maximum number of un-aligned bases between two alignments to be included in splitter file. [50]
   --minIndelSize     INT Minimum structural variant feature size for split alignments to be included in splitter file. [50]
   --minNonOverlap    INT Minimum non-overlaping base pairs between two alignments for a read to be included in splitter file. [20]
   --minClipSize      INT Minumum number of bases a mapped read must be clipped to be included in unmapped file. [20]

-h --help                 Print samblaster help to stderr.
-q --quiet                Output fewer statistics.
   --version              Print samblaster version number to stderr.

ALIGNMENT TYPE DEFINITIONS: Below, we will use the following definitions for alignment types. Starting with samblaster release 0.1.22, these definitions are affected by the use of the -M option. By default, samblaster will use the current definitions of alignment types as specified in the SAM Specification. Namely, alignments marked with FLAG 0x100 are considered secondary, while those marked with FLAG 0x800 are considered supplementary. If the -M option is specified, alignments marked with either FLAG 0x100 or 0x800 are considered supplementary, and no alignments are considered secondary. A primary alignment is always one that is neither secondary nor supplementary. Only primary and supplementary alignments are used to find chimeric (split-read) mappings. The -M flag is used for backward compatibility with older SAM/BAM files in which "chimeric" alignments were marked with FLAG 0x100, and should also be used with output from more recent runs of bwa mem using its -M option.

DUPLICATE IDENTIFICATION: A duplicate read pair is defined as a pair that has the same signature for each mapped read as a previous read pair in the input SAM file. The signature is comprised of the combination of the sequence name, strand, and the reference offset where the 5' end of the read would fall if the read were fully aligned (not clipped) at its 5' end. The 5' aligned reference position is calculated using a combination of the POS field, the strand, and the CIGAR string. This definition of signature matches that used by Picard MarkDuplicates.

For pairs in which both reads are mapped, both signatures must match.
For pairs in which only one side is mapped (an "orphan"), the signature of the mapped read must match a previously seen orphan. Starting with samblaster 0.1.25, forward and reverse strand orphans/singletons are allowed to be duplicates of each other. In an orphan pair, the unmapped read need not appear in the input file. In addition, mapped non-paired singleton (possible long) read alignments will be treated the same as an orphan pair with a missing unmapped read.
No doubly unmapped pair will be marked as a duplicate.
Any secondary or supplementary alignment associated with a duplicate primary alignment will also be marked as a duplicate.

DISCORDANT READ PAIR IDENTIFICATION: A discordant read pair is one which meets all of the following criteria:

Both side of the read pair are mapped (neither FLAG 0x4 or 0x8 is set).
The properly paired FLAG (0x2) is not set.
Secondary or supplementary alignments are never output as discordant, although a discordant read pair can have such alignments associated with them.
Duplicate read pairs that meet the above criteria will be output as discordant unless the -e option is used.

SPLIT READ IDENTIFICATION: Split Read alignments are derived from a single read when one portion of the read aligns to a different region of the reference genome than another portion of the read. Such pairs of alignments often define a structural variant (SV) breakpoint, and are therefore useful input to SV detection algorithms such as LUMPY. samblaster uses the following strategy to identify split reads alignments.

Identify reads that have between two and --maxSplitCount primary and supplementary alignments.
Sort these alignments by their strand-normalized position along the read.
Two alignments are output as splitters if they are adjacent on the read, and meet these criteria:
- each covers at least --minNonOverlap base pairs of the read that the other does not.
- the two alignments map to different reference sequences and/or strands.
- the two alignments map to the same sequence and strand, and represent a SV that is at least --minIndelSize in length, and have at most --maxUnmappedBases of un-aligned base pairs between them.
Split read alignments that are part of a duplicate read will be output unless the -e option is used.

UNMAPPED/CLIPPED READ IDENTIFICATION: An unmapped or clipped read is a primary alignment that is unaligned over all or part of its length respectively. The lack of a full alignment may be caused by a SV breakpoint that falls within the read. Therefore, samblaster will optionally output such reads to a FASTQ file for re-alignment by a tool, such as YAHA, geared toward finding split-read mappings. samblaster applies the following strategy to identify and output unmapped/clipped reads:

An unmapped read has the unmapped read FLAG set (0x4).
A clipped read is a mapped read with a CIGAR string that begins or ends with at least --minClipSize unaligned bases (CIGAR code S and/or H), and is not from a read that has one or more supplementary alignments.
In order for samblaster to output the entire sequence for clipped reads, the input SAM file must have soft clipped primary alignments.
samblaster will output unmapped/clipped reads into a FASTQ file if QUAL information is available in the input file, and a FASTA file if not.
Unmapped/clipped reads that are part of a duplicate read pair will be output unless the -e option is used.

samblaster's People

Stargazers

Watchers

samblaster's Issues

Support for BED files?

I don't see anything in the documentation about support for applying Samblaster to specific regions of a genome using something like a BED file. Is this a feature that could be integrated easily?

remove duplicate for single-end read

Hi, I saw a repo used Samblaster for removing single-end duplicate(Repo link).

Is Samblaster suitable for removing single-end reads? If so, does it work similar to Picard-MarkDuplicates?

Thank you for your time.
Ji

output file option segmentation fault

Hi,

When running samblaster, specifying an input sam file (sorted by read group, with header as specified) I get a segmentation fault after about 1 minute of running:

$ samblaster -i Sample_PI574_486.wdups.sorted_by_readname.sam --output Sample_PI574_486.nodups.sorted_by_readname.sam

samblaster: Version 0.1.23
samblaster: Opening Sample_PI574_486.wdups.sorted_by_readname.sam for read.
samblaster: Opening Sample_PI574_486.nodups.sorted_by_readname.sam for write.
samblaster: Loaded 20 header sequence entries.
Segmentation fault

Interestingly, when I don't specify an output file it writes directly to stdout fine
Also, when I pipe the output from stdout into samtools view, it also works:

This command ran to completion:
$samblaster -i Sample_PI574_486.wdups.sorted_by_readname.sam | /projects/bioinformatics/builds/samtools-1.3.1/samtools-1.3.1/samtools view -Sb > Sample_PI574_486.wdups.sorted_by_readname.bam

It segfaults only when using -o or --output. The permissions are all open, I have plenty of disc space available.

after extracting disc and split reads, does the final bam still contain them?

bwa mem <idxbase> samp.r1.fq samp.r2.fq | samblaster -e -d samp.disc.sam -s samp.split.sam | samtools view -Sb - > samp.out.bam

Does samp.out.bam still contain the discordants and split reads?

Thanks,
Ming

how do samblaster define a split read?

Dear @GregoryFaust

Samblaster is a great tool to identify split reads and discordant read pairs.
I wonder to know how samblaster to define split reads?
In addition, if a split read must be flag with 2048?
Could you help me?

kind regards
Zheng Wang

can samblaster tackle reads with UMI?

threading

Silly question: would it be feasible to thread samblaster? (i.e., would it be worth my taking a whack at it)

Reason I ask is that it's choking on biscuit (https://github.com/zwdzwd/biscuit) output when I run biscuit with loads of threads (40-56) and perhaps it's possible to overcome this one way or another. I hate the idea of sorting twice, and I hate the reality even worse. But with enough reads we choke samblaster.

I'm willing to help with threading, but I was wondering if the authors can immediately point out a choke point that makes this a dumb and pointless idea. This is not meant as a feature request per se!

Makefile error OS X

I'm getting an error when trying to install samblaster. Our IT specialist thinks it could be an OS X compatibility issue as I have an older version of OS X (10.6.8):

g++ -Wall -O3 -D BUILDNUM=21 -c samblaster.cpp -o samblaster.o
samblaster.cpp: In function ‘splitLine_t* readLine(FILE_)’:
samblaster.cpp:326: error: ‘getline’ was not declared in this scope
make: *_* [samblaster.o] Error 1

Thanks in advance.

Read pair detection failure from biscuit-aligned reads

I was just trying to run Samblaster on some output from Biscuit, and was given the following error:

samblaster: Version 0.1.24
samblaster: Inputting from stdin
samblaster: Outputting to stdout
samblaster: Loaded 66 header sequence entries.
samblaster: Can't find first and/or second of pair in sam block of length 1 for id: SRR892990.1.1
samblaster:    At location: 5:46377045
samblaster:    Are you sure the input is sorted by read ids?

A small test dataset to replicate the issue may be found here.

Add support for N opcode

I think this could be supported by simply using else if (opCode == 'D' || opcode == 'N') instead of else if (opCode == 'D') in line 647 of samblaster.cpp.

What do you think?

Duplicates were removed but log says 0 dups removed

Hi @GregoryFaust

We are executing samblaster 0.1.26 using below command

samblaster_0.1.26.simg samblaster --ignoreUnmated  -a -r -i input.sam -o output.sam

samblaster log

samblaster: Version 0.1.26
samblaster: Opening input.sam for read.
samblaster: Opening output.sam for write.
samblaster: Loaded 25 header sequence entries.
samblaster: Found        62867 of    2402373 (2.617%) total read ids are marked paired yet are unmated.
samblaster: Please double check that input file is read-id (QNAME) grouped.
samblaster: Found          838 of    2402373 (0.035%) total read ids with no primary alignment.
samblaster: Please double check that input file is read-id (QNAME) grouped.
samblaster: Removed          0 of    2402373 (0.000%) total read ids as duplicates using 18128k memory in 1.911S CPU seconds and 3S wall time.

Flagstats of input sam

4744272 + 0 in total (QC-passed reads + QC-failed reads)
4423 + 0 secondary
0 + 0 supplementary
2178640 + 0 duplicates
4743484 + 0 mapped (99.98% : N/A)
4739849 + 0 paired in sequencing
2372230 + 0 read1
2367619 + 0 read2
4724503 + 0 properly paired (99.68% : N/A)
4737919 + 0 with itself and mate mapped
1142 + 0 singletons (0.02% : N/A)
3131 + 0 with mate mapped to a different chr
3064 + 0 with mate mapped to a different chr (mapQ>=5)

Flagstats of output sam post executing samblaster

2565632 + 0 in total (QC-passed reads + QC-failed reads)
2740 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
2564846 + 0 mapped (99.97% : N/A)
2562892 + 0 paired in sequencing
1283161 + 0 read1
1279731 + 0 read2
2551398 + 0 properly paired (99.55% : N/A)
2560969 + 0 with itself and mate mapped
1137 + 0 singletons (0.04% : N/A)
2585 + 0 with mate mapped to a different chr
2525 + 0 with mate mapped to a different chr (mapQ>=5)

Questions

Why does samblaster show this warning ? samblaster: Please double check that input file is read-id (QNAME) grouped. . Our input sam is already QNAME sorted. We verified that by checking SAM header SO (sort order)
Why does samblaster say that 0 duplicates removed while we can see that duplicates were indeed removed as shown by samtools flagstat

samblaster: Removed          0 of    2402373 (0.000%) total read ids as duplicates using 18128k memory in 1.911S CPU seconds and 3S wall time.

Write duplication metrics to file

Thanks for a great tool!

It would be nice if it was possible to write the output metrics to a separate file like so:

bwa mem -M <idxbase> samp.r1.fq samp.r2.fq | samblaster -M -m metrics.txt | samtools view -Sb - > samp.out.bam

The new part being -m metrics.txt. A structured format would be nice, tsv probably easiest to parse downstream.

Current output to stderr: samblaster: Removed 1229 of 4596 (26.74%) read ids as duplicates using 1324k memory in 0.000S CPU seconds and 0S wall time. . My suggestion would be something like this:

TOTAL_READS   DUPLICATES_REMOVED   MEMORY_USED   CPU_SECONDS   WALL_TIME
4596          1229                 1324k         0.000         0

Aligning in Biscuit with samblaster and samtools

Hi,
I am getting an output error relating to samblaster when trying to align to me reference genome:

samblaster: Inputting from stdin
samblaster: Outputting to stdout
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[gzread] (null)
samblaster: Loaded 172124 header sequence entries.
samblaster: No reads in input SAM file

This is my code:
module load gcc
module load bowtie2/2.4.1
module load samtools/1.16.1
module load samblaster/0.1.26
module load perl/5.30.2

biscuit align /home/chadders/projects/def-frasiert/methyl_seq/ref_genome/Eubalaena_glacialis_HiC.fasta 1151_B_S11_L001-FP.fastq 1151_B_S11_L001-RP.fastq |
samblaster | samtools sort -o my_output.bam -O BAM -

samtools index my_output.bam

I am super new to this and not sure if I have made a misstep in creating the reference index, but the outputs for this all looked ok. I know the fastq files are ok as they ran fine in bismark.

Thanks

Question about "A read-id grouped SAM file"

Hi all,

I have a SAM file which sorted by coordinate. In order to use samblaster, I use the following samtools command to sort the SAM file by read-id (QNAME)

samtools sort -n sample.bam sample.QNAME.sorted.bam

After I execute the command, the SO in @hd tag is queryname. But I still have problems of using samblaster. The problem is listed below:

samblaster: Can't find first and/or second of pair in sam block of length 1 for id:
samblaster: At location: chr20:50456848
samblaster: Are you sure the input is sorted by read ids?

Really appreciate if you could provide any suggestions and looking forward to discuss.

Thanks.

Filter out soft/hard clipped reads in a pipe SAM -> SAM ?

I was looking for a fast tool to put into a Unix pipe between bwa and samtools which would remove any alignments with S or H CIGAR clipping, as opposed to extracting said reads.

It doesn't seem samblaster can do this?

spitters : I don't get understand the code :-)

cross posted on biostars : https://www.biostars.org/p/292925/

Hi,
Could you please explain your code here : https://github.com/GregoryFaust/samblaster/blob/master/samblaster.cpp#L1125 (with a figure ?)

i don't really understand what is the logic when handling the clipped reads and how you're handling the INS or the DELS. As far as I understand the code, the start/end of a read can be inverted if the read is on '+' or '-' strand (?) and it confuses me.

Thank you ! :-)

Can't find first and/or second of pair in sam block of length 1 for id:...

Hi, I've recently downloaded SAMBLASTER for retrieving softclipped, split, and unmapped reads from whole genome sequencing data. At the moment I'm practicing on chromosome 17 of NA12878 (available from GIAB).

I've extracted chr17, generated a bam file and indexed this all using samtools. I ran

samtools view -h input.bam | samblaster -a -e -d dup.sam -s split.sam -u unmap.sam -o /dev/null

In this first instance it runs, but I then get the following messgaes to my screen

samblaster: Inputting from stdin
samblaster: Opening /dev/null for write.
samblaster: Opening 17dup.sam for write.
samblaster: Opening 17soft.sam for write.
samblaster: Opening 17unmap.sam for write.
samblaster: Loaded 86 header sequence entries.
samblaster: Can't find first and/or second of pair in sam block of length 1 for id: H06HDADXX130110:1:1101:3355:34950
samblaster:    At location: 17:798227
samblaster:    Are you sure the input is sorted by read ids?

In checking, I don't think my original input.bam was sorted by QNAME, so I've done the following

samtools sort -n input.bam #fyi, I can't now index this, is that a problem?
samtools view -h input.bam | samblaster -a -e -d dup.sam -s split.sam -u unmap.sam -o /dev/null

In re-running this, I get the same error messages as before BUT there are output .sam files in my current directory. Has Samblaster quit with only a part of the data in these -d -u -s files, or is that the final output? and how can I overcome these errors about my paired reads and sorting by read ID

Thanks!! :)

Too many sequences in header of input sam file

When the number of sequences inside SAM/BAM file is larger than 32768 samblaster will exit on the following location in the source(samblaster.cpp):

if (count >= (1 << 15))
        {
            fatalError("samblaster: Too many sequences in header of input sam file.  Exiting.\n");
        }

How can this error be caught when it is used piping with bwa mem and sambamba view/sort? In this scenario execution finishes fine(without an error), while the output BAM file is corrupted (contains 1.5MB, while input fastq files are ~50GB).

Handle chrM discordants differently

Since chrM is circular but is linearized in the reference for ease of alignments, reads that span the ends of chrM are incorrectly considered discordant. It would be nice to have the option to include these either as single-end reads in the main output or as split reads in the splitters file.

Segmentation fault on getting split reads from long read file?

Hi there, just following up on #42 I received this crash

 samtools view -h reads_lr_skbr3.fa_ngmlr-0.2.3_mapped.readsort.bam|/home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster --ignoreUnmated -s samp.split-sam -o /dev/null
samblaster: Version 0.1.24
samblaster: Inputting from stdin
samblaster: Opening /dev/null for write.
samblaster: Opening samp.split-sam for write.
samblaster: Loaded 86 header sequence entries.
Segmentation fault (core dumped)

The segmentation fault happens after the samp.split-sam file is 3.2Gb

I ran it again with valgrind and got this

ubuntu@ip-172-31-43-125:/data2$ samtools view -h reads_lr_skbr3.fa_ngmlr-0.2.3_mapped.readsort.bam| valgrind /home/linuxbrew/.linuxbrew/Cellar/samblaster/0
.1.24/bin/samblaster --ignoreUnmated -s samp.split-sam -o /dev/null
==6489== Memcheck, a memory error detector
==6489== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==6489== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==6489== Command: /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster --ignoreUnmated -s samp.split-sam -o /dev/null
==6489==
samblaster: Version 0.1.24
samblaster: Inputting from stdin
samblaster: Opening /dev/null for write.
samblaster: Opening samp.split-sam for write.
samblaster: Loaded 86 header sequence entries.
==6489== Invalid read of size 4
==6489==    at 0x40524D: hashTableInsert(hashTable*, unsigned long) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x4038A4: markDupsDiscordants(splitLine*, state_struct*) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x404216: processSAMBlock(splitLine*, state_struct*) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x401738: main (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==  Address 0x1005bec998 is 1,927,528 bytes inside a block of size 2,367,424 in arena "dinfo"
==6489==
==6489== Invalid read of size 4
==6489==    at 0x405250: hashTableInsert(hashTable*, unsigned long) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x4038A4: markDupsDiscordants(splitLine*, state_struct*) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x404216: processSAMBlock(splitLine*, state_struct*) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x401738: main (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==  Address 0x1005bec99c is 1,927,532 bytes inside a block of size 2,367,424 in arena "dinfo"
==6489==
==6489== Invalid read of size 8
==6489==    at 0x405272: hashTableInsert(hashTable*, unsigned long) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x4038A4: markDupsDiscordants(splitLine*, state_struct*) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x404216: processSAMBlock(splitLine*, state_struct*) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x401738: main (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==  Address 0x1005bec990 is 1,927,520 bytes inside a block of size 2,367,424 in arena "dinfo"
==6489==
==6489== Invalid read of size 8
==6489==    at 0x405318: hashTableInsert(hashTable*, unsigned long) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x4038A4: markDupsDiscordants(splitLine*, state_struct*) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x404216: processSAMBlock(splitLine*, state_struct*) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x401738: main (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==  Address 0x4466d9570f662424 is not stack'd, malloc'd or (recently) free'd
==6489==
==6489==
==6489== Process terminating with default action of signal 11 (SIGSEGV)
==6489==  General Protection Fault
==6489==    at 0x405318: hashTableInsert(hashTable*, unsigned long) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x4038A4: markDupsDiscordants(splitLine*, state_struct*) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x404216: processSAMBlock(splitLine*, state_struct*) (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==    by 0x401738: main (in /home/linuxbrew/.linuxbrew/Cellar/samblaster/0.1.24/bin/samblaster)
==6489==
==6489== HEAP SUMMARY:
==6489==     in use at exit: 14,692,245 bytes in 812 blocks
==6489==   total heap usage: 1,855 allocs, 1,043 frees, 28,039,720 bytes allocated
==6489==
==6489== LEAK SUMMARY:
==6489==    definitely lost: 0 bytes in 0 blocks
==6489==    indirectly lost: 0 bytes in 0 blocks
==6489==      possibly lost: 0 bytes in 0 blocks
==6489==    still reachable: 14,692,245 bytes in 812 blocks
==6489==         suppressed: 0 bytes in 0 blocks
==6489== Rerun with --leak-check=full to see details of leaked memory
==6489==
==6489== For counts of detected and suppressed errors, rerun with: -v
==6489== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

Not very good with reading stack traces like this but thought it could help

Can't find first and/or second of pair in sam block of length

Hi,

I try to use your software but I return the following sentence :

samblaster: Version 0.1.26
samblaster: Inputting from stdin
samblaster: Opening /dev/null for write.
samblaster: Opening samp.split.sam for write.
samblaster: Loaded 291 header sequence entries.
samblaster: Can't find first and/or second of pair in sam block of length 1 for id: 31c20002-7721-47a8-88e3-9dc223225e9a
samblaster:    At location: Minc_v4_shac_contig_1:1
samblaster:    Are you sure the input is sorted by read ids?samblaster: Exiting early, the following stats are for processing preceeding the error
samblaster: Output           0 split reads to samp.split.sam
samblaster: Marked           0 of          3 (0.000%) total read ids as duplicates using 4016k memory in 0.000S CPU seconds and 0S wall time.
samblaster: Premature exit (return code 1).```

Do you have some idea of how I can fixe it please ? 

Thank you

Option to remove split and disc alignments?

Thank you for this awesome tool! Is it possible to output a bam that does not have the split and discordant reads if specifying -d and -s with samblaster? Thanks again!

discordant read pairs and split reads

I want to know whether the output file of "--discordantFile" option includes the output file of "--splitterFile" option or not ?

Support multiple clipping operations at the beginning of a read

It's possible to have multiple hard clip and soft clip operations at the beginning of a read. Currently, I believe samblaster only parses the first of these as a "sclip" event and any subsequent as an end clip.

Here's a SAM example of the behavior. I believe 3/4 reads should be marked as duplicates but only 2 of 4 are.

@HD VN:1.3  SO:queryname
@SQ SN:1    LN:249250621
r1  99  1   15729   27  151M    =   16153   575 CAGGGCTGGAGGGGAGGAGGCGATCTTGCCCAAGGCCCTCCGACTGCAAGCCCCAGGGCCCGCTCACCTTGCTCCTGCTCCTTCTGCTGCTGCTTCTCCAGCTTTCGCTCCTTCATGCTGCGCAGCTTGGCCTTGCCGATGCGCCCAGCTT AAFFFKAKKKKFA<F<FK,AAF7AF,FFFFKFF7AFFFFKKKKAKKFFFAA,<A<<FKFFAFF(<FAKKKKF<FFAFK<FKKKF,AAFKKKKKAFFFKKA,,AFK,7A,AFKAAFAF7<,<7AAFK7,,A,,,7,AFA,A,,,,((,<(,A NM:i:2  MD:Z:51T90C8    AS:i:141    XS:i:136    MQ:i:27
r1  147 1   16153   27  151M    =   15729   -575    GTGCAGCTTCTCCCTGCTTCCTCTCGAGCCCCCACCCTGCAGAGCCGGACCCCTGAGCTAGCCATGCTCTGACAGTCTGAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGATACAGC ,,,77,,,,,,,,KKK<<AF<77,,((F<FFA<,7,<,,AA,(7(((<,<F<A,,,<<,,<AKF77FF77AKA7KA<,,F7A,FAKAKKKKKKKKKFKKFFFKKKKKFAKK<KFAFA,,KKKFFAFA7KKFKKKKKFKKKFAAKKKFF7AA NM:i:6  MD:Z:6A1A16T19T32C66C5  AS:i:122    XS:i:122    MQ:i:27
r2  1123    1   15734   27  5S146M  =   16153   575 CAGGGCTGGAGGGGAGGAGGCGATCTTGCCCAAGGCCCTCCGACTGCAAGCCCCAGGGCCCGCTCACCTTGCTCCTGCTCCTTCTGCTGCTGCTTCTCCAGCTTTCGCTCCTTCATGCTGCGCAGCTTGGCCTTGCCGATGCGCCCAGCTT AAFFFKAKKKKFA<F<FK,AAF7AF,FFFFKFF7AFFFFKKKKAKKFFFAA,<A<<FKFFAFF(<FAKKKKF<FFAFK<FKKKF,AAFKKKKKAFFFKKA,,AFK,7A,AFKAAFAF7<,<7AAFK7,,A,,,7,AFA,A,,,,((,<(,A NM:i:2  MD:Z:51T90C8    AS:i:141    XS:i:136    MQ:i:27
r2  1171    1   16153   27  151M    =   15734   -575    GTGCAGCTTCTCCCTGCTTCCTCTCGAGCCCCCACCCTGCAGAGCCGGACCCCTGAGCTAGCCATGCTCTGACAGTCTGAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGATACAGC ,,,77,,,,,,,,KKK<<AF<77,,((F<FFA<,7,<,,AA,(7(((<,<F<A,,,<<,,<AKF77FF77AKA7KA<,,F7A,FAKAKKKKKKKKKFKKFFFKKKKKFAKK<KFAFA,,KKKFFAFA7KKFKKKKKFKKKFAAKKKFF7AA NM:i:6  MD:Z:6A1A16T19T32C66C5  AS:i:122    XS:i:122    MQ:i:27
r3  1123    1   15734   27  5H146M  =   16153   575 CTGGAGGGGAGGAGGCGATCTTGCCCAAGGCCCTCCGACTGCAAGCCCCAGGGCCCGCTCACCTTGCTCCTGCTCCTTCTGCTGCTGCTTCTCCAGCTTTCGCTCCTTCATGCTGCGCAGCTTGGCCTTGCCGATGCGCCCAGCTT  KAKKKKFA<F<FK,AAF7AF,FFFFKFF7AFFFFKKKKAKKFFFAA,<A<<FKFFAFF(<FAKKKKF<FFAFK<FKKKF,AAFKKKKKAFFFKKA,,AFK,7A,AFKAAFAF7<,<7AAFK7,,A,,,7,AFA,A,,,,((,<(,A  NM:i:2  MD:Z:51T90C8    AS:i:141    XS:i:136    MQ:i:27
r3  1171    1   16153   27  151M    =   15734   -575    GTGCAGCTTCTCCCTGCTTCCTCTCGAGCCCCCACCCTGCAGAGCCGGACCCCTGAGCTAGCCATGCTCTGACAGTCTGAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGATACAGC ,,,77,,,,,,,,KKK<<AF<77,,((F<FFA<,7,<,,AA,(7(((<,<F<A,,,<<,,<AKF77FF77AKA7KA<,,F7A,FAKAKKKKKKKKKFKKFFFKKKKKFAKK<KFAFA,,KKKFFAFA7KKFKKKKKFKKKFAAKKKFF7AA NM:i:6  MD:Z:6A1A16T19T32C66C5  AS:i:122    XS:i:122    MQ:i:27
r4  99  1   15734   27  2H3S146M    =   16153   575 GGGCTGGAGGGGAGGAGGCGATCTTGCCCAAGGCCCTCCGACTGCAAGCCCCAGGGCCCGCTCACCTTGCTCCTGCTCCTTCTGCTGCTGCTTCTCCAGCTTTCGCTCCTTCATGCTGCGCAGCTTGGCCTTGCCGATGCGCCCAGCTT   AAFFFKAKKKKFA<F<FK,AAF7AF,FFFFKFF7AFFFFKKKKAKKFFFAA,<A<<FKFFAFF(<FAKKKKF<FFAFK<FKKKF,AAFKKKKKAFFFKKA,,AFK,7A,AFKAAFAF7<,<7AAFK7,,A,,,7,AFA,A,,,,((,<(,A NM:i:2  MD:Z:51T90C8    AS:i:141    XS:i:136    MQ:i:27
r4  147 1   16153   27  151M    =   15734   -575    GTGCAGCTTCTCCCTGCTTCCTCTCGAGCCCCCACCCTGCAGAGCCGGACCCCTGAGCTAGCCATGCTCTGACAGTCTGAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGATACAGC ,,,77,,,,,,,,KKK<<AF<77,,((F<FFA<,7,<,,AA,(7(((<,<F<A,,,<<,,<AKF77FF77AKA7KA<,,F7A,FAKAKKKKKKKKKFKKFFFKKKKKFAKK<KFAFA,,KKKFFAFA7KKFKKKKKFKKKFAAKKKFF7AA NM:i:6  MD:Z:6A1A16T19T32C66C5  AS:i:122    XS:i:122    MQ:i:27

unmated reads

Greetings,

I see that samblaster has an --ignoreUnmated flag. It seems to require that the bam file be sorted by query name. However, it seems like it also ought to work with a bam file resulting from

bwa mem ... | samtools view -b > blah.bam

From quickly looking at the source code, it seems like using the --ignoreUnmated flag on blah.bam should work since their query names are together.

What do you think? It would be nice to skip the samtools sort -n blah.bam step.

Support for UMIs

Hi There,
Thanks for samblaster, it's such a great tool. I wonder if you have considered supporting unique molecular identifiers at all? bcl2fastq supports adding them to the read name. So if a standard read name was:
@M01277:231:000000000-A8KH6:1:1101:21912:1442 1:N:0:0
then it might be like this when an ID is added
@M01277:231:000000000-A8KH6:1:1101:21912:1442:TTTCCT 1:N:0:0
If there are two barcodes then they are separated by a plus, e.g.
@M01277:231:000000000-A8KH6:1:1101:21912:1442:TTTCCT+AACCTT 1:N:0:0
(I think - I only have data from one-barcode source).

The additional condition would be to check in the read signature if the putative duplicates share the ID.

Nugen have an algorithm here https://github.com/nugentechnologies/nudup/blob/master/nudup.py but it has so much overhead it would be great if samblaster supported the IDs.

workaround for bam files with some singleton reads

Hi!

I am so glad samblaster exists! It makes sense of the chaos of bwa-mem, which has been the bane of my existence recently ... but oh-so-useful for SV calling.
One issue I have encountered, however, is when I have re-mapped reads extracted from a mapped bam in which some filtering has happened and there are no longer all the pairs present.
So, when I sort it by name for samblaster, there are some reads with no pairs and samblaster complains that the file might not be sorted by name.
If I may suggest adding an option that would allow to ignore that check and skip mateless reads, that would be awesome!!

Cheers,

Elizabeth

Is samblaster readgroup aware?

Hi,

Is samblaster read group aware? I am following GATK best practice.
https://software.broadinstitute.org/gatk/documentation/article.php?id=3060

Here at the Broad Institute, we run the initial steps of the pre-processing workflow (mapping, sorting and marking duplicates) separately on each individual read group. Then we merge the data to produce a single BAM file for each sample (aggregation); this is done by re-running Mark Duplicates, this time on all read group BAM files for a sample at the same time. Then we run Indel Realignment and Base Recalibration on the aggregated per-sample BAM files.

Picard Mark Duplicates handles that. want to know if samblaster handles that as well.
https://gatkforums.broadinstitute.org/gatk/discussion/6199/picard-mark-duplicates-handling-of-library-information

Thanks!
Tommy

Samblaster memory use

Hi there,
Thanks for the fantastic work on samblaster! I noticed during a run that memory use was somehow off:

35074 klrl262   20   0 44.9g  44g  904 S 42.9 17.8   0:47.14 samblaster

All my 4 runs showed memory use reaching 44g. Is this usual?

Edit: Version is 0.1.14

split reads detection parameters

Hi, I found samblaster very useful to detect the discordant and split reads.
However, I am not quite understand the meaning of parameters.

I would like to know which parameters I should modify if I want to detect overlapping reads with short split like the following representation:

The read length are 300 bp and they are paired end. Assuming 290bp is mapped on chr1 and the remaining 10bp is mapped on chr2. I found they cannot be detected using default parameteres. How can I detect them? Thanks.

------------chr1------|-----chr2-------------

             [====:==>
             <====:===]                     both reads split  

       [=======>
             <====:===]                     read 2 split but overlapping

             [====:==>
                    <========]             read 1 split but overlapping

Can not run MarkDup for very large bam file

If I have a very large bam file, the size of the bam file is 129GB. How should I set the parameters of markdup to run it fast? My server has 64 cores, 256G memory. No error is currently reported, but it seems to be just starting, but not running.

Please adapt the metrics

Please, consider the use of tabs to delimit between columns in the metrics. Will make life to parse easier :).
You also might want to reconsider the -m flag as proposed by #16 .
For the rest awesomeness!

Getting splitters BAM from long reads data?

Hi there,
I was curious if this tool works for getting splitters for a long read BAM file. I am currently running the steps to try it out but I was wondering if it's something I could recommend to other people (I have a visualization tool that would be ideal if it just got a BAM file with the splitters with everything else filtered out)

No license

Unless a specific license is added to this repository, the code falls into standard copyright, which means it cannot be distributed without the author's consent.

Odd output from samblaster

I'm getting very odd output from samblaster 0.1.17 (saw the same in v0.1.16). This is from a simple SE alignment against the human genome using Novoalign, so no read groups are assigned. In the following gist link I have two files, a simple samtools view -h file | samblaster run and a run w/o samblaster.

https://gist.github.com/cjfields/70d67adc346ece49d902

If it's not immediately apparent, blank lines and/or additional characters are added into the output. Any ideas on what could be going on? Needless to say, samtools chokes on this.

Pipe output to samblaster from bwa-mem2

Hi,

I am trying to upgrade my command from bwa to bwa-mem2. This command usually works.
bwa mem -M -R "@RG\tID:id\tSM:sample\tLB:lib" human_g1k_v37.fasta sample.1.fq sample.2.fq \ | samblaster -M --excludeDups --addMateTags --maxSplitCount 2 --minNonOverlap 20 \ | samtools view -S -b - \ > sample.bam

Once I start using bwa-mem2, the piping breaks.
bwa-mem2 mem -M -R "@RG\tID:id\tSM:sample\tLB:lib" human_g1k_v37.fasta sample.1.fq sample.2.fq \ | samblaster -M --excludeDups --addMateTags --maxSplitCount 2 --minNonOverlap 20 \ | samtools view -S -b - \ > sample.bam

samblaster: Version 0.1.26
samblaster: Inputting from stdin
samblaster: Outputting to stdout

Executing in AVX2 mode!!

SA compression enabled with xfactor: 8
[E::bwa_set_rg] no ID at the read group line

Important parameter settings:
BATCH_SIZE: 512
MAX_SEQ_LEN_REF: 256
MAX_SEQ_LEN_QER: 128
MAX_SEQ_LEN8: 128
SEEDS_PER_READ: 500
SIMD_WIDTH8 X: 32
SIMD_WIDTH16 X: 16
AVG_SEEDS_PER_READ: 64
samblaster: Input file is empty. Exiting.
samblaster: Premature exit (return code 1).

Do you have any idea to run samblaster from bwa-mem2 ??

Mate tags

Hey Greg,

I'm having an issue running Samblaster with "addMateTags" on long, 250 bp reads. It shows the following error and then dies partway through the file.

samblaster: New buffer length exceeds maximum while adding Mate tags.

Seems like it might be because the CIGAR string is longer than the number of bytes allocated for that field. Would it be easy to fix this? I can provide an example BAM if needed.

Add "mate cigar" SAM tag

For efficient structural variant genotyping in SpeedSeq, we need the CIGAR string of a read's mate to be stored on the same line. The reserved SAM tag for mate CIGAR is "MC" (http://sourceforge.net/p/samtools/mailman/message/32040174/, http://samtools.github.io/hts-specs/SAMv1.pdf).

Can you please add a feature to Samblaster to optionally output the mate CIGAR string with an MC tag (type Z)?

Example: MC:Z:71M30S

Duplicate `MC` tags in samblaster output

Starting in version 0.7.16, BWA produces SAM files containing MC tags. Running samblaster with --addMateTags on BWA's output then produces a output file with duplicate MC tags.

We would like to use samblaster to add MQ tags to BWA's output, but the duplicate MC tags are causing issues for downstream tools.

non-human genomes with many contigs

I use samblaster as part of a pipeline for human and non-human genome processing (it was suggested to me by by Aaron Quinlan who used it for his speed seek pipeline).

I have observed that non-human genomes with many contigs do not work with samblaster because of a hard limit in the code (as well as datatype length limits). Also when I have a job that is just below the contig limit, runs use a very large amount of RAM.

This occurs because the hash matrix you build uses contig-contig matches for binning (basically grows N^2). However, if you assign each contig an offset early on, you can bin them using those offsets instead and generate bin-bin matches for the hash rather than contig-contig matches. This both reduces RAM usage, speeds up runtimes on files with many contigs, and allows even genomes with hundreds of thousands of contigs to run without issue.

I have attached a samblaster.cpp file with the changes I mentioned. For human genomes, there is no penalty for doing this extra processing step (runtimes stay the same with the bin-bin matrix vs the contig-contig matrix).

I also changes the position datatype to a signed 32bit int rather than an unsigned one because it can apparently be negative when there are softclips near the stat of the alignments (not an issue in human because of NNN padding in the assembly, but it is an issue in other species).

I hope these changes can be integrated into samblaster as it will open it's usage up to non-human species and greatly reduce RAM usage for those genomes
samblaster.cpp.txt

walltime is twice cpu seconds

First thanks for such a nice tool.

From samblaster's stderr:

samblaster: Marked   260861786 of 1716036289 (15.201%) total read ids as duplicates using 26070192k memory in 1H6M37S(3996.760S) CPU seconds and 2H28S(7228S) wall time.

Any clue why could wall time be 2X cpu time? I'm writing read by read to stdin from python, perhaps I should buffer and write big chunks at the same time? By the way, I'm 100% sure IO is not the issue

More info for missing pair

@GregoryFaust ,

I'm trying to track down a problem with grouping reads. The error I get:

samblaster: Can't find first and/or second of pair in sam block of length 1 for id: xxxxx

It would be exceptionally helpful if the error reported the chromosome and position of the read so they could be quickly pulled for examination.

Thank you.

Alpine-based docker build fails

Hi!

I am building v0.1.26 under Alpine based docker and get the following error:
g++ -Wall -O3 -D BUILDNUM=26 -c samblaster.cpp -o samblaster.o

samblaster.cpp: In function 'UINT64 diffTVs(timeval*, timeval*)':
samblaster.cpp:107:20: error: invalid use of incomplete type 'struct timeval'
  107 |     return (((endTV->tv_sec - startTV->tv_sec) * 1000000) + (endTV->tv_usec - startTV->tv_usec));
      |                    ^~
samblaster.cpp:105:31: note: forward declaration of 'struct timeval'
  105 | inline UINT64 diffTVs (struct timeval * startTV, struct timeval * endTV)
      |                               ^~~~~~~
samblaster.cpp:107:38: error: invalid use of incomplete type 'struct timeval'
  107 |     return (((endTV->tv_sec - startTV->tv_sec) * 1000000) + (endTV->tv_usec - startTV->tv_usec));
      |                                      ^~
samblaster.cpp:105:31: note: forward declaration of 'struct timeval'
  105 | inline UINT64 diffTVs (struct timeval * startTV, struct timeval * endTV)
      |                               ^~~~~~~
samblaster.cpp:107:67: error: invalid use of incomplete type 'struct timeval'
  107 |     return (((endTV->tv_sec - startTV->tv_sec) * 1000000) + (endTV->tv_usec - startTV->tv_usec));
      |                                                                   ^~
samblaster.cpp:105:31: note: forward declaration of 'struct timeval'
  105 | inline UINT64 diffTVs (struct timeval * startTV, struct timeval * endTV)
      |                               ^~~~~~~
samblaster.cpp:107:86: error: invalid use of incomplete type 'struct timeval'
  107 |     return (((endTV->tv_sec - startTV->tv_sec) * 1000000) + (endTV->tv_usec - startTV->tv_usec));
      |                                                                                      ^~
samblaster.cpp:105:31: note: forward declaration of 'struct timeval'
  105 | inline UINT64 diffTVs (struct timeval * startTV, struct timeval * endTV)
      |                               ^~~~~~~
make: *** [Makefile:27: samblaster.o] Error 1

What goes wrong here?

Regards, Felix

Plans to use htslib ?

Greg,

Do you have any plans to incorporate htslib into samblaster?
https://github.com/samtools/htslib

Torsten

segfault during hashTableInsert

I'm running into a segfault with samblaster when streaming input from bwa. We've
seen this a few times with different user reports on bcbio but this is the first
reproducible case I've had.

The segfault happens in different positions on two different machines but after
processing to ~7Gb of SAM output it will fail during hashTableInsert with this
traceback from gdb:

#0  0x0000000000405212 in hashTableInsert(hashTable*, unsigned long) ()
#1  0x00000000004038d1 in markDupsDiscordants(splitLine*, state_struct*) ()
#2  0x0000000000404107 in processSAMBlock(splitLine*, state_struct*) ()
#3  0x00000000004017b0 in main ()

Here is a set of input files and shell script that reproduce the issue for me on
two different machines:
https://s3.amazonaws.com/chapmanb/testcases/samblaster_segfault.tar.gz

Thanks so much for any suggestions or tips to avoid the problem.

"Unknown opcode '*' in" message

Hi, I'm trying out samblaster and so far it is working nicely but I'm getting thousands of "Unknown opcode '*' in" messages in the stderr log. I can't figure out what is going on and it doesn't seem to be messing up the duplicate removal but it is annoying! Here is a 100 read file that had a few messages when mapping with BWA 0.7.5a-r405 against mm10.
reads https://www.dropbox.com/s/swruxw93pnhzdab/test.fastq.gz

Support for DT tag to distinguish PCR and optical duplicates

I'd like to request a feature analogous to the Picard MarkDuplicates TAGGING_POLICY option, where setting All will record the Duplicate Type (PCR or optical) in the optional DT tag, and OpticalOnly will only mark optical duplicates. It's often recommended to only mark optical duplicates on data from PCR-free library prep, which includes most WGS. Thanks!

Very different result between samtools and samblaster

Dear @GregoryFaust

I am using Lumpy to detect CNV. During preparing the discordant paired-end alignments, I found that the results from the following two pipelines are very different. In my test, the samblaster just reported 0 discordant read pair, but samtools reported many.

(1) samtools view -b -F 1294 sample.bam #refer to the github of lumpy (https://github.com/arq5x/lumpy-sv)
(2) samtools sort -@ 4 -O SAM -n sample.bam | samblaster (Version.0.1.25) -a -e -d $prefix.disc.sam -s $prefix.split.sam -o /dev/null

Hope you can help me, thank you.

Sincerely,
Zheng Zhuqing

Can't find first and/or second of pair error

Hi,

I am trying to realign a bam file to a new reference and in the process I would like to use samblaster. When I use the command below, I am running into this error.

samblaster: Loaded 84 header sequence entries.
samblaster: Can't find first and/or second of pair in sam block of length 1 for id: SRR622461.3665340
samblaster:    At location: *:0
samblaster:    Are you sure the input is sorted by read ids?samblaster: Exiting early, the following stats are for processing preceeding the error
samblaster: Marked           0 of        297 (0.000%) total read ids as duplicates using 1620k memory in 0.001S CPU seconds and 13M33S(813S) wall time.
samblaster: Premature exit (return code 1).

Here is my command

samtools collate -uOn128 NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.collate | samtools fastq - | bwa mem -pt20 -R  '@RG\tID:NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211\tLB:NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211\tSM:NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211\tPL:ILLUMINA' -M human_g1k_v37_Ensembl_MT_66.fasta - | samtools sort --threads=20 -m4G -n -O sam | samblaster -M | samtools view -Sb - > NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.marked.bam

Now, my question is in this scenario can I safely add "--ignoreUnmated" flag or is samblaster is not suited for this purpose. Please let me know.