brentp / mosdepth Goto Github PK

View Code? Open in Web Editor NEW

674.0 22.0 99.0 1.21 MB

fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing

License: MIT License

Shell 18.63% Nim 73.43% Python 7.94%

coverage genome depth exome wgs sequencing nim nim-lang hacktoberfest

mosdepth's Introduction

fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing.

mosdepth can output:

per-base depth about 2x as fast samtools depth--about 25 minutes of CPU time for a 30X genome.
mean per-window depth given a window size--as would be used for CNV calling.
the mean per-region given a BED file of regions.

the mean or median per-region cumulative coverage histogram given a window size

a distribution of proportion of bases covered at or above a given threshold for each chromosome and genome-wide.
quantized output that merges adjacent bases as long as they fall in the same coverage bins e.g. (10-20)
threshold output to indicate how many bases in each region are covered at the given thresholds.
A summary of mean depths per chromosome and within specified regions per chromosome.
a d4 file (better than bigwig).

when appropriate, the output files are bgzipped and indexed for ease of use.

usage

mosdepth 0.3.6

  Usage: mosdepth [options] <prefix> <BAM-or-CRAM>

Arguments:

  <prefix>       outputs: `{prefix}.mosdepth.global.dist.txt`
                          `{prefix}.mosdepth.summary.txt`
                          `{prefix}.per-base.bed.gz` (unless -n/--no-per-base is specified)
                          `{prefix}.regions.bed.gz` (if --by is specified)
                          `{prefix}.quantized.bed.gz` (if --quantize is specified)
                          `{prefix}.thresholds.bed.gz` (if --thresholds is specified)

  <BAM-or-CRAM>  the alignment file for which to calculate depth.

Common Options:

  -t --threads <threads>     number of BAM decompression threads [default: 0]
  -c --chrom <chrom>         chromosome to restrict depth calculation.
  -b --by <bed|window>       optional BED file or (integer) window-sizes.
  -n --no-per-base           dont output per-base depth. skipping this output will speed execution
                             substantially. prefer quantized or thresholded values if possible.
  -f --fasta <fasta>         fasta file for use with CRAM files [default: ].
  --d4                       output per-base depth in d4 format.

Other options:

  -F --flag <FLAG>                  exclude reads with any of the bits in FLAG set [default: 1796]
  -i --include-flag <FLAG>          only include reads with any of the bits in FLAG set. default is unset. [default: 0]
  -x --fast-mode                    dont look at internal cigar operations or correct mate overlaps (recommended for most use-cases).
  -q --quantize <segments>          write quantized output see docs for description.
  -Q --mapq <mapq>                  mapping quality threshold. reads with a quality less than this value are ignored [default: 0]
  -l --min-frag-len <min-frag-len>  minimum insert size. reads with a smaller insert size than this are ignored [default: -1]
  -u --max-frag-len <max-frag-len>  maximum insert size. reads with a larger insert size than this are ignored. [default: -1]
  -T --thresholds <thresholds>      for each interval in --by, write number of bases covered by at
                                    least threshold bases. Specify multiple integer values separated
                                    by ','.
  -m --use-median                   output median of each region (in --by) instead of mean.
  -R --read-groups <string>         only calculate depth for these comma-separated read groups IDs.
  -h --help                         show help

If you use mosdepth please cite the publication in bioinformatics

See the section below for more info on distribution.

If --by is a BED file with 4 or more columns, it is assumed the the 4th column is the name. That name will be propagated to the mosdepth output in the 4th column with the depth in the 5th column. If you don't want this behavior, simply send a bed file with 3 columns.

exome example

To calculate the coverage in each exome capture region:

mosdepth --by capture.bed sample-output sample.exome.bam

For a 5.5GB exome BAM and all 1,195,764 ensembl exons as the regions, this completes in 1 minute 38 seconds with a single CPU.

Per-base output will go to sample-output.per-base.bed.gz, the mean for each region will go to sample-output.regions.bed.gz; each of those will be written along with a CSI index that can be used for tabix queries. The distribution of depths will go to sample-output.mosdepth.dist.txt

WGS example

For 500-base windows

mosdepth -n --fast-mode --by 500 sample.wgs $sample.wgs.cram

-n means don't output per-base data, this will make mosdepth a bit faster as there is some cost to outputting that much text.

--fast-mode avoids the extra calculations of mate pair overlap and cigar operations, and also allows htslib to extract less data from CRAM, providing a substantial speed improvement.

Callable regions example

To create a set of "callable" regions as in GATK's callable loci tool:

# by setting these ENV vars, we can control the output labels (4th column)
export MOSDEPTH_Q0=NO_COVERAGE   # 0 -- defined by the arguments to --quantize
export MOSDEPTH_Q1=LOW_COVERAGE  # 1..4
export MOSDEPTH_Q2=CALLABLE      # 5..149
export MOSDEPTH_Q3=HIGH_COVERAGE # 150 ...
mosdepth -n --quantize 0:1:5:150: $sample.quantized $sample.wgs.bam

For this case. A regions with depth of 0 are labelled as "NO_COVERAGE", those with coverage of 1,2,3,4 are labelled as "LOW_COVERAGE" and so on.

The result is a BED file where adjacent bases with depths that fall into the same bin are merged into a single region with the 4th column indicating the label.

Distribution only with modified precision

To get only the distribution value, without the depth file or the per-base and using 3 threads:

MOSDEPTH_PRECISION=5 mosdepth -n -t 3 $sample $bam

Output will go to $sample.mosdepth.dist.txt

This also forces the output to have 5 decimals of precision rather than the default of 2.

D4

D4 is a format created by Hao Hou in the Quinlan lab. It is incorporated into mosdepth as of version 0.3.0 for per-base output with the --d4 flag. It improves write speed dramatically; for one test-case it takes 24.8s to write a per-base.bed.gz with mosdepth compared to 7.7s to write a d4 file. For the same case, running mosdepth without writing per-base takes 5.9 seconds so D4 greatly mitigates the cost of outputing per-base depth and the output is more useful.

Installation

The simplest option is to download the binary from the releases.

Another quick way is to

It can also be installed with brew as brew install brewsci/bio/mosdepth or used via docker with quay:

docker pull quay.io/biocontainers/mosdepth:0.3.3--h37c5b7d_2
docker run -v /hostpath/:/opt/mount quay.io/biocontainers/mosdepth:0.2.4--he527e40_0 mosdepth -n --fast-mode -t 4 --by 1000 /opt/mount/sample /opt/mount/$bam

The binary from releases is static, with no dependencies. If you build it yourself, mosdepth requires htslib version 1.4 or later. If you get an error about "libhts.so not found", set LD_LIBRARY_PATH to the directory that contains libhts.so. e.g.

LD_LIBRARY_PATH=~/src/htslib/ mosdepth -h

If you get the error could not import: hts_check_EOF you may need to install a more recent version of htslib.

If you do want to install from source, see the install.sh.

If you use archlinux, you can install as a package

distribution output

This is useful for QC.

The $prefix.mosdepth.global.dist.txt file contains, a cumulative distribution indicating the proportion of total bases (or the proportion of the --by for $prefix.mosdepth.region.dist.txt) that were covered for at least a given coverage value. It does this for each chromosome, and for the whole genome.

Each row will indicate:

chromosome (or "total")
coverage level
proportion of bases covered at that level

The last value in each chromosome will be coverage level of 0 aligned with 1.0 bases covered at that level.

A python plotting script is provided in scripts/plot-dist.py that will make plots like below. Use is python scripts/plot-dist.py \*global.dist.txt and the output is dist.html with a plot for the full set along with one for each chromosome.

Using something like that, we can plot the distribution from the entire genome. Below we show this for samples with ~60X coverage:

We can also view the Y chromosome to verify that males and females track separately. Below, we that see female samples cluster along the axes while male samples have close to 30X coverage for almost 40% of the genome.

See this blog post for more details.

thresholds

given a set of regions to the --by argment, mosdepth can report the number of bases in each region that are covered at or above each threshold value given to --thresholds. e.g:

mosdepth --by exons.bed --thresholds 1,10,20,30 $prefix $bam

will create a file $prefix.thresholds.bed.gz with an extra column for each requested threshold. An example output for the above command (assuming exons.bed had a 4th column with gene names) would look like (including the header):

#chrom  start   end     region           1X   10X  20X  30X
1       11869   12227   ENSE00002234944  358  157  110  0
1       11874   12227   ENSE00002269724  353  127  10   0
1       12010   12057   ENSE00001948541  47   8    0    0
1       12613   12721   ENSE00003582793  108  0    0    0

If there is no name (4th) column in the bed file send to --by then that column will contain "unknown" in the output.

This is extremely efficient. In our tests, excluding per-base output (-n) and using this argument with 111K exons and 12 values to --thresholds increases the run-time by < 5%.

quantize

quantize allows splitting coverage into bins and merging adjacent regions that fall into the same bin even if they have different exact coverage values. This can dramatically reduce the size of the output compared to the per-base.

It also allows outputting regions of low, high, and "callable" coverage as in GATK's callable loci tool.

An example of quantize arguments:

--quantize 0:1:4:100:200: # ... arbitary number of quantize bins.

indicates bins of: 0:1, 1:4, 4:100, 100:200, 200:infinity where the upper endpoint is non-inclusive.

The default for mosdepth is to output labels as above (0:1, 1:4, 4:100... etc.)

To change what is reported as the bin number, a user can set environment variables e.g.:

export MOSDEPTH_Q0=NO_COVERAGE
export MOSDEPTH_Q1=LOW_COVERAGE
export MOSDEPTH_Q2=CALLABLE
export MOSDEPTH_Q3=HIGH_COVERAGE

In this case, the bin label is replaced by the text in the appropriate environment variable.

This is very efficient. In our tests, excluding per-base output (-n) and using this argument with 9 bins to --quantize increases the run-time by ~ 20%. In contrast, the difference in time with and without -n can be 2-fold.

how it works

As it encounters each chromosome, mosdepth creates an array the length of the chromosome. For every start it encounters, it increments the value in that position of the array. For every stop, it decrements that position. From this, the depth at a particular position is the cumulative sum of all array positions preceding it (a similar algorithm is used in BEDTools where starts and stops are tracked separately). mosdepth avoids double-counting overlapping mate-pairs and it tracks every aligned part of every read using the CIGAR operations. Because of this data structure, the the coverage distribution calculation can be done without a noticeable increase in run-time. The image below conveys the concept:

This array accounting is very fast. There are no extra allocations or objects to track and it is also conceptually simple. For these reasons, it is faster than samtools depth which works by using the pileup machinery that tracks each read, each base.

The mosdepth method has some limitations. Because a large array is allocated and it is required (in general) to take the cumulative sum of all preceding positions to know the depth at any position, it is slower for small, 1-time regional queries. It is, however fast for window-based or BED-based regions, because it first calculates the full chromosome coverage and then reports the coverage for each region in that chromosome. Another downside is it uses more memory than samtools. The amount of memory is approximately equal to 32-bits * longest chrom length, so for the 249MB chromosome 1, it will require 1GB of memory.

mosdepth is written in nim and it uses our htslib via our nim wrapper hts-nim

speed and memory comparison

mosdepth, samtools, bedtools, and sambamba were run on a 30X genome. relative times are relative to mosdepth per-base mode with a single thread.

mosdepth can report the mean depth in 500-base windows genome-wide info under 9 minutes of user time with 3 threads.

format	tool	threads	mode	relative time	run-time	memory
BAM	mosdepth	1	base	1	25:23	1196
BAM	mosdepth	3	base	0.57	14:27	1197
CRAM	mosdepth	1	base	1.17	29:47	1205
CRAM	mosdepth	3	base	0.56	14:08	1225
BAM	mosdepth	3	window	0.34	8:44	1277
BAM	mosdepth	1	window	0.80	20:26	1212
CRAM	mosdepth	3	window	0.35	8:47	1233
CRAM	mosdepth	1	window	0.88	22:23	1209
BAM	sambamba	1	base	5.71	2:24:53	166
BAM	samtools	1	base	1.98	50:12	27
CRAM	samtools	1	base	1.79	45:21	451
BAM	bedtools	1	base	5.31	2:14:44	1908

Note that the threads to mosdepth (and samtools) are decompression threads. After about 4 threads, there is no benefit for additional threads:

Accuracy

We compared samtools depth with default arguments to mosdepth without overlap detection and discovered no differences across the entire chromosome.

mosdepth's People

Contributors

Stargazers

Watchers

Forkers

arq5x fw1121 dwinter kdm9 cauyrd genomicsiter northwestwitch wenbinmei monkollek torfinnnome xiongxu hyphaltip ryanlayer xtmgah teslaa22 altingia icedevil2001 jianzuoyi biofuture frameshiftgenomics kkapuria3 jessieuehling smangul1 wangguangya60 kanekalla antoniogps chenatf preetida sridhar0605 aoumess pythseq noahpieta skanwal seqone fernanda-rodrigues danielecook duartemolha jchenpku hugang123 jh2663 rick-heig cassimons leiyp lidd77 quanrd bwlang skyclub3 linzhi2013 zorrodong adrisede limbus-medtec junbotian17 nuin marcsaric wangpanqiao ashastry2 brendanofallon christopherwilks nagacombio dru93 tw7649116 randyharr nelight kaiqiangliu telatin xavier-j jing-xinxing medicalgenomics linhxxx lborcard elenimaniati wook2014 yemilawal steinbrl areebapatel carmelinacharalambous elineweyenbergh schaudge healthvivo fork16 yongjiam shulp2211 leornardzhou sachinkavindaa wangjie07070910 tzhang-nmdp angeltgc521 animesh shrishtee-kandoi jethror1 beverlyfujiali ztsin kojix2 applib-sg applib-sg-1009 ludvigolsen wxb263stu jlpitta martin-g

mosdepth's Issues

SIGSEGV: Illegal storage access. (Attempt to read from nil?)

I built the current head of the repo (e0131e3) on Centos 6.9 with the head of the nim repo (that was the only combination i was able to make work - had to use nim head b/c c2nim would not build on 0.17.2. had to use mosdepth head b/c the 0.2.0 release failed with a type error mosdepth.nim(362, 10) Error: type mismatch: got (seq[int64], int64, int literal(0)). could not use prebuilt binary b/c of old glibc). I'm using RUNPATHs in the executable instead of LD_LIBRARY_PATH for pcre and htslib (set with patchelf). I'm testing with 1000g data and for the most part tests run fine except for intermittent segfaults:

$ mosdepth -n --by 500 -t3 HG00096_WGS \
    data/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

The bam file is from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.

The segfaults seem to be happening only when using non-default values for the bam decompression threads (-t) but even then a couple of (identical) runs did finish without a segfault - i.e. it does not occur in 100% of runs.

Replacing usage of sambamba depth with mosdepth

Brent;
mosdepth looks awesome and we're looking into replacing depth calculation code in bcbio with it. As a first pass, we'd like to replace our usage of sambamba depth for QC and coverage calculations (and then hope to tackle goleft depth and other usage). This is work that @vladsaveliev did in bcbio so cc'ing him in for the discussion as well.

We use sambamba depth in two ways: calculation of average depth in regions and depth at thresholds. Standard mosdepth usage can replace the first and --distribution can replace the second. There are just a couple of usage differences I wanted to ask about:

We currently limit depth calculations based on SAM flags, removing reads with unmapped, mate unmapped, not primary, failedQC and duplicate flags set. Is it possible to add this for restricting the mosdepth calculations?
sambamba depth adds the BED name column, the 4th field, into the output. This is handy for aggregating depth calculations by gene names. Would it be possible to include that in the mosdepth output? Or would we be better doing this with some combination of bedtools intersect and awk magic coming out?

Thanks again for mosdepth; brilliant work.

Filter by alignment score (AS) and number of mismatches (NM) [feature request]

I'd like to filter out reads with an alignment score below a threshold (< 100 for example) and number of mismatches above a threshold (≥ 5 for example). Alternatively, I'd also appreciate your suggesting a tool that's fast for such filtering that I could then pipe into mosdepth. Thanks!

/lib64/libc.so.6: version `GLIBC_2.14' not found

the compiled binary throws an error in a centOS environment....

./mosdepth: ��f)�: ELF: x: Error 1195033751

wget https://github.com/brentp/mosdepth/releases/download/v0.2.3/mosdepth

chmod +x mosdepth

mosdepth
could not load: libhts.so
compile with -d:nimDebugDlOpen for more information

LD_LIBRARY_PATH=/home/linuxbrew/.linuxbrew/lib ./mosdepth
./mosdepth: ���: ELF: x: Error 1942262935

file /home/linuxbrew/.linuxbrew/opt/htslib/lib/libhts.so
/home/linuxbrew/.linuxbrew/opt/htslib/lib/libhts.so: symbolic link to `libhts.so.1.8'

plot-dist.py scripts not working

Hi Brent,

Sorry for another silly question !!
I have run mosdepth on an exome Sample.
mosdepth --by testTarget.bed 0200136071.bwa.bam > test.dist
my bed file is 3 column bed files and now I am using this file test.dist for plotting using plot-dist.py so I am getting below error.
Traceback (most recent call last):
File "plot-dist.py", line 23, in
for x, y in (x.rstrip().split("\t") for x in open(f)):
ValueError: too many values to unpack
I know why this error is coming but I am just wondering if I am missing something to that I can plot using that script.

Can you please help me on this !!

Thanks

$BRANCH undefined in install.sh script

Is $BRANCH defined before running install.sh? as it does not seemed to be defined before it is first referred in
https://github.com/brentp/mosdepth/blob/master/scripts/install.sh#L9

per-base coverage, break out into per-base for each position

Not a bug, and probably a @arq5x bedtools operation, but asking here because this started with mosdepth. The per-base coverage output collapses consecutive bases with identical coverage into a single interval:

MT      95      96      4184
MT      96      98      4214
MT      98      99      4222

My coverage at position 96 on chromosome MT is 4184x, 97 and 98 is 4214X, and 99 is 4222x.

I'd like to break this out to get a single entry for every single base. E.g., something like

MT      95      96      4184
MT      96      97      4214
MT      97      98      4214
MT      98      99      4222

What's the best way to accomplish this?

BAPQ filtering

Hi,
As filtering on base qualities is performed in most variant callers, it is important to take this metric into account when calculating the number of base pairs below a given threshold (especially in clinical analysis). Could you integrate this BAPQ filtering in your tool ?
Thanks,

Full per base output despite 'by' region definition

When specifying regions to output from a bed file (in 0.1.19 release binary), the per-base output is for the entire genome. While this might be the desired behavior, I'd rather get the per-base limited to the same as that specified via by/bed.

In that regard, I'd recommend two separate options:

--by: specifying interval lengths.
--region (or --bed or something): specifying specifying a file with regions (bed-file).

Currently, the two available --by behaviors are quite different. In the second case, IMHO the per-base output in the 2nd case should be limited to the requested region(s).

Out of memory error

Hello @brentp, thank you for mosdepth, it's great.

I'm getting an out of memory error with 0.2.2 and a different error with 0.2.3:

v0.2.2:

mosdepth -b 1000 -n -t 50 -q 60 out in.bam
out of memory

v0.2.3:

mosdepth -b 1000 -n -t 50 -q 60 out in.bam
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

This is with a read set I have aligned to other genomes and a genome I have aligned other read sets to; mosdepth has worked fine on these other BAMs.

This is Oxford Nanopore data; basic stats for the read set:

 num_seqs         sum_len  min_len  avg_len  max_len     N50
2,051,690  12,566,075,726        5  6,124.7  265,781  13,969

And the genome:

num_seqs     sum_len  min_len   avg_len  max_len      N50
     408  19,282,250    1,000  47,260.4  956,893  300,798

Checking the BAM, there are reads mapping to 408 of the 408 contigs, so I don't think this looks like the bug that was fixed in 0.2.3.

Based on the region.dist.txt and global.dist.txt files, mosdepth processed 191 of the 408 contigs when it crashed. The last line is reporting for 0x coverage so it looks like it got through the 191st contig completely. That contig was 1706bp long; the next contig to be processed (according to the BAM header) was 28845bp long.

Any ideas as to what might be happening here?

Thanks
John

Add output file with # of alignment records used and ignored

Hi Brent,

Would it be easy to output along with the coverage files, counts of alignment records parsed during the analysis ?

According to the --flag option value in the command line, we can have a small file which contains something like:

READS_USED    4567
READS_IGNORED    1234

This would allow to compute a theoretical (or intended) coverage for the experiment and to compare for instance with what is really on-target without having to parse the BAM elsewhere.

Thanks,
Anthony

[E::cram_read_container] Container header CRC32 failure

hello,
when I use mosdepth in linux, an error occurred like this:

And, I run it in linux system, cram file.
What could be causing it?
Thanks!

Docker for mosdepth

Thanks for a nice tool. Since I use Mac and the binary of mosdepth is for linux only, I ended up with a docker version.
Link for the Dockerfile: https://gist.github.com/thanhleviet/3797c773b060df75473b6c1263cdd8cf
The docker image is about 342 MB, it's not too big.

Then I can run the dockerized mosdepth as following:

docker run --rm -it -v $PWD:/data thanhleviet/mosdepth my-file.bam > my-file-bam.cov

Base Quality Thresholding

Hello! This tool is pretty cool, definitely faster than the one I wrote for Python a while back. Anyways, I'm wondering about computing a Q30% value per BED interval. This would essentially be the number of bases in a particular region with at least phred base quality of 30 divided by the total number of bases in the region. Similar to your threshold option, I could see this as a raw count of the number of times this happens, leaving the final division to the end user. Thoughts on this?

Empty "mosdepth.dist.txt" file

I am using the mosdepth bioconda implementation to analyze coverage of a WES bam file. My command was:

mosdepth --by exons.bed example_prefix example_tumor.bam

The example_prefix.mosdepth.dist.txt file is empty after this command. Am I doing something wrong?

Other output files are populated. I do get several warnings about chromosome regions not found. (e.g. [mosdepth] warning chromosome:chr1 from bed with 68500 regions not found

Thanks!

mosdepth pronunciation

Brent,
One of the major issues in our lab is the correct pronunciation of 'mosdepth'.
Currently we have:

more depth
mow depth
mo depth
mowz depth
moss def (...)

Could you please enlighten us?

ENH: output per-region counts (not means) for --by

currently, the output for each region specified via --by is the mean per-base coverage of the region. This is because that's what the machinery in mosdepth facilitates.

For many use-cases, users want the actual count of reads that overlap an interval. I have stubbed out a way to do this that is fast enough to incorporate using lapper.

On my test-case with an exome, and a --by file with 1,195,764 intervals, this takes the run-time for mosdepth from 130 seconds to 140 seconds and user-time with 2 decompression threads from 50 seconds to 60 seconds. I think this is acceptable given the utility and the fact that samtools bedcov takes ~33 minutes on the same bam and bed file. So, this is 2:20 for mosdepth and 33:30 for samtools bedcov

With a bed file of 10 million 300-base windows (to simulate --by 300 because I haven't implemented this for integer window yet), the user time goes only up to 80 seconds for the same (exome bam).

I would like to get feedback from @chapmanb and anyone else interested on how to expose this. Should this replace the output for $prefix.regions.bed.gz ? Or should it be a new file $prefix.region-counts.bed.gz? I would prefer to replace the output in $prefix.regions.bed.gz but are there any common use-cases for the current mean output?

note that this would leave all other output files unchanged.

Feature request: filter by readgroup

Many groups are now doing eWGS, where an exome and whole-genome are created from different libraries, but sequenced together (with different indices). They're generally aligned together into one bam/cram, since we want all the possible depth for making variant calls.

The problem is, that in order to do WGS copy number calling, we generally have to make a separate bam of just the WGS data. If mosdepth supported extracting only certain readgroups, that would save a lot of time. (or if it supported streamed input somehow, so that we could filter on the fly).

Undoubtedly, this would slow things down during that use case, because you'd have to decode those fields, but should still result in a substantial overall speed-up for eWGS workflows that now have to create a new bam

per-base.bed.gz with --by

I have a WGS BAM which I'd like to just get per-base coverage in regions in a BED file.

But when I use
mosdepth -Q 10 --threads 4 -b BED BASE BAM

The per-base.bed.gz generated is still for the whole genome. Can mosdepth output per-base coverage only for the given regions?

./mosdepth: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./mosdepth)

Hi,

I got this error and googled around. it lacks library GLIBC_2.14 on our cluster. I tried to install that lib, but so far no luck. anything else on your side can help?

Thank you,
TOmmy

mosdepth now in linuxbrew

Too hard to build from source at this stage, so I've elf-patched the binary for Linux only:

brew install brewsci/bio/mosdepth

Region-level distribution

Hi Brent,

What do you think about adding a feature to report % of an interval covered at certain depth thresholds? Similar to sambamba depth with -T options:

sambamba depth region sample.bam -L regions.bed -T 1 -T 10 -T 30

Which outputs:

chrom  chromStart  chromEnd  F3  readCount  meanCoverage  percentage1  percentage10  percentage30  sampleName
chr21  9483184     9483680   .   225        33.7641       100          98.7903       73.9919       sample
...

It gives information to detect poorly and unevenly covered regions, useful to inspect quality of capture and amplicon panels.

Bcbio used to report this with sambamba and will miss it with full migration to mosdepth, so another reason to have it is that it'll allow to migrate without loosing anything. cc @chapmanb

Vlad

Sort error

I'm trying to use your mosdepth tool to get a feel for what was covered in a recent Illumina exome that I ran, however I keep getting an error that my bam isn't sorted when i'm pretty sure it is.

This is the command I'm using to run mosdepth :-

mosdepth --by ccds.bed 2450 ~/2450.recalibrated.sorted.bam

And this is the error I receive :-

[hts-nim] starts out of order for:2450.regions.bed.gz in:1 8384339 8384836 CCDS30577.1_exon_0_50_1_8384390_f 47.71
[E::hts_idx_push] Unsorted positions on sequence #1: 67208705 followed by 8384339
[hts-nim] error adding to csi index

could not load: libhts.so

Hi Brent,

I installed mosdepth using bioconda but then I get the error:

mosdepth -h
could not load: libhts.so
compile with -d:nimDebugDlOpen for more information

When I installed, conda also installed htslib. I went looking for the libhts.so file:


ls /home/indapa/anaconda2/lib/ | grep hts
libhts.a
libhts.so
libhts.so.1.7
libhts.so.2

And then tried the following but the same error:


LD_LIBRARY_PATH=/home/indapa/anaconda2/lib mosdepth -h
could not load: libhts.so
compile with -d:nimDebugDlOpen for more information

I'm not sure what the best way forward is. I really want to use your tool so if you can help
point in the right direction, that would be great.

Package on Arch platforms

Great tool @brentp, thanks for the development! I just wanted to let you know mosdepth is available on Arch platforms:

https://aur.archlinux.org/packages/mosdepth/

Feature Request - base quality filter

Hey Brent,

Mosdepth is pulling some impressive timings for some deep WGS samples I am running--well done.

I typically use sambamba or samtools to get coverage stats and would like to switch over. One thing missing here though and also with goleft depth is the ability to filter on base qualities. Any particular reason you have not implemented?

unhandled exception: pair_depth == 0

Brent;
Thanks again for helping get mosdepth packaged and ready to go. I've done some initial integration work in bcbio and started testing on larger inputs and ran into an assertion error rated to pair_depth:

mosdepth -t 2 -F 1804 -b /mnt/work/NA12878-hg38-val/work/bedprep/cov-cov-cov-Exome-AZ.bed -d /mnt/work/NA12878-hg38-val/work/bcbiotx/tmpxFqJGS/cov-cov-cov-Exome-AZ-distribution.txt /mnt/work/NA12878-hg38-val/work/align/NA12878-1/NA12878-1-sort.bam > /mnt/work/NA12878-hg38-val/work/bcbiotx/tmpRCOY_q/cov-cov-cov-Exome-AZ-coverage.bed
Traceback (most recent call last)
mosdepth.nim(435)        mosdepth
mosdepth.nim(240)        window_main
system.nim(3554)         failedAssertImpl
system.nim(3546)         raiseAssert
system.nim(2656)         sysFatal
Error: unhandled exception: pair_depth == 0  [AssertionError]

This is publicly available data so I can try to isolate to a test case but it's a big whole genome run so am not sure about the best way to narrow it down on a first glance. If just the traceback is not useful I'll look into slicing and dicing the BED to try and provide a reproducible case. Thanks again.

Mosdepth on Windows OS

I've just managed to compile htslib and create htslib.dll on MSYS2/MinGW stack on windows.
is it possible to compile this beauty you just created here for windows.
would you please help me?

calculating mean per chromosome

Hey,

What's the best/quickest way of getting a single value for the mean depth per chromosome (and mean depth for entire genome)?

Many Thanks,
Ismail Moghul

Potential speed up

Nice work btw.

You don't need to decode things like quality values, auxiliary tags and read names. These are all intermingled together in BAM, but are (usually) in separate blocks for CRAM. This means there can be potential speed gains to selectively decoding only the bits you need.

I don't know anything about the Nimble API to htslib, but see for example samtools flagstat which is much faster on CRAM than BAM due to this:

https://github.com/samtools/samtools/blob/10b403702571299cb9a85333935b6d762badbb88/bam_stat.c#L142-L146

if (hts_set_opt(fp, CRAM_OPT_REQUIRED_FIELDS,
                SAM_FLAG | SAM_MAPQ | SAM_RNEXT)) {
    fprintf(stderr, "Failed to set CRAM_OPT_REQUIRED_FIELDS value\n");
    return 1;
}

Output only to two decimal places in $prefix.mosdepth.dist.txt

The current version of mosdepth (downloaded as a binary release) outputs cumulative distribution information only to two decimal places and has a lot of rows with 0.00 as a result. I noticed that an older version (downloaded Jan 4, 2018) output to 4 decimal places. E.g.:

#old version - $prefix.mosdepth.dist.txt:

chr1  591  0.0001  
chr1  590  0.0001  
chr1  589  0.0001  
...
total	2	0.9944
total	1	0.9978
total	0	1.0000

#current version - $prefix.mosdepth.dist.txt:

chr1  591  0.00  
chr1  590  0.00  
chr1  589  0.00
...
total	2	0.99
total	1	1.00
total	0	1.00

Please bring back the 4 decimal places in future releases. Thanks for this awesome tool!

images

[hts-nim] error in bam.queri

Hi @brentp

I'm getting the error message:
[hts-nim] error in bam.queri
multiple times, 51 in fact, last run. However, the program completes and the output looks sensible.

The error is only occurring when running in multithreaded mode.

The command I ran was:
mosdepth -n -t 8 -f Homo_sapiens_assembly38.fasta sample sample.cram

OSX build for bioconda

I am using macOS v10.13.6 and tried installing mosdepth in my env with the following command

conda install -c bioconda mosdepth==0.2.3

However, I got the following error

Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - mosdepth==0.2.3

Current channels:

  - https://conda.anaconda.org/bioconda/osx-64
  - https://conda.anaconda.org/bioconda/noarch
  - https://conda.anaconda.org/conda-forge/osx-64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/free/osx-64
  - https://repo.anaconda.com/pkgs/free/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch
  - https://repo.anaconda.com/pkgs/pro/osx-64
  - https://repo.anaconda.com/pkgs/pro/noarch
  - https://conda.anaconda.org/r/osx-64
  - https://conda.anaconda.org/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

Warnings about invalid chromosome intervals for very short contigs?

Turns out, I shouldn't write issues when I'm tired, I had mixed up the input files of course. The possible integer overflow error is still present

~~I'm running the latest mosdepth v0.2.1 using -b and a bed file from gff2bed and I'm getting numerous warnings like~~

~~[mosdepth] warning requested interval outside of chromosome range:4669..4765~~

~~But looking at the bed and fasta files I can't see where the invalid range is.~~

$grep -P '\t4669\t4765\t' *bed
jcf7180001448645_firstRound 4669 4765 evm.model.jcf7180001448645_firstRound.2.exon3 . + . exon . ID=evm.model.jcf7180001448645_firstRound.2.exon3;Parent=evm.model.jcf7180001448645_firstRound.2

$grep jcf7180001448645_firstRound *fai
jcf7180001448645_firstRound 5521 501247228 60 61

~~So this contig is 5521 bases long, definitely enough for the gene.~~

~~The bam header says the same:~~

~~@sq SN:jcf7180001448645_firstRound LN:5521~~

~~I do not see this warning for pseudomolecules, only for small-ish contigs <= 11kb.~~

~~Should this line really compare the coverage and the start? (not 100% sure what the coverage variable stores, don't know nim) https://github.com/brentp/mosdepth/blob/master/mosdepth.nim#L361~~

Unrelated, I'm also seeing what looks like integer overflows for highly repetitive pieces where the coverage is suddenly negative:

Scaffold02946   16      248     evm.model.Scaffold02946.1.exon1 174395057.34
Scaffold02946   323     526     evm.model.Scaffold02946.1.exon2 79154426.28
Scaffold02946   602     805     evm.model.Scaffold02946.1.exon3 -42867359.94
Scaffold02946   891     1357    evm.model.Scaffold02946.1.exon4 26488868.74

But for now I'm only interested in exons that have a coverage of 0.00 so I'm just ignoring those.

Thank you for your software, it is SO much faster and easy on the resources than what I've normally been using!

feature request: per base depth in regions

Hi Brent,

How would you feel about adding a flag to calculate the per-base depth for regions in a bed file?
We have a bunch of exome data, and we'd like to calculate the per base coverage of all regions in a given bed file.

Thanks!
M

Disagreement of results with samtools and bedtools

Dear,

We have being testing the performance of mosdepth and we came across with some disagreement that we are not able to explain.

The mean depth (calculated with per-base file in ROI region) is below the value that we get with bedtools and samtools. These last two tools agree one to each other. To make sure that we are certainly adding the alignments that we want, first we select them with samtools view (-Q20 -F4).

The difference is bigger as the mean depth of the sample raises.

We have tuned the tool with the parameters of quality and the off of some bam flags with no success

For some reason, mosdepth does not sum certain aligmnents but we can't guess the criteria.

Can someone help us?

some of the results are:

sample	Depth (bedtools genomecov)	Depth Mosdepth Q20 -F4	Diff
S1	226.35	140.17	86.18
S2	282.6	201.94	80.66
S3	22.1	13.59	8.51
S4	321.53	232.04	89.49
S5	302.2	200.68	101.52
S6	463.69	309.07	154.62
S7	1712.59	1218.7	493.89
S8	283.97	193.77	90.2
S9	565.61	384.24	181.37
S10	318.84	228.56	90.28
S11	498.36	330.7	167.66
S12	1023	719.74	303.26
S13	1084.31	759.14	325.17
S14	263.12	187.77	75.35
S15	288.81	209.73	79.08
S16	883.44	643.81	239.63
S17	572.41	434.08	138.33
S18	1423.13	1046.05	377.08

In addition, we have checked the results base by base and the difference with respect to IGV is clear. The example shows a base that is 1165X and mosdepth output is 828X:

Build binaries with older glibc Centos 6

Brent;
Is it possible to provide the pre-build binaries on Centos 6 with an older glibc? When using the current binaries on older systems I get:

mosdepth: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by mosdepth)

It works fine on more up to date machines but we support these older ones with bioconda (and many bcbio HPC users have them).

understanding regions.bed file

Hello brentp,

I am wondering about the output of the region.bed file created by a sample(NA12156 - BAM Test files) which i downloaded from here

regions.bed output-file:

head sample-output.regions.bed

			NC_000017.10	0	50	0.00
			NC_000017.10	50	100	0.00
			NC_000017.10	100	150	0.00
			NC_000017.10	150	200	0.00
			NC_000017.10	200	250	0.00
			NC_000017.10	250	300	0.00
			NC_000017.10	300	350	0.00
			NC_000017.10	350	400	0.00
			NC_000017.10	400	450	0.00
			NC_000017.10	450	500	0.00

This output-file was produced by the following command:

mosdepth --by 50 sample-output mapt.NA12156.altex.bam

I thought the --by command was supposed to search for the coverage in a specific window size. Since i get only zeros in my output file I was wondering if this mosdepth --by 50 sample-output mapt.NA12156.altex.bam was wrong or i didn't understand how exactly this --by command works.

So i would like to ask if you could explain me if my output makes sense or if it's incorrect, how exactly i have to use --by.

I would like to have an output like this:

			NC_000017.10	0	50	1
			NC_000017.10	50	100	2
			NC_000017.10	100	150	15
			NC_000017.10	150	200	5
			NC_000017.10	200	250	7
			NC_000017.10	250	300	0
			NC_000017.10	300	350	38
			NC_000017.10	350	400	12
			NC_000017.10	400	450	4
			NC_000017.10	450	500	1

where the last row tells us the number of reads that cover this region in a certain interval specified with
--by.

greetings taco

output mean depth by chromosome and total

see: #46 (comment)

currently, mosdepth outputs the percentiles, but it could also easily output the mean depths which is useful to some folks.

Update BioConda Version with latest release

Hey,

Great Tool.

Can you update the bioconda version with the latest version; as 0.2.2 has bugs in it, which have been fixed on github...

"Out of memory" if the BED file has track lines

When trying mosdepth on a machine with 128G RAM (of which 123 free) I get an "out of memory" error with no additional explanation. The reason lies in the BED file being used, which has one track for display in the UCSC genome browser:

track name='Gene panel' description='Regions of interest'

Removing the line allows the calculation to proceed.

Weird results in `.region.dist.txt`

Brent,

I might be misunderstanding how .region.dist.txt is calculated. I generated a BED file from .fa.fai and passed it to mosdepth with --by option:

awk 'BEGIN {FS="\t"}; {print $1 FS "0" FS $2}' gdc-viral.fa.fai > gdc.bed
mosdepth diploid_tumor_viral_mapped_mosdepth diploid_tumor_viral.mapped.bam -n --thresholds 1,5 --by gdc.bed

I would expect to get identical results from .global and .regions, however:

grep -w 5 diploid_tumor_viral_mapped_mosdepth.mosdepth.region.dist.txt | sort -r -k3,3 | head
HPV45   5       1.00
HPV39   5       1.00
HPV35   5       1.00
HPV33   5       1.00
HPV31   5       1.00
HPV26   5       1.00
HPV18   5       1.00
total   5       0.01
HPV82   5       0.01
HPV71   5       0.01

grep -w 5 diploid_tumor_viral_mapped_mosdepth.mosdepth.global.dist.txt | sort -r -k3,3 | head
HPV18   5       1.00
total   5       0.01
HPV82   5       0.01
HPV71   5       0.01
HPV25   5       0.01
HPV21   5       0.01
HPV20   5       0.01
HPV19   5       0.01
HPV14   5       0.01
HCV-2   5       0.01

That's weird that HPV26 contig even appears in the stats, even though it is not present in the BAM file at all:

samtools view diploid_tumor_viral.mapped.bam | grep HPV26 | wc
0

I would blame the BED file if .thresholds.bed.gz didn't confirm the .global stats, not .region:

zcat diploid_tumor_viral_mapped_mosdepth.thresholds.bed.gz | grep -v ^# | awk 'BEGIN {FS="\t"} { print $1 FS $2 FS $3 FS $5 FS $5/($3-$2) }' | sort -r -k 5,5 | head
#chrom  start   end     region  5X
HPV18   0       7857    7857    1
HCV-1   0       9646    98      0.0101597
HCV-2   0       9711    97      0.00998867
HPV71   0       8037    78      0.00970511
HPV19   0       7685    72      0.0093689
HPV82   0       7870    60      0.00762389
HPV20   0       7757    58      0.00747712
HPV21   0       7779    54      0.00694177
HPV25   0       7713    47      0.00609361
HPV14   0       7713    42      0.00544535

Attaching a tarball with all the files (BAM, BED, mosdepth output). I'm using mosdepth 0.2.2.
mosdepth_duploid_tumor_viral.tar.gz

Increase precision of mean depth in output

This might be a niche use case, but when dealing with low coverage datasets (for instance batches that will subsequently merged) it would be nice to be able to report the mean depth to a higher precision. Ideally this is something that could be specified on the command line.

number of decimals in .global.list.txt

I discovered mosdepth to give me more info about coverage in my bams. I noticed the 'total' in .global.dist.txt reaches as deep as the highest depth in the file, but the value next to it only has 2 decimals. Am I running it correctly? How do I increase the number of decimals in the fraction? E.g. below

total   131     0.00
total   130     0.00
total   129     0.00
total   128     0.00
total   127     0.00
total   126     0.00
total   125     0.00
total   124     0.00
total   123     0.00
total   122     0.00
total   121     0.00
total   120     0.00
total   119     0.00
...
total   23      0.00
total   22      0.01
total   21      0.01
total   20      0.01
total   19      0.01
total   18      0.01
total   17      0.01
total   16      0.01
total   15      0.01
total   14      0.01
total   13      0.01
total   12      0.01
total   11      0.01
total   10      0.02
total   9       0.02
total   8       0.02
total   7       0.02
total   6       0.03
total   5       0.04
total   4       0.05
total   3       0.07
total   2       0.13
total   1       0.34
total   0       1.00

Changing outfile format of per-base file equivalent to regions file region of one base window size

Is there a way to write per-base file in a format similar to that of --by 1 regions file?

Fill insert size between mates

Currently, there is apparently no option similar to -pc as in bedtools genomecov. Would it be a large efford to implement this?

lower coverage observed than actual

I was trying to run mosdepth on few large bam files (40 GB) generated from human data. I ran mosdepth without -n option and calculated the average autosomal coverage, it was way lower than what it was previously calculated by some other tool. To confirm, I loaded the bed file in IGV and found that mosdepth actually reports 2 less coverage at some spots than its actually observed. I am wondering what is setting this coverage calculation off

Fails for long chromosomes

mosdepth v. 0.2.2 from bioconda fails when run on a simple csi-indexed bam file with a chromosome longer than 537 Mb:

curl https://raw.githubusercontent.com/samtools/samtools/develop/examples/toy.sam 2>/dev/null | sed -E 's/LN:(.*)/LN:6000000\1/' | awk -vOFS="\t" 'NR==1 {print} NR==3 {$4=$4+600000000;$5=$5+600000000;print}' | samtools view -b -otest.bam
samtools index -c test.bam
mosdepth test test.bam

Error message:

[E::hts_idx_push] Region -1..600000006 cannot be stored in a csi index with min_shift = 14, n_lvls = 5. Try using min_shift = 14, n_lvls >= 6
[hts-nim] error adding to csi index

It is likely somehow related to the max chromosomal length of 2^29 - 1 of .tbi/.bai indexes though it should not be a problem for a .csi index (max 2^31 - 1) which is mentioned in the error message.

Maybe a check is performed with the .tbi/.bai limit rather than the .csi limit.

Maybe using an updated version of htslib would solve the issue.

See eg. samtools/samtools#241 and samtools/htslib#470

brentp / mosdepth Goto Github PK

mosdepth's Introduction

usage

exome example

WGS example

Callable regions example

Distribution only with modified precision

D4

Installation

distribution output

thresholds

quantize

how it works

speed and memory comparison

Accuracy

mosdepth's People

Contributors

Stargazers

Watchers

Forkers

mosdepth's Issues

Recommend Projects

Recommend Topics

Recommend Org