deeptools / deeptools Goto Github PK

View Code? Open in Web Editor NEW

651.0 37.0 203.0 120.58 MB

Tools to process and analyze deep sequencing data.

License: Other

Python 99.45% Shell 0.55%

python bioinformatics genomics ngs rna-seq chip-seq

deeptools's Introduction

deepTools

User-friendly tools for exploring deep-sequencing data

deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers. deepTools contains useful modules to process the mapped reads data for multiple quality checks, creating normalized coverage files in standard bedGraph and bigWig file formats, that allow comparison between different files (for example, treatment and control). Finally, using such normalized and standardized files, deepTools can create many publication-ready visualizations to identify enrichments and for functional annotations of the genome.

For support or questions please post to Biostars. For bug reports and feature requests please open an issue on github.

Citation:

Ramírez F, Ryan DP, Grüning B, Bhardwaj V, Kilpert F, Richter AS, Heyne S, Dündar F, Manke T. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Research. 2016 Apr 13:gkw257.

Documentation:

Our documentation contains more details on the individual tool scopes and usages and an introduction to our deepTools Galaxy web server including step-by-step protocols.

Please see also the FAQ, which we update regularly. Our Gallery may give you some more ideas about the scope of deepTools.

For more specific troubleshooting, feedback, and tool suggestions, please post to Biostars.

Installation

deepTools are available for:

Command line usage (via pip / conda / github)
Integration into Galaxy servers (via toolshed/API/web-browser)

There are many easy ways to install deepTools. More details can be found here.

In Brief:

Install through pypi

$ pip install deeptools

Install via conda

$ conda install -c bioconda deeptools

Install by cloning the repository

$ git clone https://github.com/deeptools/deepTools
$ cd deepTools
$ pip install .

Galaxy Installation

deepTools can be easily integrated into Galaxy. Please see the installation instructions in our documentation for further details.

Note: From version 2.3 onwards, deepTools support python3.

This tool suite is developed by the Bioinformatics Facility at the Max Planck Institute for Immunobiology and Epigenetics, Freiburg.

Documentation | deepTools Galaxy | FAQ

deeptools's People

Contributors

Stargazers

Watchers

Forkers

bgruening hjanime truecrypt32 msgendev xflicsu emilliman5 linearregression zyliang zz2liu al3n70rn martenson portiaxu boratonaj apastore guleiathms xtmgah jinfengchen lingdudefeiteng foriin nashchem avilella pombredanne steffenheyne idelvalle georgeg9 shuvra03 jcsmr-tremethick-lab fangbei samuelcollombet johnlonginotto readbio kernco bennyyu686 mmendez12 changebio manschmi dariober fw1121 snashraf adomingues cyang-2014 genomicsnx byee4 mvdbeek datagold2017 sunbymoon bioinformaticsmaterials endrebak biocodings cez026 resurgo-genetics manuelak faitero hrk2109 fanying2015 xiuying ashwini06 gtrichard mblue9 gorliver pythseq cococou young-sook boxiangliu vreuter teecloudy samesense baharehhaddad b1234561 hexylena suwangbio joseespinosa linijoseph basesloaded sklasfeld springtan giusem cotneylab evoalias wenmm juliechevalier thomascarroll mattsoup bwlang dougbarrows simon-coetzee opplatek ichobits yichaoou kriemo chaigsh zorrodong wy2160640 hwang-happy snagaraj0 aloraine bepoli smoe arvinzoy inambioinfo

deeptools's Issues

bamCorrelate problem with option --region

If there is not sufficient reads in the given region, bamCorrelate reports an error:
File "bamCorrelate", line 447, in main num_reads_per_bin[:, col])[0]
ValueError: setting an array element with a sequence.
It would be good to print a (warning) message saying that not enough data in the selected region instead of this error.

to-do for figure updates in the wiki/manual section

to be filled up :)

computeMatrix: how Galaxy handles output with multiple clusters
profiler: color option

setup.py is broken

With the current setup.py script it is not possible to use:

python setup.py --home /foo/bar or
python setup.py --install-lib

Multiple profiles in one plot - deeptools profiler

Hi guys,

I was wondering if there is a way to plot several profiles in one plot using the profiler tool. It is always easier to compare the profiles if they are in one plot.

All best,

Galaxy script for renaming chromosomes

I have not figured out a quick and painless way to remove or add "chr" for chromosome names with the tools that are present currently. Perhaps we should add a small python script.

Galaxy: sorting according to gene-body-length is missing

at the moment, only sorting "descendingly", "ascendingly" and "no sort" are available within Galaxy while for the command line, region-length-sorting is available, too

--sortUsing region_length

Make a new tool for calculating the effective genome size

bamCorrelate issue: error: argument --numberOfProcessors/-p: lot and a 2 other issues

Dear developers,

I came across an issue in bamCorrelate. For a reason, I was unable to make it run.
It is always prompting the error:

"bamCorrelate bins: error: argument --numberOfProcessors/-p: lot is not a valid number of processors"

Specifying the number of processors did not make any difference.

Also, I noticed differences in the arguments that bamCorrelate takes between the example run:
"An example usage is: bamCorrelate bins -b treatment.bam input.bam -plot
correlation.png -f 200 -method pearson" and the list of arguments from the help menu.
Also, python-2.7 $KH/bin/deepTools/bin/bamCorrelate -h is not detailed enough to know the parameters for bamCorrelate

Sorry for all that,
Hope that it will help though and many thanks for the tool,

P.K

Galaxy: adjust wrapper for R script to combine several profiler runs

Galaxy: update of the explanations within Galaxy?

Currently, the wiki pages are a bit more elaborate than the explanations that users see when they select a tool within Galaxy.
My questions:

Should we include the long descriptions? (I lean towards no)
Should we include the links to the wiki pages (I lean towards yes)
Where could I modify these information, i.e. if we really migrate the wiki once more to the deepTools website eventually, all these links need to be updated etc. I could do it if I knew where to look for it.

handling of bed files: the 6th column should not break the tool

When a bed-file contains something else than a plus or minus sign in the 6th column, an error is produced. This should be changed to a warning. (a warning should also be raised when the bed-file does not contain any information about the strand)

Galaxy: correctGCbias output for bigWig and bedGraph not working

with correctGCBias 1.5.6-45-g7190dd0 I get the following message when trying to output a bedgraph file:

format: bedgraph, database: dm3
applying correction genome partition size for multiprocessing: 362078 using region 4 sh: open: No such file or directory

it's a warning, but the file empty.

with output = bigWig I get a similar message:

end (1321) before start (1321484) line 23342 of /tmp/tmplJWLcp

computeGCbias fails when chromosome names of 2bit and bam file do not match

This behavior is OK, but should proceed when the bam chromosome names are a subset of the 2bit chromosome names.

empty bigWig file - should have meaningful, less threatening error message

Make deepTools tolerant towards chromosome naming (chr1 vs. 1)

This is related to the issue I first raised for the Galaxy implementation, but Fidel and I agreed that it would be meaningful to make deepTools capable of processing BAM and BED files even if their chromosome naming conventions do not match 100%. After all, IGV and other tools can do it, too.

Discrepancies should still be reported as to make the user aware of the fact the chromosomes are labelled differently, but this should not break the code.

bamCoverage mappingQualityFilter not working as expected

from Fidel:
The problem with the normalization (either RPKM or normalizeTo1X) is that they are based on the total number of mapped reads and not on the total number of reads of quality > minMappingQuality.

That means that people still need to filter the bamFile before running bamCoverage (which is contrary to our intention)

bamCompare should not be heavily affected by this.

bamCorrelate should had an option to remove outliers

The pearson correlation run by bamCorrelate is heavily affected by outliers that commonly occur on satellite regions that accumulate large numbers of reads.

Galaxy: place the option to limit the operation to a genome region in the main display, not in the advanced options

FAQ: How does the GC correction influence the downstream analyses?

no duplicate removal

script for turning help texts into markdown for wiki

1 wiki page should contain all the options for all the programs

everytime the help within the python scripts is updated, one should just run this script that should generate a wiki-compatible markdown page

commandLine: BED-output of clustered heatmap cannot be directly re-used with computeMatrix

The issue is caused by the lack of # lines in the output of heatmapper nowadays. To make heatmapper etc. work with Galaxy, Bjoern added the feature that the cluster ID is indicated in the 7th column of a bed-file, so that users can easily separate the file into separate data sets and supply them individually to computeMatrix. On the command line, computeMatrix expects just one BED file where the groups are separated by #. That means that currently the user needs to know that he has to turn the file with the format:

chr1 10 12 Cluster1
chr1 20  22 Cluster2

into a file like this:

chr1 10 12
# cluster1
chr1 20  22
# cluster2

Perhaps we can come up with a more elegant solution in the future.

--quiet option for computeMatrix does not keep quiet

Hi guys,

I appreciate that computeMatrix is telling me all the problems that it encounters (I know, it's a tough job, poor little bugger), but I would appreciate it even more if it would stop littering my stderror when I set the --quiet option. In short: I don't think the -q option works completely properly :)

The command line tools PE_fragment_size and bigwigCompare not part of galaxy.

I think we should add this tools to Galaxy at some point. Probably a more meaningful name for PE_fragment_size is required.

documentation fixes

I saw in the supplement of the paper several small mistakes/inconsistencies with the commandline interface of heatmapper. I tried to fixed most of them. If anyone can have a brief look ... that would be nice. We can update the paper, during the proof read I hope.

computeGCbias has --fragmentLength as optional but is required

Add tools for specific RNA-seq analysis issues, e.g. FPKM for genes etc.

a tool to generate count tables of unnormalized read numbers would be nice (very similar to what bamCorrelate is already doing) --> this could be useful to generate the count tables required by DESeq etc.
calculate the normalized read coverages (typically FPKM) per genes

Simplify the installation

The installation is not as easy as it could be. Especially more verbose warnings/errors would be useful, to assist the user.

from distutils import spawn
spawn.find_executable('samtools')

That can be used to determine binaries at runtime and fail with a meaningful and helpful message.

labels for heatmaps

at the moment, labels are only put into the profiles

bamCompare: using --verbose with --scaleFactors throws an error

Hi,
I came across the following bug in bamCompare. Specifying the "--scaleFactors" in the command line in the --verbose mode throws the error below:
Hope this will help,
P.

Traceback (most recent call last):
File "/g/furlong1/khoueiry/bin/deepTools/bin/bamCompare", line 315, in
main(args)
File "/g/furlong1/khoueiry/bin/deepTools/bin/bamCompare", line 279, in main
"RPKM is {0}".format(scaleFactor)
UnboundLocalError: local variable 'scaleFactor' referenced before assignment

computeMatrix should not break when the first line of a peaks file is a comment or a track line

add names of regions from BED file to bamCorrelate --outRawCounts

at the moment, --outRawCounts returns a matrix without row names. if a BED file is given for bamCorrelate, it makes sense to output the name of the regions in the BED file as the rowname. this matrix could then be directly plugged into DESeq and other downstream applications that require an unnormalized read count matrix

bamCompare syntax error

in line 124/125, comma missing before 'default=1000'

Galaxy: labeling of effective genome size selections is inconsistent

another thing pointed out by our proof-readers: the way users are asked to supply genome or effective genome sizes is currently not consistent between the different Galaxy tools

bamCompare

when "normalize to 1x sequencing depth" is selected, the following header and an empty field appear:

Report normalized coverage to 1x sequenceing depth:

--> this entry should probably be named "Effective genome size" instead

bamCompare, bamCoverage

There's an explanation for the genome size entry:

Enter the genome size to normalize the reads counts. Sequencing depth is defined as the total number of mapped reads * fragment length / effective genome size. To use this option, the effective genome size has to be given. Common values are: mm9: 2150570000, hg19:2451960000, dm3:121400000 and ce10:93260000.

This should read instead:

Enter the effective genome size to normalize the reads counts (the part of the genome that can be mapped, i.e. without undefined bases and highly repetitive regions). Sequencing depth is defined as the total number of mapped reads * fragment length / effective genome size. Common values are: mm9: 2150570000, hg19:2451960000, dm3:121400000 and ce10:93260000.

in correctGCbias and computeGCbias
there is a very elaborate description of the effective genome size including a drop down box (while in the other tools, it's an empty field where the user has to enter the value himself)

I think, we should make it consistent. Actually, I like the way it's done for the GCbias tools better than the empty field. The empty field, however, is consistent with how MACS has it...

FAQ: What to do when reference genome is not known

Galaxy: how to integrate IGV browser?

@diehlsa & @bgruening : what is needed to directly load data from Galaxy into IGV browser?

move the manual pages to a proper wiki

I'm aiming at something like this:
https://github.com/snowplow/snowplow/wiki

Structure should be something along these lines:
0. Installation (this should be kept in the README.md so that it's still the first thing people see)

How we use deepTools
Documentation of the tools (QC, normalization, visualization)
Recipes (Galaxy and command line)
FAQ (Galaxy and command line)

Suggestions welcome.
Cheers,
Friederike

Option --ignoreDuplicates has no effect in bamCorrelate

The option --ignoreDuplicates appears in bamCorrelate but it has no effect. This is because such option is not passed to the countReadsPerBin.getCoverageOfRegion function.

FAQ: Should bamFingerprint be used on raw BAM files only or also on bias-normalized files?

Fabian asked for recommendations here

FAQ: Which correlation is recommended for bamCorrelate?

i.e. when to use Spearman and Pearson

I think, we, ourselves, haven't fully understood this issue given the current plots we produced, right?

Galaxy should be much more file-format-restrictive when showing the selection of available files

For example, for computeMatrix only true BED or INTERVAL files should be shown in the drop down menu for the regions while only true BIGWIG files should be shown in the drop down menu for the scores.

At the moment, almost all files in a history tend to be shown, very often with the comment ("as BED" or "as BIGWIG") which, in most cases, will not work and will lead to confusion.

Is this something that needs to changed every time the tools are updated or (re-)installed in the Galaxy? And where does it need to be changed?

bamCoverage: --missingDataAsZero option should be integrated

Just as a reminder: currently, --missingDataAsZero {yes, no} only exists in bamCompare. There is no reason why it should not be available in bamCoverage (since 1.5.7., the default in bamCoverage is set to "yes")

The effective genome size numbers for bamCoverage and bamCorrelate are for uniquely map reads

The effective genome size (mappable portion of a genome) numbers for bamCoverage and bamCorrelate are for uniquely map reads. However, nowadays is common to have larger effective genome sizes when using random mapping of multi-reads. Furthermore, depending on the read length used the mappable portion of the genome changes.

Maybe we should add all options to the tools or point to Table 2 of this paper: http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0030377

Galaxy: add information about data upload into the tool description

perhaps just add the link to the wiki page (https://github.com/fidelram/deepTools/wiki/Galaxy#wiki-dataup) within the TIP section underneath the selection box

Implement alternative coverage measure (distinct reads)

From: https://www.biostars.org/p/101413/

I'm quite interested if it is possible to integrate side tools to your framework. Actually when working with exome data I've come across the alternative coverage measure that uses distinct reads, i.e. reads that cover a given position and have different offsets. It is also referred as molecular coverage. The inspiration comes from this paper http://genomebiology.com/content/12/1/R6, Fig1. Such measure could be more prone to sequencing artifacts. If you're interested the basic implementation using Picard API is here https://github.com/mikessh/exome-misc/blob/master/src/molcount/MolCountExome.groovy and here https://github.com/mikessh/exome-misc/releases/download/v1.0.0/exome-tools-v1.0.0.jar compiled as jar.

FAQ: Error with UCSCtool bedGraph to bigWig

Now that we have deepTools being tolerant towards chromosome naming, I totally forgot that other tools are not that forgiving. I just put the error here so I eventually make an FAQ entry out of it:

BedGraph to bigWig on data 76
An error occurred with this dataset: 2L is not found in chromosome sizes file

Galaxy: add abbreviation behind the name of each tool

and another thing pointed out by innocent users: they would like to have the abbreviation that we use in the overview table to be represented , i.e.:

bamCorrelate (QC) correlates pairs..
bamFingerprint (QC) plots profiles...
computeGCbias (QC) ...
correctGCbias (N) ...
bamCoverage (N) ...
bamCompare (N) ...
computeMatrix (V) ...
heatmapper (V) ...
profiler (V)

make terms in wiki/help content with command line version AND galaxy consistent

perhaps a glossary would be meaningful? Fabian reports that he's struggling with the terms and them not being consistently used...(which is most likely very true)

any opinion on whether one should use the term NGS or HTS? I tried to use HTS because Thomas preferrred it, but I have the feeling that NGS is much more commonly used

last, but not least: ANY idea on how to unify the terms in the wiki, the help texts in the python scripts and the galaxy help texts most efficiently? perhaps a "help-text-a-thon" is needed where we edit all 3 texts simultaneously? I honestly cannot bring myself to go through these massive texts all on my own, but I think it would tremendously decrease the frustration potential if we did it