maxplanck-ie / snakepipes Goto Github PK

View Code? Open in Web Editor NEW

374.0 20.0 85.0 101.1 MB

Customizable workflows based on snakemake and python for the analysis of NGS data

Home Page: http://snakepipes.readthedocs.io

License: MIT License

Python 78.26% Shell 7.66% R 14.08%

ngs rna-seq chip-seq workflow atac-seq hi-c bisulfite-sequencing snakemake

snakepipes's People

Stargazers

Watchers

Forkers

najlabioinfo merckey bioinfo-dirty-jobs steffenheyne leilyr xuanblo him72 sridhar0605 gadepallivs vivekbhr giusem qliugithub inquisitivevi maofengbiao inambioinfo wangdi2014 maolin2017 dongwei1220 katsikora rmontagn acgtcoder pythseq ropolomx csijcs xuwei684 huang-xn yirenheihei vikash84 quanrd shycheng xjyx yu-bio adrijak romemul haroon123 bioqxu mehelmy reachsagaya senthamizhanv wangyibin fuzhiliang kashyapchhatbar zoulf001 alexpenson timothykoala hzjsxu wangjie07070910 gtrichard watsonwoo moritzschaefer unique379r mxrcon fengyq dnarnas geng-lee bbyun28 snashraf ycmacy dreamfishes ronaldoellers conorbreen959 huang cvn001 griffan penghu-sc fengkuangbaozha saimmomin12 zm-git-dev acpooth tzenglei rnaimehaom jeffhsu3 jh36 zd105 chandramanikafle rbrauning gallowaylabmit akhtar-lab-mpi-ie gerikson chaunceydust sunta3iouxos nahid18 hqi87 smartgamer

snakepipes's Issues

sambamba

replace Picard markdup with sambamba markdup (much faster)
run sambamba flagstat as additional QC after mapping

Initial RTD migration

We would move the existing index/help/readme content to RTD for now. Afterwards we can open a series of issues about documentation of each workflow separately..

In recent snakemake versions, a --cluster-status option has been added to allow snakemake to simply query whether a job has completed or not. It'd be nice to add this as an option in the config file, where if it's empty then it's completely ignored (I'll make a SlurmStatus command for us to use for this).

Alternative downsampling method

Currently, the DNA-seq pipeline downsamples by using the first N reads in a fastq file. Actual randomization can be achieved using seqtk sample also for paired-end reads (seqtk sample must be called for each Fastq seperately, and -s has to be identical for both Fastq files).

Fix alignmentSieve path

update MACS2 version

we should use latest macs2 version for BAMPE mode

Ignore chrX for normalization

X chromsome should be excluded for calculation of 1x normalization in bamCoverage..

picard MarkDuplicated and metrics file

the metrics file output from picard MarkDuplicates, reports the statistics only for "Unknown Library", although it should report per library = read group "@rg".
Read-group is defined during (bowtie) mapping, but does not seem to be recognized by MarkDuplicates. Does anyone know how to fix this? It's annoying because MultiQC will fail on those metrics files.

change the bamCompare output suffix

The suffix says filtered.subtract.input.bw and filtered.log2ratio.over_input.bw for the two bamCompare commands. But the control sample doesn't have to be INPUT. In my example I have both Input and H3 controls and If I re-run with H3 control the Input subtracted files would be either over-written or not produced.

make "mode" parameter case-insensitive [RNA-seq]

move to new version of cutadapt and picard

In order to support multi-threading

restrict deepTools:plot*

the deepTools plot* commands should not be run for very many samples (say >20). Tools that can provide tabular output (i.e plotCorrelation) should be restricted to those.

Personally I would vote to remove scatterplot altogether. This can be left for dedicated analysis.

plotFingerprint

plotFingerprint takes a long time with default -n 500000. Perhaps this parameter could be reduced? Please calculate also the metrics using --outQualityMetrics

new module : HiC

I will implement the HiC workflow with input from Fidel.. It would be an independent workflow..

salmon --> DESeq

The current master branch is not handling salmon output correctly for DESeq. It simply converts gene counts from floats to integers.. TxImport needs to be implemented using tx2gene annotation and quant.sf files.

This step is currently broken the develop branch.

ChIP-Seq workflow, single-end option says TRUE by default

It actually represents the value for "paired" in the config.. It's confusing..

--filterBAM option

In discusions it emerged that --filterBAM options would be good to have to filter the mapped bam additionally for some regions, like chromosome configs and blacklisted sites. This might be useful for DNA mapping and other DNA workflows : ChIP-Seq, ATAC-seq and Hi-C.. Maybe we can discuss this ..

cluster_logs : only keep non-empty files

During the cleanup at the end of the pipeline, only the non-empty files in cluster_logs should be kept.

Produce all output plots in pdf

png outputs are not zoomable

Make use of localrules

Things like the FASTQ rule should be under the localrules keyword so they run quickly and we don't have to wait for NFS lag or the cluster to be free.

HiC workflow: continue workflow on merged samples

If you choose to --merge_samples the workflow continues using the unmerged samples for e.g. TAD calling. Suggestion: switch to the merged sample for further processing after merging.

change the bigwig output dir name to "bamCoverage"

For RNA-Seq pipeline

add pandas to python 3 env

add parameters for --nocorrect and --noTADcalling for HiC

Some times is convenient not to run the whole pipeline but only part of it. Thus, the suggested parameters will stop and avoid the creation of unneeded files.

hic workflow : cluster config

hic-workflow needs a cluster config to manage per-job memory

RNA-seq - allow for multiple DEseq sample sheets

Allow to submit multiple sample sheets for pair-wise differential gene expression.

For example, use the current routines for differential analysis and produce one output folder per submitted sample sheet.

deepTools_qc log-files are uninformative

The log-files (especially for deepTools_qc) should contain the actual call. It can be quite cumbersome to infer the precise call from the snakemake-files.
As it stands those log-files simply report some standard error, which is usually uninformative.

Travis

Integrate Travis CI for established workflows (This would take some effort) ..

Add travis (basic)
Add test datasets
Turn on testing (IDK if it's possible since it means installing the dependencies in travis as well)

HiC workflow: perform hicPlotDistVsCounts on all matrices as an option

It would be great if the HiC workflow allows us to perform the hicPlotDistVsCounts analysis on all the corrected matrices merged with the --bin_size option.

Retrospectively tag versions

only previous developers can do that.

I am unaware of previous changes the workflow went through.. We should retrospectively tag the commits for versions (starting with v0.1). For example:

v0.1 : first working implementation.
v0.2 : addition of RNAseq
v0.3 : scRNAseq

etc. etc..

Allele-specific workflow would then get it's own minor version number (v 0.X.0)...

New module : ATAC-seq

New ATAC-seq module from @mirax87 is allready there as PR : #49 .

We should merge it once the new Chip-Seq_revamp is merged (I still have to open the PR) I am waiting for Felix Kruger (SNPsplit author) to fix one issue with SNPsplit, after which it would be fine.

After merging Chip-seq revamp, we would tag v0.5 , then I after merging ATAC-Seq we would tag v0.6 ..

rMATs module : remove?

rMATS module is deprecated and doesn't work.. I would recommend to remove this, since no-one has been using it anyway.

Why are we modify GTF files by default?

Is there a reason that GTF files are munged beyond recognition in the RNA-seq workflow? They're fine as is for bulk RNA-seq and we can't even report that "we used Gencode m15" in the methods if we modify the crap out of them.

HiC_get_mad_score

I got this error (local variable 'lower' referenced before assignment) while was running the HiC pipeline. I have checked the get_mad_score function and I think the 'lower' should have been initiated before the for loop. Dont you think so? I have added lower = 0.0 above the for loop in my local one and it is working fine.

add m13 annotation

module load error

After ongoing migration of modules to conda the wrappers fail sicne they can't find the python installation after module load snakemake. The DNA-mapping wrapper is still working since it loads a specific version of snakemake. @dpryan79 needs to fix the conda version..

MIsleading color range of plotCorrelation in DNA-mapping

In DNA-mapping, deeptools plotCorrelation produces heatmaps to plot the correlation between samples. If the correlation is non-negative, the color range of correlation heatmap ranges between [0,1] for a divergent colormap.
Suggested fixes:
a) Fix color scale at [-1, 1]
b) Change from divergent to a sequential colormap, if all x are either x > 0 or x < 0

Don't over-write the main workflow log files

For all workflows. If one job fails and the users re-run the workflow, the old log file is over-written. I think it would be better if we make another log file and name it DNA_mapping.run2.log and so on, to keep the run history.

update scRNAseq workflow

After last conversation on slack, I think the scRNAseq workflow is not in sync with the newer changes.. Maybe time to update it?

stop pushing directly to master

To avoid changes unknown to others, please make all changes through pull requests

HiC workflow: multiple --bin_size parameters

For downstream analysis, multiple bin size merging are useful (large bins for displaying matrices at the chromosomal level, even larger bins for hicPlotDistVsCounts).

It would be very helpful to precise multiple parameters for the --bin_size option and make the corresponding merged matrices (that are then corrected, used for TAD calling, etc).

Docker/Conda

For the first public release of our pipeline, we should make it independent using docker and conda.. Then we can also see whether we can add some some travis tests (issue #34) .

Documentation

Hi, sorry for bringing up some native issues but since I am very new in using the snakemake workflows I had some issues in their documentation. For example for ChIP-seq pipeline I couldn't find anywhere which has been mentioned that the directory of the input bam files and the Chip bam files should be the same. I figured it out by looking at the python code after getting error while running the pipeline. Maybe one can even change the naming format which has been written in the code, but if not necessary just explaining it in the ChIP-seq documentation would be fine. Another one I have noticed was the explanation of how to use snakemake options(It is correct when running -h just there are dashes missing in snakepipe documentation.)

include estimateReadFiltering (deepTools@develop)

I frequently run:
estimateReadFiltering -b Bowtie2/*bam -o metrics.tsv --ignoreDuplicates --minMappingQuality 5 --samFlagExclude 3844 --blackListFileName BL.bed

after mapping. It would be great if this could be always included - possibly with other defaults if you don't like the above.

Fix memory issue for "gigantic" genomes

In order to count genes on very large genomes (e.g. hs with exons AND introns) there is a more than default memory required.
In particular Salmon's suffix array construction (salmon index <genome.fa>) consumes a lot of memory.
Add a parameter to the config file, to change SLURM --mem-per-cpu, if needed (default, stays with SLURM default).

switch to v 1.3
add deeptools
add HiCEx

maxplanck-ie / snakepipes Goto Github PK

snakepipes's People

Stargazers

Watchers

Forkers

snakepipes's Issues

Recommend Projects

Recommend Topics

Recommend Org