maxplanck-ie / snakepipes Goto Github PK
View Code? Open in Web Editor NEWCustomizable workflows based on snakemake and python for the analysis of NGS data
Home Page: http://snakepipes.readthedocs.io
License: MIT License
Customizable workflows based on snakemake and python for the analysis of NGS data
Home Page: http://snakepipes.readthedocs.io
License: MIT License
We would move the existing index/help/readme content to RTD for now. Afterwards we can open a series of issues about documentation of each workflow separately..
In recent snakemake versions, a --cluster-status
option has been added to allow snakemake to simply query whether a job has completed or not. It'd be nice to add this as an option in the config file, where if it's empty then it's completely ignored (I'll make a SlurmStatus
command for us to use for this).
Currently, the DNA-seq pipeline downsamples by using the first N reads in a fastq file. Actual randomization can be achieved using seqtk sample also for paired-end reads (seqtk sample must be called for each Fastq seperately, and -s has to be identical for both Fastq files).
we should use latest macs2 version for BAMPE mode
X chromsome should be excluded for calculation of 1x normalization in bamCoverage..
the metrics file output from picard MarkDuplicates, reports the statistics only for "Unknown Library", although it should report per library = read group "@rg".
Read-group is defined during (bowtie) mapping, but does not seem to be recognized by MarkDuplicates. Does anyone know how to fix this? It's annoying because MultiQC will fail on those metrics files.
The suffix says filtered.subtract.input.bw
and filtered.log2ratio.over_input.bw
for the two bamCompare commands. But the control
sample doesn't have to be INPUT. In my example I have both Input and H3 controls and If I re-run with H3 control the Input subtracted files would be either over-written or not produced.
In order to support multi-threading
the deepTools plot* commands should not be run for very many samples (say >20). Tools that can provide tabular output (i.e plotCorrelation) should be restricted to those.
Personally I would vote to remove scatterplot altogether. This can be left for dedicated analysis.
plotFingerprint takes a long time with default -n 500000. Perhaps this parameter could be reduced? Please calculate also the metrics using --outQualityMetrics
I will implement the HiC workflow with input from Fidel.. It would be an independent workflow..
The current master branch is not handling salmon output correctly for DESeq. It simply converts gene counts from floats to integers.. TxImport needs to be implemented using tx2gene
annotation and quant.sf
files.
This step is currently broken the develop branch.
It actually represents the value for "paired" in the config.. It's confusing..
In discusions it emerged that --filterBAM options would be good to have to filter the mapped bam additionally for some regions, like chromosome configs and blacklisted sites. This might be useful for DNA mapping and other DNA workflows : ChIP-Seq, ATAC-seq and Hi-C.. Maybe we can discuss this ..
During the cleanup at the end of the pipeline, only the non-empty files in cluster_logs should be kept.
png outputs are not zoomable
Things like the FASTQ
rule should be under the localrules
keyword so they run quickly and we don't have to wait for NFS lag or the cluster to be free.
If you choose to --merge_samples the workflow continues using the unmerged samples for e.g. TAD calling. Suggestion: switch to the merged sample for further processing after merging.
For RNA-Seq pipeline
Some times is convenient not to run the whole pipeline but only part of it. Thus, the suggested parameters will stop and avoid the creation of unneeded files.
hic-workflow needs a cluster config to manage per-job memory
Allow to submit multiple sample sheets for pair-wise differential gene expression.
For example, use the current routines for differential analysis and produce one output folder per submitted sample sheet.
The log-files (especially for deepTools_qc) should contain the actual call. It can be quite cumbersome to infer the precise call from the snakemake-files.
As it stands those log-files simply report some standard error, which is usually uninformative.
Integrate Travis CI for established workflows (This would take some effort) ..
It would be great if the HiC workflow allows us to perform the hicPlotDistVsCounts analysis on all the corrected matrices merged with the --bin_size option.
only previous developers can do that.
I am unaware of previous changes the workflow went through.. We should retrospectively tag the commits for versions (starting with v0.1). For example:
etc. etc..
Allele-specific workflow would then get it's own minor version number (v 0.X.0)
...
New ATAC-seq module from @mirax87 is allready there as PR : #49 .
We should merge it once the new Chip-Seq_revamp is merged (I still have to open the PR) I am waiting for Felix Kruger (SNPsplit author) to fix one issue with SNPsplit, after which it would be fine.
After merging Chip-seq revamp, we would tag v0.5 , then I after merging ATAC-Seq we would tag v0.6 ..
rMATS module is deprecated and doesn't work.. I would recommend to remove this, since no-one has been using it anyway.
Is there a reason that GTF files are munged beyond recognition in the RNA-seq workflow? They're fine as is for bulk RNA-seq and we can't even report that "we used Gencode m15" in the methods if we modify the crap out of them.
I got this error (local variable 'lower' referenced before assignment) while was running the HiC pipeline. I have checked the get_mad_score function and I think the 'lower' should have been initiated before the for loop. Dont you think so? I have added lower = 0.0 above the for loop in my local one and it is working fine.
After ongoing migration of modules to conda the wrappers fail sicne they can't find the python installation after module load snakemake
. The DNA-mapping wrapper is still working since it loads a specific version of snakemake. @dpryan79 needs to fix the conda version..
In DNA-mapping, deeptools plotCorrelation produces heatmaps to plot the correlation between samples. If the correlation is non-negative, the color range of correlation heatmap ranges between [0,1] for a divergent colormap.
Suggested fixes:
a) Fix color scale at [-1, 1]
b) Change from divergent to a sequential colormap, if all x are either x > 0 or x < 0
For all workflows. If one job fails and the users re-run the workflow, the old log file is over-written. I think it would be better if we make another log file and name it DNA_mapping.run2.log
and so on, to keep the run history.
After last conversation on slack, I think the scRNAseq workflow is not in sync with the newer changes.. Maybe time to update it?
To avoid changes unknown to others, please make all changes through pull requests
For downstream analysis, multiple bin size merging are useful (large bins for displaying matrices at the chromosomal level, even larger bins for hicPlotDistVsCounts).
It would be very helpful to precise multiple parameters for the --bin_size option and make the corresponding merged matrices (that are then corrected, used for TAD calling, etc).
For the first public release of our pipeline, we should make it independent using docker and conda.. Then we can also see whether we can add some some travis tests (issue #34) .
Hi, sorry for bringing up some native issues but since I am very new in using the snakemake workflows I had some issues in their documentation. For example for ChIP-seq pipeline I couldn't find anywhere which has been mentioned that the directory of the input bam files and the Chip bam files should be the same. I figured it out by looking at the python code after getting error while running the pipeline. Maybe one can even change the naming format which has been written in the code, but if not necessary just explaining it in the ChIP-seq documentation would be fine. Another one I have noticed was the explanation of how to use snakemake options(It is correct when running -h just there are dashes missing in snakepipe documentation.)
I frequently run:
estimateReadFiltering -b Bowtie2/*bam -o metrics.tsv --ignoreDuplicates --minMappingQuality 5 --samFlagExclude 3844 --blackListFileName BL.bed
after mapping. It would be great if this could be always included - possibly with other defaults if you don't like the above.
In order to count genes on very large genomes (e.g. hs with exons AND introns) there is a more than default memory required.
In particular Salmon's suffix array construction (salmon index <genome.fa>) consumes a lot of memory.
Add a parameter to the config file, to change SLURM --mem-per-cpu, if needed (default, stays with SLURM default).
WIP...
The scripts are a mix of python 2 and 3 at this moment.. causes problems on user's side..
In the bam filtering rule, only uniquely mapping reads should be kept by default, we can specify a flag to count the multimappers if users wish to..
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.