Giter Club home page Giter Club logo

snakepipes's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snakepipes's Issues

sambamba

  1. replace Picard markdup with sambamba markdup (much faster)
  2. run sambamba flagstat as additional QC after mapping

Initial RTD migration

We would move the existing index/help/readme content to RTD for now. Afterwards we can open a series of issues about documentation of each workflow separately..

--cluster-status

In recent snakemake versions, a --cluster-status option has been added to allow snakemake to simply query whether a job has completed or not. It'd be nice to add this as an option in the config file, where if it's empty then it's completely ignored (I'll make a SlurmStatus command for us to use for this).

Alternative downsampling method

Currently, the DNA-seq pipeline downsamples by using the first N reads in a fastq file. Actual randomization can be achieved using seqtk sample also for paired-end reads (seqtk sample must be called for each Fastq seperately, and -s has to be identical for both Fastq files).

picard MarkDuplicated and metrics file

the metrics file output from picard MarkDuplicates, reports the statistics only for "Unknown Library", although it should report per library = read group "@rg".
Read-group is defined during (bowtie) mapping, but does not seem to be recognized by MarkDuplicates. Does anyone know how to fix this? It's annoying because MultiQC will fail on those metrics files.

change the bamCompare output suffix

The suffix says filtered.subtract.input.bw and filtered.log2ratio.over_input.bw for the two bamCompare commands. But the control sample doesn't have to be INPUT. In my example I have both Input and H3 controls and If I re-run with H3 control the Input subtracted files would be either over-written or not produced.

restrict deepTools:plot*

the deepTools plot* commands should not be run for very many samples (say >20). Tools that can provide tabular output (i.e plotCorrelation) should be restricted to those.

Personally I would vote to remove scatterplot altogether. This can be left for dedicated analysis.

plotFingerprint

plotFingerprint takes a long time with default -n 500000. Perhaps this parameter could be reduced? Please calculate also the metrics using --outQualityMetrics

new module : HiC

I will implement the HiC workflow with input from Fidel.. It would be an independent workflow..

salmon --> DESeq

The current master branch is not handling salmon output correctly for DESeq. It simply converts gene counts from floats to integers.. TxImport needs to be implemented using tx2gene annotation and quant.sf files.

This step is currently broken the develop branch.

--filterBAM option

In discusions it emerged that --filterBAM options would be good to have to filter the mapped bam additionally for some regions, like chromosome configs and blacklisted sites. This might be useful for DNA mapping and other DNA workflows : ChIP-Seq, ATAC-seq and Hi-C.. Maybe we can discuss this ..

Make use of localrules

Things like the FASTQ rule should be under the localrules keyword so they run quickly and we don't have to wait for NFS lag or the cluster to be free.

RNA-seq - allow for multiple DEseq sample sheets

Allow to submit multiple sample sheets for pair-wise differential gene expression.

For example, use the current routines for differential analysis and produce one output folder per submitted sample sheet.

deepTools_qc log-files are uninformative

The log-files (especially for deepTools_qc) should contain the actual call. It can be quite cumbersome to infer the precise call from the snakemake-files.
As it stands those log-files simply report some standard error, which is usually uninformative.

Travis

Integrate Travis CI for established workflows (This would take some effort) ..

  • Add travis (basic)
  • Add test datasets
  • Turn on testing (IDK if it's possible since it means installing the dependencies in travis as well)

Retrospectively tag versions

only previous developers can do that.

I am unaware of previous changes the workflow went through.. We should retrospectively tag the commits for versions (starting with v0.1). For example:

  • v0.1 : first working implementation.
  • v0.2 : addition of RNAseq
  • v0.3 : scRNAseq

etc. etc..

Allele-specific workflow would then get it's own minor version number (v 0.X.0)...

New module : ATAC-seq

New ATAC-seq module from @mirax87 is allready there as PR : #49 .

We should merge it once the new Chip-Seq_revamp is merged (I still have to open the PR) I am waiting for Felix Kruger (SNPsplit author) to fix one issue with SNPsplit, after which it would be fine.

After merging Chip-seq revamp, we would tag v0.5 , then I after merging ATAC-Seq we would tag v0.6 ..

rMATs module : remove?

rMATS module is deprecated and doesn't work.. I would recommend to remove this, since no-one has been using it anyway.

Why are we modify GTF files by default?

Is there a reason that GTF files are munged beyond recognition in the RNA-seq workflow? They're fine as is for bulk RNA-seq and we can't even report that "we used Gencode m15" in the methods if we modify the crap out of them.

HiC_get_mad_score

I got this error (local variable 'lower' referenced before assignment) while was running the HiC pipeline. I have checked the get_mad_score function and I think the 'lower' should have been initiated before the for loop. Dont you think so? I have added lower = 0.0 above the for loop in my local one and it is working fine.

module load error

After ongoing migration of modules to conda the wrappers fail sicne they can't find the python installation after module load snakemake. The DNA-mapping wrapper is still working since it loads a specific version of snakemake. @dpryan79 needs to fix the conda version..

MIsleading color range of plotCorrelation in DNA-mapping

In DNA-mapping, deeptools plotCorrelation produces heatmaps to plot the correlation between samples. If the correlation is non-negative, the color range of correlation heatmap ranges between [0,1] for a divergent colormap.
Suggested fixes:
a) Fix color scale at [-1, 1]
b) Change from divergent to a sequential colormap, if all x are either x > 0 or x < 0

Don't over-write the main workflow log files

For all workflows. If one job fails and the users re-run the workflow, the old log file is over-written. I think it would be better if we make another log file and name it DNA_mapping.run2.log and so on, to keep the run history.

update scRNAseq workflow

After last conversation on slack, I think the scRNAseq workflow is not in sync with the newer changes.. Maybe time to update it?

HiC workflow: multiple --bin_size parameters

For downstream analysis, multiple bin size merging are useful (large bins for displaying matrices at the chromosomal level, even larger bins for hicPlotDistVsCounts).

It would be very helpful to precise multiple parameters for the --bin_size option and make the corresponding merged matrices (that are then corrected, used for TAD calling, etc).

Docker/Conda

For the first public release of our pipeline, we should make it independent using docker and conda.. Then we can also see whether we can add some some travis tests (issue #34) .

Documentation

Hi, sorry for bringing up some native issues but since I am very new in using the snakemake workflows I had some issues in their documentation. For example for ChIP-seq pipeline I couldn't find anywhere which has been mentioned that the directory of the input bam files and the Chip bam files should be the same. I figured it out by looking at the python code after getting error while running the pipeline. Maybe one can even change the naming format which has been written in the code, but if not necessary just explaining it in the ChIP-seq documentation would be fine. Another one I have noticed was the explanation of how to use snakemake options(It is correct when running -h just there are dashes missing in snakepipe documentation.)

include estimateReadFiltering (deepTools@develop)

I frequently run:
estimateReadFiltering -b Bowtie2/*bam -o metrics.tsv --ignoreDuplicates --minMappingQuality 5 --samFlagExclude 3844 --blackListFileName BL.bed

after mapping. It would be great if this could be always included - possibly with other defaults if you don't like the above.

Fix memory issue for "gigantic" genomes

In order to count genes on very large genomes (e.g. hs with exons AND introns) there is a more than default memory required.
In particular Salmon's suffix array construction (salmon index <genome.fa>) consumes a lot of memory.
Add a parameter to the config file, to change SLURM --mem-per-cpu, if needed (default, stays with SLURM default).

switch to python 3

The scripts are a mix of python 2 and 3 at this moment.. causes problems on user's side..

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.