zavolanlab / zarp Goto Github PK

View Code? Open in Web Editor NEW

29.0 5.0 5.0 24.35 MB

The Zavolab Automated RNA-seq Pipeline

License: Apache License 2.0

Python 81.63% Shell 18.37%

bioinformatics rna-seq snakemake conda docker fair high-throughput pipeline reproducible-research singularity

zarp's People

Contributors

Stargazers

Watchers

Forkers

zm-git-dev fgypas angrymaciek paulaidt

zarp's Issues

Run C.elegans samples

Test the workflow on another organism, further using the SRA download. E.g. could try C.elegans, samples from project SRP330447

tempdir specification required for STAR-related rules

In a very specific test case where @dominikburri cloned the repo on his local machine and tried to execute the pipeline with singularity contaniers two STAR related rules failed. The input data were on our login node, mounted to his OS. STAR could not find a tempdir and required an additional flag.

I think this could be fixed in the profiles with the --default-resources keyword for snakemake : we would need to specify tmpdir= just there.

CC @mkatsanto

chore: inspect the required resource scaling

Is your feature request related to a problem? Please describe.
We have encountered a problem with cutadapt crashing due to low memory on some datasets; we should check how all the steps scale with the input data size and account for that accordingly in the pipeline.

Snakefmt and lint checks

Add automatic snakefmt and lint checks.
Refactor code to follow the standards
All checks here: https://snakemake.github.io/snakemake-workflow-catalog/ should become green in order to move to the standardised workflows.

~~Currently blocked by snakemake/snakefmt#118~~
~~Last update: snakemake/snakefmt#118 (comment)~~

refactor: remove {seqmode} wildcard if applicable

Is your feature request related to a problem? Please describe.
As sample names have to be unique, there could not be one sample with two different sequencing modes. That means the seqmode doesn't have to appear in filenames, and consequently the wildcard {seqmode} is redundant. However, whether this is true has to be checked first.

Describe the solution you'd like
Check whether it might be possible to remove seqmode from filenames. If yes, implement.

Describe alternatives you've considered
Maybe this theory is wrong and we need seqmode as a wildcard.

Additional context
At least in some places it should be possible to remove the wildcard

feat: Automatically increase compute resources if insufficient

Currently for HPC (Slurm) execution, the compute resources like memory and time are specified per job in profiles/slurm-config.json.

If the execution fails because a job needs more resources one needs to adjust the above file and re-run the pipeline manually, which is not very effective.

Snakemake provides a functionality to re-run failed jobs: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?highlight=attempt#resources

What it requires:

the option restart-times has to be added to the snakemake command with the number of attempts
each affected job should include the parameter attempt with which the memory and/or time limits can be adjusted (e.g. multiplying attempt with memory in MB)

In this way, jobs that fail because of insufficient resources are re-submitted with more resources.

Dummy README with link to gitlab

Snakemake update

Describe the bug
Update in snakemake is crucial as it will deal with slurm execution.
Currently there are too many sacct calls in slurm exectution, which might cause issues to the execution system.
Snakemake-Profiles/slurm#81

unset temporary directory for map genome STAR?

This is rather a note than an issue.

The provided tests/input_files/rule_config.yaml contains specification for the temporary directory in rule map_genome_star (and pe_map_genome_star):

zarp/tests/input_files/rule_config.yaml

Line 171 in 33a8e32

--outTmpDir: '/tmp/star_tmp'

I guess this temporary directory is fine in linux systems and for local execution but I ran into problems when running it on slurm.
Before I did not have problems with this, but recently I got - for some samples only - errors with the temporary directory (unfortunately I don't have the error message anymore, it got overwritten by new, successful runs).
I tried to

switch /tmp/star_tmp to $TMPDIR/star_tmp within the rule config,
switch it to the rule map_genome_star in workflow/rules/single_end.snakefile.smk
try instead of $TMPDIR the actual value on the cluster

but with all the same error message that the temporary directory is not accessible. Binding the temp dir in the singularity-args did not help.

I ended up to completely remove the specification of the temp directory and instead use the default one: "the temp directory will default to outFileNamePrefix_STARtmp" (from STAR manual 2.7.10a), which resolved the issue.
This is not ideal, as it is now within the results directory and does not use the clusters temp directories.
Since I don't know whether the problem is with the cluster, Singularity (e.g. binding paths), or the snakemake set-up, I cannot propose a solution and rather ask for help.

Find appropriate starting estimates for resources

Describe the bug
The cluster.json tends to underestimate the resources required for a successful run.

Expected behavior
Resources have to be updated manually

bug: local test with singularity fails?

Describe the bug
I have cloned the repo fresh and followed the instructions from the README to run a local test with singularity containers.

To Reproduce
Follow the testing as in the README.

Expected behavior
Clean pass

Screenshots

I cannot post the screnshot not to disclose sciCORE dirtree. Please check Slack

bug: local test with conda fails?

Describe the bug
Running bash tests/test_integration_workflow_with_conda/test.local.sh exits with

WorkflowError:
Python package jinja2 must be installed to create reports.
  File "/home/christina/miniconda3/envs/zarp/lib/python3.7/site-packages/snakemake/report/__init__.py", line 594, in auto_report

This happens at the step

+ snakemake --snakefile=../../workflow/Snakefile --configfile=../input_files/config.yaml --report=snakemake_report.html

although the environment.yml for zarp conda env specifies jinja2=2.11.2

To Reproduce
Steps to reproduce the behavior:

Follow installation instructions in zarp README to create the conda env (I did a fresh creation of the conda env root & update with dev)
conda activate zarp
run bash tests/test_integration_workflow_with_conda/test.local.sh

Expected behavior
exit status 0.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

feat: create .snakemake-workflow-catalog.yml

Is your feature request related to a problem? Please describe.
For snakemake workflow catalogue visibility

Describe the solution you'd like
create the file, checkout instructions https://snakemake.github.io/snakemake-workflow-catalog/?rules=true

usage:
  mandatory-flags: # optional definition of additional flags
    desc: # describe your flags here in a few sentences (they will be inserted below the example commands)
    flags: # put your flags here
  software-stack-deployment: # definition of software deployment method (at least one of conda, singularity, or singularity+conda)
    conda: true # whether pipeline works with --use-conda
    singularity: true # whether pipeline works with --use-singularity
    singularity+conda: true # whether pipeline works with --use-singularity --use-conda
  report: true # add this to confirm that the workflow allows to use 'snakemake --report report.zip' to generate a report containing all results and explanations

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Combine conda installations into one

Is your feature request related to a problem? Please describe.
When installing zarp, one usually wants to test the workflow with the provided tests.
But this requires the additional installation of the dev environment: https://github.com/zavolanlab/zarp#5-non-essential-dependencies-installation
In particular, bedtools is missing from the "standard" environment.

Describe the solution you'd like
One conda installation should be sufficient to run the provided tests.

Describe alternatives you've considered
Keep the dev environment but explain in more detail in the README the behaviour.

cutadapt output is inconsistent while running in singularity and conda

@dominikburri discovered this issue while debugging today.

At the end of the workflow we get different output files for cutadapt depending which execution strategy we use, (on the test data) only because one column with version number is added to the output table. It is just a format issue, the actual result values are the same.

This is strange however since we specify the same version of cutadapt for both conda and singularity. Maybe they are pulling different patch versions? or different builds?

Anyway, should be easy to fix by freezing exactly the same version of the software for both apporaches.
This could potentially solve the problem we discovered in #16 (comment) I think this might be the same thing manifested in a different place.

FYI @fgypas

Github actions too slow

Is your feature request related to a problem? Please describe.
Github actions ci test creates multiple times the same environment and creates different jobs for tasks that can be in one job.
Mamba implementation may make it faster.

Describe the solution you'd like
Mamba implementation and all tasks in one job, while creating the zarp env once

Additional context
https://github.com/conda-incubator/setup-miniconda

Slurm QOS parameter

Is your feature request related to a problem? Please describe.
Not every slurm cluster support QOS but the default configuration specifies one:

zarp/profiles/slurm-config.json

Line 4 in 33a8e32

"qos": "6hours",

This will cause errors on clusters without QOS.

Describe the solution you'd like
Ideally, each slurm cluster can handle the workflow without modifications.
The QOS assignment above could be removed from the default slurm configuration, and an additional one could be created for clusters with QOS.

Describe alternatives you've considered
Alternatively, note the current behaviour with QOS in the readme or in some other format (e.g. FAQ).

Support for unstranded data

This issue is a transferal from the previous issue on gitlab (number 89).

Original description

It seems that the v0.1 milestone will support only stranded data.
We need to make sure that in later versions we support unstranded data as well.

Comments

@fgypas
It seems that we might not support it in the end, so closing for now.

@uniqueg
I disagree strongly and think it should be part of the published version. The very first test with external real-world data (SARS-related samples) demonstrated that there are still plenty of relevant unstranded libraries around. Besides, whether a library is stranded or not is not something that is (unfortunately) typically reported when uploading samples to SRA, so whenever anyone wants to run any sample from there, it's a gamble we will accept it (after checking first with another pipeline) or not. That's a really poor user experience...

@mzavolan
Well.. I think the generating unstranded data makes no sense, especially now. When analyzing such data people make choices that are not warranted and introduce errors for sure (e.g. cumulating reads from plus and minus strand, and of course, discarding regions where there is transcription in both directions. I don't want to spend time implementing and testing choices like this. How do you want to proceed?

@uniqueg
I understand the sentiment but then that argument applies more or less to any data that were obtained with inferior protocols or outdated technology. And to my taste that depends a little bit too much on personal opinion and what kind of resources you have access to. Unstranded data has been proven useful in the past, so I can't really see how analyzing them makes no sense, despite all the obvious drawbacks.
Anyway, how about we leave this issue open and defer the discussion until we know how to proceed with Rhea? If we want to allow our users to analyze samples straight from SRA, I think Rhea should handle the vast majority of samples on there, and that would probably mean it should handle unstranded libraries. If we don't care too much about that particular use case and mostly concern ourselves with how Rhea can serve ourselves, then we should probably drop it for the reasons mentioned.

@mkatsanto
This feature will not be implemented in the near future according to our scope. If the public requires it we can implement it in later versions.

@uniqueg
Do if reviewers ask for that, otherwise wait until users ask for it.

bug: results subdir architecture bug?

While @meric466 run the test data me and @mkatsanto discovered that there probably is a bug in the results directory architecture.
Under results/samples we have per sample results directory. However, besides each sample s we have another directory s.se which contains only sort_genomic_alignments_samtools subdir. We are probably not evaluating wildcards properly at some rule...

Binding path error

Describe the bug
The singularity slurm test fails with error: unable to add "{path}" to mount list: destination must be an absolute path

To Reproduce
Run zarp/tests/test_integration_workflow/test.slurm.sh

Expected behavior
The workflow fails with error like in description

chore: write-up our text for version 0.3.0

Link to the whole document available @ the project's Slack workspace.

CC: @mzavolan @uniqueg @fgypas @mkatsanto @ninsch3000 @dominikburri @meric466

chore: send notifications to those who are interested in zarp

@uniqueg and @mkatsanto have access to the responses in the form. People might sign up there for

contributing to zarp
beta-testing
being notified upon official release of zarp.

@uniqueg wrote:
Let's take care of this after next release/submission.

outdated gitignore

Describe the bug

gitignore is outdated

To Reproduce
Steps to reproduce the behavior:

create a new run within config.
all run files are not ignored in git.

Expected behavior
All results and config files from own samples should be ignored.

possible fix:
The lines with run in .gitignore (lines 329 & 330) should be replaced by config.

docs: update links/info referring to old repository

Currently there are still some links from gitlab, e.g. on how to clone the repository.
This should be replaced by the new links.

Download genome resources

The genome sequence and annotation file for the species from which the RNA-seq data comes are needed as inputs to zarp.

We use the reference files from ENSEMBL.

The user supplies the species name (e.g. homo_sapiens) or taxon ID (e.g. 9606) and the chromosome sequences that compose the karyotype (no haplotypes) are downloaded, along with the corresponding annotation.

feat: sra download

Download run files from SRA.

Describe the solution you'd like
This download should NOT happen within core ZARP. Instead, a separate (small) snakemake pipeline will for now be created within the workflow/rules/ directory, which can be called by ZARP-cli. At a later stage that pipeline might be migrated elsewhere.

Inputs:

SRR identifier OR SRA run table

Outputs:

directory containing all downloaded SRR files

See related zarp-cli issue

doc: config/README.md

Is your feature request related to a problem? Please describe.
For snakemake workflow catalogue visibility

Describe the solution you'd like
move section "running the workflow on your own samples" to config/README.md

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

bug: htsinfer: ValueError on max function

Describe the bug
running the htsinfer workflow fails on some SRA samples.
Seems to be a problem when no read_lengths is there, so the max function has nothing to operate on.

Tried with this sample table and SRA download workflow (which downloads the fq.gz successfully).

sample	fq1	fq2
SRR18223821		
SRR18125741		
SRR19162082		
SRR19770954		
SRR701798

Screenshots
Parts of the error within htsinfer_to_tsv.stderr.log

    tparams["index_size"] = max([int(i) for i in read_lengths if i is not None])

ValueError: max() arg is an empty sequence

Du not require logo & URL for MultiQC report

The user should have the ability to choose not to supply a custom logo and URL for the MultiQC report when running ZARP. The outcome should be that no logo and URL are provided in the report.

To achieve this, the user should be able to specify empty strings for these parameters. To further improve user experience, the report_logo and report_url parameters could/should be made entirely optional in config.yaml (then taking on empty strings as defaults).

Another question in this context is whether, in the current implementation, a user is allowed to specify an HTTP(S) URL pointing to a logo, rather than the logo itself. It would just be nice to test this out while implementing the above, so that we can document that behavior more accurately.

Fastqc report after adapter trimming

Is your feature request related to a problem? Please describe.
The current fastqc is applied to the raw fastq files. However it is still relevant for the user to see what the quality of reads is or if there are still overrepresented sequences after the trimming.

Describe the solution you'd like
Introduce a second rule fastqc_trimmed, where the trimmed fastq file is used as input.

docs: add changelog

Set up changelog (in CHANGELOG.md in root directory), describing v1.0.0 features in bullet point format

Choose different SRA samples for SRA download pipeline test

Is your feature request related to a problem? Please describe.
The two samples specified in the current test file tests/input_files/sra_samples.tsv are not suitable for testing zarp, as the first one is nanopore seq (zarp only supports illumina) and the second is metagenomic, which makes it impossible to test the organism inference of the htsinfer pipeline, which is supposed to be run after the sra pipeline.

Describe the solution you'd like
Put different SRA numbers in the sra_samples.tsv file, e.g. some from mouse.

Describe alternatives you've considered
SRAs from E.coli, or C.elegans, or D.melanogaster, or S.cerevisiae

Additional context
Add any other context or screenshots about the feature request here.

same snakemake version for normal and root users?

Currently the install/environment.yml and install/environment.root.yml contain different versions of snakemake (6.5.5 vs. 5.19.2). Is this intended?

Infer sample metadata from data itself with HTSinfer

Is your feature request related to a problem? Please describe.

Currently, metadata for each sample has to be determined and put in a sample table, which is tedious.

Describe the solution you'd like

We can call HTSinfer and have the tool sniff the missing metadata. As HTSinfer does not estimate fragment size distributions, for single-ended libraries we should set a "reasonable" default here, if the user does not provide them.

Note that this is a meta-issue, as it will require several changes, at least the following (prepare one issue each for the following):

Publish HTSinfer on Conda and BioContainers.
Create a rule that runs HTSinfer
Parse original sample table and call HTSinfer rule accordingly (no need to call it with all functionalities for all samples if metadata was already supplied; rule or bare Python code)
Prepare an updated sample table (or dictionary that Snakemake can access?) from the output (rule or bare Python code)
Read values from updated table or dictionary across entire pipeline

Additional context

Originally, we were thinking of integrating HTSinfer into ZARP-cli. However, as it is potentially a long-running tool, it is better to run it as part of the pipeline in its own environment.

feat: replace snakemake pkg manager: conda -> mamba

Since we use snakemake version > 6.0.0 the preferable frontend package manager for building internal, per-rule envs is now mamba, not conda.

I have tried to tacke it quickly but, strangely enough, when I specify to use mamba then our output files of cutadapt do not match the expected checksums... (see: #16 (comment)). Therefore for now our profiles specify --conda-frontend conda.

It might be nice to replace conda with mamba, eventually...

ci: set up GitHub Actions

Scale time resources

Is your feature request related to a problem? Please describe.
Each rule has a static requirement of time needed to run. However real running times scale with the input file sizes. On top of that, in the existing implementation, specific queues provided in the scicore slurm system are specified in the slurm profile, which might cause incompatibilities with other slurm systems.

Describe the solution you'd like

Make use of time variable within each rule. Scale the estimate with the input sizes.

Based on some existing runs, I can fit by least squares the size - runtimes and use the coefficient for scaling. Not sure if this will always work (Multiple inputs, multiprocessing optimisation). Still as we have more information on successful runs we could improve the accuracy of these estimates.

Check if the attempt keyword can be used to increase the time requirement upon restart.

Turns out there is no access to the attempt parameter in the params field. The plan is to first calculate the time estimate in the resources and then interface that to the params field, so that we can then feed it to the cluster.json

Check if the time estimate is translated correctly in terms of the queueing system.
For the 6hour queue this works fine. Jobs that require less than 30 minutes are still going to the 6hour queue, which is specific to this slurm instance and should be fine in general.

Implementation tasks:

Add params and resources time parameters and use standard times first
Fix the time parameter to be scaled to the input sizes by finding coefficient of scaling for a few rules (ensure the concept works).
Fix the time parameter to be scaled to the input sizes by finding coefficient of scaling for each rule
Remove specific rule specifications in cluster.json files
Run standard tests
Run some real dataset to ensure the estimates are reasonable
Check efficiency score of the workflow to observe if it has increased

Describe alternatives you've considered
If all the above do. not work focus on eliminating the queue specification from the cluster.json

Github Action is not triggered on a simple push. Github Action is triggered twice on a new pull request.

Describe the bug

The Github Action is not triggered on push.
The Github Action is triggered twice when you open a new pull request.

To Reproduce

Push a new commit in a new branch.
No github action will be triggered
Wait a bit
Open a PR
The tests will start twice (for push and PR)

Expected behavior

Trigger CI once when you push a commit
Trigger CI once per PR

WorkflowError: Python package jinja2 must be installed to create reports.

Describe the bug
When testing the workflow with: bash tests/test_integration_workflow/test.slurm.sh or bash tests/test_integration_workflow/test.slurm.sh, jinja2 is not found?

To Reproduce
When running either of the commands above in the zarp conda env.
jinja2 should be available.

chore: snakemake version update

Is your feature request related to a problem? Please describe.
There has been a new major Snakemake release since we previously pushed this project: 7.0.0+.
Should we update @mkatsanto?

Describe the solution you'd like
If yes - it would require changing two lines in the environment recipes, should be easy...

chore: add issues and pull request templates

It would be nice to include templates for github: issues (feature request, bug report) as well as pull request.

take a closer look at the "notemp" flag in snakemake profiles

An issue we discovered with @mkatsanto today while debugging:

We are not sure the notemp flag should be set in the snakemake profiles. However, if we get rid of it then during the pipeline execution multiqc fails - it does not find files which it expects... They are probably removes due to the temp() mechanism.

We need to take a closer look @ what needs to be specified for that rule as required....

chore: migrate future issues from GitLab

We have some ongoing work/discussion in a few issues that we could come back to in the furure.
It would be very nice to have them transferred into here.
Should we transfer them manually or is there an automatic way?

chore: set up automatic versioning

When consistently using semantic/conventional commits, it is possible to automatically create releases, for example, whenever we merge into the main branch.

This should cover the following:

Determine whether a release should bump the major, minor or patch version (this requires strict following of the conventional commits specs!)
Tag and describe the release
Update the changelog (see #5)

Possibly we could also configure this to publish updates to Zenodo and other outlets (e.g., WorkflowHub, Dockstore) via web hooks, but I'm not sure.

Normalize results by fraction of reads in some MultiQC plots

Is your feature request related to a problem? Please describe.
Some MultiQC plots (e.g., from FastQC, tin-score-calculation, STAR), report absolute numbers of reads rather than fractions of reads. While it is useful to know the sequencing depths of the different samples/libraries, it is perhaps not very informative/helpful to represent these differences in multiple plots, as this makes it hard to compare key metrics across samples, e.g., the fraction of (uniquely) mapped reads across samples. This is especially problematic when these plots are exported and put in publications as static images.

Describe the solution you'd like
Have one plot that reports sequencing depths/library sizes for each sample as bar plots. Have all other plots report fractions of reads instead.

Additional context
It may not be possible to configure MultiQC in a way that read fractions are visualized instead of read numbers. In that case, it is probably too much of a hassle to implement the desired solution.

bugfix: update & pull new version of the TIN score repo

TIN score script takes too much memory, scicore team needed to kill the cluster jobs...
It seems that there is still some problem with the parallelization.
REPO: https://github.com/zavolanlab/tin-score-calculation

Once this is fixed we should bump up the version and create a new release in the TIN score repo + push to pypi (automatic in the CI) + upload to biocontainers. Then we should refer to that new version in our zarp main Snakefile here.

CC: @mkatsanto @dominikburri

feat: samtools update in singularity containers

We have just updated samtools to version to 1.9 from bioconda, the previous one raised errors, see here.
We should do the same for singularity runs too, to use the same versions of tools consistently.

chore: create an example run

Created on GitLab by @mkatsanto, initially assigned to @meric466 :

Create some real sample run along with the reports and visualisations.
Input data, configs, intermediate data & output should be made available on Zenodo and linked to the documentation

Run E.coli samples

Test the workflow on more organisms, one of which is E. coli.

rule star_rpm with FileNotFoundError

This is more of a note, if others experience similar behaviour, so that they can rule out that this is the problem.

Describe the bug
Workflow fails on rule star_rpm with FileNotFoundError.

To Reproduce
Steps to reproduce the behavior:

create a zarp run, but keep config and results directories separate.
Start zarp with Singularity from base repo directory.
Workflow will likely fail on star_rpm
Remove shadow: "full" from the rule, restart.
Workflow will run through.

Expected behavior
The shadow keyword takes relative paths to the working directory. Maybe a more meaningful error message would help, or a check for the file locations could be done prior to the start.

Additional context
The shadow keyword reduces the disk space footprint on the workflow and makes it possible to delete whole temp directories.
But it requires that all the files are sub directories of the working directory.
Maybe removing this altogether is less hassle?

zavolanlab / zarp Goto Github PK

zarp's People

Contributors

Stargazers

Watchers

Forkers

zarp's Issues

Original description

Comments

Recommend Projects

Recommend Topics

Recommend Org