zavolanlab / zarp Goto Github PK
View Code? Open in Web Editor NEWThe Zavolab Automated RNA-seq Pipeline
License: Apache License 2.0
The Zavolab Automated RNA-seq Pipeline
License: Apache License 2.0
Test the workflow on another organism, further using the SRA download. E.g. could try C.elegans, samples from project SRP330447
In a very specific test case where @dominikburri cloned the repo on his local machine and tried to execute the pipeline with singularity
contaniers two STAR related rules failed. The input data were on our login node, mounted to his OS. STAR could not find a tempdir and required an additional flag.
I think this could be fixed in the profiles with the --default-resources
keyword for snakemake : we would need to specify tmpdir=
just there.
CC @mkatsanto
Is your feature request related to a problem? Please describe.
We have encountered a problem with cutadapt
crashing due to low memory on some datasets; we should check how all the steps scale with the input data size and account for that accordingly in the pipeline.
Currently blocked by snakemake/snakefmt#118
Last update: snakemake/snakefmt#118 (comment)
Is your feature request related to a problem? Please describe.
As sample names have to be unique, there could not be one sample with two different sequencing modes. That means the seqmode doesn't have to appear in filenames, and consequently the wildcard {seqmode} is redundant. However, whether this is true has to be checked first.
Describe the solution you'd like
Check whether it might be possible to remove seqmode from filenames. If yes, implement.
Describe alternatives you've considered
Maybe this theory is wrong and we need seqmode as a wildcard.
Additional context
At least in some places it should be possible to remove the wildcard
Currently for HPC (Slurm) execution, the compute resources like memory and time are specified per job in profiles/slurm-config.json
.
If the execution fails because a job needs more resources one needs to adjust the above file and re-run the pipeline manually, which is not very effective.
Snakemake provides a functionality to re-run failed jobs: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?highlight=attempt#resources
What it requires:
restart-times
has to be added to the snakemake command with the number of attemptsattempt
with which the memory and/or time limits can be adjusted (e.g. multiplying attempt
with memory in MB)In this way, jobs that fail because of insufficient resources are re-submitted with more resources.
Describe the bug
Update in snakemake is crucial as it will deal with slurm execution.
Currently there are too many sacct calls in slurm exectution, which might cause issues to the execution system.
Snakemake-Profiles/slurm#81
This is rather a note than an issue.
The provided tests/input_files/rule_config.yaml
contains specification for the temporary directory in rule map_genome_star
(and pe_map_genome_star
):
zarp/tests/input_files/rule_config.yaml
Line 171 in 33a8e32
I guess this temporary directory is fine in linux systems and for local execution but I ran into problems when running it on slurm.
Before I did not have problems with this, but recently I got - for some samples only - errors with the temporary directory (unfortunately I don't have the error message anymore, it got overwritten by new, successful runs).
I tried to
/tmp/star_tmp
to $TMPDIR/star_tmp
within the rule config,map_genome_star
in workflow/rules/single_end.snakefile.smk
$TMPDIR
the actual value on the clusterbut with all the same error message that the temporary directory is not accessible. Binding the temp dir in the singularity-args
did not help.
I ended up to completely remove the specification of the temp directory and instead use the default one: "the temp directory will default to outFileNamePrefix_STARtmp" (from STAR manual 2.7.10a), which resolved the issue.
This is not ideal, as it is now within the results directory and does not use the clusters temp directories.
Since I don't know whether the problem is with the cluster, Singularity (e.g. binding paths), or the snakemake set-up, I cannot propose a solution and rather ask for help.
Describe the bug
The cluster.json tends to underestimate the resources required for a successful run.
Expected behavior
Resources have to be updated manually
Describe the bug
I have cloned the repo fresh and followed the instructions from the README to run a local test with singularity containers.
To Reproduce
Follow the testing as in the README.
Expected behavior
Clean pass
Screenshots
I cannot post the screnshot not to disclose sciCORE dirtree. Please check Slack
Describe the bug
Running bash tests/test_integration_workflow_with_conda/test.local.sh
exits with
WorkflowError:
Python package jinja2 must be installed to create reports.
File "/home/christina/miniconda3/envs/zarp/lib/python3.7/site-packages/snakemake/report/__init__.py", line 594, in auto_report
This happens at the step
+ snakemake --snakefile=../../workflow/Snakefile --configfile=../input_files/config.yaml --report=snakemake_report.html
although the environment.yml for zarp conda env specifies jinja2=2.11.2
To Reproduce
Steps to reproduce the behavior:
README
to create the conda env (I did a fresh creation of the conda env root & update with dev)conda activate zarp
bash tests/test_integration_workflow_with_conda/test.local.sh
Expected behavior
exit status 0.
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
Is your feature request related to a problem? Please describe.
For snakemake workflow catalogue visibility
Describe the solution you'd like
create the file, checkout instructions https://snakemake.github.io/snakemake-workflow-catalog/?rules=true
usage:
mandatory-flags: # optional definition of additional flags
desc: # describe your flags here in a few sentences (they will be inserted below the example commands)
flags: # put your flags here
software-stack-deployment: # definition of software deployment method (at least one of conda, singularity, or singularity+conda)
conda: true # whether pipeline works with --use-conda
singularity: true # whether pipeline works with --use-singularity
singularity+conda: true # whether pipeline works with --use-singularity --use-conda
report: true # add this to confirm that the workflow allows to use 'snakemake --report report.zip' to generate a report containing all results and explanations
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
When installing zarp, one usually wants to test the workflow with the provided tests.
But this requires the additional installation of the dev environment: https://github.com/zavolanlab/zarp#5-non-essential-dependencies-installation
In particular, bedtools is missing from the "standard" environment.
Describe the solution you'd like
One conda installation should be sufficient to run the provided tests.
Describe alternatives you've considered
Keep the dev environment but explain in more detail in the README the behaviour.
@dominikburri discovered this issue while debugging today.
At the end of the workflow we get different output files for cutadapt
depending which execution strategy we use, (on the test data) only because one column with version number is added to the output table. It is just a format issue, the actual result values are the same.
This is strange however since we specify the same version of cutadapt
for both conda and singularity. Maybe they are pulling different patch versions? or different builds?
Anyway, should be easy to fix by freezing exactly the same version of the software for both apporaches.
This could potentially solve the problem we discovered in #16 (comment) I think this might be the same thing manifested in a different place.
FYI @fgypas
Is your feature request related to a problem? Please describe.
Github actions ci test creates multiple times the same environment and creates different jobs for tasks that can be in one job.
Mamba implementation may make it faster.
Describe the solution you'd like
Mamba implementation and all tasks in one job, while creating the zarp env once
Additional context
https://github.com/conda-incubator/setup-miniconda
Is your feature request related to a problem? Please describe.
Not every slurm cluster support QOS but the default configuration specifies one:
zarp/profiles/slurm-config.json
Line 4 in 33a8e32
Describe the solution you'd like
Ideally, each slurm cluster can handle the workflow without modifications.
The QOS assignment above could be removed from the default slurm configuration, and an additional one could be created for clusters with QOS.
Describe alternatives you've considered
Alternatively, note the current behaviour with QOS in the readme or in some other format (e.g. FAQ).
This issue is a transferal from the previous issue on gitlab (number 89).
It seems that the v0.1 milestone will support only stranded data.
We need to make sure that in later versions we support unstranded data as well.
@fgypas
It seems that we might not support it in the end, so closing for now.
@uniqueg
I disagree strongly and think it should be part of the published version. The very first test with external real-world data (SARS-related samples) demonstrated that there are still plenty of relevant unstranded libraries around. Besides, whether a library is stranded or not is not something that is (unfortunately) typically reported when uploading samples to SRA, so whenever anyone wants to run any sample from there, it's a gamble we will accept it (after checking first with another pipeline) or not. That's a really poor user experience...
@mzavolan
Well.. I think the generating unstranded data makes no sense, especially now. When analyzing such data people make choices that are not warranted and introduce errors for sure (e.g. cumulating reads from plus and minus strand, and of course, discarding regions where there is transcription in both directions. I don't want to spend time implementing and testing choices like this. How do you want to proceed?
@uniqueg
I understand the sentiment but then that argument applies more or less to any data that were obtained with inferior protocols or outdated technology. And to my taste that depends a little bit too much on personal opinion and what kind of resources you have access to. Unstranded data has been proven useful in the past, so I can't really see how analyzing them makes no sense, despite all the obvious drawbacks.
Anyway, how about we leave this issue open and defer the discussion until we know how to proceed with Rhea? If we want to allow our users to analyze samples straight from SRA, I think Rhea should handle the vast majority of samples on there, and that would probably mean it should handle unstranded libraries. If we don't care too much about that particular use case and mostly concern ourselves with how Rhea can serve ourselves, then we should probably drop it for the reasons mentioned.
@mkatsanto
This feature will not be implemented in the near future according to our scope. If the public requires it we can implement it in later versions.
@uniqueg
Do if reviewers ask for that, otherwise wait until users ask for it.
While @meric466 run the test data me and @mkatsanto discovered that there probably is a bug in the results directory architecture.
Under results/samples
we have per sample results directory. However, besides each sample s
we have another directory s.se
which contains only sort_genomic_alignments_samtools
subdir. We are probably not evaluating wildcards properly at some rule...
Describe the bug
The singularity slurm test fails with error: unable to add "{path}" to mount list: destination must be an absolute path
To Reproduce
Run zarp/tests/test_integration_workflow/test.slurm.sh
Expected behavior
The workflow fails with error like in description
Link to the whole document available @ the project's Slack workspace.
CC: @mzavolan @uniqueg @fgypas @mkatsanto @ninsch3000 @dominikburri @meric466
@uniqueg and @mkatsanto have access to the responses in the form. People might sign up there for
@uniqueg wrote:
Let's take care of this after next release/submission.
Describe the bug
gitignore is outdated
To Reproduce
Steps to reproduce the behavior:
config
.Expected behavior
All results and config files from own samples should be ignored.
possible fix:
The lines with run
in .gitignore
(lines 329 & 330) should be replaced by config
.
Currently there are still some links from gitlab, e.g. on how to clone the repository.
This should be replaced by the new links.
The genome sequence and annotation file for the species from which the RNA-seq data comes are needed as inputs to zarp.
We use the reference files from ENSEMBL.
The user supplies the species name (e.g. homo_sapiens) or taxon ID (e.g. 9606) and the chromosome sequences that compose the karyotype (no haplotypes) are downloaded, along with the corresponding annotation.
Download run files from SRA.
Describe the solution you'd like
This download should NOT happen within core ZARP. Instead, a separate (small) snakemake pipeline will for now be created within the workflow/rules/
directory, which can be called by ZARP-cli. At a later stage that pipeline might be migrated elsewhere.
Inputs:
Outputs:
Is your feature request related to a problem? Please describe.
For snakemake workflow catalogue visibility
Describe the solution you'd like
move section "running the workflow on your own samples" to config/README.md
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Describe the bug
running the htsinfer workflow fails on some SRA samples.
Seems to be a problem when no read_lengths is there, so the max function has nothing to operate on.
Tried with this sample table and SRA download workflow (which downloads the fq.gz successfully).
sample fq1 fq2
SRR18223821
SRR18125741
SRR19162082
SRR19770954
SRR701798
Screenshots
Parts of the error within htsinfer_to_tsv.stderr.log
tparams["index_size"] = max([int(i) for i in read_lengths if i is not None])
ValueError: max() arg is an empty sequence
The user should have the ability to choose not to supply a custom logo and URL for the MultiQC report when running ZARP. The outcome should be that no logo and URL are provided in the report.
To achieve this, the user should be able to specify empty strings for these parameters. To further improve user experience, the report_logo
and report_url
parameters could/should be made entirely optional in config.yaml
(then taking on empty strings as defaults).
Another question in this context is whether, in the current implementation, a user is allowed to specify an HTTP(S) URL pointing to a logo, rather than the logo itself. It would just be nice to test this out while implementing the above, so that we can document that behavior more accurately.
Is your feature request related to a problem? Please describe.
The current fastqc is applied to the raw fastq files. However it is still relevant for the user to see what the quality of reads is or if there are still overrepresented sequences after the trimming.
Describe the solution you'd like
Introduce a second rule fastqc_trimmed, where the trimmed fastq file is used as input.
Set up changelog (in CHANGELOG.md
in root directory), describing v1.0.0
features in bullet point format
Is your feature request related to a problem? Please describe.
The two samples specified in the current test file tests/input_files/sra_samples.tsv
are not suitable for testing zarp, as the first one is nanopore seq (zarp only supports illumina) and the second is metagenomic, which makes it impossible to test the organism inference of the htsinfer pipeline, which is supposed to be run after the sra pipeline.
Describe the solution you'd like
Put different SRA numbers in the sra_samples.tsv
file, e.g. some from mouse.
Describe alternatives you've considered
SRAs from E.coli, or C.elegans, or D.melanogaster, or S.cerevisiae
Additional context
Add any other context or screenshots about the feature request here.
Currently the install/environment.yml
and install/environment.root.yml
contain different versions of snakemake (6.5.5 vs. 5.19.2). Is this intended?
Is your feature request related to a problem? Please describe.
Currently, metadata for each sample has to be determined and put in a sample table, which is tedious.
Describe the solution you'd like
We can call HTSinfer and have the tool sniff the missing metadata. As HTSinfer does not estimate fragment size distributions, for single-ended libraries we should set a "reasonable" default here, if the user does not provide them.
Note that this is a meta-issue, as it will require several changes, at least the following (prepare one issue each for the following):
Additional context
Originally, we were thinking of integrating HTSinfer into ZARP-cli. However, as it is potentially a long-running tool, it is better to run it as part of the pipeline in its own environment.
Since we use snakemake version > 6.0.0
the preferable frontend package manager for building internal, per-rule envs is now mamba, not conda.
I have tried to tacke it quickly but, strangely enough, when I specify to use mamba then our output files of cutadapt do not match the expected checksums... (see: #16 (comment)). Therefore for now our profiles specify --conda-frontend conda
.
It might be nice to replace conda with mamba, eventually...
Is your feature request related to a problem? Please describe.
Each rule has a static requirement of time needed to run. However real running times scale with the input file sizes. On top of that, in the existing implementation, specific queues provided in the scicore slurm system are specified in the slurm profile, which might cause incompatibilities with other slurm systems.
Describe the solution you'd like
Based on some existing runs, I can fit by least squares the size - runtimes and use the coefficient for scaling. Not sure if this will always work (Multiple inputs, multiprocessing optimisation). Still as we have more information on successful runs we could improve the accuracy of these estimates.
Turns out there is no access to the attempt parameter in the params field. The plan is to first calculate the time estimate in the resources and then interface that to the params field, so that we can then feed it to the cluster.json
Implementation tasks:
Describe alternatives you've considered
If all the above do. not work focus on eliminating the queue specification from the cluster.json
Describe the bug
To Reproduce
Expected behavior
Describe the bug
When testing the workflow with: bash tests/test_integration_workflow/test.slurm.sh or bash tests/test_integration_workflow/test.slurm.sh, jinja2 is not found?
To Reproduce
When running either of the commands above in the zarp conda env.
jinja2 should be available.
Is your feature request related to a problem? Please describe.
There has been a new major Snakemake release since we previously pushed this project: 7.0.0+
.
Should we update @mkatsanto?
Describe the solution you'd like
If yes - it would require changing two lines in the environment recipes, should be easy...
It would be nice to include templates for github: issues (feature request, bug report) as well as pull request.
An issue we discovered with @mkatsanto today while debugging:
We are not sure the notemp
flag should be set in the snakemake profiles. However, if we get rid of it then during the pipeline execution multiqc
fails - it does not find files which it expects... They are probably removes due to the temp()
mechanism.
We need to take a closer look @ what needs to be specified for that rule as required....
We have some ongoing work/discussion in a few issues that we could come back to in the furure.
It would be very nice to have them transferred into here.
Should we transfer them manually or is there an automatic way?
When consistently using semantic/conventional commits, it is possible to automatically create releases, for example, whenever we merge into the main
branch.
This should cover the following:
Possibly we could also configure this to publish updates to Zenodo and other outlets (e.g., WorkflowHub, Dockstore) via web hooks, but I'm not sure.
Is your feature request related to a problem? Please describe.
Some MultiQC plots (e.g., from FastQC, tin-score-calculation, STAR), report absolute numbers of reads rather than fractions of reads. While it is useful to know the sequencing depths of the different samples/libraries, it is perhaps not very informative/helpful to represent these differences in multiple plots, as this makes it hard to compare key metrics across samples, e.g., the fraction of (uniquely) mapped reads across samples. This is especially problematic when these plots are exported and put in publications as static images.
Describe the solution you'd like
Have one plot that reports sequencing depths/library sizes for each sample as bar plots. Have all other plots report fractions of reads instead.
Additional context
It may not be possible to configure MultiQC in a way that read fractions are visualized instead of read numbers. In that case, it is probably too much of a hassle to implement the desired solution.
TIN score script takes too much memory, scicore team needed to kill the cluster jobs...
It seems that there is still some problem with the parallelization.
REPO: https://github.com/zavolanlab/tin-score-calculation
Once this is fixed we should bump up the version and create a new release in the TIN score repo + push to pypi (automatic in the CI) + upload to biocontainers. Then we should refer to that new version in our zarp main Snakefile here.
We have just updated samtools
to version to 1.9
from bioconda, the previous one raised errors, see here.
We should do the same for singularity runs too, to use the same versions of tools consistently.
Created on GitLab by @mkatsanto, initially assigned to @meric466 :
Create some real sample run along with the reports and visualisations.
Input data, configs, intermediate data & output should be made available on Zenodo and linked to the documentation
Test the workflow on more organisms, one of which is E. coli.
This is more of a note, if others experience similar behaviour, so that they can rule out that this is the problem.
Describe the bug
Workflow fails on rule star_rpm
with FileNotFoundError.
To Reproduce
Steps to reproduce the behavior:
star_rpm
shadow: "full"
from the rule, restart.Expected behavior
The shadow keyword takes relative paths to the working directory. Maybe a more meaningful error message would help, or a check for the file locations could be done prior to the start.
Additional context
The shadow keyword reduces the disk space footprint on the workflow and makes it possible to delete whole temp directories.
But it requires that all the files are sub directories of the working directory.
Maybe removing this altogether is less hassle?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.