avsastry / modulome-workflow Goto Github PK
View Code? Open in Web Editor NEWWorkflow to download, process, and explore microbial RNA-seq data from NCBI SRA
License: MIT License
Workflow to download, process, and explore microbial RNA-seq data from NCBI SRA
License: MIT License
I had a couple of runs recently that jumped to the last process (assemble_tmp) or the one before (multiqc) without running the rest of the required rules. As far as i can tell, it's because of the .ifEmpty([])
statement in the input lines. It looks like they are creating a bypass to generating the output from previous processes (maybe when there are too many cores/samples). What is the reason for having the ifEmpty
step?
We should try to minimize notebook usage as much as possible. They can lead to issues if the cells are not run the in the exact order and they require more manual work when compared to scripts. We can convert the QC/QA notebooks to scripts that outputs everything the user needs to know to run QC/QA (e.g. cluster figure, pearson correlation between replicates etc.) and the user can change the input parameters to the script to change QC thresholds.
I am having an issue both on a local macbook, and a virtual linux machine on Azure, where the OptICA step does not finish. It seems to be 'hanging' indefinetely. For instance I ran this on a dataset of 164 samples:
bash ./run_ica.sh -n 16 -o ../data/interim/ -v ../data/processed_data/log_tpm_norm.csv
Here is the output, where it hangs:
Computing dimension 160 of 164
##################################
Setting up...
0.25 seconds elapsedRunning ICA...
Completed run 1 of 7 on Processor 0
2.10 minutes elapsed
Completed run 2 of 7 on Processor 0
2.08 minutes elapsed
Completed run 3 of 7 on Processor 0
1.85 minutes elapsed
Completed run 4 of 7 on Processor 0
1.58 minutes elapsed
Completed run 5 of 7 on Processor 0
2.07 minutes elapsed
Completed run 6 of 7 on Processor 0
52.93 seconds elapsed
Completed run 7 of 7 on Processor 0
1.60 minutes elapsedAll ICA runs complete!
12.33 minutes elapsed
So I get the A and M files for dimension 150 in this case, but not for 160. I get the same issue doing this as well, where dimension 152 does not complete:
bash ./run_ica.sh -n 16 -m 152 -s 2 -o ../data/interim/ -v ../data/processed_data/log_tpm_norm.csv
Thanks for any help!
/Mathias
Details of machine:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focalLinux avm-sdt-nilmat-ica 5.15.0-1054-azure #62~20.04.1-Ubuntu SMP Wed Jan 17 12:22:56 UTC 2024 x86_64 GNU/Linux
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 57 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHMemory:
total used free shared buff/cache available
Mem: 128756 3437 122362 6 2956 124229
Can someone please help with this error?
N E X T F L O W ~ version 23.10.1
Launching main.nf
[scruffy_wright] DSL2 - revision: 53840c131a
ERROR ~ No signature of method: groovyx.gpars.dataflow.DataflowBroadcast.into() is applicable for argument types: (Script_e1bcc410eabc93ca$_runScript_closure1) values: [Script_e1bcc410eabc93ca$_runScript_closure1@50a1af86]
Possible solutions: find(), any(), bind(java.lang.Object), with(groovy.lang.Closure), print(java.io.PrintWriter), print(java.lang.Object)
-- Check script 'main.nf' at line: 65 or see '.nextflow.log' file for more details
It seems like the NCBI APIs are incompatible with the current version of sratools (I think), so the latter must be updated in the fasterq-dump container.
Replacing the version number worked for me:
FROM ubuntu:18.04
# Metadata
MAINTAINER Anand Sastry <[email protected]>
# Set noninteractive mode
ENV DEBIAN_FRONTEND noninteractive
# Install pigz and sra-toolbox
USER root
RUN apt-get update && apt-get install -y procps pigz wget libxml-libxml-perl
RUN wget -q http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.6/sratoolkit.3.0.6-ubuntu64.tar.gz -O /tmp/sratoolkit.tar.gz && tar zxf /tmp/sratoolkit.tar.gz -C /opt/ && rm /tmp/sratoolkit.tar.gz
RUN mkdir //ncbi && mkdir //ncbi/public && mkdir //ncbi/public/sra && mkdir //ncbi/public/refseq && chmod -R 777 //ncbi
ENV PATH="/opt/sratoolkit.3.0.6-ubuntu64/bin/:${PATH}"
Your READ.Me states that all "Pre-requisite software" required for using the workflows without docker is listed under each respective workflow.
However no such list is then given anywhere. All simply refer to your docker.
Could you please provide a list of dependencies for installing this WITHOUT using docker?
I am facing an error during the 2nd step of the pipeline. I suppose the error is related to prefetch and fasterq-dump while fetching data from SRA. Also, I would like to mention that I am using version 22.10.8 of nextflow for running the pipeline as I face errors with the latest version of the same.
It would be great if someone could help with the following error.
sudo ./nextflow run main.nf -profile local --organism mycobacterium_abscessus --metadata mab.tsv --sequence_dir sequence_dir/ --outdir results
N E X T F L O W ~ version 22.10.8
Launching main.nf
[reverent_bartik] DSL1 - revision: ef90b5fca3
executor > local (13)
[27/ad4b19] process > bowtie_build [100%] 1 of 1 ✔
[8d/7bec19] process > gff2bed [100%] 1 of 1 ✔
[56/467f86] process > download_fastq (14) [ 1%] 5 of 306, failed: 5, retries: 5
[- ] process > stage_fastq_single -
executor > local (13)
[27/ad4b19] process > bowtie_build [100%] 1 of 1 ✔
[8d/7bec19] process > gff2bed [100%] 1 of 1 ✔
[56/467f86] process > download_fastq (14) [ 1%] 5 of 306, failed: 5, retries: 5
[- ] process > stage_fastq_single -
executor > local (14)
[27/ad4b19] process > bowtie_build [100%] 1 of 1 ✔
[8d/7bec19] process > gff2bed [100%] 1 of 1 ✔
[41/5bd9cd] process > download_fastq (20) [ 1%] 6 of 307, failed: 6, retries: 6
[- ] process > stage_fastq_single -
Use https://hub.docker.com/r/sbrg/pymodulon as a base container, add R/R libraries, and any other required software.
The pie chart in expression_QC_part1 showing final pass/fail is flipped. Can be easily fixed by changing the list passed to reindex function.
_,_,pcts = plt.pie(pass_qc.value_counts().reindex([False,True]),
labels = ['Failed','Passed'],
colors=['tab:red','tab:blue'],
autopct='%.0f%%',textprops={'size':16});
Can't push to this repo. But, the get_dimensions.py is missing the python3 shebang on top. It can't be run on cmd line.
Use JSON file instead of csv/tsv as input metadata file.
Just talked to some of the people at DTU who developed anti-smash and they mentioned that using csv files can lead to unexpected outcomes/ errors that may be harder to catch when you start scaling your pipeline. Some of our most common errors arise from using this format. We should consider switching to JSON instead. This will require lots of changes:
nextflow run main.nf -profile local --organism bacillus_subtilis --metadata ../test/test_metadata.tsv --sequence_dir ../test/sequence_files/ --outdir ../test/nf_results/
Error executing process > 'multiqc (1)'
Caused by:
Process multiqc (1) terminated with an error exit status (125)
Command executed:
multiqc -f -c multiqc_config.yaml .
assemble_qc_stats.py multiqc_data
Command exit status:
125
Command output:
(empty)
Command error:
Unable to find image 'avsastry/multiqc-rockhopper:1.0' locally
docker: Error response from daemon: pull access denied for avsastry/multiqc-rockhopper, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.
Any idea regarding the source of this error?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.