avsastry / modulome-workflow Goto Github PK

Workflow to download, process, and explore microbial RNA-seq data from NCBI SRA

License: MIT License

Python 0.15% Shell 0.05% Nextflow 0.10% Jupyter Notebook 26.54% HTML 73.06% Dockerfile 0.01% PostScript 0.10%

rna-seq ncbi-sra nextflow independent-component-analysis

modulome-workflow's Issues

nextflow jumps to last process

I had a couple of runs recently that jumped to the last process (assemble_tmp) or the one before (multiqc) without running the rest of the required rules. As far as i can tell, it's because of the .ifEmpty([]) statement in the input lines. It looks like they are creating a bypass to generating the output from previous processes (maybe when there are too many cores/samples). What is the reason for having the ifEmpty step?

Convert QC/QA notebooks to scripts

We should try to minimize notebook usage as much as possible. They can lead to issues if the cells are not run the in the exact order and they require more manual work when compared to scripts. We can convert the QC/QA notebooks to scripts that outputs everything the user needs to know to run QC/QA (e.g. cluster figure, pearson correlation between replicates etc.) and the user can change the input parameters to the script to change QC thresholds.

Unfinished OptICA step

I am having an issue both on a local macbook, and a virtual linux machine on Azure, where the OptICA step does not finish. It seems to be 'hanging' indefinetely. For instance I ran this on a dataset of 164 samples:
bash ./run_ica.sh -n 16 -o ../data/interim/ -v ../data/processed_data/log_tpm_norm.csv

Here is the output, where it hangs:

Computing dimension 160 of 164

##################################

Setting up...
0.25 seconds elapsed

Running ICA...
Completed run 1 of 7 on Processor 0
2.10 minutes elapsed
Completed run 2 of 7 on Processor 0
2.08 minutes elapsed
Completed run 3 of 7 on Processor 0
1.85 minutes elapsed
Completed run 4 of 7 on Processor 0
1.58 minutes elapsed
Completed run 5 of 7 on Processor 0
2.07 minutes elapsed
Completed run 6 of 7 on Processor 0
52.93 seconds elapsed
Completed run 7 of 7 on Processor 0
1.60 minutes elapsed

All ICA runs complete!
12.33 minutes elapsed

So I get the A and M files for dimension 150 in this case, but not for 160. I get the same issue doing this as well, where dimension 152 does not complete:
bash ./run_ica.sh -n 16 -m 152 -s 2 -o ../data/interim/ -v ../data/processed_data/log_tpm_norm.csv

Thanks for any help!
/Mathias

Details of machine:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal

Linux avm-sdt-nilmat-ica 5.15.0-1054-azure #62~20.04.1-Ubuntu SMP Wed Jan 17 12:22:56 UTC 2024 x86_64 GNU/Linux

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 57 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GH

Memory:
total used free shared buff/cache available
Mem: 128756 3437 122362 6 2956 124229

error during step 2 (processing of raw data)

Can someone please help with this error?

N E X T F L O W ~ version 23.10.1
Launching main.nf [scruffy_wright] DSL2 - revision: 53840c131a
ERROR ~ No signature of method: groovyx.gpars.dataflow.DataflowBroadcast.into() is applicable for argument types: (Script_e1bcc410eabc93ca$_runScript_closure1) values: [Script_e1bcc410eabc93ca$_runScript_closure1@50a1af86]
Possible solutions: find(), any(), bind(java.lang.Object), with(groovy.lang.Closure), print(java.io.PrintWriter), print(java.lang.Object)

-- Check script 'main.nf' at line: 65 or see '.nextflow.log' file for more details

`fasterq-dump` image needs an update

It seems like the NCBI APIs are incompatible with the current version of sratools (I think), so the latter must be updated in the fasterq-dump container.

Replacing the version number worked for me:

FROM ubuntu:18.04

# Metadata
MAINTAINER Anand Sastry <[email protected]>

# Set noninteractive mode
ENV DEBIAN_FRONTEND noninteractive

# Install pigz and sra-toolbox
USER root
RUN apt-get update && apt-get install -y procps pigz wget libxml-libxml-perl
RUN wget -q http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.6/sratoolkit.3.0.6-ubuntu64.tar.gz -O /tmp/sratoolkit.tar.gz && tar zxf /tmp/sratoolkit.tar.gz -C /opt/ && rm /tmp/sratoolkit.tar.gz

RUN mkdir //ncbi && mkdir //ncbi/public && mkdir //ncbi/public/sra && mkdir //ncbi/public/refseq && chmod -R 777 //ncbi

ENV PATH="/opt/sratoolkit.3.0.6-ubuntu64/bin/:${PATH}"

"Pre-requisite software" not listed anywhere

Your READ.Me states that all "Pre-requisite software" required for using the workflows without docker is listed under each respective workflow.
However no such list is then given anywhere. All simply refer to your docker.
Could you please provide a list of dependencies for installing this WITHOUT using docker?

Add installation instructions for optICA

Make requirements.txt file
Make conda yaml file
Create docker container
Add Pre-requisite software instructions

Issue with RNA-seq data processing(Step-2)

I am facing an error during the 2nd step of the pipeline. I suppose the error is related to prefetch and fasterq-dump while fetching data from SRA. Also, I would like to mention that I am using version 22.10.8 of nextflow for running the pipeline as I face errors with the latest version of the same.
It would be great if someone could help with the following error.

sudo ./nextflow run main.nf -profile local --organism mycobacterium_abscessus --metadata mab.tsv --sequence_dir sequence_dir/ --outdir results
N E X T F L O W ~ version 22.10.8
Launching main.nf [reverent_bartik] DSL1 - revision: ef90b5fca3
executor > local (13)
[27/ad4b19] process > bowtie_build [100%] 1 of 1 ✔
[8d/7bec19] process > gff2bed [100%] 1 of 1 ✔
[56/467f86] process > download_fastq (14) [ 1%] 5 of 306, failed: 5, retries: 5
[- ] process > stage_fastq_single -
executor > local (13)
[27/ad4b19] process > bowtie_build [100%] 1 of 1 ✔
[8d/7bec19] process > gff2bed [100%] 1 of 1 ✔
[56/467f86] process > download_fastq (14) [ 1%] 5 of 306, failed: 5, retries: 5
[- ] process > stage_fastq_single -
executor > local (14)
[27/ad4b19] process > bowtie_build [100%] 1 of 1 ✔
[8d/7bec19] process > gff2bed [100%] 1 of 1 ✔
[41/5bd9cd] process > download_fastq (20) [ 1%] 6 of 307, failed: 6, retries: 6
[- ] process > stage_fastq_single -

Create Docker container for Jupyter notebooks

Use https://hub.docker.com/r/sbrg/pymodulon as a base container, add R/R libraries, and any other required software.

Make conda environment yml file
Make docker container
Test QC notebooks in container
Test QC notebooks in conda environment
Test characterization notebooks in container
Test characterization notebooks in conda environment
Add instructions to README.md

The pass fail pie chart is flipped

The pie chart in expression_QC_part1 showing final pass/fail is flipped. Can be easily fixed by changing the list passed to reindex function.

_,_,pcts = plt.pie(pass_qc.value_counts().reindex([False,True]),
        labels = ['Failed','Passed'],
        colors=['tab:red','tab:blue'],
        autopct='%.0f%%',textprops={'size':16});

python shebang missing from get_dimensions.py

Can't push to this repo. But, the get_dimensions.py is missing the python3 shebang on top. It can't be run on cmd line.

Use JSON file as input metadata

Use JSON file instead of csv/tsv as input metadata file.

Just talked to some of the people at DTU who developed anti-smash and they mentioned that using csv files can lead to unexpected outcomes/ errors that may be harder to catch when you start scaling your pipeline. Some of our most common errors arise from using this format. We should consider switching to JSON instead. This will require lots of changes:

Integrate json into Nextflow
Allow users to manually add things that are converted to json (maybe something like ALE sheets)
Add checks on data types

Error executing process > 'multiqc (1)'

nextflow run main.nf -profile local --organism bacillus_subtilis --metadata ../test/test_metadata.tsv --sequence_dir ../test/sequence_files/ --outdir ../test/nf_results/

Error executing process > 'multiqc (1)'

Caused by:
Process multiqc (1) terminated with an error exit status (125)

Command executed:

multiqc -f -c multiqc_config.yaml .
assemble_qc_stats.py multiqc_data

Command exit status:
125

Command output:
(empty)

Command error:
Unable to find image 'avsastry/multiqc-rockhopper:1.0' locally
docker: Error response from daemon: pull access denied for avsastry/multiqc-rockhopper, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.

Any idea regarding the source of this error?

avsastry / modulome-workflow Goto Github PK

modulome-workflow's People

Contributors

Stargazers

Watchers

Forkers

modulome-workflow's Issues

Computing dimension 160 of 164

Recommend Projects

Recommend Topics

Recommend Org