talusbio / nf-encyclopedia Goto Github PK

View Code? Open in Web Editor NEW

6.0 1.0 4.0 18.87 MB

A NextFlow pipeline for chromatogram library DIA proteomics workflows

License: Apache License 2.0

Makefile 0.09% Nextflow 45.05% Python 30.09% HTML 3.33% Groovy 3.29% Shell 0.98% Dockerfile 5.04% R 12.14%

nextflow bioinformatics data-independent-acquisition pipeline proteomics workflow

nf-encyclopedia's People

Stargazers

Watchers

Forkers

animesh mriffle jspaezp sjust-seerbio

nf-encyclopedia's Issues

Tests fail locally

Hello there,

I noticed that tests fail when running them locally with this error:

Command error:
  Unable to find image 'talusbio/nf-encyclopedia:latest' locally
  docker: Error response from daemon: pull access denied for talusbio/nf-encyclopedia, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.

because they use this line:

nf-encyclopedia/conf/test.config

Line 10 in 6b469ea

process.container = "nf-encyclopedia"

and therefore fail because even though ghcr.io/talusbio/nf-encyclopedia:latest had been pulled, the tag is not matched.

Notably, that line also makes the tests pass in CICD because right before the test runs, the image is built using:

/usr/bin/docker buildx build --iidfile /tmp/docker-build-push-V0tuRX/iidfile --tag nf-encyclopedia:latest --load --metadata-file /tmp/docker-build-push-V0tuRX/metadata-file .

https://github.com/TalusBio/nf-encyclopedia/actions/runs/3192943440/jobs/5210985034#step:5:102

So maybe a complete solution would entail looking for env variables and use a file that contains only "nf-encyclopedia" as the container definition, otherwise use "ghcr.io/talusbio/nf-encyclopedia:latest".

LMK what you think!

TLDR:
tests fail if you didnt build the image in the same computer. we can either document it or fix it :P

[TODO] Add documentation to the README

Eventually we should update the README to include details about this workflow, including what the workflow does and how to test it locally.

[Feature] Add Slack notification support

Current we support notifications through email. However, it would be nice to support notifications through Slack, like nf-core pipelines such as quantms do.

Add Documentation

We need better documentation is addition to the README. After a conversation with @cia23, here are some things to add, with likely more to come:

Parameter documentation

This could take the form of a JSON file, like what is used by nf-core (example). The json file is easy to parse, but unfortunately does not live in the config file.
Alternatively, this could take the form of special comments incorporated into the config file. I'm thinking something akin to doxygen for C++ or roxygen2 for R. The downside here would be that we'd need to write a parser for it to build documentation.

Running the pipeline

We should have an example command for folks to run.
We need to better document prerequisites like Docker.

Unable to access jarfile /code/encyclopedia.jar

I am running the pipeline via WSL2 setup and getting the following error

./nextflow run TalusBio/nf-encyclopedia -r latest --input input.csv --dlib proteins.dlib --fasta proteins.fasta
N E X T F L O W  ~  version 22.10.2
Launching `https://github.com/TalusBio/nf-encyclopedia` [dreamy_minsky] DSL2 - revision: 63c5d914a2 [latest]
executor >  local (2)
[-        ] process > CONVERT_TO_MZML:MSCONVERT                         -
[-        ] process > BUILD_CHROMATOGRAM_LIBRARY:ENCYCLOPEDIA_SEARCH    -
[-        ] process > BUILD_CHROMATOGRAM_LIBRARY:ENCYCLOPEDIA_AGGREGATE -
[fe/53691a] process > PERFORM_QUANT:ENCYCLOPEDIA_SEARCH (1)             [100%] 2 of 2, failed: 2 ✘
[-        ] process > PERFORM_QUANT:ENCYCLOPEDIA_AGGREGATE              -
[-        ] process > MSSTATS                                           -
Execution cancelled -- Finishing pending tasks before exit
Error executing process > 'PERFORM_QUANT:ENCYCLOPEDIA_SEARCH (2)'

Caused by:
  Process `PERFORM_QUANT:ENCYCLOPEDIA_SEARCH (2)` terminated with an error exit status (1)

Command executed:

  gzip -df mz600-604.210712_ratio_22m_01_058.mzML.gz
  java -Djava.aws.headless=true -Xmx31G -jar /code/encyclopedia.jar \
      -i mz600-604.210712_ratio_22m_01_058.mzML \
      -f proteins.fasta \
      -l proteins.dlib \
      -percolatorVersion v3-01 -quantifyAcrossSamples true -scoringBreadthType window \
       \
  | tee mz600-604.210712_ratio_22m_01_058.mzML.local.log
  gzip mz600-604.210712_ratio_22m_01_058.mzML.features.txt

Command exit status:
  1

Command output:
  (empty)

Command error:
  Error: Unable to access jarfile /code/encyclopedia.jar

Work dir:
  /home/ash022/work/eb/9560810abcdc8abe743dda3e406ae4

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

The inputs are from base repo folder https://github.com/TalusBio/nf-encyclopedia/tree/main/tests/data

cat input.csv
file, chrlib
mz600-604.210712_ratio_22m_01_046.mzML.gz, false
mz600-604.210712_ratio_22m_01_058.mzML.gz, false

Any ideas how to proceed?

[msstats] Gracefully handle contrasts with invalid R column names

Currently, conditions that contain spaces or begin with numbers will cause the pipeline to fail. Instead, we should be able to gracefully handle these cases.

I think changing check.names to FALSE in read.table() should fix this:

nf-encyclopedia/bin/msstats.R

Line 56 in 6b469ea

annot_df <- read.csv(annot_csv, header = TRUE, stringsAsFactors = FALSE) %>%

[Bug] MSstats drops some proteins

We've observed that MSstats is dropping some proteins. At first, we suspected this was due to too much missing data for a peptide. Now, after @cia23's discussions with the Drug Disco team, it seems to be a bug.

Currently, we perform an inner join on the proteins from the EncyclopeDIA proteins.txt and peptides.txt output, under the assumption that the Protein columns should always yield a 1-to-many match and all peptides with a protein accepted at 1% FDR would be accounted for. Unfortunately, this assumption is wrong: the Protein column in peptides.txt does not contain the protein groups in the Protein column from proteins.txt, leading to missing proteins.

The fix here is to use the PeptideSequences column in proteins.txt to map proteins to peptides, which will be a headache.

Find the difference between the EncyclopeDIA CLI and GUI

We've noticed significant differences between results obtained with the EncyclopeDIA GUI and CLI. Unfortunately, talking with Brian and Seth hasn't revealed anything that could be the cause.

Here's how we should find the problem:

Create a small mzML file to iterate with. I think the best way to do this is to take a normal file and filter for 1-2 DIA windows using msconvert.
Verify that we can reproduce our problems with this smaller file.
Add a print statement to the EncyclopeDIA codebase to see exactly what parameters are being used by the CLI. @ricomnl has already made some progress on this. @ricomnl - do you know if we can add it as some kind of "debug level" logging in the official EncyclopeDIA version?

I hope that this small file is one that we could incorporate for unit tests as well.

[maintainance] Update encyclopedia to v2

LMK if you want me to take a look at this!

[TODO] Add process capturing QC metrics

We want to add a process to the pipeline that captures a set of QC metrics and then writes these to a DB.

Current idea:

Change unique_peptides_proteins → qc_task
Within the qc_task, read all the wide elib, the quant_peptides, and the quant_proteins
Calculate unique peptides and proteins, %cv across dmso samples, gopher values
For now drop it in S3 → show it in the data standard-report
Then store it in noSQL in scispot

[msstats] Remove row names from output

The output for msstats.proteins.txt contains the row names, which causes problems for Excel users. We should just remove it.