Giter Club home page Giter Club logo

mlst-nf's Introduction

Tests

mlst-nf

A nextflow pipeline for running mlst on a set of assemblies.

flowchart TD
  assembly --> quast(quast)
  quast --> assembly_qc
  assembly --> mlst(mlst)
  mlst --> mlst.json
  mlst --> parse_alleles(parse_alleles)
  parse_alleles --> alleles.csv
  parse_alleles --> sequence_type.csv
Loading

Usage

nextflow run BCCDC-PHL/mlst-nf \
  --assembly_input </path/to/assemblies> \
  --outdir </path/to/outdir>

The pipeline also supports a 'samplesheet input' mode. Pass a samplesheet.csv file with the headers ID, ASSEMBLY:

nextflow run BCCDC-PHL/mlst-nf \
  --samplesheet_input </path/to/samplesheet.csv> \
  --outdir </path/to/outdir>

Outputs

Outputs for each sample will be written to a separate directory under the output directory, named using the sample ID.

The following output files are produced for each sample.

sample-01
├── sample-01_20211202154752_provenance.yml
├── sample-01_alleles.csv
├── sample-01_mlst.json
└── sample-01_sequence_type.csv

The mlst.json output is generated directly by the mlst tool. It has the following format:

[
   {
      "scheme" : "sepidermidis",
      "alleles" : {
         "mutS" : "1",
         "yqiL" : "1",
         "tpiA" : "1",
         "pyrR" : "2",
         "gtr" : "2",
         "aroE" : "1",
         "arcC" : "16"
      },
      "sequence_type" : "184",
      "filename" : "test/example.gbk.gz",
      "id" : "test/example.gbk.gz"
   }
]

The alleles.csv file is generated based on the .json output, and includes a couple of boolean (True/False) fields to indicate whether the allele is a perfect match, or if it is a novel allele, based on the presence of ? or ~ characters in the allele calls, as described here.

The per-locus score field is computed based on the rules described here.

The fields in in the alleles.csv output are:

sample_id
scheme
locus
allele
perfect_match
novel_allele
score

The sequence_type.csv file includes an overall sequence type ID based on the allele calls for each locus, and the overall score, which is simply the sum of the per-locus scores for the sample.

sample_id
scheme
sequence_type
score

Provenance

Each analysis will create a provenance.yml file for each sample. The filename of the provenance.yml file includes a timestamp with format YYYYMMDDHHMMSS to ensure that a unique file will be produced if a sample is re-analyzed and outputs are stored to the same directory.

- pipeline_name: BCCDC-PHL/mlst-nf
  pipeline_version: 0.1.4
  nextflow_session_id: f18b89aa-06f7-41e4-b016-3519dfd5a5cb
  nextflow_run_name: sharp_bhaskara
  timestamp_analysis_start: 2024-02-20T22:59:37.862710
- input_filename: NC-000913.3.fa
  input_path: /home/runner/work/mlst-nf/mlst-nf/.github/data/assemblies/NC-000913.3.fa
  sha256: 6b195feda4c66140f6762742eb8b30c2652f02b45878b174f5b00ef85ecc95d7
- process_name: mlst
  tools:
    - tool_name: mlst
      tool_version: 2.16.1
      parameters:
      - parameter: minid
        value: 95
      - parameter: mincov
        value: 10
      - parameter: minscore
        value: 50
- process_name: quast
  tools:
    - tool_name: quast
      tool_version: 5.0.2
      parameters:
        - parameter: --space-efficient
          value: null
        - parameter: --fast
          value: null
        - parameter: --min-contig
          value: 0

mlst-nf's People

Contributors

dfornika avatar

Stargazers

 avatar

Watchers

 avatar

mlst-nf's Issues

Add support for `--collect_outputs`

We currently only generate a separate output directory for each sample. But it would be convenient to collect the sequence types for all samples into a single .csv file as well. The user should be able to specify a prefix for the collected outputs, using a --collected_outputs_prefix flag, whose default value is collected.

Remove `versioned_outdir` param

The versioned_outdir param hasn't proven to be useful, and it clutters up our publishDir directives.

Remove the versioned_outdir param.

Pipeline fails on low-quality assembly

Quast will fail when given assemblies with no contig greater than 500bp, which causes the pipeline to fail. One poor-quality sample could crash a full run, so it would make the overall pipeline more robust if we can prevent the pipeline from crashing in the presence of a single low-quality sample.

`parse_alleles.py` fails when no alleles included in mlst output

Command error:
  Traceback (most recent call last):
    File "/home/dfornika/.nextflow/assets/BCCDC-PHL/mlst-nf/bin/parse_alleles.py", line 78, in <module>
      main(args)
    File "/home/dfornika/.nextflow/assets/BCCDC-PHL/mlst-nf/bin/parse_alleles.py", line 29, in main
      num_alleles = len(mlst[sample]['alleles'])
  TypeError: object of type 'NoneType' has no len()

json output from mlst was:

{
   "sample-X.fa" : {
      "scheme" : "-",
      "sequence_type" : "-",
      "alleles" : null,
      "filename" : "sample-X.fa"
   }
}

Adopt nf-core conventions

In anticipation of integrating with tools and platforms like Sequera Platform we'd like to evaluate what would be necessary to adopt the nf-core conventions for our existing pipelines. Since this is a fairly simple pipeline, it's a good candidate for conversion to nf-core.

Make input QC optional

There are cases where we run this pipeline on the outputs of another pipeline (generally BCCDC-PHL/routine-assembly. That pipeline may already perform QC on its outputs, so running essentially the same QC on the inputs of this pipeline would be redundant.

Add a --skip_input_qc flag that causes the QUAST analysis on the input assemblies to be skipped.

Add optional versioned output directory

The pipeline currently creates one output directory per sample and publishes all outputs there. eg:

publishDir "${params.outdir}/${sample_id}", mode: 'copy', pattern: "${sample_id}_mlst.json"

When combining this pipeline with others, it may be useful to encapsulate the outputs from this pipeline in a sub-directory that is named with the pipeline name and version.

So by default we would create outputs of this structure:

.
├── sample-01
│   ├── sample-01_alleles.csv
│   └── sample-01_sequence_type.csv
├── sample-02
│   ├── sample-02_alleles.csv
│   └── sample-02_sequence_type.csv
└── sample-03
    ├── sample-03_alleles.csv
    └── sample-03_sequence_type.csv

...but when running with a --versioned_outdir flag , we would produce:

.
├── sample-01
│   └── mlst-nf-v0.1-output
│       ├── sample-01_alleles.csv
│       └── sample-01_sequence_type.csv
├── sample-02
│   └── mlst-nf-v0.1-output
│       ├── sample-01_alleles.csv
│       └── sample-01_sequence_type.csv
└── sample-03
    └── mlst-nf-v0.1-output
        ├── sample-01_alleles.csv
        └── sample-01_sequence_type.csv
 

...then a subsequent analysis could produce similar outputs alongside:

.
├── sample-01
│   ├── mlst-nf-v0.1-output
│   │   └── sample-01_mlst.csv
│   └── routine-assembly-v0.2-output
│       ├── sample-01_bakta.gbk
│       └── sample-01_unicycler.fa
├── sample-02
│   ├── mlst-nf-v0.1-output
│   │   └── sample-02_mlst.csv
│   └── routine-assembly-v0.2-output
│       ├── sample-02_bakta.gbk
│       └── sample-02_unicycler.fa
└── sample-03
    ├── mlst-nf-v0.1-output
    │   └── sample-03_mlst.csv
    └── routine-assembly-v0.2-output
        ├── sample-03_bakta.gbk
        └── sample-03_unicycler.fa

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.