Giter Club home page Giter Club logo

pangolin-nf's Introduction

pangolin-nf

push main

Call SARS-CoV-2 lineages using pangolin across many sequencing runs. Offers the ability to update pangolin/pangoLEARN to ensure that the latest lineage definitions are used. This pipeline is designed to take the output of BCCDC-PHL/ncov2019-artic-nf as its input, and makes some assumptions about directory structures for finding consensus sequences to analyze.

This pipeline also incorporates a 'genome completeness threshold' to assist with quality control. The genome completeness is the proportion of the full SARS-CoV-2 genome for which consensus sequence was successfully generated. That statistic is included in the output. In addition, the genome_completeness_status field indicates whether the sample was above or below a genome completness threshold. The genome completeness threshold is set to 85% by default but can be set to another value using the --genome_completeness_threshold flag.

The --update_pangolin flag controls whether or not pangolin should be updated before proceeding with analysis. The --update_pangolin_data flag controls whether pangolin's data dependencies such as pangoLEARN models and lineage definitions should be updated before proceeding with analysis. Updates are disabled by default.

Usage

nextflow run BCCDC-PHL/pangolin-nf \
  [--update_pangolin] \
  [--update_pangolin_data] \
  [--ivar_consensus] \
  [--genome_completeness_threshold <genome_completeness_threshold>] \
  --analysis_parent_dir <analysis_parent_dir> \
  --outdir <outdir>

Output

run_id sample_id genome_completeness genome_completeness_status lineage conflict pangoLEARN_version pangolin_version pango_version status note
210330_M01234_0123_000000000-G653A sample-01 95.1 ABOVE_GENOME_COMPLETENESS_THRESHOLD B.1 0 2021-04-28 2.4 v1.1.23 passed_qc
210330_M01234_0123_000000000-G653A sample-02 75.2 BELOW_GENOME_COMPLETENESS_THRESHOLD P.1 0 2021-04-28 2.4 v1.1.23 passed_qc 15/17 P.1 (B.1.1.28.1) SNPs (1 ref and 0 other)
210330_M01234_0123_000000000-G653A sample-03 0 BELOW_GENOME_COMPLETENESS_THRESHOLD None 0 2021-04-28 2.4 v1.1.23 fail N_content:1.0

pangolin-nf's People

Contributors

dfornika avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

pangolin-nf's Issues

Output a single `pangolin_lineages.csv` file

The pipeline currently produces two output files, named pangolin_lineages_with_incomplete.csv and pangolin_lineages.csv.

Create only one output file, called pangolin_lineages.csv that includes both incomplete and complete genomes.

Adjust update parameters

As of v3.x, pangolin provides two separate update-related flags: --update and --update-data. The --update flag is used to update the pangolin tool, while the --update-data flag is used to update the pangoLEARN model and other datasets related to scorpio.

Change this pipeline's --update flag to --update_pangolin.

Add a --update_pangolin_data flag.

Make pipeline more robust to changes in pangolin output format

The current implementation of this pipeline has a fairly rigid dependency on the number of fields and order of fields in the pangolin output.

In particular, this command relies on the ordering and number of fields in pangolin output remaining unchanged:

head -n 1 ${lineage_report} | awk -F ',' 'BEGIN {OFS=FS}; {print \$1,\$2,"genome_completeness","genome_completness_status",\$3,\$4,\$5,\$6,\$7,\$8,\$9,\$10,\$11,\$12,\$13,\$14}' > header.csv

This makes the pipeline overall somewhat brittle if there are changes to the pangolin output. In general it is prefereable to parse the output by field header, allow extra fields or missing fields to be handled gracefully and not to make assumptions about the order of fields in the output.

Include positive controls in results

Request to include the positives in the pangolin pipeline results (for positive control QC - ensure they aren't a different lineage than expected).

Negatives probably not important since we monitor % completeness for those.

This will also allow John/end users to check positives in our merged output file with updated lineages.

Select between ivar-generated and freebayes-generated consensus sequences

The consensus sequences generated by BCCDC-PHL/ncov2019-artic-nf are deposited to two directories. The consensus sequences generated by ivar are stored to <outdir>/ncovIllumina_sequenceAnalysis_makeConsensus, while those generated by freebayes are stored to ncovIllumina_sequenceAnalysis_callConsensusFreebayes

Add a parameter --ivar_consensus that will select the ivar-generated consensus sequence. Otherwise, the freebayes-generated sequence will be used.

Choose most recent ARTIC analysis output directory version instead of specifying one

As we update our BCCDC-PHL/ncov2019-artic-nf pipeline, we produce output directories that are named with the pipeline version (eg. `ncov2019-artic-nf-v1.3-output).

Originally, this pipeline would use a parameter to choose the artic output directory version to use for analysis. But we are now at a point where we don't have a results for a single ARTIC analysis version across all of our runs.

Instead, choose the most recent ARTIC analysis output directory (based on version number).

Analysis will fail if a run directory exists but has no qc.csv file

Before running pangolin, we use the prepare_multi_fasta process to create a multi-fasta file for each run. This is done by looking at the run's qc.csv file that is generated by the ncov2019-artic-nf pipeline:

export LATEST_ANALYSIS=\$(cat ${latest_artic_analysis_version})
tail -n+2 ${analysis_dir}/ncov2019-artic-nf-\${LATEST_ANALYSIS}-output/*.qc.csv | grep -iv '^NEG' | grep -iv '^POS' | awk -F "," 'BEGIN {OFS=FS}; \$2 < (100 - ${genome_completeness_threshold}) {print \$1,(100 - \$2)}' > ${run_id}_above_completeness_threshold.csv
tail -n+2 ${analysis_dir}/ncov2019-artic-nf-\${LATEST_ANALYSIS}-output/*.qc.csv | grep -iv '^NEG' | grep -iv '^POS' | awk -F "," 'BEGIN {OFS=FS}; \$2 > (100 - ${genome_completeness_threshold}) {print \$1,(100 - \$2)}' > ${run_id}_below_completeness_threshold.csv
while IFS="," read -r sample_id percent_n; do
cat ${analysis_dir}/ncov2019-artic-nf-\${LATEST_ANALYSIS}-output/ncovIllumina_sequenceAnalysis_makeConsensus/\${sample_id}*.fa \
| awk -F "_" '/^>/ { split(\$2, a, "."); print ">"a[1] }; !/^>/ { print \$0 }' \
>> ${run_id}.consensus.fa;
done < <(cat ${run_id}_above_completeness_threshold.csv ${run_id}_below_completeness_threshold.csv)

In a case where an analysis output directory exists, but the qc.csv file does not exist or is empty, one of the outputs (<run_id>.consensus.fa) of prepare_multi_fasta will not be created, causing the pipeline to fail.

Use nextflow work dir as tempdir

Pangolin allows the user to specify which directory to use as the tempdir. By default it will use the system's $TEMDIR. But on some systems that directory may be fairly low capacity, and filling it can cause problems for other processes on the system.

Use the nextflow work dir as the tempdir.

Include genome completeness in output

Include a field genome_completeness for each sample in the output file.

The value should be calculated as 100 - pct_N_bases (from the ARTIC qc.csv file).

Bump pangolin version

Bump the version of pangolin that's installed in the pipeline's conda env.

In practice, we will also update pangolin directly within the env on a regular basis as updates to pangolin are released.

Field `pangolin_version` is duplicated in output

As of pangolin v2.4.2, the pangolin_version field is included in the standard pangolin output. This pipeline had been adding that field manually, but that is no longer necessary. Currently, the pangolin_version field is duplicated, and the note field is missing from the output.

Remove the duplicate pangolin_version field from the output and restore the output format.

Include records for all samples for run in output

Include records for all samples on the sequencing run in the output. This includes:

  • Samples whose consensus files are below the genome completeness threshold
  • Samples that failed to produce a consensus sequence

...but does not include:

  • Positive and negative control samples

For samples that are below the genome completeness threshold or don't have a consensus sequence, report their lineage as NA. Also add a value to the note field to explain why the lineage is NA.

Make pangolin updates optional

There have been a few pangolin updates recently that have introduced changes that cause this pipeline to break. It would be useful to have a flag that could control whether or not pangolin will be updated before the pipeline were run. With that flag available, we could easily temporarily turn off pangolin updates while preparing the rest of the pipeline for the update.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.