genome / analysis-workflows Goto Github PK

Open workflow definitions for genomic analysis from MGI at WUSM.

License: MIT License

Common Workflow Language 100.00%

analysis-workflows's Introduction

analysis-workflows

Overview

The McDonnell Genome Institute (MGI) and contributing staff, faculty, labs and departments of Washington University School of Medicine (WUSM) share Common Workflow Language (CWL) workflow definitions focused on reusable, reproducible analysis pipelines for genomics data.

Structure

The main structure of this repo is described in the following table:

Path	Description
definitions	parent directory containing all CWL tool and workflow definitions
definitions/pipelines	all workflows which rely on subworkflows and tools to produce final outputs
definitions/subworkflows	workflows that combine multiple tools to produce intermediate (used as inputs to other subworkflows) pipeline outputs
definitions/tools	CWL that wrap command line interfaces or scripts connecting multiple tools
definitions/types	custom CWL data types for inputs to tools and workflows
example_data	example input data, input YAML files, and expected output files for testing

Documentation

All documentation of CWL pipelines, subworkflows, and tools as well as additional information regarding test data, continous integration, and configuration can be found on the GitHub wiki: https://github.com/genome/analysis-workflows/wiki

Quick Start

Workflows

Download our repository with git clone https://github.com/genome/analysis-workflows.git

The official CWL user guide covers the basics of reading and writing CWL files, constructing input files, and running workflows.

Workflow Execution Service

These workflow definitions are built for interoperability with any Workflow Execution Service (WES) schema compatible implementation that supports CWL.

Each CWL file is validated using cwltool. Additional workflow definition testing is performed with Cromwell. However, currently there are no automated workflow tests using Cromwell.

Docker

In order to provide a portable environment, each tool in our workflow has a designated Docker container. Download Docker here.

All MGI supported Docker images used in the tool workflow definitions are available on mgibio DockerHub.

Many tools rely on third-party Docker images publicly available from sources such as Docker Hub and BioContainers.

Data

Full reference data is documented and available for download* on the wiki *Coming soon

Example data, packaged together with fully populated yamls corresponding to top level workflows in this repo's definitions/pipelines directory, can be found on our public GCP bucket. To download this package, use our helper docker container: docker run -v <desired_absolute_path>:/staging mgibio/data_downloader:0.1.0 gsutil -m cp -r gs://analysis-workflows-example-data /staging

Note: We are currently migrating and updating our example data. Files within the example_data directory of this repository are no longer fully supported, and some are out of date. Moving forward, all data will be hosted in GCP. The instructions above currently download the full, uncompressed example data set (~800 mb). More granular, compressed downloads are upcoming. Advanced users may explore the bucket structure and download individual files using wget https://storage.googleapis.com/analysis-workflows-example-data/[path_to_file] (omitting path_to_file will download a manifest describing the directory structure).

Contributions

A big thanks to all of the developers, bioinformaticians and scientists who built this resource. For a complete list of software contributions, i.e. commits, to this repository, please see the GitHub Contributors section.

Collaborators

The following WUSM collaborators have provided significant contributions in terms of workflow design, scientific direction, and validation of analysis-workflows output.

Departments, Institutes, and Labs

Partners

analysis-workflows's People

Contributors

Stargazers

Watchers

analysis-workflows's Issues

Sketch out a manual SNV and small Indel variant detection protocol.

Add DoCM genotyping

Would want to add read counts (not necessarily bam-readcount) and VAFs to VCF
Platypus, SAMTools, or GATK to call variants at site?
Does it support indels.
Filter : 5 reads and 1%

Add ExAC allele frequencies

Execute a workflow with HCC1395 exome data through the Strelka variant caller.

Use VEP from github rather than ensembl-tools

It's currently labeled as beta; however, the latest ensembl release 87 is the last release where the ensembl-tools version of VEP will be updated. Future updates and releases of the VEP software will be independent and the software can be installed directly from Github rather than through the ensembl-tools.

https://github.com/Ensembl/ensembl-vep

Execute a workflow with HCC1395 exome data through the Mutect variant caller.

Like #10, but with Mutect. This one will require parallelization.

Execute a detect_variants workflow with HCC1395 WGS data.

Build or identify a suitable Docker image for Strelka.

Build CWL representation of somatic workflow.

Update detect_variants Mutect2

Add dbSNP and COSMIC inputs to Mutect2 workflow.

Basic VCF to TSV/Excel converter

Find or develop a basic VCF to TSV converter.

Import COSMIC VCF.

Add synonyms input

Define a synonyms input file to map chromosome names between GRCh38DH and Ensembl.

Generate or identify Docker Hub images for necessary container(s).

'mgibio' organization was created on Docker Hub with a team name of 'arvados' for this purpose.

Also created 'arvados_trial' with write access to 'arvados' team:
https://hub.docker.com/r/mgibio/arvados_trial/

Import the Strelka exome config file as an input.

The strelka exome config file used for manual execution:

/gscuser/jwalker/git/HCC1395/arvados/strelka/exome_config.ini

update Strelka to Strelka2

https://github.com/Illumina/strelka/releases/latest

Import GRCh38 reference genome.

GRCh38 reference FASTA with index and sequence dictionary:

/gscmnt/gc9005/info/model_data/2887491634/build4ec1c5bd1f6941b8a99f2e230217cb91/all_sequences.fa
/gscmnt/gc9005/info/model_data/2887491634/build4ec1c5bd1f6941b8a99f2e230217cb91/all_sequences.dict
/gscmnt/gc9005/info/model_data/2887491634/build4ec1c5bd1f6941b8a99f2e230217cb91/all_sequences.fa.fai

Basic somatic SNV and small indel variant detection strategy.

On the Arvados trial implement a basic SNV and small indel detection strategy using CWL and containerized tools (via Docker). The workflow will run HCC1395 exome and WGS data through variant detection and a false-positive filter. The final product is a VEP annotated somatic VCF.

Add Exome QC step

An optional QC step to evaluate target regions and coverage. Metrics to be provided by Dave Spencer via @dufeiyu

Add VEP annotation to detect_variants.cwl workflow

Connect QC workflow cwl into the main pipeline cwl

Import HCC1395 Exome aligned BAM files.

Optional coding_only param to VEP

Determine GRCh38DH tracks for segmental duplications and homopolymer annotation.

Attempt local install of Arvados on OpenStack

Using the griffithlab OpenStack resources, deploy a test installation of the Arvados services.

Add Pindel to detect_variants.cwl workflow

Strelka requires increased keep cache

Hi Jason,

Strelka seems to be very I/O intensive, and its very likely that it is bottlenecked on the input data coming from keep. The way to solve this is by increasing the keep cache. To do so, add:

$namespaces:
arv: "http://arvados.org/cwl#"

(similar to requirements/hints)

and under hints:

class: arv:RuntimeConstraints
keep_cache: 40000

The 40000 example is in MB. You should change the keep_cache number depending on the nodesize you ask for. When doing calculations, the keep_cache should be similar to RAM. For example, if you ask for 2000 MB keep cache, ask for a nodesize with min RAM to be 2GB+(how much you need for actually computing).

Hope this helps!

Thanks,
Bryan

On Thu Nov 03 16:45:47 2016, [email protected] wrote:

Hi,

It appears our Strelka job is still running as part of our WGS trial on
the public cloud cluster. Here is the pipeline URL:
https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-ujd9m79gutuc3bs

I'd normally expect this job to complete in 4 hours with 8 CPUs. The log
shows the current resource utilization, but I'm not seeing any logs from
strelka itself.

This is related to ticket #137.

Thanks,
Jason

Use MultiQC to summarize QC workflow

http://multiqc.info/

bam-readcount for pVAC-Seq

After all variant detection and annotation have completed, run bam-readcount to generate the native format files as input to pVAC-Seq

Add Manta to Strelka workflow.

Strelka has improved sensitivity for intermediate/long indels when Manta is pre-run and it's output SVs are provided as an input to Strelka:
Illumina/strelka#72