Giter Club home page Giter Club logo

analysis-workflows's Introduction

Build Status

analysis-workflows

Overview

The McDonnell Genome Institute (MGI) and contributing staff, faculty, labs and departments of Washington University School of Medicine (WUSM) share Common Workflow Language (CWL) workflow definitions focused on reusable, reproducible analysis pipelines for genomics data.

Structure

The main structure of this repo is described in the following table:

Path Description
definitions parent directory containing all CWL tool and workflow definitions
definitions/pipelines all workflows which rely on subworkflows and tools to produce final outputs
definitions/subworkflows workflows that combine multiple tools to produce intermediate (used as inputs to other subworkflows) pipeline outputs
definitions/tools CWL that wrap command line interfaces or scripts connecting multiple tools
definitions/types custom CWL data types for inputs to tools and workflows
example_data example input data, input YAML files, and expected output files for testing

Documentation

All documentation of CWL pipelines, subworkflows, and tools as well as additional information regarding test data, continous integration, and configuration can be found on the GitHub wiki: https://github.com/genome/analysis-workflows/wiki

Quick Start

Workflows

Download our repository with git clone https://github.com/genome/analysis-workflows.git

The official CWL user guide covers the basics of reading and writing CWL files, constructing input files, and running workflows.

Workflow Execution Service

These workflow definitions are built for interoperability with any Workflow Execution Service (WES) schema compatible implementation that supports CWL.

Each CWL file is validated using cwltool. Additional workflow definition testing is performed with Cromwell. However, currently there are no automated workflow tests using Cromwell.

Docker

In order to provide a portable environment, each tool in our workflow has a designated Docker container. Download Docker here.

All MGI supported Docker images used in the tool workflow definitions are available on mgibio DockerHub.

Many tools rely on third-party Docker images publicly available from sources such as Docker Hub and BioContainers.

Data

Full reference data is documented and available for download* on the wiki *Coming soon

Example data, packaged together with fully populated yamls corresponding to top level workflows in this repo's definitions/pipelines directory, can be found on our public GCP bucket. To download this package, use our helper docker container: docker run -v <desired_absolute_path>:/staging mgibio/data_downloader:0.1.0 gsutil -m cp -r gs://analysis-workflows-example-data /staging

Note: We are currently migrating and updating our example data. Files within the example_data directory of this repository are no longer fully supported, and some are out of date. Moving forward, all data will be hosted in GCP. The instructions above currently download the full, uncompressed example data set (~800 mb). More granular, compressed downloads are upcoming. Advanced users may explore the bucket structure and download individual files using wget https://storage.googleapis.com/analysis-workflows-example-data/[path_to_file] (omitting path_to_file will download a manifest describing the directory structure).

Contributions

A big thanks to all of the developers, bioinformaticians and scientists who built this resource. For a complete list of software contributions, i.e. commits, to this repository, please see the GitHub Contributors section.

Collaborators

The following WUSM collaborators have provided significant contributions in terms of workflow design, scientific direction, and validation of analysis-workflows output.

Departments, Institutes, and Labs

Partners

DOI

analysis-workflows's People

Contributors

acoffman avatar apaul7 avatar bryanfisk avatar chad388 avatar chrisamiller avatar dufeiyu avatar grua avatar gschang avatar guesu avatar irenaeuschan avatar jasonwalker80 avatar jhundal avatar johnegarza avatar johnmaruska avatar leylabwustl avatar malachig avatar matthew-mosior avatar saimukund20 avatar sam16711 avatar sridhar0605 avatar susannasiebert avatar tmooney avatar zlskidmore avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

analysis-workflows's Issues

Add DoCM genotyping

Would want to add read counts (not necessarily bam-readcount) and VAFs to VCF
Platypus, SAMTools, or GATK to call variants at site?
Does it support indels.
Filter : 5 reads and 1%

Add synonyms input

Define a synonyms input file to map chromosome names between GRCh38DH and Ensembl.

Import GRCh38 reference genome.

GRCh38 reference FASTA with index and sequence dictionary:

/gscmnt/gc9005/info/model_data/2887491634/build4ec1c5bd1f6941b8a99f2e230217cb91/all_sequences.fa
/gscmnt/gc9005/info/model_data/2887491634/build4ec1c5bd1f6941b8a99f2e230217cb91/all_sequences.dict
/gscmnt/gc9005/info/model_data/2887491634/build4ec1c5bd1f6941b8a99f2e230217cb91/all_sequences.fa.fai

Basic somatic SNV and small indel variant detection strategy.

On the Arvados trial implement a basic SNV and small indel detection strategy using CWL and containerized tools (via Docker). The workflow will run HCC1395 exome and WGS data through variant detection and a false-positive filter. The final product is a VEP annotated somatic VCF.

Add Exome QC step

An optional QC step to evaluate target regions and coverage. Metrics to be provided by Dave Spencer via @dufeiyu

Strelka requires increased keep cache

Hi Jason,

Strelka seems to be very I/O intensive, and its very likely that it is bottlenecked on the input data coming from keep. The way to solve this is by increasing the keep cache. To do so, add:

$namespaces:
arv: "http://arvados.org/cwl#"

(similar to requirements/hints)

and under hints:

  • class: arv:RuntimeConstraints
    keep_cache: 40000

The 40000 example is in MB. You should change the keep_cache number depending on the nodesize you ask for. When doing calculations, the keep_cache should be similar to RAM. For example, if you ask for 2000 MB keep cache, ask for a nodesize with min RAM to be 2GB+(how much you need for actually computing).

Hope this helps!

Thanks,
Bryan

On Thu Nov 03 16:45:47 2016, [email protected] wrote:

Hi,

It appears our Strelka job is still running as part of our WGS trial on
the public cloud cluster. Here is the pipeline URL:
https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-ujd9m79gutuc3bs

I'd normally expect this job to complete in 4 hours with 8 CPUs. The log
shows the current resource utilization, but I'm not seeing any logs from
strelka itself.

This is related to ticket #137.

Thanks,
Jason

bam-readcount for pVAC-Seq

After all variant detection and annotation have completed, run bam-readcount to generate the native format files as input to pVAC-Seq

Account for CRAM conversion.

We either need to:

  1. Natively support CRAM throughout the workflow
  2. Convert back to BAM first thing
  3. Convert back to BAM only for steps that require BAM instead of CRAM

Import HCC1395 WGS aligned BAM files.

Tumor:
/gscmnt/gc1401/info/build_merged_alignments/merged-alignment-blade14-4-8.gsc.wustl.edu-jwalker-31054-58617c541153434c8c61e0ee93aef61a/58617c541153434c8c61e0ee93aef61a.bam

Normal:
/gscmnt/gc13030/info/build_merged_alignments/merged-alignment-blade13-2-1.gsc.wustl.edu-jwalker-6069-2bf0e5226390471dadd4e1127e30976b/2bf0e5226390471dadd4e1127e30976b.bam

Attempt running detect_variants.cwl with Toil

As a test of the CWL portability and an evaluation of the Toil workflow system, attempt running our detect_variants.cwl workflow ideally using the LSF --batchSystem for job execution.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.