Giter Club home page Giter Club logo

aws-batch-genomics's Introduction

Genomics Research on AWS

A tutorial on how to package and deploy a bioinformatics workflow on AWS using AWS Batch

This tutorial will cover the material presented within the "Genomics Workflows on AWS" blog post series (Part 1, Part 2, Part 3, Part 4) that covers the basics of bootstrapping a bioinformatics analysis pipeline on AWS. We break down the tutorial roughly as follows:

  1. Setting up your AWS account (if you do not already have one)
  2. Package a set of bioinformatics applications using Docker
  3. Create a AWS Batch environment for analysis
  4. Define and deploy AWS Step Functions to control the data processing steps
  5. Initiate a workflow

Step 1. Set up an AWS account and the AWS Command Line Interface

If you don't already have a AWS account, please follow the "Backup Files to Amazon S3 using the AWS CLI" 10 minute tutorial to make sure you have a working AWS environment needed for this tutorial.

We will also assume that you are working on a system that can build Docker images. If not, please refer to the Docker getting started guides.

To ensure that you are logged into your Amazon ECR repos, type eval $(aws ecr get-login) into your terminal

Step 2. Building Docker images for applications

Each individual application used in the tutorial has been packaged as a Docker container. The Docker container packaging methodology is described in Part 2 of the tutorial. We have provided a Makefile to build all of the necessary Docker images for this demonstration project, as well as deploy each container to a AWS EC2 Container Registry (ECR). Below is an example to enable you to build and deploy the Docker container image for Isaac. Prior to installing Isaac please review the license here and verify it is acceptable to you for your use.

# From the root of the repository directory
cd tools/isaac/docker

# Create a ECR repository for Isaac, then copy the `repositoryUri` into a variable
REPO_URI=$(aws ecr create-repository --repository-name isaac \
  --output text --query "repository.repositoryUri")

# If the repository already exists, then
# query ECR for the `repositoryUri`
REPO_URI=$(aws ecr describe-repositories \
  --repository-names isaac \
  --output text --query "repositories[0].repositoryUri")

# execute the build supplying the repositoryUri
make REGISTRY=${REPO_URI}

Repeat the above for SAMtools (license here), Strelka (license here), and SnpEff (license is LGPLv3, as referenced here).

Step 3. Setting up AWS Batch for an analysis environment

First, you will want to set up all of your IAM policies and roles with the following command:

aws cloudformation create-stack --stack-name iam-batch --template-body file://batch/setup/iam.template.yaml --capabilities CAPABILITY_NAMED_IAM

We have provided a CloudFormation template as well as a python script (either/or) to configure your batch environment. While you can also do this via the CLI, the CloudFormation template at batch/setup/batch_env.template.yaml (recommended) or the python script at batch/setup/create_batch_env.py (if following the blog posts) will do this for you. Please see the README.md in batch/setup for additional information on how to run these.

Step 4. Define and deploy AWS Step Functions to control the data processing steps

First, you will need to create Lambda functions (python2.7) for each tool to submit the jobs, as well as get the status of a jobs. They can be found at:

  • batchSubmitJob: batch/runner/batch_submit_job.py
  • batchSubmitIsaacJob: tools/isaac/lambda/lambda_function.py
  • batchSubmitStrelkaJob: tools/strelka/lambda/lambda_function.py
  • batchSubmitSamtoolsStatsJob: tools/samtools_stats/lambda/lambda_function.py
  • batchSubmitSnpeffJob: tools/snpeff/lambda/lambda_function.py
  • batchGetJobStatus: batch/runner/batch_get_job_status.py

Next, capture the ARNs from the newly created Lambda functions and use that information in your to deploy the CloudFormation stack for the workflow.

aws cloudformation create-stack --stack-name genomics-workflow \
--template-body file://workflow/deploy_state_machine.yaml \
--capabilities CAPABILITY_NAMED_IAM \
--parameters <list of parameters with the above ARNs>

Record the arn of the state machine that was created.

Step 5. Initiate a workflow

Modify the file workflow/test.input.json to direct to the appropriate paths in your own account.

Next, execute the following CLI command:

aws stepfunctions start-execution --state-machine-arn <your-arn> --input file://workflow/test.input.json

You can then navigate to the AWS Step Functions console to monitor your workflow's progress.

aws-batch-genomics's People

Contributors

ajfriedman18 avatar delagoya avatar dougalb avatar gf-atebbe avatar stephen-terrell avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.