Giter Club home page Giter Club logo

swipe's Introduction

SWIPE: SFN-WDL infrastructure for pipeline execution

Swipe is a terraform module for creating AWS infrastructure to run WDL workflows. Swipe uses Step Functions, Batch, S3, and Lambda to run WDL workflows in a scalable, performant, reliable, and observable way.

With swipe you can run a WDL workflow with S3 inputs with a single call to the AWS Step Functions API and get your results in S3 as well as paths to your results in your Step Function Output.

Why use swipe?

  • Minimal infrastructure setup: Swipe is an infrastructure module first so you won't need to do much infrastructure configuration like you might have to do with other tools that are primarily software. Once you configure the minimal swipe configuration variables in the terraform module and apply you can start using swipe at high scale right away.
  • Highly optimized for working with large files: Many bioinformatics tools are local tools designed to take files as input and produce files as input. Often, these files can get very large though many tools either don't support distributed approaches or would be made much slower with a distributed approach. Swipe is highly optimized for this use case. By default swipe:
    • Configures AWS Batch to work with NVME drives for super fast file I/O operations
    • Has a built in multi-threaded S3 uploader and downloader that can saturate 10 GB/sec network connection so your input and output files can be downloaded quickly
    • Has a built in input cache so inputs common to all of your pipeline runs can be safely re-used across jobs. This is particularly useful if your pipeline uses large reference databases that don't change from run to run which is typical of many bioinformatics workloads.
  • Cost savings while preserving pipeline throughput and latency: Swipe tries each workflow first on a Spot instance for cost savings, then retries the workflow on-demand after the first failure. This results in high cost savings with a minimal sacrifice to both throughput and latency. If swipe retried on spot throughput may still be high, but by retrying on demand swipe also keeps latency (time for a single pipeline to complete) relatively low. This is useful if you have users waiting on results.
  • Built in monitoring: Swipe automatically monitors key workflow metrics and you can analyzing failures in the AWS console
  • Easy integration: Using AWS eventbridge you can easily route SNS notifications to notify other services of workflow status

Why not use swipe?

  • You are not using AWS: Swipe is highly opinionated about the infrastructure it runs on. If you are not using AWS you can't use swipe.
  • You are running distributed big data jobs: At time of writing, swipe is optimized for workflows with local files. If you intend to run distributed big data jobs, like Map Reduce jobs, swipe is probably not the right choice.

Usage

Basic Usage

Create swipe infrastructure

To use swipe you first need to create the infrastructure with terraform. You will need an S3 bucket to store your inputs and outputs called a workspace and an S3 bucket to store your wdl files. They can be the same bucket but for clarity I will use two different buckets, and I recommend you do the same.

resource "aws_s3_bucket" "workspace" {
  bucket = "my-test-app-swipe-workspace"
}

resource "aws_s3_bucket" "wdls" {
  bucket = "my-test-app-swipe-wdls"
}

module "swipe" {
    source = "github.com/chanzuckerberg/swipe?ref=v0.7.0-beta"

    app_name               = "my-test-app"
    workspace_s3_prefixes  = [aws_s3_bucket.workspace.bucket]
    wdl_workflow_s3_prefix = aws_s3_bucket.workspace.bucket
}

This will produce an output called sfn_arns, a map of stepfunction names to their ARNs. By default swipe creates a single default stepfunction called default.

Upload your WDL workflow to S3

Now we need to define a workflow to run. Here is a basic WDL workflow that leverages some files:

version 1.0
workflow hello_swipe {
  input {
    File hello
    String docker_image_id
  }

  call add_world {
    input:
      hello = hello,
      docker_image_id = docker_image_id
  }

  output {
    File out = add_world.out
  }
}

task add_world {
  input {
    File hello
    String docker_image_id
  }

  command <<<
    cat ~{hello} > out.txt
    echo world >> out.txt
  >>>

  output {
    File out = "out.txt"
  }

  runtime {
      docker: docker_image_id
  }
}

Let's save this one as hello.wdl and upload it:

aws s3 cp hello.wdl s3://my-test-app-swipe-wdls/hello.wdl

Let's also make a test input for file for it and upload that:

cat hello >> input.txt
aws s3 cp input.txt s3://my-test-app-swipe-workspace/input.txt

Run your wdl

You can run you WDL with inputs and an output path using the AWS API. Here I will use python and boto3 for easy readability:

import boto3
import json

client = boto3.client('stepfunctions')

response = client.start_execution(
    stateMachineArn='DEFAULT_STEP_FUNCTION_ARN',
    name='my-swipe-run',
    input=json.dumps({
      "RUN_WDL_URI": "s3://my-test-app-swipe-wdls/hello.wdl",
      "OutputPrefix": "s3://my-test-app-swipe-workspace/outputs/",
      "Input": {
          "Run": {
              "hello": "s3://my-test-app-swipe-workspace/input.txt",
          }
      }
    }),
)

Once your step function is complete your output should be at s3://my-test-app-swipe-workspace/outputs/hello/out.txt. Note that out.txt came from the WDL workflow.

Development

Requirements

  • Terraform
  • Python 3.9
  • Docker + Docker Compose

Running tests

Setup python dependencies:

virtualenv .venv
source .venv/bin/activate
pip install -r requirements.txt

Bring up mock AWS infrastructure:

make up

Run the tests:

make test

swipe's People

Contributors

dtsai-czi avatar ebezzi avatar jakeyheath avatar jgadling avatar kislyuk avatar morsecodist avatar rzlim08 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

swipe's Issues

Add status notification queue

Right now your application needs to poll step function statuses to be aware of job completion. We are adding SQS queues with events containing the completion status of each stepfunction that you can subscribe to in your user-facing application.

Support for IMDSv2

Description

It would be great if Swipe supported IMDSv2. Right now, prodsec and infrasec are trying to enable this feature to help defend against Server-Side Request Forgery (SSRF) attacks, but it requires the caller to the metadata endpoint first make a PUT request to grab a short-lived token and then make a GET request to the metadata endpoint with that token in the header.

References

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html
https://czi.atlassian.net/wiki/spaces/INFRASEC/pages/2090696711/AWS+Metadata+V2+Migration
https://github.com/chanzuckerberg/SSRFs-Up
https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/
https://github.com/chanzuckerberg/SSRFs-Up/wiki/FAQ#i-read-the-other-docs-but-i-still-dont-really-get-why-this-matters-can-you-provide-an-example-of-how-ssrf-affects-my-application

Proposed Change

% git diff origin/main
diff --git a/terraform/modules/swipe-sfn-batch-job/batch_job_container_properties.yml b/terraform/modules/swipe-sfn-batch-job/batch_job_container_properties.yml
index 095cd9e..5f174a8 100644
--- a/terraform/modules/swipe-sfn-batch-job/batch_job_container_properties.yml
+++ b/terraform/modules/swipe-sfn-batch-job/batch_job_container_properties.yml
@@ -11,7 +11,7 @@ image: "${batch_docker_image}"
 command:
   - "/bin/bash"
   - "-c"
-  - "for i in \"$@\"; do eval \"$i\"; done; cd /"
+  - 'for i in "$@"; do eval "$i"; done; cd /'
   - "swipe"
   - "set -a"
   - "if [ -f /etc/environment ]; then source /etc/environment; fi"
@@ -19,7 +19,7 @@ command:
   - "set +a"
   - >-
     while true; do
-    if curl -sf http://169.254.169.254/latest/meta-data/spot/instance-action; then
+    if TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600" && curl -H "X-aws-ec2-metadata-token: $TOKEN" -sf http://169.254.169.254/latest/meta-data/spot/instance-action; then
     echo WARNING: THIS SPOT INSTANCE HAS BEEN SCHEDULED FOR TERMINATION >> /dev/stderr;
     fi;
     sleep 10;

Only retry on with on-demand instances in cases where spot has terminated an instance

Currently any failure (unless it is a known failure) results in a retry with on-demand retry if it fails on spot. This is highly inefficient. The ideal is we only execute this retry in cases where the instance is terminated. This is particularly bad for timeouts. Timeouts result in the batch job being killed so it can't produce a known error via logging then the very long pipeline is retried and usually timeouts again on the more costly on-demand instances. We should also identify any failures that are not a result of termination and not retry in those cases.

This issue is more important if pipelines fail often.

System is not operable without an SQS queue

In the default configuration, the SQS queue is not created or managed by this module, yet it is required. If the queue is not present, the following error prevents any execution from succeeding:

{
    "timestamp": "2022-04-20 12:56:22.849000-07:00",
    "type": "ExecutionFailed",
    "id": 22,
    "previousEventId": 21,
    "executionFailedEventDetails": {
        "error": "QueueDoesNotExist",
        "cause": "{\"errorMessage\":\"An error occurred (AWS.SimpleQueueService.NonExistentQueue) when calling the SendMessage operation: The specified queue does not exist for this wsdl version.\",\"errorType\":\"QueueDoesNotExist\",\"stackTrace\":[\"  File \\\"/var/task/app.py\\\", line 79, in handle_failure\\n    raise failure_type(cause)\\n\"]}"
    }
}

We should edit the code to either make the queue optional, or make it a required Terraform input variable.

Fix localstack tests

Currently our CICD localstack tests are failing. We are not state change notifications in the SQS queue with localstack. It is not currently clear why. For now I have disabled these tests.

Error while running `terraform apply` in SWIPE root

Thanks for maintaining and extending SWIPE.

On a terraform apply in a new clone, I get:

╷
│ Error: Error putting IAM role policy handle_failure: MalformedPolicyDocument: Policy statement must contain resources.
│ 	status code: 400, request id: 5b6a1db0-809a-47cf-8b69-ab50e1c4ed68
│
│   with module.sfn.module.sfn_io_helper.aws_iam_role_policy.iam_role_policy["handle_failure"],
│   on terraform/modules/sfn-io-helper-lambdas/main.tf line 43, in resource "aws_iam_role_policy" "iam_role_policy":
│   43: resource "aws_iam_role_policy" "iam_role_policy" {
│
╵
╷
│ Error: Error putting IAM role policy handle_success: MalformedPolicyDocument: Policy statement must contain resources.
│ 	status code: 400, request id: 63b2a74d-7e94-4e4a-9fcc-2851362f5ffe
│
│   with module.sfn.module.sfn_io_helper.aws_iam_role_policy.iam_role_policy["handle_success"],
│   on terraform/modules/sfn-io-helper-lambdas/main.tf line 43, in resource "aws_iam_role_policy" "iam_role_policy":
│   43: resource "aws_iam_role_policy" "iam_role_policy" {
│
╵
╷
│ Error: Error putting IAM role policy report_spot_interruption: MalformedPolicyDocument: Policy statement must contain resources.
│ 	status code: 400, request id: 544dda7b-0b0c-4d37-a8ea-46f4046b3e3a
│
│   with module.sfn.module.sfn_io_helper.aws_iam_role_policy.iam_role_policy["report_spot_interruption"],
│   on terraform/modules/sfn-io-helper-lambdas/main.tf line 43, in resource "aws_iam_role_policy" "iam_role_policy":
│   43: resource "aws_iam_role_policy" "iam_role_policy" {
│
╵
╷
│ Error: Error putting IAM role policy process_stage_output: MalformedPolicyDocument: Policy statement must contain resources.
│ 	status code: 400, request id: 96d4072b-f1f3-4932-8726-6392ba5357a2
│
│   with module.sfn.module.sfn_io_helper.aws_iam_role_policy.iam_role_policy["process_stage_output"],
│   on terraform/modules/sfn-io-helper-lambdas/main.tf line 43, in resource "aws_iam_role_policy" "iam_role_policy":
│   43: resource "aws_iam_role_policy" "iam_role_policy" {
│
╵
╷
│ Error: Error putting IAM role policy process_sfn_event: MalformedPolicyDocument: Policy statement must contain resources.
│ 	status code: 400, request id: 1102847b-4703-47b0-acad-53f01a76e107
│
│   with module.sfn.module.sfn_io_helper.aws_iam_role_policy.iam_role_policy["process_sfn_event"],
│   on terraform/modules/sfn-io-helper-lambdas/main.tf line 43, in resource "aws_iam_role_policy" "iam_role_policy":
│   43: resource "aws_iam_role_policy" "iam_role_policy" {
│
╵
╷
│ Error: Error putting IAM role policy process_batch_event: MalformedPolicyDocument: Policy statement must contain resources.
│ 	status code: 400, request id: a322d329-6694-4f11-aa07-f5990f31f22c
│
│   with module.sfn.module.sfn_io_helper.aws_iam_role_policy.iam_role_policy["process_batch_event"],
│   on terraform/modules/sfn-io-helper-lambdas/main.tf line 43, in resource "aws_iam_role_policy" "iam_role_policy":
│   43: resource "aws_iam_role_policy" "iam_role_policy" {
│
╵
╷
│ Error: Error putting IAM role policy report_metrics: MalformedPolicyDocument: Policy statement must contain resources.
│ 	status code: 400, request id: c9d33df1-5d07-459b-a5f5-d4a1c1f1f46f
│
│   with module.sfn.module.sfn_io_helper.aws_iam_role_policy.iam_role_policy["report_metrics"],
│   on terraform/modules/sfn-io-helper-lambdas/main.tf line 43, in resource "aws_iam_role_policy" "iam_role_policy":
│   43: resource "aws_iam_role_policy" "iam_role_policy" {
│
╵
╷
│ Error: Error putting IAM role policy preprocess_input: MalformedPolicyDocument: Policy statement must contain resources.
│ 	status code: 400, request id: 57298d4b-37b7-41d3-a5c3-7255524044e7
│
│   with module.sfn.module.sfn_io_helper.aws_iam_role_policy.iam_role_policy["preprocess_input"],
│   on terraform/modules/sfn-io-helper-lambdas/main.tf line 43, in resource "aws_iam_role_policy" "iam_role_policy":
│   43: resource "aws_iam_role_policy" "iam_role_policy" {
│
╵

Haven't looked closely into what causes this, but just wanted to give a heads up.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.