Giter Club home page Giter Club logo

cdk-demo's Introduction

Step function Managed Batch Job connected through a common EFS volume

This project is an example of using AWS Step functions to manage and track a series of AWS Batch jobs in N_TO_N mode.

Tech:

  • AWS Batch
  • AWS Step functions
  • AWS Lambda
  • Amazon S3
  • Amazon ECS (Managed by AWS Batch)
  • Amazon EFS
  • Amazon EventBridge

Getting started

Prerequisites:

Standing up:

First:

  • Ensure docker engine is running
  • Authenticate your shell session with your desired AWS account and region.

Then run:

yarn
cdk deploy

This may take around 5-10 minutes to deploy initially. Other updates will be faster.

Tearing down

cdk destroy

There will be several parts of this sample that CDK will not destroy and have to be destroyed manually after the fact

  • EFS Volume
  • S3 Bucket
  • Cloudwatch log groups

Validating deployment:

From the AWS console:

  • Find the step function launched ('BatchStateMachine{id}').

  • Click 'Start execution'. Leave the name and input as the default. The step function will begin and more than likely look like this as it waits for the batch job process to complete:

    Batch running

  • While it's waiting, click into any of the Batch Job steps and view the resource link. This will show the nodes currently running and if there are any successes/failures.

  • Eventually the step function will complete, check the S3 bucket to view the result output of the step function.

Solution Architecture (High level)

Solution Design

What's not noted here (but exists in the deployed CDK code)

  • AWS IAM (Policies, roles, etc. All documented in CDK code)
  • Amazon VPC
  • Cloudwatch

Step function flow:

  1. Cron trigger begins step function
  2. The first lambda is run and has several responsibilities:
    • Generate a unique Id for the current step function workflow to be passed into subsequent steps
    • Generate a variable range of Ids and then divide them up based on the maximum number of Ids allowed per Batch Job Array Node
    • Provision directories in EFS:
      • A directory for the current step function Id
      • For each calculated node (Id's / Node ID limit) provision a directory
      • A directory /prep to put the Id's in under each node index
    • Return the number of Array nodes required by Batch
  3. All three Batch Jobs are started with the size determined by the return of the previous function. These are all started one after the other in rapid succession without waiting for each job to finish running.
    1. Batch Job 1:
      • Loads the Id's for the current index
      • Generates random 'data' for the Ids
    • Provisions a /data directory under the current index
    • Places the new 'data' into the /data directory
    1. Batch Job 2 (N_TO_N dependency on Batch Job 1):
      • Loads the Id's for the current index
      • Loads the data from the previous step
      • Performs a simple function based on the data
      • Provisions a /results directory under the current index
      • Places the function results into the /results directory
    2. Batch Job 3 (N_TO_N dependency on Batch Job 2):
      • Moves the data from the /results directory to the egress s3 bucket
    • Performs a simple function based on the data
    • Provisions a /results directory under the current index
    • Places the function results into the /results directory
  4. While these are running a 'watching' lambda is kicked off:
    • Get all 3 Batch Job Ids
    • Check the status of all 3 batch jobs using aws SDK
    • If all 3 are no longer running, finish the step function.

Visualisation:

Step function visualisation

Solution Architecture (Low level)

Another layer of the batch jobs exposed:

Solution Design

Batch Job detail

These batch jobs are configured to run in N_TO_N mode. Meaning they are all running concurrently and once a node of the same index completes in a prior step, then the batch job can run the same node. i.e. if node 1 finishes in step 1, but node 0 is still running, then node 1 in step 2 can still run.

Batch Fade Process

Each of these Batch jobs runs in 'Array mode' and can scale up to an array of 10,000 per batch job. These batch jobs can be placed into queues, while batch jobs can share queues and in turn - compute environments - this does have the added side effect of your subsequent steps queuing behind the first batch job, preventing concurrent behaviour.

Each of these batch job queues are on top of a defined compute environment. Once these queues are populated, then the managed ECS portion of batch will spin up a cluster capable of processing the queue. It will either spin up the minimum instances needed to process the queue, so small queues won't request excess resources. OR the managed cluster will spin up enough instances to sit within predefined limits (Max vCPU, min vCPU etc.).

CDK-NAG

For security best practices cdk-nag is utilised during the cdk process. This can be configured and/or disabled by removing the integration within bin/cdk.ts . More information here

FAQ

Why 3 batch tasks

In order to demonstrate the N_TO_N nature of batch and how the three typical steps of a data pipeline sit together (Procurement, Processing and Egress), three steps are most appropriate.

ECS managed clusters

You do not need to manage your own ECS clusters with this solution.

How high can this scale?

The solution has been tested with 400,000 records across 10,000 batch job array size. According to this service limits page, that is the maximum array size allowed. That page will outline some more of the scaling limits and stay more up to date than this ReadMe.


Disaster Recovery Stack

Overview

This Stack contains the AWS Cloud Development Kit (CDK) code for deploying a disaster recovery stack in AWS. The stack is designed to ensure data durability and availability for a product management system through the use of AWS services such as DynamoDB, Lambda, API Gateway, AWS Backup, and EventBridge.

The Recovery Methods that are used in this Stack:

  • Point in Time Recovery (PITR)
  • On Demand Recovery
  • Cross-regions Recovery
  • Scheduled Backups
  • Pilot Light
  • Warm Standby

Features

  • DynamoDB Table: A table named 'Product' with pay-per-request billing, point-in-time recovery, and cross-region replication in 'us-east-1'.
  • Lambda Function: A Docker-based Lambda function ('ProductLambda') for performing CRUD operations on the DynamoDB table and creating backups.
  • AWS Backup Plan: A plan ('DynamoDB-Backup-Plan') for creating daily backups of the 'Product' table at 9:10 AM UTC, with backups retained for 30 days.
  • EventBridge Rule: A rule ('BackupScheduleRule') for triggering the Lambda function at 10:00 AM UTC for post-backup processes.
  • API Gateway REST API: An API ('Product-Lambda-API') with endpoints to manage products and initiate backups.

Usage

First: Ensure docker is running

  1. Deployment: Use the AWS CDK CLI to deploy this stack to your AWS account.
cdk deploy DisasterRecoveryStack
  1. Interacting with the API: Use the output API Gateway URL to interact with the product database. Supported operations include adding, getting, updating, and deleting products, as well as creating backups.

    Adding data: Either you can use Postman or Thunder Client by posting the following URL:

https://2l0qferg69.execute-api.us-east-2.amazonaws.com/prod/addProduct

And pasting the following test data into the raw JSON body

{
    "product_category": "Accessories",
    "product_title": "Airpods pro 2"
}

Cleanup

To avoid incurring future charges, remember to delete the resources created by this stack:

cdk destroy DisasterRecoveryStack

cdk-demo's People

Contributors

tonytan4ever avatar abidkhan03 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.