Giter Club home page Giter Club logo

aws-batch-operational-dashboards's Introduction

AWS Batch Operational Dashboard

AWS Batch Operational Dashboard provides a code sample to deploy a solution to show Amazon EC2 resources and Container resource usage by AWS Batch jobs.

This solution relies on a serverless architecture to create a Grafana dashboard to visualize compute and memory resources usage by AWS Batch jobs. It provides better insights at the jobs level on how Amazon EC2 resources are used.

This application is designed to be scalable by collecting data from events and API calls using Amazon EventBrige and does not make API calls to describe your resources. Data collected through events and API are partially aggregated to DynamoDB to recoup information and generate Amazon CloudWatch metrics with the Embedded Metric Format. The application also deploys a several of dashboards displaying the job states, Amazon EC2 instances belonging your Amazon ECS Clusters (AWS Batch Compute Environments), ASGs across Availability Zones.

Prerequisites

Install AWS Serverless Application Model Command Line Interface (AWS SAM CLI) version >=1.72.0 by following the instructions

Dashboard

The dashboard allows to visualize AWS Batch jobs status, start and stop time, job queue, instance type, instance id, availability zone, Amazon Cloudwatch logs associated with the job. In addition, you can navigate through time and focus on a specific AWS Batch job to observe the Amazon EC2 CPU and Memory usage, the container CPU and memory requested and used as well as the EBS operations that are related to the job.

Grafana AWS Batch Dashboard

Architecture

The architecture track AWS Batch job events through Amazon EventBrige that are routed to a step function that store the AWS job states, availability zones, instance type, instance id, instance pricing model, log stream in a DynamoDB database. You can visualize the the results using Amazon Managed Grafana through Amazon Athena.

Operational Dashboard Architecture

Deployment

The deployment of the dashboard is composed of four steps.

AWS Organization and AWS IAM Identity Center

Amazon Managed Grafana relies on single sign-on using your organization’s identity provider to authenticate users. The following steps guide you to setup AWS Organization, and AWS IAM Identity Center.

NOTE: If you already have AWS Organization and AWS IAM Identity Center you can skip those steps.

1. Create an AWS Organizations

  1. Open AWS Organization.
  2. Choose Create an Organization. By default, the organization is created with all features enabled.
  3. The organization is created and the AWS accounts page appears. The only account present is your management account, and it's currently under the root organizational unit (OU).

2. Enable AWS IAM Identity Center

  1. Open AWS IAM Identity Center.
  2. Choose Enable.

Deploy the architecture

To deploy the serverless application, run the following in your shell:

BATCH_DASHBOARD_NAME="batch-op-dashboard"
sam build
sam deploy --stack-name ${BATCH_DASHBOARD_NAME} \
    --guided \
    --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND

Follow the instructions and fill the parameters.

Once deploy let's retrieve the Amazon Managed Grafana dashboard id.

GRAFANA_ID=`sam list stack-outputs --stack-name ${BATCH_DASHBOARD_NAME} \
    --output json | \
    jq -r '.[] | select(.OutputKey=="GrafanaWorkspaceId") | .OutputValue'`

Create and add group permissions to grafana

Amazon Managed Grafana integrates with AWS IAM Identity Center to provide identity federation. The federation provides users and groups that will grant access to Amazon Managed Grafana as a Viewer, Editor or Admin. The following steps guide you to create a viewer and admin group.

1. Retrieve Identity store ID.

  1. Open AWS IAM Identity Center settings
  2. Copy the Identity store ID from the identity store tab. This will be used in the next step

2. Create groups and users in IAM Identity Center

Set the identity store ID to the value copied in the previous step.

IDENTITY_STORE="d-1234567890"

Create grafana admin and viewer groups.

ADMIN_GROUP=`aws identitystore create-group --identity-store-id ${IDENTITY_STORE} \
    --display-name 'grafana-batch-op-dashboard-admin'\
    --query GroupId \
    --output text`

VIEWER_GROUP=`aws identitystore create-group --identity-store-id ${IDENTITY_STORE} \
    --display-name 'grafana-batch-op-dashboard-viewer' \
    --query GroupId \
    --output text`

Create users

USER_ID=`aws identitystore create-user --identity-store-id ${IDENTITY_STORE} \
    --user-name 'johndoe' \
    --display-name 'John'\
    --name Formatted=string,FamilyName=Doe,GivenName=John \
    --emails [email protected],Type=string,Primary=True \
    --query UserId \
    --output text`

Add user to group. Here the ADMIN GROUP.

aws identitystore create-group-membership --identity-store-id ${IDENTITY_STORE} \
    --group-id ${ADMIN_GROUP} \
    --member-id UserId=${USER_ID}

Setup Amazon Managed Grafana

First, you will add the groups created previously to the grafana dashboard.

aws grafana update-permissions --workspace-id ${GRAFANA_ID} \
    --update-instruction-batch \
    "action=ADD,role=ADMIN,users=[{id=$ADMIN_GROUP,type=SSO_GROUP}]"

aws grafana update-permissions --workspace-id ${GRAFANA_ID} \
    --update-instruction-batch \
    "action=ADD,role=VIEWER,users=[{id=$VIEWER_GROUP,type=SSO_GROUP}]"

Now let's get the URL to access the dashboard.

aws grafana describe-workspace --workspace-id ${GRAFANA_ID} \
    --query workspace.endpoint \
    --output text

You should get an URL that you will paste in your web browser, like g-1234567890.grafana-workspace.us-east-1.amazonaws.com.

You will be prompted to login with the credentials user-name created earlier. At user creation, each user will receive an initial password in their emails.

Use the password from the mail associated with your user name to login.

Create the dashboard in Amazon Managed Grafana

Once connected as administrator, you will start by settings data sources.

1. Add Amazon CloudWatch Data source

  1. Select the hamburger menu on the left pane.
  2. Expand Administration
    Grafana Data sources
  3. Choose Data sources.
  4. Choose Add new data source.
  5. Select CloudWatch.
    Grafana coudwatch
  6. On the Default Region menu, choose your AWS Region.
  7. Choose Save & test.

2. Add Amazon Athena Data source

Before starting, you will retrieve the S3 bucket name created to store the AWS Batch jobs data through Amazon Athena.

In a terminal:

sam list stack-outputs --stack-name ${BATCH_DASHBOARD_NAME} \
    --output json | \
    jq -r '.[] | select(.OutputKey=="AthenaSpillBucket") | .OutputValue'

Copy the output that you will use in the setup of Amazon Athena data source in Amazon Managed Grafana.

In the Amazon Managed Grafana dashboard:

  1. Select the hamburger menu on the left pane.
  2. Expand Administration
    Grafana Data sources
  3. Choose Data sources.
  4. Choose Add new data source.
  5. Choose Amazon Athena.
    Grafana athena
  6. On the Default Region menu, choose your AWS Region.
  7. On the Data source menu, choose aws-batch-jobs-data.
  8. On the Database menu, choose default.
  9. On the Workgroup menu, choose batch-wg.
  10. On the Output Location menu, copy paste the bucket value s3://DOC-EXAMPLE-BUCKET.
  11. Choose Save & test.

3. Create dashboard

To create a dashboard in Amazon Managed Grafana for AWS Batch, you will start from the template provided in this repository to generate dashboard for your environment.

In a terminal, run the following commands:

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

TABLE=`sam list stack-outputs --stack-name ${BATCH_DASHBOARD_NAME} \
    --output json | \
    jq -r '.[] | select(.OutputKey=="AthenaDataSource") | .OutputValue'`

python3 ./generate-grafana-dashboard.py --table ${TABLE}

Once you have created your dashboard in json format, you will import it in Amazon Managed Grafana:

  1. Select the hamburger menu on the left pane.
  2. Choose Dashboards.
    Grafana Dashboard
  3. Choose New on the right side.
  4. Choose Import.
    Grafana import
  5. Choose Upload dashboard JSON file, select the batch-grafana-dashboard.json file.
  6. Choose Load.
  7. Select the Athena and CloudWatch data sources your created previously. Grafana import dashboard
  8. Choose Import.

You will be redirected to the dashboard you imported. Once you will have your first AWS Batch jobs running. You will be able to see the data associated with it in the dashboard.

Clean up

To delete the SAM application deployment, you can use the terminal and enter:

sam delete

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

aws-batch-operational-dashboards's People

Contributors

amazon-auto avatar mhuguesaws avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

mhuguesaws

aws-batch-operational-dashboards's Issues

Add GPU metric support

Currently the solution obtain CPU,memory and EBS metric through CloudWatch.
It will be great to have individual GPU metric such as compute and memory utilization.

Add display jobs in RUNNING status

Currently, the dashboard display job with a stoppedAt time leading to only show jobs FAILED or SUCCEEDED.
It will be nice to see RUNNING jobs as well.

Display Instance Id in EC2 metrics

EC2 metrics display the name of the metric. I will be more useful to have the instance id.
Solution: specify as lias {{InstanceId}} in Grafana

EBS Read IOPS reported incorrectly

There is a couple of places where the metric 'Ops' should be used.
E.g., the template file batch-grafana-dashboard-template.json has the following section, where the EBSReadOps_Average should be used.

  "title": "EBS Read IOPS",
  "transformations": [
    {
      "id": "calculateField",
      "options": {
        "alias": "EBS Read IOPS",
        "binary": {
          "left": "EBSReadBytes_Average",
          "operator": "/",
          "reducer": "sum",
          "right": "60"
        },
        "mode": "binary",
        "reduce": {
          "include": [
            "EBSReadBytes_Average"
          ],
          "reducer": "sum"
        },
        "replaceFields": true
      }
    }
  ],
  "type": "timeseries"
}

],

Fetch error: 404 Not Found Instantiating; when trying to add Athena data source

Hello,
When I try to add an Amazon Athena Data Source, I get this error below
image
I do not get prompted to add any information related to the Athena data source, as is shown in your tutorial. The moment I click on athena in the add data source page, it shows me "Data Source added" and I immediately get the error in the screenshot above.
I've tried the Athena plugin version 2.13.5 and 2.14.0. I am running Grafana 9.4.
Is Athena part of the Enterprise Plugins and is that why I'm getting this error?

Deployment failure for an account without an organization

During deployment "sam deploy --stack-name ${BATCH_DASHBOARD_NAME} ...", one could experience the following error if the account is not a member of an organization:
Resource handler returned message:
"Your account is not a member of an
organization. (Service:
AWSSingleSignOn; Status Code: 400;
Error Code: AccessDeniedException;
Request ID: 7fdf6cad-1b9b-4d28-
b959-26262ec7a900; Proxy: null)
(Service: Grafana, Status Code:
403, Request ID: 66819d95-0456-
4853-ae0b-ba5a8e02191e)"
(RequestToken: 869a9138-6c5f-f50a-
1c72-3bb9c43c4b00,
HandlerErrorCode: AccessDenied)

Cost Dashboard Instructions

The instructions for the Cost Dashboard is slightly off as there are no data sources in the json file that references Cloudwatch.

This is what I see:
Screenshot 2023-11-08 at 12 17 38 PM

No Athena type available in Managed Graphana "Add Data Source" dialog

Hello,
After deploying the solution today, trying to follow up the guide and at the step requiring adding an Athena datasource I'm unable to do so:
the Athena doesn't appear as datasource type available to add.
The previous datasource (Cloudwatch) was added successfully but not this one.

Are there prerequisites missing for the Athena to become available as datasource?

Fix vcpus and memory switched value

In the case the job definition is registered with memory first and vcpus second, the values are interchanged and lead to incorrect display in the dashboard.

Set container insights retention

By default, container insights data retention is 1 day.
This feature will setup container insights retention based on a variable.

Fix grafana update permissions

 aws grafana update-permissions --workspace-id ${GRAFANA_ID} \
>     --update-instruction-batch \
>     action=ADD,role=ADMIN,users=[{$ADMIN_GROUP,type=SSO_GROUP}]

Parameter validation failed:
Invalid type for parameter updateInstructionBatch[0].users[0], value: 14d844e8-3081-7007-3d2b-1cec5daf3063, type: <class 'str'>, valid types: <class 'dict'>
Invalid type for parameter updateInstructionBatch[1].users[0], value: type=SSO_GROUP, type: <class 'str'>, valid types: <class 'dict'>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.