Giter Club home page Giter Club logo

no-tears-cluster's Introduction

AWS HPC Cluster Deployment Guide ๐Ÿš€

No Tears Cluster Architecture


Step 1

To deploy, click:

Region Launch
North Virginia (us-east-1) Launch
Oregon (us-west-2) Launch
Ireland (eu-west-1) Launch
Frankfurt (eu-central-1) Launch
More Regions (Click to expand)
** Warning! This QuickStart is untested in the following regions. Use at your own risk **
Region Launch
Ohio (us-east-2) Launch
California (us-west-1) Launch
London (eu-west-2) Launch
Paris (eu-north-1) Launch
Stockholm (eu-north-1) Launch
Middle East (me-south-1) Launch
South America (sa-east-1) Launch
Canada (ca-central-1) Launch
Hong Kong (ap-east-1) Launch
Tokyo (ap-northeast-1) Launch
Seoul (ap-northeast-2) Launch
Mumbai (ap-south-1) Launch
Singapore (ap-southeast-1) Launch
Sydney (ap-southeast-2) Launch

If you have not authenticated with the AWS Management Console, you will be prompted to login with the AWS Account ID or alias, IAM user name, and password that was provided to you.

Step 2

After clicking Launch Stack you will be redirected to the CloudFormation Quick create stack screen.

Quick create stack

Step 3

Scroll to the bottom of the page, and leave most of the Parameters as they are.

  1. Enter an initial BudgetLimit for the project. This will be used to track spending and send an alert when you cross 80%.

  2. Update NotificationEmail that will receive budget notifications.

  3. Fill out the UserPasswordParameter with a temporary password. Keep it simple! You will be prompted to change it on first use.

  4. Select I acknowledge that AWS Cloudformation might create IAM resources.

  5. Now, click Create Stack to deploy the QuickStart Environment.

HPCQuickstartParameters

Deployment

Deployment takes about 15 minutes. This QuickStart provisions:

  • a Cloud9 Integrated Development Environment (IDE) in the selected region;
  • an AWS Parallel Cluster environment, named hpc-cluster, for batch scheduled jobs and interactive computing;
  • a non-root IAM User, with full Administrator access to the AWS Console to create custom architectures.
  • a Budget notification email at 80% of the BudgetLimit

Provisioning is complete when all Stacks show CREATE_COMPLETE.

image


Onboarding Users ๐Ÿ‘จโ€๐Ÿ’ป๐Ÿ‘ฉโ€๐Ÿ’ป

When all stacks show CREATE_COMPLETE, click on the Outputs tab of the AWS-HPC-Quickstart stack. Send the end-user the following information (3 of 5 are on the Outputs tab):

  1. ResearchWorkspaceURL -- URL to directly access the Cloud9 Research Environment.

  2. UserLoginUrl -- To authenticate into the AWS Console to create custom architectures.

  3. UserName -- The AWS IAM username for the end user.

  4. Password that you entered to launch the CloudFormation stack.

  5. The USER_GUIDE included in this repository.

Outputs


Embedding this Solution in Your Own CloudFormation Template

Click to expand

NoTears is provided as a standalone CloudFormation template with a large number of parameters. Use-cases where configuration options need to be limited, or NoTears embedded as one component of a broader architecture are also supported. To do this, nest the stack as follows:

AWSTemplateFormatVersion: '2010-09-09'
Resources:
    NoTearsHPC:
      Type: 'AWS::CloudFormation::Stack'
      Properties: 
          TemplateURL: 'https://notearshpc-quickstart.s3.amazonaws.com/0.2.3/cfn.yaml'
          Parameters: 
            OperatingSystem: alinux2
            EnableBudget: false
            CreateUserAndGroups: true
            EnableFSxLustre: false
            SpackVersion: v0.15.0

An illustrating example is provided in inherit.yaml.


FAQ ๐Ÿง

  • If you see the following: The security token included in the request is invalid

    It's likely an issue with the account having been just created. Wait for some time and try again.

  • On first login to the Cloud9 terminal you may see something like:

    Agent pid 3303
    Identity added: /home/ec2-user/.ssh/AWS-HPC-Quickstart-NJCH9esX (/home/ec2-user/.ssh/AWS-HPC-Quickstart-NJCH9esX)
    

    Ignore these messages. They indicate that your SSH key for Parallel Cluster was located.

Developer Setup

Click to expand

Note: This section is only for developing the solution, to create a cluster, see Launch the HPC Quickstart Envrionment

The first step is installing node.js, this can be done easily with Homebrew. After that completes, install aws-cdk:

$ brew install node
$ npm install -g aws-cdk

(Alternatively, use brew install aws-cdk)

Optionally create a python virtualenv or conda environment. We assume this is in ./.env.

Now you can activate the python virtualenv and install the python dependencies:

$ source .env/bin/activate               # Can be skipped if not using a virtualenv
$ pip install -r requirements.txt

Make sure your region and aws credentials are setup by running:

$ aws configure

Acquire a Github AuthToken:

  1. sign into your account
  2. Settings -> Developer Settings -> Personal Access Tokens
  3. Create token without checking any boxes
  4. Copy the provided token and add to your $HOME/.bashrc:
export GIT_AUTH_TOKEN=XXXXXXX

At this point, it's time to setup CDK, the following needs to be done once in each account:

$ cdk bootstrap

And finally, deploy the app:

$ cdk deploy --parameters UserPasswordParameter=******* --parameters NotificationEmail=*******@amazon.com

I've surely missed a bunch of python dependencies, the format for installing those is:

$ pip install aws-cdk.custom-resources

Once it finishes deploy, you'll get an ouput with a link to the Cloud9 URL. Click on that to quickly see the Cloud9 result:

image

Deploying to Public Bucket

Use the include upload.sh script to publish the template and assets (lambda zips, bootstrap scripts, etc.) in an S3 bucket.

$ sh upload.sh [target-bucket-name]

This will create a modified cfn.yaml in the app directory. The script also creates buckets:

  • s3://target-bucket-name, the central repository for the template and assets

  • s3://target-bucket-name-region, a clone of the central repository, where region expands to all AWS public regions.

Before publishing the template, the upload script synthesizes a yaml template and transforms the parameters and asset URLs to match the regional buckets. The final output, cfn.yaml, will reflect all changes.

Pro-Tips

Use cdk synth | less to see the generated template.

Provide parameters to the stack via cdk deploy --parameters pcluster:KEY=VALUE.

Useful commands

  • cdk ls list all stacks in the app
  • cdk synth emits the synthesized CloudFormation template
  • cdk deploy deploy this stack to your default AWS account/region (remember to update the version number on the quick launch buttons in the README!)
  • cdk diff compare deployed stack with current state
  • cdk docs open CDK documentation

Enjoy!

no-tears-cluster's People

Contributors

bollig avatar dependabot[bot] avatar sean-smith avatar stephenmsachs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

no-tears-cluster's Issues

nodejs 14

update required to nodejs 10.x to 12/14 for C9BootstrapProviderframework

Make user creation optional

Lots of accounts restrict creating IAM users. We should make the user optional either via a CreateUser=True/False parameter or if the password is blank skip user creation.

pcluster not found on Cloud 9 workspace

When I launch the stack and connect to the Cloud 9 workspace, pcluster is not there:

rsignell:~/environment $ more Welcome.txt 
  _   _         _______                     _____ _           _
 | \ | |       |__   __|                   / ____| |         | |
 |  \| | ___      | | ___  __ _ _ __ ___  | |    | |_   _ ___| |_ ___ _ __
 | . ` |/ _ \     | |/ _ \/ _` | '__/ __| | |    | | | | / __| __/ _ \ '__|
 | |\  | (_) |    | |  __/ (_| | |  \__ \ | |____| | |_| \__ \ ||  __/ |
 |_| \_|\___/     |_|\___|\__,_|_|  |___/  \_____|_|\__,_|___/\__\___|_|

 To get started, run:

 $ pcluster list --color

 To ssh into the cluster, once it's gone into CREATE_COMPLETE, run:

 $ pcluster ssh hpc-cluster

 To connect to the desktop GUI via DCV run:

 $ pcluster dcv connect hpc-cluster
rsignell:~/environment $ pcluster list --color
Traceback (most recent call last):
  File "/usr/bin/pcluster", line 33, in <module>
    sys.exit(load_entry_point('aws-parallelcluster==2.8.0', 'console_scripts', 'pcluster')())
  File "/usr/bin/pcluster", line 22, in importlib_load_entry_point
    for entry_point in distribution(dist_name).entry_points
  File "/usr/local/lib/python3.6/site-packages/importlib_metadata/__init__.py", line 558, in distribution
    return Distribution.from_name(distribution_name)
  File "/usr/local/lib/python3.6/site-packages/importlib_metadata/__init__.py", line 215, in from_name
    raise PackageNotFoundError(name)
importlib_metadata.PackageNotFoundError: No package metadata was found for aws-parallelcluster
rsignell:~/environment $ 

It seemed from the no tears video that pcluster would already be installed.

I tried installing pcluster (using conda), but then I get:

(aws) rsignell:~/environment $ pcluster list --color
ERROR: Error when calling DescribeInstanceTypes for instances c5.2xlarge: Request has expired.

@sean-smith I hope this provides enough clues what I'm not doing correctly...

Make VPC and Subnet Optional

Create a parameter createNetwork = true/false and vpc, public_subnet, private_subnet parameters.

This would allow you to use an existing vpc and subnets instead of no-tears-cluster creating them everytime.

expired credentials

Admin:~/environment $ pcluster update -c al2-config.ini hpc-cluster-al6
Retrieving configuration from CloudFormation for cluster hpc-cluster-al6...
ERROR: Failed when retrieving cluster config version from DynamoDB with error An error occurred (UnrecognizedClientException) when calling the GetItem operation: The security token included in the request is invalid

Note that ~/.aws/credentials exists on Cloud9 and contains:

# Do not modify this file, if this file is modified it will not be updated. If the file is deleted, it will be recreated on Mon Oct 05 2020 23:48:15 GMT+0000 (Coordinated Universal Time).
# [...] #

[default]
aws_access_key_id=[...]
aws_secret_access_key=[...]
aws_session_token=[...]
region=us-east-1

No-Ingress Cluster Access

Feature Request

Cloud9 recently launched a feature that allows you to connect to Cloud9 without having to open up the Security Group for SSH access:

image

It would be really cool to support this through No-Tears-Cluster so users need no ingress access for their cluster at all.

Cloud9 Not Bootstrapping

When FSx Lustre is disabled, the cloud9 bootstrap process does not execute. Perhaps related to a custom resource executing before C9 is done booting? Or the parent stack finishes before CR can execute?

No-Tears-Cluster CLI command

The value in no-tears-cluster comes from the ability to make the pcluster config file generic, i.e. you can share a pcluster config file by posting it on S3 and anyone can use it just by setting ConfigS3Uri to point to that file.

This feature request proposes decoupling that from the stack creation via a CLI command.

$ no-tears-cluster configure cluster_template.ini -o cluster.ini
$ pcluster create cluster -c cluster.ini

Where cluster.ini could look like:

[global]
cluster_template = hpc
update_check = true
sanity_check = true

[aws]
aws_region_name = ${AWS_DEFAULT_REGION}

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[cluster hpc]
key_name = ${ssh_key_id}
base_os = ubuntu1804
scheduler = slurm
master_instance_type = c5.2xlarge
compute_instance_type = c5.18xlarge
vpc_settings = public-private
fsx_settings = fsx-scratch2
disable_hyperthreading = true
dcv_settings = dcv
post_install = ${post_install_script_url}
post_install_args = "/shared/spack-0.15 releases/v0.15 /opt/slurm/log sacct.log"
s3_read_resource = arn:aws:s3:::*
s3_read_write_resource = ${s3_read_write_resource}/*
initial_queue_size = 0
max_queue_size = 10
placement_group = DYNAMIC
master_root_volume_size = 50
compute_root_volume_size = 30
ebs_settings = myebs
cw_log_settings = cw-logs

[ebs myebs]
volume_size = 200
shared_dir = /shared

[fsx fsx-scratch2]
shared_dir = /scratch
storage_capacity = 1200
deployment_type = SCRATCH_2
import_path=${s3_read_write_url}

[dcv dcv]
enable = master
port = 8443
access_from = 0.0.0.0/0

[cw_log cw-logs]
enable = false

[vpc public-private]
vpc_id = ${vpc_id}
master_subnet_id = ${master_subnet_id}
compute_subnet_id = ${compute_subnet_id}
use_public_ips = false

Cloud9 instance: no space left on device

I came back to my Cloud9 environment created by no-tears-cluster after a few weeks of inactivity and when I tried to login, I got no space left on device. Is it doing some logging that fills up the disk during periods of inactivity?
Looks like the EBS volume is 10GB.

I was able to do

yum clean packages

to free up enough space that I could store the resize.sh script mentioned in the resize EBS instructions but the script failed.

Needless to say, a bit of a hurdle for a mere researcher.

Timeout comment incorrect

# returns the InstanceId. Timeout of 20000 milliseconds.
while True:
# Filters SSM Managed Instances to find only the matching cloud9 env
instance_info = ssm_client.describe_instance_information(Filters=[{
'Key': 'tag:aws:cloud9:environment', 'Values': [ "{}".format(cloud9_environment)]
}])
# if instance not found/not registered to SSM, instance_info is empty.
if instance_info.get('InstanceInformationList'):
if instance_info.get('InstanceInformationList')[0].get('PingStatus') == 'Online':
return instance_info.get('InstanceInformationList')[0].get('InstanceId')
if context.get_remaining_time_in_millis() < 20000:

The function has 900 second timeout and code gives up with < 20 seconds remaining. Update doc to state timeout is 880 seconds.

failed build

I attempted to use the 'no tears' solution to build a 'test' HPC server with a $50 budget via the web interface.

My build failed with the following:
You have attempted to create more buckets than allowed (Service: Amazon S3; Status Code: 400; Error Code: TooManyBuckets; f

What is the lowest cost budget system I can create with this solution ?

SRA toolkit now named 'sratoolkit'

USERGUIDE.md contains text and screen captures using sra-toolkit for the Spack package which is named sratoolkit in the current Spack release.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.