awslabs / deeplearning-cfn Goto Github PK

Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow

License: Other

Python 100.00%

deeplearning tensorflow mxnet ec2-instance aws aws-cloudformation distributed-training deep-learning-ami

deeplearning-cfn's Introduction

Distributed Deep Learning on AWS Using MXNet and TensorFlow

AWS CloudFormation, which creates and configures Amazon Web Services resources with a template, simplifies the process of setting up a distributed deep learning cluster. The AWS CloudFormation Deep Learning template uses the Amazon Deep Learning AMI (which provides MXNet, TensorFlow, Caffe, Theano, Torch, and CNTK frameworks) to launch a cluster of EC2 instances and other AWS resources needed to perform distributed deep learning.
With this template, we continue with our mission to make distributed deep learning easy. AWS CloudFormation creates all resources in the customer account.

What's New?

We've updated the AWS CloudFormation Deep Learning template to add some exciting new features and capabilities.

May 16 2019

Updated AWS DLAMI Conda version in CFN template file to v23.0. Check release notes for details.

April 30 2019

Add Support to Amazon Linux 2. Check product overview for details.
Updated AWS DLAMI Conda version in CFN template file to v22.0 release notes for details.

Feb 12 2019

Updated AWS DLAMI Conda version in CFN template file to v21.2. Check release notes for details.

Dec 19 2018

Updated AWS DLAMI Conda version in CFN template file to v20.0. Check release notes for details.
Updated instructions to use tensorflow with horovod for distributed training.

Nov 7 2018

Introduce AWS DLAMI Conda v16.0 - For developers who want pre-installed pip packages of deep learning frameworks in separate virtual environments, the Conda-based AMI is available in Ubuntu and Amazon Linux. Check AWS DLAMI Official webpage for more details.
We now support 11 AWS regions- us-east-1, us-west-2, eu-west-1, us-east-2, ap-southeast-2, ap-northeast-1, ap-northeast-2, ap-south-1, eu-central-1,ap-southeast-1, us-west-1.
We now support g3 and c5 instances, and remove g2 instances.
Update MXNet submodule to 1.3.0.

Mar 22 2018

We now support 10 AWS regions - us-east-1, us-west-2, eu-west-1, us-east-2, ap-southeast-2, ap-northeast-1, ap-northeast-2, ap-south-1, eu-central-1,ap-southeast-1.
We now support p3 instances.

Older Release Notes

We now support 5 AWS regions - us-east-1, us-east-2, us-west-2, eu-west-1 and ap-southeast-2.
We've enhanced the AWS CloudFormation Deep Learning template with automation that continues stack creation even if the provisioned number of worker instances falls short of the desired count. In the previous version of the template, if one of the worker instances failed to be provisioned, for example, if it a hit account limit, AWS CloudFormation rolled back the stack and required you to adjust your desired count and restart the stack creation process. The new template includes a function that automatically adjusts the count down and proceeds with setting up the rest of the cluster (stack).
We now support creating a cluster of CPU Amazon EC2 instance types.
We've also added Amazon Elastic File System (Amazon EFS) support for the cluster created with the template.
- Amazon EFS is automatically mounted on all worker instances during startup.
- Amazon EFS allows sharing of code, data, and results across worker instances.
- Using Amazon EFS doesn't degrade performance for densely packed files (for example, .rec files containing image data).
We now support creating a cluster of instances running Ubuntu. See the Ubuntu Deep Learning AMI.

EC2 Cluster Architecture

The following architecture diagram shows the EC2 cluster infrastructure.

Resources Created by the Deep Learning Template

The Amazon Deep Learning template creates a stack that contains the following resources:

A VPC in the customer account.
The requested number or available number of worker instances in an Auto Scaling group within the VPC. These worker instances are launched in a private subnet.
A master instance in a separate Auto Scaling group that acts as a proxy to enable connectivity to the cluster with SSH. AWS CloudFormation places this instance within the VPC and connects it to both the public and private subnets. This instance has both public IP addresses and DNS.
An Amazon EFS file storage system configured in General Purpose performance mode.
A mount target to mount Amazon EFS on the instances.
A security group that allows external SSH access to the master instance.
A security group that allows the master and worker instances to mount and access Amazon EFS through NFS port 2049.
Two security groups that open ports on the private subnet for communication between the master and workers.
An AWS Identity and Access Management (IAM) role that allows instances to poll Amazon Simple Queue Service (Amazon SQS) and access and query Auto Scaling groups and the private IP addresses of the EC2 instances.
A NAT gateway used by the instances within the VPC to talk to the outside world.
Two Amazon SQS queues to configure the metadata at startup on the master and the workers.
An AWS Lambda function that monitors the Auto Scaling group's launch activities and modifies the desired capacity of the Auto Scaling group based on availability.
An Amazon Simple Notification Service (Amazon SNS) topic to trigger the Lambda function on Auto Scaling events.
AWS CloudFormation WaitCondition and WaitHandler, with a stack creation timeout of 55 minutes to complete metadata setup.

How the Deep Learning Template Works

The startup script enables SSH forwarding on all hosts. Enabling SSH agent forwarding is essential because frameworks such as MXNet use SSH for communication between master and worker instances during distributed training.

The startup script on the master polls the master SQS queue for messages confirming that Auto Scaling setup is complete. The Lambda function sends two messages, one when the master Auto Scaling group is successfully set up, and a second when either the requested capacity is satisfied or when instances fail to launch on the worker Auto Scaling group. When instance launch fails on the worker Auto Scaling group, the Lambda function modifies the desired capacity to the number of instances that have been successfully launched.

Upon receiving messages on the Amazon SQS master queue, the setup script on the master configures all of the necessary worker metadata (IP addresses of the workers, GPU count, etc.,) and broadcasts the metadata on the worker SQS queue. Upon receiving this message, the startup script on the worker instances that are polling the SQS worker queue configure this metadata on the workers.

The following environment variables are set up on all the instances:

$DEEPLEARNING_WORKERS_PATH: The file path that contains the list of workers
$DEEPLEARNING_WORKERS_COUNT: The total number of workers
$DEEPLEARNING_WORKER_GPU_COUNT: The number of GPUs on the instance
$EFS_MOUNT: The directory where Amazon EFS is mounted

Setting Up a Deep Learning Stack

To set up a deep learning AWS CloudFormation stack, follow Using the AWS CloudFormation Deep Learning Template.

Running Distributed Training

To demonstrate how to run distributed training using MXNet and Tensorflow frameworks, we use the standard CIFAR-10 model. CIFAR-10 is a sufficiently complex network that benefits from a distributed setup and that can be quickly trained on such a setup.

Log in to the Master Instance

Follow Step 3 in Using the AWS CloudFormation Deep Learning Template.

Clone the awslabs/deeplearning-cfn repo that contains the examples onto the EFS mount

Note: This could take a few minutes.

git clone https://github.com/awslabs/deeplearning-cfn $EFS_MOUNT/deeplearning-cfn && \
cd $EFS_MOUNT/deeplearning-cfn && \
#
#fetches dmlc/mxnet and tensorflow/models repos as submodules
git submodule update --init $EFS_MOUNT/deeplearning-cfn/examples/tensorflow/models && \
git submodule update --init $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet && \
cd $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet && \
git submodule update --init $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet/3rdparty/dmlc-core
# We also need to pull latest dmlc-core code to make it work in conda environment
cd $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet/3rdparty/dmlc-core
git fetch origin master
git checkout -b master origin/master

Running Distributed Training on MXNet

The following example shows how to run CIFAR-10 with data parallelism on MXNet. Note the use of the DEEPLEARNING_* environment variables.

#terminate all running Python processes across workers
while read -u 10 host; do ssh -o "StrictHostKeyChecking no" $host "pkill -f python" ; \
done 10<$DEEPLEARNING_WORKERS_PATH

#navigate to the MXNet image-classification example directory \
cd $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet/example/gluon/

#run the CIFAR10 distributed training example in mxnet conda environment(Change mxnet_p36 to mxnet_p27 if you wnat to run in python27 environment)\
../../tools/launch.py -n $DEEPLEARNING_WORKERS_COUNT -H $DEEPLEARNING_WORKERS_PATH "source activate mxnet_p36 && python image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_device_sync"

We were able to run the training for 100 epochs in 25 minutes on 2 P2.8x EC2 instances and achieve a training accuracy of 92%.

These steps summarize how to get started. For more information about running distributed training on MXNet, see Run MXNet on Multiple Devices.

Running Distributed Training on TensorFlow with Horovod

Horovod is a distributed training framework to make distributed deep learning easy and fast. You can get more information about the advantages of using Horovod to do distributed training with Tensorflow or other framework. Starting from DLAMI Conda v19.0, it comes with example pre-configured scripts to show how you can train model with Tensorflow and Horovod on multi gpus/machines.

The following example shows how to train a ResNet-50 model with synthetic data.

cd ~/examples/horovod/tensorflow
vi hosts

You should be able to see localhost slots=8 in your hosts file. The number of slots means how many GPUs you want to use in that machine to train your model. Also, you should append your worker nodes to the hosts file, and assign GPU number to it. To know how many GPUs available in your instance, run nvidia-smi. After the change, your hosts file should look like

localhost slots=<#GPUs>
<worker node private ip> slots=<#GPUs>
......

You can easily calculate the number of GPUs you'll use to train the model by summing up the slots available on each machine. Note that: the argument passed to the train_synthetic.sh script below is passed to -np parameter of mpirun. The -np argument represents the total number of processes and the slots argument in hostfile represents the split of those processes per machine.

Then, just run ./train_synthetic.sh 24 or replace 24 with number of GPUs you use.

deeplearning-cfn's People

Contributors

Stargazers

Watchers

Forkers

npfp antman261 larvapupae ivjia fuxi2003 kelizhong dts3 liaosa adhorn sandeep-krishnamurthy chao1224 domdivakaruni atifs eric-haibin-lin jamiekang anoora17 tensoralex harimau99 sooheon younggyo reddevstorm hanifmahboobi ahmedulde srngit r1manepalli roshrini pilhokim lrwilt pgkarthik univrs davisliang lovetocommit monsoon15 armenr alexwal snyrx8 kktam dholdaway xdslab b0noi gopicares scottlinehealthcare imadem nadiiap gridl wingring47 albarqouni maneeshs dhagell shyamalu srikanth-bottu hunslater-deeplearning lllhhhqqq goswamig priya-gittest cherifsy pakigya ashishagarwal1983 jarvisglee hyandell gtbai spencerx bc2009 awsgeek shubhampachori12110095 iheo ramesh534 piaopiaodedudou jief123 barisbilenvural codevulture pierrenowi alessandro-arensi nuass afcarl dr-sarvamangala armandmcqueen liangbogopher mulholland0318 jerryliu112 vikas-kum mirocody apeforest huyaqiang09 mswaroopa1 samanthafeidfischer dalavancloud yuxihu basti-kachel surajkota shiyudian richardwan7 smart-patrol omerch nskool ajayvohra2005 tmolayo chrisqiqiang johnjdailey sachinmittal2212

deeplearning-cfn's Issues

/etc/hosts appears to be assigned at random now

Whenever we spin up an environment with this template, the script /opt/deeplearning/dl_cfn_setup_v2.py gets called on the instances. This script updates the /etc/hosts file on each node with the list of all the addresses of the hosts in the cluster so it looks like this:

# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost6 localhost6.localdomain6
172.31.4.18 deeplearning-master
172.31.4.18 deeplearning-worker1
172.31.17.209 deeplearning-worker2
172.31.23.139 deeplearning-worker3
172.31.28.203 deeplearning-worker4
172.31.39.46 deeplearning-worker5
172.31.0.67 deeplearning-worker6
172.31.4.223 deeplearning-worker7
172.31.40.227 deeplearning-worker8

Note that the address for deeplearning-master and deeplearning-worker1 are the same. This is correct. The last few times I created a new environment, though, it's been assigning the hostnames based on their IP order. This means that the deeplearning-master could end up being assigned as any one of the other worker nodes.

The scripts being used internally by our engineers testing the distributed training systems on AWS rely on deeplearning-master being assigned as deeplearning-worker1. Something changed recently that's stopping this from happening.

Has anyone else encountered this? As far as I can tell, it's worked fine up until a few days ago.

"ROLLBACK_COMPLETE" when upgrade the default Ubuntu image to 2.3_Sep2017

Just only change the row of
"us-west-2" : { "AMI" : "ami-7fd7c906" },
with
"us-west-2" : { "AMI" : "ami-d6ee1dae" },

Error message is "Parameter validation failed: parameter value for parameter name KeyName does not exist. Rollback requested by user."
Why ? Thanks.

In Distributed TensorFlow Mask R-CNN example, the default AMI ids are trailing current version

The default AMI ids for deep learning Ubuntu and Amazon Linux AMI in Cloud Formation templates need to be updated.

Getting ROLLBACK_COMPLETE status when I try to Create cluster in Asia-Pacific (Mumbai) Region

When i used the default config file(provided in this repository) it said no AMI for this region so i changed the AMI ID to a suitable one. After changing the AMI ID and when I finally click in create stack, I am getting "ROLLBACK_COMPLETE" status with sub statuses as "Resource Creation Cancelled"

Following is the event status:

ROLLBACK_COMPLETE	AWS::CloudFormation::Stack	ParameterOptimization
11:58:16 UTC+0550	ROLLBACK_IN_PROGRESS	AWS::CloudFormation::Stack	ParameterOptimization
11:58:15 UTC+0550	CREATE_FAILED	AWS::SQS::Queue	WorkerQueue
11:58:15 UTC+0550	CREATE_FAILED	AWS::SQS::Queue	MasterQueue
11:58:15 UTC+0550	CREATE_FAILED	AWS::EC2::VPC	Vpc
11:58:15 UTC+0550	CREATE_FAILED	AWS::EC2::EIP	NATGatewayEIP
11:58:15 UTC+0550	CREATE_FAILED	AWS::EC2::InternetGateway	InternetGateway
11:58:14 UTC+0550	CREATE_FAILED	AWS::EFS::FileSystem	FileSystem

Kindly help

Should the spelling of Wednesdays be reviewed ?

Hi All,

in the example below I see:

examples/tensorflow/models/textsum/data/vocab:8711:0: "wednesdy" is a misspelling of "wednesdays"

IS it possible to set instance storage with this template?

The default storage is limited to 50GB, what can I do in case I want more than 50GB?

broken link

In the section "Running Distributed Training on TensorFlow"
The link "Distributed Tensorflow" is broken.

TensorFlow and the distributed training sample code discussed in Distributed Tensorflow.

Create multiple stacks with the same efs

When I create two stacks with the same efs, it fails and rollbacks with the following error:
requested subnet for new mount target is not in the same VPC as existing mount targets (Service: AmazonElasticFileSystem; Status Code: 409; Error Code: MountTargetConflict; Request ID: 57f54cf9-cdf7-11e8-ade0-4fc9b2d0aee0)

I believe it is common to create the different stacks with the same efs, so can anyone please resolve this bug?

In Distributed TensorFlow Mask R-CNN example, prepare script bug

In https://github.com/awslabs/deeplearning-cfn/blob/master/examples/distributed-tensorflow/prepare-s3-bucket.sh there are code lines at the bottom trying to copy non-existent script files into S3. Also, there is a bug in copying run.sh and setup.sh in terms of specifying correct path.

Permission denied (publickey)

Hi,
I am trying to follow all the step to run this parallel training example. But I have an error.
When I run

  #terminate all running Python processes across workers \
while read -u 10 host; do ssh -o "StrictHostKeyChecking no" $host "pkill -f python" ; \
done 10<$DEEPLEARNING_WORKERS_PATH

It will output:
Permission denied (publickey).
I tried the code in Master server, I think the connection between Master and Worker is something wrong.
Anyone can help? Really appreciate it.
Best regards
Daniel

Please clarify Step 2, Number 2 of StackSetup.md

I'm following through the otherwise perfectly clear instructions in StackSetup.md, and then hit this:

Enable SSH agent forwarding. This enables communication with all of the instances in the private subnet. Using the DNS/IP address that you recorded in the first step, modify the SSH configuration to include these lines:

 Host IP/DNS-from-above  
 ForwardAgent yes

My question is: Where do I "modify the SSH configuration"?

Does that refer to my .ssh/config file on my local machine?

Or, some setting buried deep in the config of some instance in the DeepLearning Stack?

Or, ... ???

Sorry if this is incredibly obvious to the more experienced.

In Distributed TensorFlow Mask R-CNN example, the Tensorpack hash used is old

In Distributed TensorFlow Mask R-CNN example, the prepare script is cloning a older Tensorpack hash. Tensorpack hash needs to be updated.,

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.