Giter Club home page Giter Club logo

deeplearning-cfn's Introduction

Distributed Deep Learning on AWS Using MXNet and TensorFlow

AWS CloudFormation, which creates and configures Amazon Web Services resources with a template, simplifies the process of setting up a distributed deep learning cluster. The AWS CloudFormation Deep Learning template uses the Amazon Deep Learning AMI (which provides MXNet, TensorFlow, Caffe, Theano, Torch, and CNTK frameworks) to launch a cluster of EC2 instances and other AWS resources needed to perform distributed deep learning.
With this template, we continue with our mission to make distributed deep learning easy. AWS CloudFormation creates all resources in the customer account.

What's New?

We've updated the AWS CloudFormation Deep Learning template to add some exciting new features and capabilities.

May 16 2019

  • Updated AWS DLAMI Conda version in CFN template file to v23.0. Check release notes for details.

April 30 2019

  • Add Support to Amazon Linux 2. Check product overview for details.
  • Updated AWS DLAMI Conda version in CFN template file to v22.0 release notes for details.

Feb 12 2019

  • Updated AWS DLAMI Conda version in CFN template file to v21.2. Check release notes for details.

Dec 19 2018

  • Updated AWS DLAMI Conda version in CFN template file to v20.0. Check release notes for details.
  • Updated instructions to use tensorflow with horovod for distributed training.

Nov 7 2018

  • Introduce AWS DLAMI Conda v16.0 - For developers who want pre-installed pip packages of deep learning frameworks in separate virtual environments, the Conda-based AMI is available in Ubuntu and Amazon Linux. Check AWS DLAMI Official webpage for more details.

  • We now support 11 AWS regions- us-east-1, us-west-2, eu-west-1, us-east-2, ap-southeast-2, ap-northeast-1, ap-northeast-2, ap-south-1, eu-central-1,ap-southeast-1, us-west-1.

  • We now support g3 and c5 instances, and remove g2 instances.

  • Update MXNet submodule to 1.3.0.

Mar 22 2018

  • We now support 10 AWS regions - us-east-1, us-west-2, eu-west-1, us-east-2, ap-southeast-2, ap-northeast-1, ap-northeast-2, ap-south-1, eu-central-1,ap-southeast-1.

  • We now support p3 instances.

Older Release Notes

  • We now support 5 AWS regions - us-east-1, us-east-2, us-west-2, eu-west-1 and ap-southeast-2.

  • We've enhanced the AWS CloudFormation Deep Learning template with automation that continues stack creation even if the provisioned number of worker instances falls short of the desired count. In the previous version of the template, if one of the worker instances failed to be provisioned, for example, if it a hit account limit, AWS CloudFormation rolled back the stack and required you to adjust your desired count and restart the stack creation process. The new template includes a function that automatically adjusts the count down and proceeds with setting up the rest of the cluster (stack).

  • We now support creating a cluster of CPU Amazon EC2 instance types.

  • We've also added Amazon Elastic File System (Amazon EFS) support for the cluster created with the template.

    • Amazon EFS is automatically mounted on all worker instances during startup.
    • Amazon EFS allows sharing of code, data, and results across worker instances.
    • Using Amazon EFS doesn't degrade performance for densely packed files (for example, .rec files containing image data).
  • We now support creating a cluster of instances running Ubuntu. See the Ubuntu Deep Learning AMI.

EC2 Cluster Architecture

The following architecture diagram shows the EC2 cluster infrastructure.

Resources Created by the Deep Learning Template

The Amazon Deep Learning template creates a stack that contains the following resources:

  • A VPC in the customer account.
  • The requested number or available number of worker instances in an Auto Scaling group within the VPC. These worker instances are launched in a private subnet.
  • A master instance in a separate Auto Scaling group that acts as a proxy to enable connectivity to the cluster with SSH. AWS CloudFormation places this instance within the VPC and connects it to both the public and private subnets. This instance has both public IP addresses and DNS.
  • An Amazon EFS file storage system configured in General Purpose performance mode.
  • A mount target to mount Amazon EFS on the instances.
  • A security group that allows external SSH access to the master instance.
  • A security group that allows the master and worker instances to mount and access Amazon EFS through NFS port 2049.
  • Two security groups that open ports on the private subnet for communication between the master and workers.
  • An AWS Identity and Access Management (IAM) role that allows instances to poll Amazon Simple Queue Service (Amazon SQS) and access and query Auto Scaling groups and the private IP addresses of the EC2 instances.
  • A NAT gateway used by the instances within the VPC to talk to the outside world.
  • Two Amazon SQS queues to configure the metadata at startup on the master and the workers.
  • An AWS Lambda function that monitors the Auto Scaling group's launch activities and modifies the desired capacity of the Auto Scaling group based on availability.
  • An Amazon Simple Notification Service (Amazon SNS) topic to trigger the Lambda function on Auto Scaling events.
  • AWS CloudFormation WaitCondition and WaitHandler, with a stack creation timeout of 55 minutes to complete metadata setup.

How the Deep Learning Template Works

The startup script enables SSH forwarding on all hosts. Enabling SSH agent forwarding is essential because frameworks such as MXNet use SSH for communication between master and worker instances during distributed training.

The startup script on the master polls the master SQS queue for messages confirming that Auto Scaling setup is complete. The Lambda function sends two messages, one when the master Auto Scaling group is successfully set up, and a second when either the requested capacity is satisfied or when instances fail to launch on the worker Auto Scaling group. When instance launch fails on the worker Auto Scaling group, the Lambda function modifies the desired capacity to the number of instances that have been successfully launched.

Upon receiving messages on the Amazon SQS master queue, the setup script on the master configures all of the necessary worker metadata (IP addresses of the workers, GPU count, etc.,) and broadcasts the metadata on the worker SQS queue. Upon receiving this message, the startup script on the worker instances that are polling the SQS worker queue configure this metadata on the workers.

The following environment variables are set up on all the instances:

  • $DEEPLEARNING_WORKERS_PATH: The file path that contains the list of workers

  • $DEEPLEARNING_WORKERS_COUNT: The total number of workers

  • $DEEPLEARNING_WORKER_GPU_COUNT: The number of GPUs on the instance

  • $EFS_MOUNT: The directory where Amazon EFS is mounted

Setting Up a Deep Learning Stack

To set up a deep learning AWS CloudFormation stack, follow Using the AWS CloudFormation Deep Learning Template.

Running Distributed Training

To demonstrate how to run distributed training using MXNet and Tensorflow frameworks, we use the standard CIFAR-10 model. CIFAR-10 is a sufficiently complex network that benefits from a distributed setup and that can be quickly trained on such a setup.

Log in to the Master Instance

Follow Step 3 in Using the AWS CloudFormation Deep Learning Template.

Clone the awslabs/deeplearning-cfn repo that contains the examples onto the EFS mount

Note: This could take a few minutes.

git clone https://github.com/awslabs/deeplearning-cfn $EFS_MOUNT/deeplearning-cfn && \
cd $EFS_MOUNT/deeplearning-cfn && \
#
#fetches dmlc/mxnet and tensorflow/models repos as submodules
git submodule update --init $EFS_MOUNT/deeplearning-cfn/examples/tensorflow/models && \
git submodule update --init $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet && \
cd $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet && \
git submodule update --init $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet/3rdparty/dmlc-core
# We also need to pull latest dmlc-core code to make it work in conda environment
cd $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet/3rdparty/dmlc-core
git fetch origin master
git checkout -b master origin/master

Running Distributed Training on MXNet

The following example shows how to run CIFAR-10 with data parallelism on MXNet. Note the use of the DEEPLEARNING_* environment variables.

#terminate all running Python processes across workers
while read -u 10 host; do ssh -o "StrictHostKeyChecking no" $host "pkill -f python" ; \
done 10<$DEEPLEARNING_WORKERS_PATH

#navigate to the MXNet image-classification example directory \
cd $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet/example/gluon/

#run the CIFAR10 distributed training example in mxnet conda environment(Change mxnet_p36 to mxnet_p27 if you wnat to run in python27 environment)\
../../tools/launch.py -n $DEEPLEARNING_WORKERS_COUNT -H $DEEPLEARNING_WORKERS_PATH "source activate mxnet_p36 && python image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_device_sync"

We were able to run the training for 100 epochs in 25 minutes on 2 P2.8x EC2 instances and achieve a training accuracy of 92%.

These steps summarize how to get started. For more information about running distributed training on MXNet, see Run MXNet on Multiple Devices.

Running Distributed Training on TensorFlow with Horovod

Horovod is a distributed training framework to make distributed deep learning easy and fast. You can get more information about the advantages of using Horovod to do distributed training with Tensorflow or other framework. Starting from DLAMI Conda v19.0, it comes with example pre-configured scripts to show how you can train model with Tensorflow and Horovod on multi gpus/machines.

The following example shows how to train a ResNet-50 model with synthetic data.

cd ~/examples/horovod/tensorflow
vi hosts

You should be able to see localhost slots=8 in your hosts file. The number of slots means how many GPUs you want to use in that machine to train your model. Also, you should append your worker nodes to the hosts file, and assign GPU number to it. To know how many GPUs available in your instance, run nvidia-smi. After the change, your hosts file should look like

localhost slots=<#GPUs>
<worker node private ip> slots=<#GPUs>
......

You can easily calculate the number of GPUs you'll use to train the model by summing up the slots available on each machine. Note that: the argument passed to the train_synthetic.sh script below is passed to -np parameter of mpirun. The -np argument represents the total number of processes and the slots argument in hostfile represents the split of those processes per machine.

Then, just run ./train_synthetic.sh 24 or replace 24 with number of GPUs you use.

deeplearning-cfn's People

Contributors

aaronmarkham avatar ajayvohra2005 avatar anirudh2290 avatar antman261 avatar b0noi avatar hyandell avatar junpuf avatar lupesko avatar mirocody avatar nswamy avatar richardwan7 avatar sandeep-krishnamurthy avatar surajkota avatar yuruofeifei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deeplearning-cfn's Issues

Getting ROLLBACK_COMPLETE status when I try to Create cluster in Asia-Pacific (Mumbai) Region

When i used the default config file(provided in this repository) it said no AMI for this region so i changed the AMI ID to a suitable one. After changing the AMI ID and when I finally click in create stack, I am getting "ROLLBACK_COMPLETE" status with sub statuses as "Resource Creation Cancelled"

Following is the event status:

11:59:30 UTC+0550 ROLLBACK_COMPLETE AWS::CloudFormation::Stack ParameterOptimization  
  11:58:16 UTC+0550 ROLLBACK_IN_PROGRESS AWS::CloudFormation::Stack ParameterOptimization
  11:58:15 UTC+0550 CREATE_FAILED AWS::SQS::Queue WorkerQueue
  11:58:15 UTC+0550 CREATE_FAILED AWS::SQS::Queue MasterQueue
  11:58:15 UTC+0550 CREATE_FAILED AWS::EC2::VPC Vpc
  11:58:15 UTC+0550 CREATE_FAILED AWS::EC2::EIP NATGatewayEIP
  11:58:15 UTC+0550 CREATE_FAILED AWS::EC2::InternetGateway InternetGateway
  11:58:14 UTC+0550 CREATE_FAILED AWS::EFS::FileSystem FileSystem

Kindly help

/etc/hosts appears to be assigned at random now

Whenever we spin up an environment with this template, the script /opt/deeplearning/dl_cfn_setup_v2.py gets called on the instances. This script updates the /etc/hosts file on each node with the list of all the addresses of the hosts in the cluster so it looks like this:

# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost6 localhost6.localdomain6
172.31.4.18 deeplearning-master
172.31.4.18 deeplearning-worker1
172.31.17.209 deeplearning-worker2
172.31.23.139 deeplearning-worker3
172.31.28.203 deeplearning-worker4
172.31.39.46 deeplearning-worker5
172.31.0.67 deeplearning-worker6
172.31.4.223 deeplearning-worker7
172.31.40.227 deeplearning-worker8

Note that the address for deeplearning-master and deeplearning-worker1 are the same. This is correct. The last few times I created a new environment, though, it's been assigning the hostnames based on their IP order. This means that the deeplearning-master could end up being assigned as any one of the other worker nodes.

The scripts being used internally by our engineers testing the distributed training systems on AWS rely on deeplearning-master being assigned as deeplearning-worker1. Something changed recently that's stopping this from happening.

Has anyone else encountered this? As far as I can tell, it's worked fine up until a few days ago.

Permission denied (publickey)

Hi,
I am trying to follow all the step to run this parallel training example. But I have an error.
When I run

  #terminate all running Python processes across workers \
while read -u 10 host; do ssh -o "StrictHostKeyChecking no" $host "pkill -f python" ; \
done 10<$DEEPLEARNING_WORKERS_PATH

It will output:
Permission denied (publickey).
I tried the code in Master server, I think the connection between Master and Worker is something wrong.
Anyone can help? Really appreciate it.
Best regards
Daniel

Create multiple stacks with the same efs

When I create two stacks with the same efs, it fails and rollbacks with the following error:
requested subnet for new mount target is not in the same VPC as existing mount targets (Service: AmazonElasticFileSystem; Status Code: 409; Error Code: MountTargetConflict; Request ID: 57f54cf9-cdf7-11e8-ade0-4fc9b2d0aee0)

I believe it is common to create the different stacks with the same efs, so can anyone please resolve this bug?

broken link

In the section "Running Distributed Training on TensorFlow"
The link "Distributed Tensorflow" is broken.

TensorFlow and the distributed training sample code discussed in Distributed Tensorflow.

Please clarify Step 2, Number 2 of StackSetup.md

I'm following through the otherwise perfectly clear instructions in StackSetup.md, and then hit this:


  1. Enable SSH agent forwarding. This enables communication with all of the instances in the private subnet. Using the DNS/IP address that you recorded in the first step, modify the SSH configuration to include these lines:
 Host IP/DNS-from-above  
 ForwardAgent yes

My question is: Where do I "modify the SSH configuration"?

Does that refer to my .ssh/config file on my local machine?

Or, some setting buried deep in the config of some instance in the DeepLearning Stack?

Or, ... ???

Sorry if this is incredibly obvious to the more experienced.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.