Distributed Training in PyTorch

There are some distributed training steps you can try according to PyTorch Document.

PyTorch provides several options for data-parallel training. For applications that gradually grow from simple to complex and from prototype to production, the common development trajectory would be:

Use single-device training, if the data and model can fit in one GPU, and the training speed is not a concern.

Use single-machine multi-GPU DataParallel, if there are multiple GPUs on the server, and you would like to speed up training with the minimum code change. Use single-machine multi-GPU DistributedDataParallel, if you would like to further speed up training and are willing to write a little more code to set it up.

Use multi-machine DistributedDataParallel and the launching script, if the application needs to scale across machine boundaries.

Use torchelastic to launch distributed training, if errors (e.g., OOM) are expected or if the resources can join and leave dynamically during the training.

In this repo, I compared single-device(1) with single-machine multi-GPU DataParallel(2) and single-machine multi-GPU DistributedDataParallel.

Environment

Nvidia RTX 2080ti * 2
torch==1.7.1
torchvision==0.8.2

All dependencies are written in requirements.txt, and you can also access through Dockerfile.

How to Run

All three folders - src/single/, src/dp/, and src/ddp/ - are independent structures.

Single

$ sh src/single/run_single.sh

DataParallel

$ sh src/dp/run_dp.sh

DistributedDataParallel

$ sh src/ddp/run_ddp.sh

Result

Batch size is set to 128 or 256. It is recommended to use SyncBatchNorm in DDP training, but I used vanila BatchNorm so just trained on 256 batch size. Best model is selected according to validation top-1 accuracy.

And I did not care detailed hyperparameter settings, so you can change some settings in order to improve performance (e.g. using ADAM optimizer).

Dataset	Model	Test Loss	Top-1 Acc	Top-5 Acc	Batch Size	Method
CIFAR-100	ResNet-18	1.3728	70.99%	91.57%	128	Single
CIFAR-100	ResNet-18	1.3394	70.64%	91.60%	256	Single
CIFAR-100	ResNet-18	1.2974	71.48%	91.65%	128	DataParallel (DP)
CIFAR-100	ResNet-18	1.3373	71.20%	91.53%	256	DataParallel (DP)
CIFAR-100	ResNet-18	1.2268	71.17%	91.84%	256	DistributedDataParallel (DDP)

Experiment results are averaged value of random seed 2, 4, 42.
Automatic Mixed Precision(AMP) is applied to every experiment.

youngerous / distributed-training-comparison Goto Github PK

distributed-training-comparison's Introduction

Distributed Training in PyTorch

Environment

How to Run

Single

DataParallel

DistributedDataParallel

Result

Reference

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent