Giter Club home page Giter Club logo

distributed-training-comparison's Introduction

Distributed Training in PyTorch

There are some distributed training steps you can try according to PyTorch Document.

PyTorch provides several options for data-parallel training. For applications that gradually grow from simple to complex and from prototype to production, the common development trajectory would be:

  1. Use single-device training, if the data and model can fit in one GPU, and the training speed is not a concern.
  2. Use single-machine multi-GPU DataParallel, if there are multiple GPUs on the server, and you would like to speed up training with the minimum code change. Use single-machine multi-GPU DistributedDataParallel, if you would like to further speed up training and are willing to write a little more code to set it up.
  3. Use multi-machine DistributedDataParallel and the launching script, if the application needs to scale across machine boundaries.
  4. Use torchelastic to launch distributed training, if errors (e.g., OOM) are expected or if the resources can join and leave dynamically during the training.

In this repo, I compared single-device(1) with single-machine multi-GPU DataParallel(2) and single-machine multi-GPU DistributedDataParallel.

Environment

  • Nvidia RTX 2080ti * 2
  • torch==1.7.1
  • torchvision==0.8.2

All dependencies are written in requirements.txt, and you can also access through Dockerfile.

How to Run

All three folders - src/single/, src/dp/, and src/ddp/ - are independent structures.

Single

$ sh src/single/run_single.sh

DataParallel

$ sh src/dp/run_dp.sh

DistributedDataParallel

$ sh src/ddp/run_ddp.sh

Result

Batch size is set to 128 or 256. It is recommended to use SyncBatchNorm in DDP training, but I used vanila BatchNorm so just trained on 256 batch size. Best model is selected according to validation top-1 accuracy.

And I did not care detailed hyperparameter settings, so you can change some settings in order to improve performance (e.g. using ADAM optimizer).

Dataset Model Test Loss Top-1 Acc Top-5 Acc Batch Size Method
CIFAR-100 ResNet-18 1.3728 70.99% 91.57% 128 Single
CIFAR-100 ResNet-18 1.3394 70.64% 91.60% 256 Single
CIFAR-100 ResNet-18 1.2974 71.48% 91.65% 128 DataParallel (DP)
CIFAR-100 ResNet-18 1.3373 71.20% 91.53% 256 DataParallel (DP)
CIFAR-100 ResNet-18 1.2268 71.17% 91.84% 256 DistributedDataParallel (DDP)
  • Experiment results are averaged value of random seed 2, 4, 42.
  • Automatic Mixed Precision(AMP) is applied to every experiment.

Reference

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.