Giter Club home page Giter Club logo

saina's Introduction

SAINA

Straggler-Aware In-Network Aggregation (SAINA) mitigates the straggler problem in distributed deep learning while preventing accuracy degradation. It aggregates local gradients of the fastest $k$ workers to exclude stragglers and changes $k$ adaptively to balance the tradeoff between training speed and accuracy.

We proposed Switch-Friendly Convergence Detection (SFCD) algorithm which detects a convergence point and increases $k$ incrementally whenever a training error reaches the convergence point. The SFCD algorithm employs a simple bitwise operation on the signs of global gradients, which is supported by commodity programmable switches and thus it can be implemented practically. Simulation results demonstrate that SAINA can reduce the waiting time for stragglers and shows higher time-to-accuracy (TTA) performance than the existing in-network aggregation scheme and the straggler mitigation schemes.

The source code of SAINA was implemented on top of SwitchML.

Performance Results

Time-to-Accuracy Performance (VGG-16)

Batch Size = 16 Batch Size = 32 Batch Size = 64
VGG-16 (batch size = 16) VGG-16 (batch size = 32) VGG-16 (batch size = 64)

Time-to-Accuracy Performance (SqueezeNet)

Batch Size = 16 Batch Size = 32 Batch Size = 64
SqueezeNet (batch size = 16) SqueezeNet (batch size = 32) SqueezeNet (batch size = 64)

Time-to-Accuracy Performance (ResNet50)

Batch Size = 16 Batch Size = 32 Batch Size = 64
ResNet50 (batch size = 16) ResNet50 (batch size = 32) ResNet50 (batch size = 64)

For ResNet50 model, when the batch size is set to 64, all schemes could not reach the target accuracy. Although we need to investigate better hyperparameters for improving the performance of all schemes, SAINA can reach the target accuracy faster than other schemes for other batch sizes.

Effect of $s_{th}$

$s_{th}$ VGG-16 SqueezeNet
30% 472.4s 771.2s
40% 448.8s 543.1s
50% 426.0s 401.0s
60% 555.6s 508.4s
70% 564.0s 530.3s

Effect of $c_{th}$

$c_{th}$ VGG-16 SqueezeNet
1 451.0s 475.5s
3 426.3s 464.1s
5 426.0s 401.0s
7 451.1s 417.3s
9 463.0s 434.1s
$k$* 423.8s 396.5s

Effect of SFCD Algorithm

Scheme VGG-16 Final Accuracy VGG-16 TTA Reduction
SAINA ($k$=1) 73.9% -
SAINA ($k$=8) 76.3% 1.51x
SAINA ($k^*$) 78.0% 1.70x
Scheme VGG-16 Final Accuracy SqueezeNet TTA Reduction
SAINA ($k$=1) 70.1% 2.28x
SAINA ($k$=8) 72.2% 2.15x
SAINA ($k^*$) 72.5% 2.84x

The above tables are results when the batch size is set to 32. In these tables, Final Accuracy means the accuracy when all training is completed, and TTA Reduction means how much TTA is reduced compared to when $k$ is fixed to 16.

saina's People

Contributors

hotephen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.