Giter Club home page Giter Club logo

proteus's Introduction

Proteus

This is the official implementation of "Proteus: Simulating the Performance of Distributed DNN Training". [arXiv]

Proteus is the first standalone simulator to model the performance of complex parallelization strategies through simulation execution. Proteus first models complex parallelization strategies with a unified representation named Strategy Tree. Then, it compiles the strategy tree into a distributed execution graph and simulates the complex runtime behaviors, comp-comm overlap and bandwidth sharing, with a Hierarchical Topo-Aware Executor (HTAE). Proteus is evaluated across a wide variety of DNNs on three hardware configurations. Experimental results show that Proteus achieves $3.0$% average prediction error and preserves order for training throughput of various parallelization strategies. Compared to state-of-the-art approaches, Proteus reduces prediction error by up to $133.8$%.

Installation

First, compile the nccl in external/nccl by running the following commands:

cd external/nccl
make -j src.build

Then, add proteus to PYTHONPATH

pip install graphviz toposort
export PYTHONPATH=$PYTHONPATH:/path/to/proteus

Usage

Cluster Configuration

The cluster configuration is defined with a device topo file and a cluster json file. The device topo file specifies the topology of a single node, and the cluster json file specifies the cluster info. We provide some example topo files and cluster json files in examples/clusters/. The device topo file is generated by running nccl-tests with NCCL_TOPO_DUMP_FILE (link).

Run Examples

We provide some examples in examples/. Try Proteus with

cd examples
mkdir -p log
python alexnet.py -model alexnet -bs 256 -cluster clusters/dgx1_v100_2ib/n1_g1.json -ps dp --profile-iters 50

Citation

@article{duan2023proteus,
  title={Proteus: Simulating the Performance of Distributed DNN Training},
  author={Duan, Jiangfei and Li, Xiuhong and Xu, Ping and Zhang, Xingcheng and Yan, Shengen and Liang, Yun and Lin, Dahua},
  journal={arXiv preprint arXiv:2306.02267},
  year={2023}
}

proteus's People

Contributors

jf-d avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.