Light

shenjunkun / awesome-distributed-ml Goto Github PK

View Code? Open in Web Editor NEW

This project forked from shenggan/awesome-distributed-ml

0.0 0.0 0.0 35 KB

A curated list of awesome projects and papers for distributed training or inference

awesome-distributed-ml's Introduction

Awesome Distributed Machine Learning System

A curated list of awesome projects and papers for distributed training or inference especially for large model.

Contents

Awesome Distributed Machine Learning System

Open Source Projects

Papers

Survey

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis by Tal Ben-Nun et al., ACM Computing Surveys 2020
A Survey on Auto-Parallelism of Neural Networks Training by Peng Liang., techrxiv 2022

Pipeline Parallelism

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism by Yanping Huang et al., NeurIPS 2019
PipeDream: generalized pipeline parallelism for DNN training by Deepak Narayanan et al., SOSP 2019
Memory-Efficient Pipeline-Parallel DNN Training by Deepak Narayanan et al., ICML 2021
DAPPLE: a pipelined data parallel approach for training large models by Shiqing Fan et al. PPoPP 2021
Chimera: efficiently training large-scale neural networks with bidirectional pipelines by Shigang Li et al., SC 2021

Mixture-of-Experts Parallelism

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding by Dmitry Lepikhin et al., ICLR 2021
FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models by Jiaao He et al., PPoPP 2022
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale by Samyam Rajbhandari et al., ICML 2022
Tutel: Adaptive Mixture-of-Experts at Scale by Changho Hwang et al., arxiv 2022

Hybrid Parallelism & Framework

Efficient large-scale language model training on GPU clusters using megatron-LM by Deepak Narayanan et al., SC 2021
GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training by Arpan Jain et al., SC 2020
Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training by Can Karakus et al., arxiv 2021
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch by Jinhui Yuan et al., arxiv 2021
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training by Zhengda Bian., arxiv 2021

Memory Efficient Training

Training deep nets with sublinear memory cost by Tianqi Chen et al., arxiv 2016
ZeRO: memory optimizations toward training trillion parameter models by Samyam Rajbhandari et al., SC 2020
Capuchin: Tensor-based GPU Memory Management for Deep Learning by Xuan Peng et al., ASPLOS 2020
SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping by Chien-Chin Huang et al., ASPLOS 2020
Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization by Paras Jain et al., MLSys 2020
Dynamic Tensor Rematerialization by Marisa Kirisame et al., ICLR 2021
ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training by Jianfei Chen et al., ICML 2021
ZeRO-Offload: Democratizing Billion-Scale Model Training by Jie Ren et al., ATC 2021
ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning by Samyam Rajbhandari et al., SC 2021
PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management by Jiarui Fang et al., arxiv 2021
GACT: Activation Compressed Training for Generic Network Architectures by Xiaoxuan Liu et al., ICML 2022
MegTaiChi: dynamic tensor-based memory management optimization for DNN training by Zhongzhe Hu et al., ICS 2022

Auto Parallelization

Mesh-tensorflow: Deep learning for supercomputers by Noam Shazeer et al., NeurIPS 2018
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks by Zhihao Jia et al., ICML 2018
Beyond Data and Model Parallelism for Deep Neural Networks by Zhihao Jia et al., MLSys 2019
Supporting Very Large Models using Automatic Dataflow Graph Partitioning by Minjie Wang et al., EuroSys 2019
GSPMD: General and Scalable Parallelization for ML Computation Graphs by Yuanzhong Xu et al., arxiv 2021
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning by Lianmin Zheng et al., OSDI 2022
Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization by Colin Unger, Zhihao Jia, et al., OSDI 2022

Communication Optimization

Blink: Fast and Generic Collectives for Distributed ML by Guanhua Wang et al., MLSys 2020
GC3: An Optimizing Compiler for GPU Collective Communication by Meghan Cowan et al., arxiv 2022
Breaking the computation and communication abstraction barrier in distributed machine learning workloads by Abhinav Jangda et al., ASPLOS 2022

Inference

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale by Reza Yazdani Aminabadi et al., arxiv 2022
EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models by Jiangsu Du et al., arxiv 2022

Applications

BaGuaLu: targeting brain scale pretrained models with over 37 million cores by Zixuan Ma et al., PPoPP 2022
NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism by Shixiong Zhao et al., ASPLOS 2022

Contribute

All contributions to this repository are welcome. Open an issue or send a pull request.

awesome-distributed-ml's People

Contributors

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.