A curated list of awesome projects and papers for distributed training or inference especially for large model.
- Megatron-LM: Ongoing Research Training Transformer Models at Scale
- DeepSpeed: A Deep Learning Optimization Library that Makes Distributed Training and Inference Easy, Efficient, and Effective.
- ColossalAI: A Unified Deep Learning System for Large-Scale Parallel Training
- OneFlow: A Performance-Centered and Open-Source Deep Learning Framework
- Mesh TensorFlow: Model Parallelism Made Easier
- FlexFlow: A Distributed Deep Learning Framework that Supports Flexible Parallelization Strategies.
- Alpa: Auto Parallelization for Large-Scale Neural Networks
- Easy Parallel Library: A General and Efficient Deep Learning Framework for Distributed Model Training
- FairScale: PyTorch Extensions for High Performance and Large Scale Training
- Pipeline Parallelism and SPMD for PyTorch
- Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis by Tal Ben-Nun et al., ACM Computing Surveys 2020
- A Survey on Auto-Parallelism of Neural Networks Training by Peng Liang., techrxiv 2022
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism by Yanping Huang et al., NeurIPS 2019
- PipeDream: generalized pipeline parallelism for DNN training by Deepak Narayanan et al., SOSP 2019
- Memory-Efficient Pipeline-Parallel DNN Training by Deepak Narayanan et al., ICML 2021
- DAPPLE: a pipelined data parallel approach for training large models by Shiqing Fan et al. PPoPP 2021
- Chimera: efficiently training large-scale neural networks with bidirectional pipelines by Shigang Li et al., SC 2021
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding by Dmitry Lepikhin et al., ICLR 2021
- FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models by Jiaao He et al., PPoPP 2022
- DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale by Samyam Rajbhandari et al., ICML 2022
- Tutel: Adaptive Mixture-of-Experts at Scale by Changho Hwang et al., arxiv 2022
- Efficient large-scale language model training on GPU clusters using megatron-LM by Deepak Narayanan et al., SC 2021
- GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training by Arpan Jain et al., SC 2020
- Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training by Can Karakus et al., arxiv 2021
- OneFlow: Redesign the Distributed Deep Learning Framework from Scratch by Jinhui Yuan et al., arxiv 2021
- Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training by Zhengda Bian., arxiv 2021
- Training deep nets with sublinear memory cost by Tianqi Chen et al., arxiv 2016
- ZeRO: memory optimizations toward training trillion parameter models by Samyam Rajbhandari et al., SC 2020
- Capuchin: Tensor-based GPU Memory Management for Deep Learning by Xuan Peng et al., ASPLOS 2020
- SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping by Chien-Chin Huang et al., ASPLOS 2020
- Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization by Paras Jain et al., MLSys 2020
- Dynamic Tensor Rematerialization by Marisa Kirisame et al., ICLR 2021
- ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training by Jianfei Chen et al., ICML 2021
- ZeRO-Offload: Democratizing Billion-Scale Model Training by Jie Ren et al., ATC 2021
- ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning by Samyam Rajbhandari et al., SC 2021
- PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management by Jiarui Fang et al., arxiv 2021
- GACT: Activation Compressed Training for Generic Network Architectures by Xiaoxuan Liu et al., ICML 2022
- MegTaiChi: dynamic tensor-based memory management optimization for DNN training by Zhongzhe Hu et al., ICS 2022
- Mesh-tensorflow: Deep learning for supercomputers by Noam Shazeer et al., NeurIPS 2018
- Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks by Zhihao Jia et al., ICML 2018
- Beyond Data and Model Parallelism for Deep Neural Networks by Zhihao Jia et al., MLSys 2019
- Supporting Very Large Models using Automatic Dataflow Graph Partitioning by Minjie Wang et al., EuroSys 2019
- GSPMD: General and Scalable Parallelization for ML Computation Graphs by Yuanzhong Xu et al., arxiv 2021
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning by Lianmin Zheng et al., OSDI 2022
- Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization by Colin Unger, Zhihao Jia, et al., OSDI 2022
- Blink: Fast and Generic Collectives for Distributed ML by Guanhua Wang et al., MLSys 2020
- GC3: An Optimizing Compiler for GPU Collective Communication by Meghan Cowan et al., arxiv 2022
- Breaking the computation and communication abstraction barrier in distributed machine learning workloads by Abhinav Jangda et al., ASPLOS 2022
- DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale by Reza Yazdani Aminabadi et al., arxiv 2022
- EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models by Jiangsu Du et al., arxiv 2022
- BaGuaLu: targeting brain scale pretrained models with over 37 million cores by Zixuan Ma et al., PPoPP 2022
- NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism by Shixiong Zhao et al., ASPLOS 2022
All contributions to this repository are welcome. Open an issue or send a pull request.