An Easy, Fast and Memory-Efficient System for DiT Training and Inference

Latest News 🔥

[2024/03/18] Support OpenSora: An open-source initiative dedicated to efficiently reproducing OpenAI's Sora.
[2024/02/27] Officially release OpenDiT: An Easy, Fast and Memory-Efficent System for DiT Training and Inference.

About

OpenDiT is an open-source project that provides a high-performance implementation of Diffusion Transformer (DiT) powered by Colossal-AI, specifically designed to enhance the efficiency of training and inference for DiT applications, including text-to-video generation and text-to-image generation.

OpenDiT has been adopted by OpenSora, MiniSora.

OpenDiT boasts the performance by the following techniques:

Up to 80% speedup and 50% memory reduction on GPU
- Kernel optimization including FlashAttention, Fused AdaLN, and Fused layernorm kernel.
- Hybrid parallelism methods including ZeRO, Gemini, and DDP. Also, sharding the ema model further reduces the memory cost.
FastSeq: A novel sequence parallelism method
- Specially designed for DiT-like workloads where the activation size is large but the parameter size is small.
- Up to 48% communication save for intra-node sequence parallel.
- Break the memory limitation of a single GPU and reduce the overall training and inference time.
Ease of use
- Huge performance improvement gains with a few line changes
- Users do not need to know the implementation of distributed training.
Complete pipeline of text-to-image and text-to-video generation
- Researchers and engineers can easily use and adapt our pipeline to real-world applications without modifying the parallel part.
- Verify the accuracy of OpenDiT with text-to-image training on ImageNet and release checkpoint.

Authors: Xuanlei Zhao, Zhongkai Zhao, Ziming Liu, Haotian Zhou, Qianli Ma, Yang You

OpenDiT will continue to integrate more open-source DiT models. Stay tuned for upcoming enhancements and additional features!

Installation

Prerequisites:

Python >= 3.10
PyTorch >= 1.13 (We recommend to use a >2.0 version)
CUDA >= 11.6

We strongly recommend using Anaconda to create a new environment (Python >= 3.10) to run our examples:

conda create -n opendit python=3.10 -y
conda activate opendit

Install ColossalAI:

git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI
git checkout adae123df3badfb15d044bd416f0cf29f250bc86
pip install -e .

Install OpenDiT:

git clone https://github.com/oahzxl/OpenDiT
cd OpenDiT
pip install -e .

(Optional but recommended) Install libraries for training & inference speed up:

# Install Triton for fused adaln kernel
pip install triton

# Install FlashAttention
pip install flash-attn

# Install apex for fused layernorm kernel
git clone https://github.com/NVIDIA/apex.git
cd apex
git checkout 741bdf50825a97664db08574981962d66436d16a
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"

Usage

Here are supported models and their usage:

Model	Source	Fucntion	Usage	Optimize
DiT	https://github.com/facebookresearch/DiT	label-to-image	Usage	✅
OpenSora	https://github.com/hpcaitech/Open-Sora	text-to-video	Usage	✅

FastSeq

In the realm of visual generation models, such as DiT, sequence parallelism is indispensable for effective long-sequence training and low-latency inference. Two key features can summarize the distinctive nature of these tasks:

The model parameter is smaller compared with LLMs, but the sequence can be very long, making communication a bottleneck.
As the model size is relatively small, it only needs sequence parallelism within a node.

However, existing methods like DeepSpeed-Ulysses and Megatron-LM Sequence Parallelism face limitations when applied to such tasks. They either introduce excessive sequence communication or lack efficiency in handling small-scale sequence parallelism.

To this end, we present FastSeq, a novel sequence parallelism for large sequences and small-scale parallelism. Our method focuses on minimizing sequence communication by employing only two communication operators for every transformer layer. We leverage AllGather to enhance communication efficiency, and we strategically employ an async ring to overlap AllGather communication with qkv computation, further optimizing performance.

Here are the results of our experiments, more results will be coming soon:

DiT Reproduction Result

We have trained DiT using the origin method with OpenDiT to verify our accuracy. We have trained the model from scratch on ImageNet for 80k steps on 8xA100. Here are some results generated by our trained DiT:

Our loss also aligns with the results listed in the paper:

To reproduce our results, you can follow our instruction.

Acknowledgement

We extend our gratitude to Zangwei Zheng for providing valuable insights into algorithms and aiding in the development of the video pipeline. Additionally, we acknowledge Shenggan Cheng for his guidance on code optimization and parallelism. Our appreciation also goes to Fuzhao Xue, Shizun Wang, Yuchao Gu, Shenggui Li, and Haofan Wang for their invaluable advice and contributions.

This codebase borrows from:

OpenSora: An open-source initiative dedicated to efficiently reproducing OpenAI's Sora.
DiT: Scalable Diffusion Models with Transformers.
PixArt: An open-source DiT-based text-to-image model.
Latte: An attempt to efficiently train DiT for video.

Contributing

If you encounter problems using OpenDiT or have a feature request, feel free to create an issue! We also welcome pull requests from the community.

Citation

@misc{zhao2024opendit,
  author = {Xuanlei Zhao, Zhongkai Zhao, Ziming Liu, Haotian Zhou, Qianli Ma, and Yang You},
  title = {OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/NUS-HPC-AI-Lab/OpenDiT}},
}

ashnoorsingh / opendit Goto Github PK

opendit's Introduction

Latest News 🔥

About

Installation

Usage

FastSeq

DiT Reproduction Result

Acknowledgement

Contributing

Citation

Star History

opendit's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent