ENGLISH | 中文版
DeepStream is a distributed machine learning training framework that is currently under development. It uses MPI to pass parameters and update gradients between multiple machines, and it also allows for training on a single machine. The features currently supported by DeepStream include:
- Synchronous training
- Asynchronous training
- Data parallelism
- Pipeline model parallelism
There are also several features in the development pipeline:
- Tensor model parallelism
- Passing parameters through Gloo
- Disaster recovery
Currently, DeepStream relies on MPI for parameter synchronization, so you need to install OpenMPI. Please note that you should not install both OpenMPI and MPICH at the same time.
sudo apt install openmpi-bin libopenmpi-dev
sudo pacman -S openmpi
# ./build_run.sh <node num>
./build_run.sh 1
./build_run.sh 3