This guide provides instructions for setting up and using Podman containers for running deep learning applications with PyTorch and NVIDIA GPUs.
- Containers For Deep Learning: NVIDIA User Guide
- Podman and the NVIDIA Container Toolkit: Installing Podman
- Support for Container Device Interface Running a Workload with CDI
- Running PyTorch in Docker Containers with NVIDIA GPUs: NVIDIA PyTorch Notes
- Run on an On-Prem Cluster: Pytorch Cluster Setup
- Project Folder:
- Rename your project folder to
my_project
.
- Rename your project folder to
- Environment Variables:
- Open the
.env
/.argfile
file in the root directory. - Set your project name as an environment variable (e.g.,
PROJECT_NAME=my_project
). - Set the Jupyter Lab port (e.g.,
JUPYTER_PORT=8000
). - Configure cluster settings (
MASTER_PORT
,MASTER_ADDR
,WORLD_SIZE
,NODE_RANK
). - Set NCCL environment variables.
- Open the
- Requirements File:
- Add any necessary pip dependencies to the
requirements.txt
file.
- Add any necessary pip dependencies to the
- Starting the Container:
- Run
bash build.sh
to build and start the container using Podman.
- Run
- Accessing Jupyter Lab:
- Connect to Jupyter Lab through
http://<ip-address>:<JUPYTER_PORT>/?token=<token>
- Connect to Jupyter Lab through
- Direct File Execution:
- To directly execute a file, such as a python script, from the terminal, use a command like the following:
( source .env && podman exec -w /workspace/my_project $PROJECT_NAME-$NODE_RANK conda run --live-stream -n accelerate accelerate launch my-project.py --arg1 ../path/to/data )
- This command sources your environment variables from
.env
and executes the specified Python script or Jupyter notebook inside the Podman container.
- To directly execute a file, such as a python script, from the terminal, use a command like the following:
- Synchronization between Nodes with Optional File Execution:
- The
sync
folder contains a script for synchronizing your working directory with remote nodes, essential for training on a cluster. - The script supports
start
andstop
actions for synchronizing and managing containers on remote nodes. - Additionally, the
sync/sync.sh
command can take an optional fourth argument specifying a file/path (script or notebook) from the project directory, which will then be executed. - Starting Synchronization and Containers:
- Usage:
bash sync/sync.sh <local_absolute_path> <remote_relative_path> start [optional_file_path]
. - For example, to start synchronization and execute a script:
bash sync/sync.sh ~/my_project .sync/my_project start /scripts/my-script.py
.
- Usage:
- Stopping Remote Containers:
- Usage:
bash sync/sync.sh <local_absolute_path> <remote_relative_path> stop
. - For example:
bash sync/sync.sh ~/my_project .sync/my_project stop
.
- Usage:
- Configuring Sync Settings:
- Update the
sync/config.json
file to include your own nodes, their respective SSH access details, and keys. Ensure to replacenode1
,node2
, etc., with your actual node details.
- Update the
- The