Light

maxivhuber / nvidia-docker Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 30 KB

Jupyter Notebook 3.83% Dockerfile 5.06% Shell 91.10%

nvidia-docker's Introduction

Deep Learning Container Setup and Usage Guide

This guide provides instructions for setting up and using Podman containers for running deep learning applications with PyTorch and NVIDIA GPUs.

Useful Resources

Containers For Deep Learning: NVIDIA User Guide
Podman and the NVIDIA Container Toolkit: Installing Podman
Support for Container Device Interface Running a Workload with CDI
Running PyTorch in Docker Containers with NVIDIA GPUs: NVIDIA PyTorch Notes
Run on an On-Prem Cluster: Pytorch Cluster Setup

Setup Instructions

Project Folder:
- Rename your project folder to my_project.
Environment Variables:
- Open the .env/.argfile file in the root directory.
- Set your project name as an environment variable (e.g., PROJECT_NAME=my_project).
- Set the Jupyter Lab port (e.g., JUPYTER_PORT=8000).
- Configure cluster settings (MASTER_PORT, MASTER_ADDR, WORLD_SIZE, NODE_RANK).
- Set NCCL environment variables.
Requirements File:
- Add any necessary pip dependencies to the requirements.txt file.

Usage

Starting the Container:
- Run bash build.sh to build and start the container using Podman.
Accessing Jupyter Lab:
- Connect to Jupyter Lab through http://<ip-address>:<JUPYTER_PORT>/?token=<token>
Direct File Execution:
- To directly execute a file, such as a python script, from the terminal, use a command like the following:
  - ( source .env && podman exec -w /workspace/my_project $PROJECT_NAME-$NODE_RANK conda run --live-stream -n accelerate accelerate launch my-project.py --arg1 ../path/to/data )
- This command sources your environment variables from .env and executes the specified Python script or Jupyter notebook inside the Podman container.

Synchronization between Nodes

Synchronization between Nodes with Optional File Execution:
- The sync folder contains a script for synchronizing your working directory with remote nodes, essential for training on a cluster.
- The script supports start and stop actions for synchronizing and managing containers on remote nodes.
- Additionally, the sync/sync.sh command can take an optional fourth argument specifying a file/path (script or notebook) from the project directory, which will then be executed.
- Starting Synchronization and Containers:
  - Usage: bash sync/sync.sh <local_absolute_path> <remote_relative_path> start [optional_file_path].
  - For example, to start synchronization and execute a script: bash sync/sync.sh ~/my_project .sync/my_project start /scripts/my-script.py.
- Stopping Remote Containers:
  - Usage: bash sync/sync.sh <local_absolute_path> <remote_relative_path> stop.
  - For example: bash sync/sync.sh ~/my_project .sync/my_project stop.
- Configuring Sync Settings:
  - Update the sync/config.json file to include your own nodes, their respective SSH access details, and keys. Ensure to replace node1, node2, etc., with your actual node details.

nvidia-docker's People

Stargazers

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.