Giter Club home page Giter Club logo

orion's Introduction

Orion

Orion is a fine-grained scheduler for interference-free GPU sharing across ML workloads. It is based on our EuroSys'24 paper "Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications".

Table of Contents

Introduction

Orion is a fine-grained, interference-free scheduler for GPU sharing across ML workloads. We assume one of the clients is high-priority, while the rest of the clients are best-effort.

Orion intercepts CUDA, CUDNN, and CUBLAS calls and submits them into software queues. The Scheduler polls these queues and schedules operations based on their resource requirements and their priority. See ARCHITECTURE for more details on the system and the scheduling policy.

Orion expects that each submitted job has a file where all of its operations, along with their profiles and Straming Multiprocessor (SM) requirements are listed. See PROFILE for detailed instructions on how to profile a client applications, and how to generate the profile files.

Example

We have set up a docker image: fotstrt/orion-ae with all packages pre-installed. Alternatively, follow the instructions on the 'setup' directory, and check INSTALL, to install Orion and its dependencies.

See PROFILE to generate profiling files for each workload. Create a json file containing all the info for the workloads that are about to share the GPU. See examples under 'artifact_evaluation/example'.

The file 'launch_jobs.py' is responsible for spawning the scheduler and the application thread(s).

Project Structure

> tree .
├── profiling                     # Scripts and instructions for profiling
│   ├── benchmarks                # Scripts of DNN models for profiling
│   ├── postprocessing            # Scripts for processing of profile files
└── src                           # Source code
│   ├── cuda_capture              # Code to intercept CUDA/CUDNN/CUBLAS calls
│   └── scheduler                 # Implementation of the scheduling policy
│   └── scheduler_frontend.py     # Python interface for the Orion scheduler
└── benchmarking                  # Scripts and configuration files for benchmarking
|   ├── benchmark_suite           # Training and inference scripts
|   ├── model_kernels             # Files containing profile information for the submitted models
└── related                       # Some of the related baselines: MPS, Streams, Tick-Tock
└── artifact_evaluation           # Scripts and instructions for artifact evaluation
|   ├── example                   # Basic example to test Orion functionality
|   ├── fig7                      # Scripts to reproduce Figure 7 of the paper
|   ├── fig10                     # Scripts to reproduce Figure 10 of the paper
└── setup                         # Instructions and scripts to install Orion's prerequisites.

Hardware Requirements

Orion currently supports NVIDIA GPUs.

Hardware Configuration used in the paper

For the experiments presented in the paper, we evaluated Orion in Google Cloud Platform VMs with the following configurations:

  • n1-standard-8 VM (8 vCPUs, 30GB of DRAM) with an V100-16GB GPU, with CUDA 10.2
  • a2-highgpu-1g VM (12 vCPUs, 85GB of DRAM) with an A100-40GB GPU, with CUDA 11.3

In both cases, the machines have Ubuntu 18.04.

Installation

see INSTALL.

Debugging

see DEBUGGING.

Paper

If you use Orion, please cite our paper:

@inproceedings {eurosys24orion,
  author = {Strati Foteini and Ma Xianzhe and Klimovic Ana},
  title = {Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications},
  booktitle = {},
  year = {2024},
  isbn = {},
  address = {},
  pages = {},
  url = {},
  doi = {},
  publisher = {Association for Computing Machinery},

}

orion's People

Contributors

fotstrt avatar xianzhema avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.