Giter Club home page Giter Club logo

ai-infra-cluster-provisioning's Introduction

Overview

Purpose

The purpose of this tool is to provide a very quick and simple way to provision Google Cloud Platform (GCP) compute clusters of specifically accelerator optimized machines.

Machine Type Comparison

Feature \ Machine A2 A3
Nvidia GPU Type A100 -- 40GB and 80GB H100 80GB
VM Shapes Several 8 GPUs
GPUDirect-TCPX Unsupported Supported
Multi-NIC Unsupported 5 vNICS -- 1 for CPU and 4 for GPUs (one per pair of GPUs)

Repository Content Summary

This repository contains:

  • sets of terraform modules that create GCP resources, each tailored toward running AI/ML workloads on a specific accelerator optimized machine type.
  • an entrypoint script that will find or create a terraform backend in a Google Cloud Storage (GCS) bucket, call the appropriate terraform commands using the terraform modules and a user provided terraform variables (tfvars) file, and upload all logs to the GCS backend bucket.
  • a docker image -- us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image -- that has all necessary tools installed which calls the entrypoint script and creates a cluster for you.

How to provision a cluster

Prerequisites

In order to provision a cluster, the following are required:

Google Cloud Authentication

The command to authorize tools to create resources on your behalf is:

gcloud auth application-default login

The above command is:

  • recommended when using the docker image along with exposing your credentials to the container with the -v "${HOME}/.config/gcloud:/root/.config/gcloud" flag (explained below). Without this, the tool will prompt you on every invocation to authorize itself to create GCP resources for you.
  • necessary when using this repository in an existing terraform module or HPC-Toolkit blueprint.

Methods

After running through the prerequisites above, there are a few ways to provision a cluster:

  1. Run the docker image: do this if you don't have any existing infrastructure as code.
  2. Integrate into an existing terraform project: do this if you already have (or plan to have) a terraform project and would like to have the same terraform apply create this cluster along with all your other infrastructure.
  3. Integrate into an existing HPC Toolkit Blueprint: do this if you already have (or plan to have) an HPC Toolkit Blueprint and would like to have the same ghpc deploy create this cluster along with all your other infrastructure.

Build docker image to deploy cluster

Clone this repo to Cloud Shell terminal/GCE VM/Cloud Workstation/your own machine

git clone https://github.com/llm-on-gke/ai-infra-cluster-provisioning
cd ai-infra-cluster-provisioning

Run the command, to create Artifact Registry repo if it is first time to build your own image with code customization

gcloud artifacts repositories create cluster-provision-dev --repository-format=docker --location=us

Then build your own image:

gcloud builds submit .

Wait for a few minutes, and verify creation of the following docker image from Google Cloud Console, Artifact Registry page:

Artifact Registry Repo and image: cluster-provision-dev/cluster-provision-image

Run the docker image

For this method, all you need (in addition to the above requirements) is a terraform.tfvars file (user generated or copied from an example -- a3-mega) in your current directory and the ability to run docker. In a terminal, run:

# create/update the cluster
docker run \
  --rm \
  -v "${HOME}/.config/gcloud:/root/.config/gcloud" \
  -v "${PWD}:/root/aiinfra/input" \
  us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \
  create a3-mega mig-cos

# destroy the cluster
docker run \
  --rm \
  -v "${HOME}/.config/gcloud:/root/.config/gcloud" \
  -v "${PWD}:/root/aiinfra/input" \
  us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \
  destroy a3-mega mig-cos

Quick explanation of the docker run flags (in same order as above):

  • -v "${HOME}/.config/gcloud:/root/.config/gcloud" exposes gcloud credentials to the container so that it can access your GCP project.
  • -v "${PWD}:/root/aiinfra/input" exposes the current working directory to the container so the tool can read the terraform.tfvars file.
  • create/destroy tells the tool whether it should create or destroy the whole cluster.
  • a3-mega specifies which type of cluster to provision -- this will influence mainly machine type, networking, and startup scripts.
  • mig-cos tells the tool to create a Managed Instance Group and start a container at boot.

Integrate into an existing terraform project

For this method, you need to install terraform. Examples of usage as a terraform module can be found in the main.tf files in any of the examples -- a3-mega. Cluster provisioning then happens the same as any other terraform:

# assuming the directory containing main.tf is the current working directory

# create/update the cluster
terraform init && terraform validate && terraform apply -var-file="terraform.tfvars"

# destroy the cluster
terraform init && terraform validate && terraform apply -destroy

Integrate into an existing HPC Toolkit Blueprint

For this method, you need to build ghpc. Examples of usage as an HPC Toolkit Blueprint can be found in the blueprint.yaml files in any of the examples -- a3-mega. Cluster provisioning then happens the same as any blueprint:

# assuming the ghpc binary and blueprint.yaml are both in
# the current working directory

# create/update the cluster
./ghpc create -w ./blueprint.yaml && ./ghpc deploy a3-mega-mig-cos

# destroy the cluster
./ghpc create -w ./blueprint.yaml && ./ghpc destroy a3-mega-mig-cos

ai-infra-cluster-provisioning's People

Contributors

soumyapani avatar stevenborisko avatar chris113113 avatar gkroiz avatar rick-c-goog avatar mrgeislinger avatar samcmho avatar parambole avatar valentinali2008 avatar dmitrykakurin avatar hmhv1222 avatar sdlin avatar chrishenzie avatar samuelkarp avatar sdeiley avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.