Giter Club home page Giter Club logo

cloud-gpu-reliability's Introduction

cloud-gpu-reliability

After encountering some reliability issues with on-demand provisioning of GPU resources, I put together this benchmarking harness to test AWS vs. GCP availability.

To maximize the statistical and practical significance of results:

  • Each provisioning uses the same GPU configurations (currently a T4). GCP provides more flexibility here since their accelerators can be mounted to any hardware configuration whereas AWS only provisions these more powerful GPUs on designated VM configurations.
  • Each deployment runs at the same approximate time, roughly 48 times a day. We handle this spawning via separate threads because async support isn't yet available for the official AWS and GCP Python APIs .
  • It performs a random search for what times during the day we should perform the trial. This attempts to account for the variability during daily demand of jobs that don't fit a set schedule.

At the risk of stating the obvious: running this locally will create cloud resources that you'll have to pay for while they run. This package takes every care to cleanup resources once it creates them but run at your own risk.

Getting Started

This repo manages dependencies with poetry. A regular pip install -e . should work fine but might not pull in dependency versions that are tested.

poetry install

You'll also have to configure an .env file with your AWS and GCP credentials in order to execute. This should be relatively straightforward given the key names that are specified in Settings. To encode the GCP service key, you'll have to do something like:

cat ~/personal-gcp-service-key.json | base64

When you're ready to run the trial:

docker-compose up

Errors

GCP:

Operation type [insert] failed with message "The zone 'projects/{project}/zones/{zone}' does not have enough resources available to fulfill the request. Try a different zone, or try again later."
Resource exhausted (HTTP 429): ZONE_RESOURCE_POOL_EXHAUSTED

cloud-gpu-reliability's People

Contributors

piercefreeman avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.