Giter Club home page Giter Club logo

odin's Introduction

Odin

About

Odin is a lightweight framework for automating machine learning workflows with Kubernetes. The codebase is compact and is designed to orchestrate and run ML workflows with a simple Git-backed configuration to describe processing pipelines. Odin provides a simple API for launching and monitoring jobs and resources over a cluster. It is a minimalistic effort, built under the DRY philosophy, and it relies on industrial strength tools such as Git, PostgreSQL and Kubernetes underneath. It also optionally supports (but does not require) commonly used Kubernetes operators for distributed training including a la carte operators provided by the Kubeflow and PyTorch Elastic projects, to make it easy to train across multiple devices or machines. It can be run in Google Kubernetes Engine in the cloud, or on local Kubernetes clusters, providing a simple and unified interface for scheduling ML jobs.

We built Odin with simplicity and reproducibility in mind. You define your workflows declaratively in a compact YAML configuration language and submit them to Odin with a single command or HTTP request. Training machine learning models at scale is already challenging -- the tooling for running these jobs should be as simple and transparent as possible. Odin is completely written in Python and communicates with Kubernetes using the official API. It is very hackable too -- the code provides a native Python tier, allowing developers to embed the graph executor directly. It also provides a thin, lightweight WebSocket tier which allows the development of alternative sub-systems from other programming languages. Finally, it offers a simple HTTP layer that provides full access to user, job and pipeline management. We also have Python client code for HTTP and WebSockets to make integration straightforward at any level.

Design

Odin Core

At the lowest level, the core of Odin provides Python code for embedding a directed acyclic graph (DAG) scheduler in python to orchestrate containers via the Kubernetes API. There are several supported resources, including Pods and additional CRDs can be extended via sub-classing. These resources can be orchestrated via a simple YAML file, and each step in the pipeline has access to the Odin Data store, which can be either MongoDB or PostgreSQL backend.

While the core code is all written in Python, we wanted to provide an agnostic interface for scheduling to higher layers, facilitating polyglot architectures handling the more advanced system services such as user management, resource access, and Restful APIs. To enable higher level building blocks from the core, we provide a very simple WebSocket service tier in the core, which makes it easy, for example, to write a node.js server that can access the core functionality, while providing higher level management functions.

Midgard and Odin HTTP

HTTP API

The HTTP API for Odin provides a REST interface to user and resource management with JWT authentication. This layer extends the base with a Postgres user database and user management functionality, as well as a documented API for scheduling worfklows within Odin. Additionally, the web-service tier is git-backed making it easy to communicate pipeline updates via 3rd party tooling while at the same time ensuring an industrial grade versioning system for pipelines that can connect easily with CI/CD solutions.

Midgard

The midgard DaemonSet, provides node-level information concerning devices like GPUs as well as system resources to the HTTP API which aggregates this information and makes it available via a central endpoint that can be queried within the cluster. This information makes it easy to get cluster-level granular information including information that would normally be retrieved on single systems using nvidia-smi.

Integrating with a metrics service

Odin has been used as a framework for large-scale training within Interactions and autoML, where a set of hyper-parameters are explored and logged via a metrics server. While, in theory, any metrics server that can log and retrieve results from experiments for a given run would work, xpctl is a service that is known to work well within the Odin ecosystem. It is an extensible, task-level metric service that (like Odin) is Mongo or Postgres-backed, provides a simple HTTP API for logging and retrieving experiments, and is containerized with images on Dockerhub.

We will add more documentation on using a metrics service for continuous deployment and autoML in the near future.

Setup

There are sample setup Kubernetes config YAMLs and a detailed tutorial for setting up Odin from scratch on microk8s

Some sample pipelines can be found here.

odin's People

Contributors

blester125 avatar dpressel avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.