Giter Club home page Giter Club logo

engineering.labs's Introduction

Operationalizing Pytorch Model with Mlflow

Context

MLOps is a new buzzword in Software Development. It goes beyond DevOps by adding a brand new perspective over data and models. In fact, Any reasonable Machine Learning solution must deal with assets like ML Models, datasets, features, etc. Therefore, a simple ML application may turn into a complex workflow in order to ensure the best model runs in Production. Besides that, developing ML models demands a relevant time assessing and comparing parameters, algorithms and model performance. MLOps goal is to address these concerns by tying tools and ML frameworks in order to quickly deploy models to Production (and keep them there).

PyTorch is one of the main machine learning libraries used in applications such as computer vision and natural language processing. It is catching up with TensorFlow accelerating the path from research prototyping to production deployment. Recently, Pytorch community announced a number of technical contributions to enable end-to-end support for MLflow usage with PyTorch.

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. One may interact with its components by command line interface or well-known APIs (Python, R, Java and REST), building a workflow capable of handling development and production activities. Actually, Its simple design allows small teams (or a lone data scientist as well) to rapidly set up a working environment and execute a whole Experimentation and Assesment stage before going to formal software development.

In this lab, MLOps community wants to test and provide feasible ways to build an end-to-end model lifecycle with Pytorch assets and Mlflow.

Team

  • Alexey Naiden
  • John Savage
  • Michel Vasconcelos
  • Varuna Jayasiri

Repo Organization

The following list describes the most relevant folders in which this lab is organized. Some folders in this repo are self-explanatory (like imgs) and therefore aren't mentioned in the list.

  • src: Main source code folder. It comprises all software produced during this lab. Src itself is organized into subfolders according to which components it belongs to;
  • docs: Binary documentation, drawings and reports produced during this lab;

Proposal

Design Overview

Our team intend to orchestrate common unix management toolkits (shell commands, ssh, etc), MLFlow CLI and MLFlow components to build a seamless end-to-end pipeline. It should comprehend Continuous Integration (CI), Continuous Delivery (CD) and Continuous Traning (CT). The following figure depicts the proposed architecture.

Proposed architecture

Nodes are bare-metal or virtual machines that host a part of the pipeline process. ML components are ML-based code that should be trained, tested and deployed by end-to-end pipeline.They must be packaged as MLFlow projects. Training nodes host the training process which runs in Training containers, running dockerized images built with MLFlow and necessary libs to train ML components. Each ML component defines its needs, that is, libs and external dependencies. To ensure traceability, provenance and model configuration management, one may track hyperparameters, training steps, test results, extra data and the model itself into MLFlow Tracking Server hosted in the Tracking Node. All these items, metadata and trained models, are stored in external resources: RDBMS and Cloud Storage Services. Trained models are embedded into dockerized Serving Instances, containers running Torchserve providing access to the models via HTTP.

In order to provide feedback loop, a Monitor Node checks the status and performance meters, issuing new training cycles or a full rebuild process. Control or Operator Machine is the VM used by Configuration Manager or Operation Engineer to follow up the process or issue commands to the nodes in the pipeline. This process may run through many different perimeters, however this solution comprises Development and Production stages only.

Architectural Components and Decisions

In this section, we record the (most relevant) design decisions, components selection and the rationale around them.

# Subject Description Affected Items Solution Rationale
1 Computing ML needs processing power as well as GPU. Our architecure relies on distributed elements, thus needing a reasonable number of physical or virtual machines (some with GPU-power) All Nodes and Machines GCP Compute Service It provides $ 300,00 of (enough) free services and we can also spin some gpu-powered machines
2 Durable Data Storage Every serious solution needs to store durable data Tracking Server Google Cloud SQL A Postgres Database hosted by Google Cloud SQL is reliable and fast enough for our needs
3 Large and Binary Data Storage ML projects eventually need to store models and extra large files, This type of data doesn't fit in RDBMS Tracking Server
Artifact Store
Google Cloud Storage Same as #1
4 Shipment and Deployment There are many deployable assets as well as acessory tools that need specific environment and SO libraries to run. Just installing them into a machine isn't a viable option Training Node
Google Cloud Services
Docker It's the de facto pattern for shipping and deploying things
5 Lib Management Python has a powerful but sometimes confusing and conflicting library ecossystem. Our components may depend on conflicting libs and that will lead to problems in Production All Nodes and Machines Pip Pip is heavily used by Python community. It also work quite well with isolation provided by Docker images
6 Remote Access Sometimes we need to issue remote commands or operate a remote machine All Nodes and Machines SSH Hey... It's SSH. There's no need to rationale :-P
7 CI/CD & Workflow Management We need to seamlessly integrate and deploy ML stuff. We also have to coordinate the process between nodes All Items Github Actions We're using GitHub, it's heavily used by community and comes off-the-shelf
8 Project Metadata We need to describe the ML Component, its structure and how it should run Source Code Repository
Training Node
MLFlow Project Definition Non-functional Requirement we must adhere
9 ML Lifecycle Management We must track the ML Process All Nodes and Machines MLFlow Non-functional Requirement we must adhere
10 ML Framework We are building ML Components. Therefore, we must use a ML framework to train our models Training Node
Serving Instances
Pytorch Non-functional Requirement we must adhere

Workflow

A common ML development workflow should comprise Experimentation, Development and Production stages. While Development and Production are common stages in Software Development Industry, Experimentation brings trust to the model being developed. That's a core activity for any serious Machine Learning development. During Experimentation, Data Science teams may run various scenarios, compare key metrics and realize their progress with model performance before going to Production pipeline. Although MLFlow provides an easy path to that stage, our emphasis is on the release process. Experimentation may be addressed in a future work.

The following workflow depicts a common development scenario for our pipeline: a production web application uses a ML Model to provide business value.

Solution Workflow

The first stage is Development Workflow. It concerns the continuous integration of new ML code and Dataset changes into a Development Server. Here follows the steps:

  1. Commited code or changes in the Dataset fires a training execution. These events trigger Github Actions (GH Actions);

  2. ML pipeline starts. we are considering that previous data activities (Extraction, Validation and Preparation) were already executed. GH Actions issues a gcloud command to Training Node, so the ML Component code gets up to date. The ML Component itself describes its structure (MLFlow project file) and how it should be run (Dockerfile). The MLFlow CLI starts the training process by issuing: mlflow run. The training happens in a GPU-powered Docker container configured with the Tracking Server (MLFLOW_TRACKING_URI) . At the end of this step, there's a new trained model with its respective metadata;

  3. The Training Node sends the trained model, its extra files, and metadata to the Model Registry and Tracking Server respectively. In fact, these assets are physically stored in a Cloud Storage Service acessible from Internet;

  4. The development workflow ends by deploying the trained model in Google Cloud Run Service. GH Actions creates a new image that contains the trained model (mlflow create deployment) which is tagged with the MLflow model name and version and then is pushed to Google Container Registry. This image is deployed to a development Cloud Run deployment, where various acceptance tests can be performed to ensure the image is ready for production.

The decision to promote a model to Production happens after some experiments. Model version X is selected. This external event triggers the Production Workflow:

  1. Operations and Product Management Team decides to deploy a model to Production environment. A release command is issued, specifying a model name and version to deploy, triggering GitHub Actions. The runner pulls and configures the corresponding Docker image and deploys this to the Production Cloud Run environment

  2. The Monitor node tracks how model is performing and periodically starts a local training pipeline with Production data in order to avoid Model Erosion;

  3. The model realizes the model doesn't perform anymore (Model Erosion) or there are significant changes in the data (changes in the data semantic, new relevant fields, etc). It start a feedback loop triggering a full model rebuild process.

License

MIT

engineering.labs's People

Contributors

anayden avatar michelav avatar savagej avatar ivannardini avatar aponte411 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.