MLOps

Taking machine learning models to production, then maintaining & monitoring them.
You should have Microsoft VSCode and Docker Desktop installed and running in your local machine. To install Docker Desktop, follow Docker Installation guidelines for your operating system.

MLOps

MLOps Workflow

Steps included in succesful creation of a MLOps project.
1. Data Management and analysis.
2. Experimentation
3. Solution developement & Testing.
4. Deployment & Serving.
5. Monitoring & maintenance.

Coding Guidelines

Guidelines on writing codes for project.
1. Organize code into clean, reusable units 🔧 - functions, classes & modules. 💡
2. Use git for code versioning.
3. Follow style guidelines: write comments, docstrings, type annotations.
4. Keep requirements.txt and Dockerfile updated.
5. Testing

Git basics

You should have git setup and running on your local machine.

Git Setup

Configurating user information used across

git config --global user.name "[firstname lastname]"

git config --global user.email ["valid-email"]

Init

Initiallizing and cloning repositories

git init

git clone [url]

Useful commands

Check current status
```
git status
```
Add files for versioning and tracking
```
 git add <f_name>
```
Commit staged content
```
git commit -m "[description]"
```
List all branches in git. A * will appear after active branch.
```
git branch
```
Switch to another branch and check it out to working directory.
```
git checkout -b "[branch-name]"
```
Add a git URL
```
git remote add "[alias]" <URL>
```
Fetch down all the branches from that Git remote.
```
git fetch "[alias]"
```
Merge a remote brach into your current branch and bring it up-to-date.
```
git merge "[alias]/[branch]"
```
Transmit local branch commits to the remote repository branch
```
git push "[alias]" "[branch]"
```
Fetch and merge any commits from tracking remote branch
```
git pull
```

Project Organization

We are going to use PyScaffold Cookiecutter Data Science project template.

PyScaffold Setup

Install PyScaffold
```
pip install pyscaffoldext-cookiecutter
```
Install pre-commit
```
pip install pre-commit
```
Initiallize an empty project with cookiecutter data science project structure
```
putup --dsproject <Name of your project>
```

Structure of project directory/repository

├── AUTHORS.md              <- List of developers and maintainers.
├── CHANGELOG.md            <- Changelog to keep track of new features and fixes.
├── CONTRIBUTING.md         <- Guidelines for contributing to this project.
├── Dockerfile              <- Build a docker container with `docker build .`.
├── LICENSE.txt             <- License as chosen on the command-line.
├── README.md               <- The top-level README for developers.
├── configs                 <- Directory for configurations of model & application.
├── data
│   ├── external            <- Data from third party sources.
│   ├── interim             <- Intermediate data that has been transformed.
│   ├── processed           <- The final, canonical data sets for modeling.
│   └── raw                 <- The original, immutable data dump.
├── docs                    <- Directory for Sphinx documentation in rst or md.
├── environment.yml         <- The conda environment file for reproducibility.
├── models                  <- Trained and serialized models, model predictions,
│                              or model summaries.
├── notebooks               <- Jupyter notebooks. Naming convention is a number (for
│                              ordering), the creator's initials and a description,
│                              e.g. `1.0-fw-initial-data-exploration`.
├── pyproject.toml          <- Build configuration. Don't change! Use `pip install -e .`
│                              to install for development or to build `tox -e build`.
├── references              <- Data dictionaries, manuals, and all other materials.
├── reports                 <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures             <- Generated plots and figures for reports.
├── scripts                 <- Analysis and production scripts which import the
│                              actual PYTHON_PKG, e.g. train_model.
├── setup.cfg               <- Declarative configuration of your project.
├── setup.py                <- [DEPRECATED] Use `python setup.py develop` to install for
│                              development or `python setup.py bdist_wheel` to build.
├── src
│   └── classify_covid      <- Actual Python package where the main functionality goes.
├── tests                   <- Unit tests which can be run with `pytest`.
├── .coveragerc             <- Configuration for coverage reports of unit tests.
├── .isort.cfg              <- Configuration for git hook that sorts imports.
└── .pre-commit-config.yaml <- Configuration of pre-commit git hooks.

Pipeling in Machine Learning

Series of successive & sometimes parallel steps in which we process data.

General Steps

Extracting, transforming and loading data.
Creating a test/train split.
Model Training
Model Evaluation

Example: Suppose that you have 2 parameter settings - Epoch = 10 & 20. You would probably want to run steps 1 & 2 only once for both setting, and 3 & 4 twice for each setting. Having a pipeline makes this work easy.

Simple ML pipeline 🔧

flowchart LR

A(Load Data) --> B(Featurize)
B --> C{Data Split}
C -->|Train Data| D[Train Model]
C -->|Test Data| E[Evaluate Model]

For making production ready projects, we need to convert Jupyter Notebooks into .py modules.

Makes versioning easy to automate & build pipelines.

Keep parameters in a config file (config.yaml). We will use Hydra.cc in to load these configuration files. For example,

params:
    batch_size: 32
    learning_rate: 0.01
    training_epoch: 30
    num_gpus: 4

Keep more reusable codes into .py modules. For e.g. Create visualize.py to contain visualization task.
Create .py modules for each computation task(stage).
Structure .py modules for run in both mode - Jupyter & Terminal

💡 Example of structuring `Jupyter Notebook Code` to a `.py` module

Converting dataset loading in jupyter notebook to a python script. We will use hydra.cc for configuration load afterwards. Fow now, we are using yaml library.

This example showcases use of Argument Parser to pass arguments to module from terminal.

dataset_load.py

import typing
import yaml
import argparse

def data_load(config_path: Text) -> None:
    cfg = yaml.safe_load(open(config_path))
    raw_data_path = cfg['data_load']['raw_data.path']
    ...
    ...
    data.to_csv(cfg['dataset_processed_path'])

if __name__ == '__main__':
    args_parser = argparse.ArgumentParser()
    args_parser.add_argument('--config', dest = 'config', required = True)
    args = args_parser.parse_args()

    data_load(config_path = args.config)

To import this function into Jupyter Notebook : from dataset_load import data_load and pass the argument to the function.
To run it from terminal, change directory to root folder and execute : python -m src.stages.data_load --config=params.yaml

To build a ML pipeline, create modeules for each stage like above. Then, run tose modules sequentially.

DVC - Data Version Control

DVC is an open source version control system for ML projects. It will be used for

Experiment management - creating pipelines, tracking metrics, parameters and dependencies.
Data Versioning - Versioning data as we version codes using git.

Installing DVC - pip install dvc[all]

Good to intregrate logging.

Initiallizing DVC - dvc init Creates a .dvc folder containing all information about the directory. You must add DVC under git control - git add . & git commit -m "Init DVC"

Automating pipelines with DVC 🛠️

Running stages in sequence manually might be cumbersome and a time taking process. DVC helps in organizing stages into pipeline.
Stages might depend on parameters, outputs of other stages, and other dependencies. DVC helps in tracking all of them and runs only the stage where a change is detected. (💡Remember example of 2 possible epoch values?)

DVC builds a dependency graph(Directed Acyclic Graph)of stages to determine the order of execution. It saves them in a dvc.yaml file. To check graph - dvc dag

Adding stages to DVC pipeline - `dvc add`

To add a stage to DVC pipeline, execute the following command.

dvc stage add -n <name> \ # Name: Name of the stage of pipeline
    -d <dependencies> \   # Dependencies: files(to track) on which processing of stage depends.
    -o <outputs> \        # Outputs: outputs of stage. DVC tracks them for any external change.
    -p <parameters> \     # Parameters: parameters in the config file to track for changes.
        command           # Command to execute on execution of this stage of pipeline

Example : Adding data_load.py module as a stage in DVC pipeline

dvc stage add -n data_load \
    -d src/data_load.py \
    -o data/iris.csv \
    -p data_load \
    python -m src.data_load --config=params.yaml

Structure of dvc.yaml file

stages:
    stage1: #Name of stage
        cmd: <Commmand to execute>
        deps: <Dependencies>
        params: <Parameters>
        outs: <Outputs>
    ...

You can manually add stages or make changes to stages in the dvc.yaml file.

Running/Reproducing pipelines - `dvc repro`

After adding all stages to pipeline, execute:

dvc repro
git add .
git commit -m "Description"

DVC will run the pipeline and start to monitor all the parameters, dependencies and outputs specified. When you execute dvc repro for the next time:

If any stage dependency change is detected, DVC runs stages affected by this change. It won't run the unaffected stages.
Before running any stage, it deletes all outputs of the stage.
DVC follows downstream to produce other stages.

To reproduce single stage:

dvc repro -s <stage_name> # add -f for forced execution.,

Versioning data and models with DVC

Need of data versioning:

Reproucible ML experiments require versioned data, models & artifacts.
Meet regulatory compliance & ethical AI requirements(e.g. Health & Finance).
Data processing takes a long time, resources are expensive, need to be deterministic and reproducile.
We need not produce same data repeatedly.

How data versioning works? - reflinks

Use git to version code, dvc to version data.

DVC commands - `dvc add`, `dvc push`, `dvc pull`

Add file/folder to DVC:

dvc add <file/folder> # creates reflink to cache for file added

Setting up remote storage: Either create a local remote storage(dummy remote) or add S3, Gdrive, Blob, etc.
```
dvc remote add -f "<name of remote>" </>
```
</> : <tmp/dvc> for local storage

</> : <gdrive/folder_id> for Google Drive
To push data to remote or pull from remote:
```
dvc push/pull
```

Tracking changes & switching between versions - `dvc status` & `dvc checkout`

To track status of staged files:

dvc status # returns any changes made to files tracked by DVC

To switch version: (check: https://dvc.org/doc/command-reference/checkout)
```
dvc checkout
```

Data access in DVC - `dvc list`, `dvc get`, `dvc import`

To list project contents, including files, models, and directories tracked by DVC and by Git:
```
dvc list "<URL>"
```
To download data, but not keeping track of changes with remote:
```
dvc get "<URL>"
```
TO download data, and keep track of changes:
```
dvc import "<URL>"
```

Hydra

Hydra is a configuration management framework for Machine Learning/ Data Science projects.

Installing Hydra: pip install hydra-core --upgrade

You must have all configurations in a folder named config as per our Pyscaffold Cookie-cutter DS template.

To import configuration to a python file:

Method 1:

import hydra
from omegaconf import DictConfig, OmegaConf

@hydra.main(version_base=None, config_path="conf", config_name="config")
def my_app(cfg : DictConfig) -> None:
    print(OmegaConf.to_yaml(cfg))

if __name__ == "__main__":
    my_app()

Method 2:

 from hydra import compose, initialize

 # Loading configuration file using Hydra
 initialize(version_base=None, config_path='../../configs')
 config = compose(config_name=config_name)

To use configuration: config.<>.<>

Weights and Biases

Experiment tracking utility for machine learning.

Installing Wandb: pip install wandb

To Login:

Open wandb.ai > Settings > Danger Zone > API
Copy your API key.
Execute wandb login & paste your key.

You must be logged in to your Weights and Biases account now.

To start a new run:

import wandb
wandb.init(project = '<Project_name>', config = config)
# Note that config has to be loaded using Hydra.cc before
# calling this command.
# This will upload training configurations to W&B portal.

Integration with Keras:

# We use keras callback to integrate W&B with our model.
# This will log accuracy, AUR loss, GPU & CPU usage.
# Pass the callback to model.fit
model.fit(
  X_train,
  y_train,
  validation_data=(X_test, y_test),
  callbacks=[WandbCallback()]
)

To log any other metrics: wandb.log('parameter_name': parameter_value)

harshitsingh-14 / mlops Goto Github PK

mlops's Introduction

MLOps

Table of contents

MLOps Workflow

Coding Guidelines

Git basics

Git Setup

Init

Useful commands

Project Organization

PyScaffold Setup

Structure of project directory/repository

Pipeling in Machine Learning

General Steps

💡 Example of structuring Jupyter Notebook Code to a .py module

DVC - Data Version Control

Automating pipelines with DVC 🛠️

Adding stages to DVC pipeline - dvc add

Running/Reproducing pipelines - dvc repro

Versioning data and models with DVC

DVC commands - dvc add, dvc push, dvc pull

Tracking changes & switching between versions - dvc status & dvc checkout

Data access in DVC - dvc list, dvc get, dvc import

mlops's People

Contributors

Recommend Projects

Recommend Topics