Optimizing an ML Pipeline in Azure

Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.

Useful Resources

Summary

In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..." The dataset that was given contained bank marketing data and the task was a using logistic regression model to determine if the client would subscribe to a term deposit.

In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..." In this projec we used the HyperDrive to tune the hyperparameters of the Logistic Regression model. We also used AutoML to find the most most optimized model for the same dataset. The best performing model that AutoML came up with was "VotingEnsemble" with accuracy of 0.9185

Scikit-learn Pipeline

Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm. There were the steps taken:

Import data using TabularDatasetFactory
Cleaning of data. One-hot encoding of categorical features and preprocessing of date
Splitting of data into train and test data
Using scikit-learn logistic regression model for classification
Create the RunScriptConfig and passed the necessary paramenters such as the training script, the enviornment and the compute target for the our job
Configure Hyperdrive: a) Parameter Sampler b) Primary Metric Selection c) Early termination policy d) Resources and Configuration Run script which includes the estimator
Save the trained model

As specified above, we have used logistic regression model to perfom binary classification and hyperdrive tool to choose the best hyperparameter values from the parameter search space. Under the hood logistic regression uses a sigmoidal function to estimate the probabilities between the dependent/target variable and one or more independent variables(features).

What are the benefits of the parameter sampler you chose?

The chosen parameter sampler was RandomParameterSampling because it supports both discrete and continuous hyperparameters. In random sampling , the hyperparameter (C : smaller values specify stronger regularization, max_iter : maximum number of iterations taken for the solvers to converge) values are randomly selected from the defined search space. The biggest benefit of RandomSampling is it choose hyperparamters randmoly thus reducing the computational time and complexity.

What are the benefits of the early stopping policy you chose?

The early stopping policy I chose was BanditPolicy because it is based on slack factor and evaluation interval. the benefit of Bandit is that it terminates the runs where the primary metric is not within the specified slack factor compared to the best performing run.

AutoML

In 1-2 sentences, describe the model and hyperparameters generated by AutoML. AutoML generated around 35 models for us with 4 cross validations and 4 model out of 35 outperform the Logistic Regression model from Scikit-learn pipeline. The best performing model was VotingEnsemble with accuracy of 0.9185. VotingEnsemble works by combining the predictions from several models. TThe number of weak learners (i.e. regression trees) is controlled by the parameter n_estimators; The size of each tree can be controlled either by setting the tree depth via max_depth or by setting the number of leaf nodes via max_leaf_nodes. The learning_rate is a hyper-parameter in the range (0.0, 1.0] that controls overfitting via shrinkage .

Pipeline comparison

Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one? Comparing the two methods, the AutoML allows us to have a variety of models ready with given performance metrics alongside doing the data prep tasks. Python SDK Pipeline provide us with more customization for hyper paramters, data preperation and model selection while the amount of coding required here is time consuming but in some cases where advance feature engineering is required Scikit-learn pipeline can play a vital role. Both methods perfordmed almost the same with AutoML did slightly better.

Future work

What are some areas of improvement for future experiments? Why might these improvements help the model? model fairness would be a good area of improvment and if there is any class imbalance add techniques to add samples of the class with less data

pejho / nd00333_azmlnd_optimizing_a_pipeline_in_azure-starter_files Goto Github PK

nd00333_azmlnd_optimizing_a_pipeline_in_azure-starter_files's Introduction

Optimizing an ML Pipeline in Azure

Overview

Useful Resources

Summary

Scikit-learn Pipeline

AutoML

Pipeline comparison

Future work

nd00333_azmlnd_optimizing_a_pipeline_in_azure-starter_files's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent