Giter Club home page Giter Club logo

nd00333_azmlnd_optimizing_a_pipeline_in_azure-starter_files's Introduction

Optimizing an ML Pipeline in Azure

Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.

Useful Resources

Summary

In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..." The dataset that was given contained bank marketing data and the task was a using logistic regression model to determine if the client would subscribe to a term deposit.

In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..." In this projec we used the HyperDrive to tune the hyperparameters of the Logistic Regression model. We also used AutoML to find the most most optimized model for the same dataset. The best performing model that AutoML came up with was "VotingEnsemble" with accuracy of 0.9185

Scikit-learn Pipeline

Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm. There were the steps taken:

  1. Import data using TabularDatasetFactory
  2. Cleaning of data. One-hot encoding of categorical features and preprocessing of date
  3. Splitting of data into train and test data
  4. Using scikit-learn logistic regression model for classification
  5. Create the RunScriptConfig and passed the necessary paramenters such as the training script, the enviornment and the compute target for the our job
  6. Configure Hyperdrive: a) Parameter Sampler b) Primary Metric Selection c) Early termination policy d) Resources and Configuration Run script which includes the estimator
  7. Save the trained model

As specified above, we have used logistic regression model to perfom binary classification and hyperdrive tool to choose the best hyperparameter values from the parameter search space. Under the hood logistic regression uses a sigmoidal function to estimate the probabilities between the dependent/target variable and one or more independent variables(features).

What are the benefits of the parameter sampler you chose?

The chosen parameter sampler was RandomParameterSampling because it supports both discrete and continuous hyperparameters. In random sampling , the hyperparameter (C : smaller values specify stronger regularization, max_iter : maximum number of iterations taken for the solvers to converge) values are randomly selected from the defined search space. The biggest benefit of RandomSampling is it choose hyperparamters randmoly thus reducing the computational time and complexity.

What are the benefits of the early stopping policy you chose?

The early stopping policy I chose was BanditPolicy because it is based on slack factor and evaluation interval. the benefit of Bandit is that it terminates the runs where the primary metric is not within the specified slack factor compared to the best performing run.

AutoML

In 1-2 sentences, describe the model and hyperparameters generated by AutoML. AutoML generated around 35 models for us with 4 cross validations and 4 model out of 35 outperform the Logistic Regression model from Scikit-learn pipeline. The best performing model was VotingEnsemble with accuracy of 0.9185. VotingEnsemble works by combining the predictions from several models. TThe number of weak learners (i.e. regression trees) is controlled by the parameter n_estimators; The size of each tree can be controlled either by setting the tree depth via max_depth or by setting the number of leaf nodes via max_leaf_nodes. The learning_rate is a hyper-parameter in the range (0.0, 1.0] that controls overfitting via shrinkage .

Pipeline comparison

Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one? Comparing the two methods, the AutoML allows us to have a variety of models ready with given performance metrics alongside doing the data prep tasks. Python SDK Pipeline provide us with more customization for hyper paramters, data preperation and model selection while the amount of coding required here is time consuming but in some cases where advance feature engineering is required Scikit-learn pipeline can play a vital role. Both methods perfordmed almost the same with AutoML did slightly better.

Future work

What are some areas of improvement for future experiments? Why might these improvements help the model? model fairness would be a good area of improvment and if there is any class imbalance add techniques to add samples of the class with less data

nd00333_azmlnd_optimizing_a_pipeline_in_azure-starter_files's People

Contributors

pejho avatar abhiojha8 avatar abrahamfeinberg avatar sudkul avatar erickgalinkin avatar scign avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.