Giter Club home page Giter Club logo

nd00333_azmlnd_optimizing_a_pipeline_in_azure-starter_files's Introduction

Optimizing an ML Pipeline in Azure

Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.

Summary

In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..." Copied from https://archive.ics.uci.edu/ml/datasets/bank+marketing The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). We do this with a hyperparameter search over a Logistic Regression and a AutoML approach.

In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..." The best performing model was a AutoML model composed of a VotingEnsemble, which is a mix of the prediction of several submodels. The accuracy was 0.91711. It must be noted that the label data is imbalanced thus a better metric to use would have been the f1 score.

Scikit-learn Pipeline

Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm. The scikit learn pipeline was composed of a .py file and a hyperparameter tuning notebook. The .py file downloads the data, cleans the data (one-hot encoding, transforms text in to discrete features). It then splits data into a train-test split, fits a logistic regression and logs the accuracy. The notebook calls the .py file with different hyperparameter configurations for the logistic regression.

What are the benefits of the parameter sampler you chose? RandomParameterSampling is often outperforming Gridsampling, since often there are only a couple of hyperparameters that really matter. With grid sampling that means that there are multiple runs executed with the same value for the relevant hyperparameters, while the the hyperparameters that do not matter are varied. With random sampling you end up with more different sample points from your relevant hyperparameters. Karpathy once mentioned this in his computer vision course. Now baysian sampling works probably even better, but wanted to keep it simple for this exercise.

What are the benefits of the early stopping policy you chose? This saves time and compute. If you don't improve anymore, it's better to stop since subsequent runs are just wasted. Now the trick is to know when you are not improving anymore. A bandit policy is a simple early stopping policy, comparing runs to the best one and stopping the runs if they are too far from the best result so far.

AutoML

In 1-2 sentences, describe the model and hyperparameters generated by AutoML. AutoML basically tries a lot of different models and scalers and imputers and see what works best. I noted many boosting algorithms, which tend to do very well with classification problems.

Pipeline comparison

Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one? The AutoML model performed better (0.92 vs 0.90). This is because AutoML has a couple of advantages: 1) it tries out many different models 2) these models are more advanced than a 'simple' logistic regression model 3) The numeric features are scaled and imputed, which does not happen in the logistic pipeline.

Future work

What are some areas of improvement for future experiments? Why might these improvements help the model? Add more data, engineer some extra features that the model can use to make better predictions. Use more hyperparameters for the logistic classifier, or increase the search space. Preprocess the data (scaling/normalization) may help to reduce a bias toward giving more attention to features with large absolute values. Imputing missing values may improve the score. Using the f1 score is a better way to judge model performance with unbalanced datasets. For the AutoML model, perhaps take the XGBoost model and run a hyperparameter search on that one.

Proof of cluster clean up

I did it in the code with the delete method. However, I don't fully understand why this should be done, if I have a compute cluster with min_nodes = 0, this does not incur any costs, so isn't it useful to keep and re-use it? I do understand Azure counts the nodes in your quota even if you don't use them.

nd00333_azmlnd_optimizing_a_pipeline_in_azure-starter_files's People

Contributors

abrahamfeinberg avatar erickgalinkin avatar jvanelteren avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.