This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.
- ScriptRunConfig Class
- Configure and submit training runs
- HyperDriveConfig Class
- How to tune hyperparamters
In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..." The dataset that was given contained bank marketing data and the task was a using logistic regression model to determine if the client would subscribe to a term deposit.
In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..." In this projec we used the HyperDrive to tune the hyperparameters of the Logistic Regression model. We also used AutoML to find the most most optimized model for the same dataset. The best performing model that AutoML came up with was "VotingEnsemble" with accuracy of 0.9185
Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm. There were the steps taken:
- Import data using TabularDatasetFactory
- Cleaning of data. One-hot encoding of categorical features and preprocessing of date
- Splitting of data into train and test data
- Using scikit-learn logistic regression model for classification
- Create the RunScriptConfig and passed the necessary paramenters such as the training script, the enviornment and the compute target for the our job
- Configure Hyperdrive: a) Parameter Sampler b) Primary Metric Selection c) Early termination policy d) Resources and Configuration Run script which includes the estimator
- Save the trained model
As specified above, we have used logistic regression model to perfom binary classification and hyperdrive tool to choose the best hyperparameter values from the parameter search space. Under the hood logistic regression uses a sigmoidal function to estimate the probabilities between the dependent/target variable and one or more independent variables(features).
What are the benefits of the parameter sampler you chose?
The chosen parameter sampler was RandomParameterSampling because it supports both discrete and continuous hyperparameters. In random sampling , the hyperparameter (C : smaller values specify stronger regularization, max_iter : maximum number of iterations taken for the solvers to converge) values are randomly selected from the defined search space. The biggest benefit of RandomSampling is it choose hyperparamters randmoly thus reducing the computational time and complexity.
What are the benefits of the early stopping policy you chose?
The early stopping policy I chose was BanditPolicy because it is based on slack factor and evaluation interval. the benefit of Bandit is that it terminates the runs where the primary metric is not within the specified slack factor compared to the best performing run.
In 1-2 sentences, describe the model and hyperparameters generated by AutoML. AutoML generated around 35 models for us with 4 cross validations and 4 model out of 35 outperform the Logistic Regression model from Scikit-learn pipeline. The best performing model was VotingEnsemble with accuracy of 0.9185. VotingEnsemble works by combining the predictions from several models. TThe number of weak learners (i.e. regression trees) is controlled by the parameter n_estimators; The size of each tree can be controlled either by setting the tree depth via max_depth or by setting the number of leaf nodes via max_leaf_nodes. The learning_rate is a hyper-parameter in the range (0.0, 1.0] that controls overfitting via shrinkage .
Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one? Comparing the two methods, the AutoML allows us to have a variety of models ready with given performance metrics alongside doing the data prep tasks. Python SDK Pipeline provide us with more customization for hyper paramters, data preperation and model selection while the amount of coding required here is time consuming but in some cases where advance feature engineering is required Scikit-learn pipeline can play a vital role. Both methods perfordmed almost the same with AutoML did slightly better.
What are some areas of improvement for future experiments? Why might these improvements help the model? model fairness would be a good area of improvment and if there is any class imbalance add techniques to add samples of the class with less data