This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.
In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..." Copied from https://archive.ics.uci.edu/ml/datasets/bank+marketing The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). We do this with a hyperparameter search over a Logistic Regression and a AutoML approach.
In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..." The best performing model was a AutoML model composed of a VotingEnsemble, which is a mix of the prediction of several submodels. The accuracy was 0.91711. It must be noted that the label data is imbalanced thus a better metric to use would have been the f1 score.
Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm. The scikit learn pipeline was composed of a .py file and a hyperparameter tuning notebook. The .py file downloads the data, cleans the data (one-hot encoding, transforms text in to discrete features). It then splits data into a train-test split, fits a logistic regression and logs the accuracy. The notebook calls the .py file with different hyperparameter configurations for the logistic regression.
What are the benefits of the parameter sampler you chose? RandomParameterSampling is often outperforming Gridsampling, since often there are only a couple of hyperparameters that really matter. With grid sampling that means that there are multiple runs executed with the same value for the relevant hyperparameters, while the the hyperparameters that do not matter are varied. With random sampling you end up with more different sample points from your relevant hyperparameters. Karpathy once mentioned this in his computer vision course. Now baysian sampling works probably even better, but wanted to keep it simple for this exercise.
What are the benefits of the early stopping policy you chose? This saves time and compute. If you don't improve anymore, it's better to stop since subsequent runs are just wasted. Now the trick is to know when you are not improving anymore. A bandit policy is a simple early stopping policy, comparing runs to the best one and stopping the runs if they are too far from the best result so far.
In 1-2 sentences, describe the model and hyperparameters generated by AutoML. AutoML basically tries a lot of different models and scalers and imputers and see what works best. I noted many boosting algorithms, which tend to do very well with classification problems.
Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one? The AutoML model performed better (0.92 vs 0.90). This is because AutoML has a couple of advantages: 1) it tries out many different models 2) these models are more advanced than a 'simple' logistic regression model 3) The numeric features are scaled and imputed, which does not happen in the logistic pipeline.
What are some areas of improvement for future experiments? Why might these improvements help the model? Add more data, engineer some extra features that the model can use to make better predictions. Use more hyperparameters for the logistic classifier, or increase the search space. Preprocess the data (scaling/normalization) may help to reduce a bias toward giving more attention to features with large absolute values. Imputing missing values may improve the score. Using the f1 score is a better way to judge model performance with unbalanced datasets. For the AutoML model, perhaps take the XGBoost model and run a hyperparameter search on that one.
I did it in the code with the delete method. However, I don't fully understand why this should be done, if I have a compute cluster with min_nodes = 0, this does not incur any costs, so isn't it useful to keep and re-use it? I do understand Azure counts the nodes in your quota even if you don't use them.