microsoft / mlhyperparametertuning Goto Github PK

View Code? Open in Web Editor NEW

60.0 28.0 33.0 995 KB

Example of using HyperDrive to tune a regular ML learner.

License: MIT License

Jupyter Notebook 96.93% Python 3.07%

azureml

mlhyperparametertuning's Introduction

Author: Mario Bourgoin

Training of Python scikit-learn models on Azure

Overview

This scenario shows how to tune a Frequently Asked Questions (FAQ) matching model that can be deployed as a web service to provide predictions for user questions. For this scenario, "Input Data" in the architecture diagram refers to text strings containing the user questions to match with a list of FAQs. The scenario is designed for the Scikit-Learn machine learning library for Python but can be generalized to any scenario that uses Python models to make real-time predictions.

Design

The scenario uses a subset of Stack Overflow question data which includes original questions tagged as JavaScript, their duplicate questions, and their answers. It tunes a Scikit-Learn pipeline to predict the match probability of a duplicate question with each of the original questions. The application flow for this architecture is as follows:

Create an Azure ML Service workspace.
Create an Azure ML Compute cluster.
Upload training, tuning, and testing data to Azure Storage.
Configure a HyperDrive random hyperparameter search.
Submit the search.
Monitor until complete.
Retrieve the best set of hyperparameters.
Register the best model.

Prerequisites

Linux (Ubuntu).
Anaconda Python installed.
Azure account.

The tutorial was developed on an Azure Ubuntu DSVM, which addresses the first two prerequisites. You can allocate such a VM on Azure Portal by creating a "Data Science Virtual Machine for Linux (Ubuntu)" resource.

Setup

To set up your environment to run these notebooks, please follow these steps. They setup the notebooks to use Azure seamlessly.

Create a Linux Ubuntu VM.
Log in to your VM. We recommend that you use a graphical client such as X2Go to access your VM. The remaining steps are to be done on the VM.
Open a terminal emulator.

Clone, fork, or download the zip file for this repository:

git clone https://github.com/Microsoft/MLHyperparameterTuning.git

Enter the local repository:
```
cd MLHyperparameterTuning
```
Create the Python MLHyperparameterTuning virtual environment using the environment.yml:
```
conda env create -f environment.yml
```
Activate the virtual environment:
```
source activate MLHyperparameterTuning
```
The remaining steps should be done in this virtual environment.
Login to Azure:
```
az login
```
You can verify that you are logged in to your subscription by executing the command:
```
az account show -o table
```

If you have more than one Azure subscription, select it:

az account set --subscription <Your Azure Subscription>

Start the Jupyter notebook server:
```
jupyter notebook
```

Steps

After following the setup instructions above, run the Jupyter notebooks in order starting with 00_Data_Prep_Notebook.ipynb.

Cleaning up

The last Jupyter notebook describes how to delete the Azure resources created for running the tutorial. Consult the conda documentation for information on how to remove the conda environment created during the setup. And if you created a VM, you may also delete it.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

mlhyperparametertuning's People

Contributors

Stargazers

Watchers

mlhyperparametertuning's Issues

Show how to handle failed jobs

Demonstrate the use of Bayesian hyperparameter sampling

Include an alternative to notebook 04 that shows how to use Bayesian hyperparameter sampling.

Replace unnecessary pipeline data dependencies

Replace unnecessary AzureML pipeline data dependencies by using run_after().

Use low-priority nodes

Consider changing the experiment to use low-priority nodes.

Make the training script also accept inputs from a configuration file

Allow a JSON-formatted configuration file input of hyperparameters to the training script.

The file would contain a dict.
Non-hyperparameter entries in the file would be ignored.
Values passed on the command line would NOT override the configuration file values.

Use AML Hyperdrive

Modify the Happy Path to use AML with Hyperdrive in place of Batch AI.

Fix link in last cell, in 01_Training_Script.ipynb

Notebook kernels do not autoselect to the created environment kernel.

By default, when starting a new notebook, jupyter prompts the user to a similar looking, but incorrect python kernel.

Should either indicated the correct kernel to select, or figure out how to get jupyter to suggest the correct env kernel.

Add Papermill parameters for estimators

Add Papermill parameters for estimators to notebooks 01, 03, and 04

Use Python logging in scripts

Mat: Instead of using print statements in the scripts use Python logging. Only reason is you can add debug logging statements that people can turn on or off. Print is fine in notebooks but when running things on clusters sometimes it is nice to have a verbose mode.

Remove ServicePrincipal login requirement in get_auth()

https://github.com/Microsoft/MLHyperparameterTuning/blob/master/get_auth.py

Due to finding that with CICD, logging in with a service principal is sufficient to get the interactive login and proceed with workspace creation, this code should be simplified to remove the service principal information.

Add control of iteration used in testing

Currently, the testing script uses the maximum number of iterations of the trained model to score the data. Add an "early_stopping_rounds" argument to the training script so it records the best iteration on validation data found, and an argument to the testing script that controls whether that is the iteration used in scoring.

Refactor scripts into functions

Mat: Some of the scripts that are created in the notebooks could be broken up a bit. Where you are writing a comment stating what those lines below do that means it could be a function. Makes for easier composability. Completely optional though****

Docker prerequisites not needed

Asserts instead of Exceptions

Prevalence of asserts within methods instead of raising exceptions

Unused import

ipywidgets unused import statement

Add failOnStderr:true to all DevOps pipeline Bash steps

Add failOnStderr:true to all bash steps per Dan Grecoe:

There seems to be a change to the way the bash steps fail. That is, we've noticed in one of the paths that a notebook that throws an exception does not automatically stop the build as it used to.

Mathew Salvaris pointed out that there is a flag failOnStderr that is defaulted to false. We believe this used to be true. To ensure that the paths truly are building correctly I'm asking that each of the authors go through their path and make the following changes.

Remove unused metric name

The script created by the 01_Training_Script.ipynb notebook defines a variable name that is never used, which could be confusing:

# What to name the metric logged
metric_name = "logloss"

Those lines should be deleted.

test

Create best model using distributed training

Once an AML SDK LightGBM estimator is available, use it to create the best model on the cluster.

Remove Service Principal Credentials

Due to a finding with getting the interactive login session (which happens prior to notebooks being run) the need for service principal information is no longer needed.

Remove parameters for such in the following two notebooks:

https://github.com/Microsoft/MLHyperparameterTuning/blob/master/02_Run_Locally.ipynb
https://github.com/Microsoft/MLHyperparameterTuning/blob/master/04_HyperDrive_Run_Recovery.ipynb

pin python and packages' version

It is good to pin python and packages' version (especially when using AML SDK )

Before going public. replace python-dotenv repo with package in environment.yml

To workaround a bug with dotenv.set_env, the python-dotenv repo is referenced in environment.yml. Before release, replace this with the package python-dotenv pinned to the right version.

Add Azure ML Pipeline notebook for Hyperparameter Tuning

Add an Azure ML pipeline notebook that does hyperparameter tuning from notebooks 04 through 06.

Use the conda_dependencies_file input for consistent environments

Put all dependencies for Estimators in a single Conda dependencies file to be used in all cases.

Unable to run notebook in build system

Notebook:
https://github.com/Microsoft/MLHyperparameterTuning/blob/master/04_HyperDrive_Run_Recovery.ipynb

Issue:
Cell 5 Code:
run = get_run(exp, 'HyperDrive_Run_ID', rehydrate=True)
run

Cell 5 Comment:
Use the ID of the run to get a handle to it. That ID was printed with the run when it was submitted in the previous notebook. You can also find that ID in Azure Portal on your experiment's page. You may need to add a RunId column to the table of experiment runs.

The build system has no idea how to collect information from an output screen from the notebook and replace that inline in code.

I would prefer (remove conda environments) or something more descriptive.

Error during job execution: "Paging async iterator protocol is not available for JobPaged"

Error during job execution in notebook 5 (see attached image below): "Paging async iterator protocol is not available for JobPaged"