microsoft / hi-ml Goto Github PK

View Code? Open in Web Editor NEW

244.0 10.0 55.0 29.98 MB

HI-ML toolbox for deep learning for medical imaging and Azure integration

Home Page: https://aka.ms/hi-ml

License: MIT License

Makefile 1.18% Python 98.31% Dockerfile 0.14% Jupyter Notebook 0.29% Shell 0.08%

healthcare health ai machine-learning ml mlops azure azureml deep-learning python

hi-ml's Introduction

Microsoft Health Intelligence Machine Learning Toolbox

Overview

This toolbox aims at providing low-level and high-level building blocks for Machine Learning / AI researchers and practitioners. It helps to simplify and streamline work on deep learning models for healthcare and life sciences, by providing tested components (data loaders, pre-processing), deep learning models, and cloud integration tools.

This repository consists of two Python packages, as well as project-specific codebases:

PyPi package hi-ml-azure - providing helper functions for running in AzureML.
PyPi package hi-ml - providing ML components.
hi-ml-cpath: Models and workflows for working with histopathology images

Getting started

For the full toolbox (this will also install hi-ml-azure):

Install from pypi via pip, by running pip install hi-ml

For just the AzureML helper functions:

Install from pypi via pip, by running pip install hi-ml-azure

For the histopathology workflows, please follow the instructions here.

If you would like to contribute to the code, please check the developer guide.

Documentation

The detailed package documentation, with examples and API reference, is on readthedocs.

Quick start: Using the Azure layer

Use case: you have a Python script that does something - that could be training a model, or pre-processing some data. The hi-ml-azure package can help easily run that on Azure Machine Learning (AML) services.

Here is an example script that reads images from a folder, resizes and saves them to an output folder:

from pathlib import Path
if __name__ == '__main__':
    input_folder = Path("/tmp/my_dataset")
    output_folder = Path("/tmp/my_output")
    for file in input_folder.glob("*.jpg"):
        contents = read_image(file)
        resized = contents.resize(0.5)
        write_image(output_folder / file.name)

Doing that at scale can take a long time. We'd like to run that script in AzureML, consume the data from a folder in blob storage, and write the results back to blob storage.

With the hi-ml-azure package, you can turn that script into one that runs on the cloud by adding one function call:

from pathlib import Path
from health_azure import submit_to_azure_if_needed
if __name__ == '__main__':
    current_file = Path(__file__)
    run_info = submit_to_azure_if_needed(compute_cluster_name="preprocess-ds12",
                                         input_datasets=["images123"],
                                         # Omit this line if you don't create an output dataset (for example, in
                                         # model training scripts)
                                         output_datasets=["images123_resized"],
                                         default_datastore="my_datastore")
    # When running in AzureML, run_info.input_datasets and run_info.output_datasets will be populated,
    # and point to the data coming from blob storage. For runs outside AML, the paths will be None.
    # Replace the None with a meaningful path, so that we can still run the script easily outside AML.
    input_dataset = run_info.input_datasets[0] or Path("/tmp/my_dataset")
    output_dataset = run_info.output_datasets[0] or Path("/tmp/my_output")
    files_processed = []
    for file in input_dataset.glob("*.jpg"):
        contents = read_image(file)
        resized = contents.resize(0.5)
        write_image(output_dataset / file.name)
        files_processed.append(file.name)
    # Any other files that you would not consider an "output dataset", like metrics, etc, should be written to
    # a folder "./outputs". Any files written into that folder will later be visible in the AzureML UI.
    # run_info.output_folder already points to the correct folder.
    stats_file = run_info.output_folder / "processed_files.txt"
    stats_file.write_text("\n".join(files_processed))

Once these changes are in place, you can submit the script to AzureML by supplying an additional --azureml flag on the commandline, like python myscript.py --azureml.

That's it!

For details, please refer to the onboarding page.

For more examples, please see examples.md.

Issues

If you've found a bug in the code, please check the issues page. If no existing issue exists, please open a new one. Be sure to include

A descriptive title
Expected behaviour (including a code sample if possible)
Actual behavior

Contributing

We welcome all contributions that help us achieve our aim of speeding up ML/AI research in health and life sciences. Examples of contributions are

Data loaders for specific health & life sciences data
Network architectures and components for deep learning models
Tools to analyze and/or visualize data
...

Please check the detailed page about contributions.

Licensing

MIT License

You are responsible for the performance, the necessary testing, and if needed any regulatory clearance for any of the models produced by this toolbox.

Contact

If you have any feature requests, or find issues in the code, please create an issue on GitHub.

Contribution Licensing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

hi-ml's People

Contributors

Stargazers

Watchers

hi-ml's Issues

Improve logic for waiting for job completion

Move this segment into the AzureML layer:

        # For PR builds where we wait for job completion, the job must have ended in a COMPLETED state.
        if self.azure_config.wait_for_completion and not is_run_and_child_runs_completed(azure_run):
            raise ValueError(f"Run {azure_run.id} in experiment {azure_run.experiment.name} or one of its child "
                             "runs failed.")

Handle the "v" in version numbering

Our code in setup.py will trigger with new tags. setuptools.setup will reject tags that are not release versions but we could do more to make that explicit by checking for the leading "v".

Also when we tag releases as, say, "v0.1.1" the leading "v" is carried through setuptools.setup so it becomes part of the pip test download

Successfully installed pip-21.2.4
Collecting hi-ml==v0.1.0
Downloading hi_ml-0.1.0-py3-none-any.whl (25 kB)

(from here)

This works, but it would be cleaner to submit the version number using the public version identifier format mandated in PEP 440, i.e. without the leading "v"

Switch tests to use a separate AzureML workspace

Create a new AzureML workspace and storage account only for the hi-ml package. Ideally, this should be tried out with Azure CLI tools - it would be ideal if we had a completely script-driven way of creating
- resource group
- workspace
- storage account
- service principal
- data store
- compute cluster
Document all the necessary steps in an MD file, store the script(s) to create everything
Switch the github workflow to this new workspace and service principal

Ensure the submit_if_needed copes with str for all files and paths

All arguments that point to files are presently accepting Path objects. However, many people still use strings for handling file names. Ensure that all arguments are happily accepting PathOrString.

Naming corrections (plus other hanging issues)

PR #36 has some unanswerd comments around naming, mocking, documentation, and repitition here.

Add all items required for making the repository public

Ensure that all files have copyright notices, and that editors are set up to automatically insert them (PyCharm does it correctly on InnerEye)

You must run the following source code analysis tools:
CredScan
CodeQL (Semmle)
Component Governance Detection
The easiest way to run these tools is to add thems in your build pipeline in a Microsoft-managed Azure DevOps account.

For CodeQL, please ensure the following (detailed instructions for CodeQL can be found here):
Select the source code language in the CodeQL task.
If your application was developed using multiple languages, add multiple CodeQL tasks.
Define the build variable LGTM.UploadSnapshot=true.
Configure the build to allow scripts to access OAuth token.
If the code is hosted in Github, create Azure DevOps PAT token with code read scope for dev.azure.com/Microsoft (or ‘all’) organization and set the local task variable System_AccessToken with it. (Note: This only works for YAML-based pipelines.)
Review security issues by navigating to semmleportal.azurewebsites.net/lookup. It may take up to one day to process results.

Use an alternative pypi server to upload the test packages

We have sporadic test pipeline failures when building a Docker image in the AzureML jobs. Example here The job claims that package version post282 does not exist - but the github build agents successfully pulled that version a few minutes before. This probably comes from a package mirror that AML are using, that is not completely up to date with test.pypi.
As a workaround, we could publish to an MS internal package feed. AML would probably not have a cache of that, and hence always use the latest.

Automatically copy new issues from this repository to our ADO backlog

(Separated out from Issue #2)

AML Entry script path as Linux-style path; setting data store

from old code:

    # AzureML seems to sometimes expect the entry script path in Linux format, hence convert to posix path
    entry_script_relative_path = source_config.entry_script.relative_to(source_config.root_folder).as_posix()

In the same washup:

    # Use blob storage for storing the source, rather than the FileShares section of the storage account.
    run_config.source_directory_data_store = workspace.datastores.get(WORKSPACE_DEFAULT_BLOB_STORE_NAME).name

Add options for input and output datasets

submit_if_needed needs arguments for

default data store
input datasets
output datasets

Datasets can be specified either programmatically via a DatasetConfig, or as a string. DatasetConfig should contain

dataset name. If does not yet exist, create
datastore name - if missing, use the default data store
version (optional) - only makes sense for input datasets
should the dataset be mounted or downloaded when running in AML?
folder location for mount or download when running in AML
Optional: folder location to use when running outside AML

Datasets as strings:

"foo": use dataset foo
"foo:2": Use version 2 of dataset "foo"
"mount:foo:/tmp/nix": Mount dataset at the given path. Create an equivalent for "download". Add options for versions

Depending on if the dataset is in the input or outputs list, we can create an input or output config from that.

Better Requirements

Add a new file: run_requirements.txt with the package run requirements.
Parse this and add it to the setup.py install_requires array.
Add a new shell script to pip install all the requirements for ease of development.

Make the package tag-line on PyPi more concrete

Javier pointed out that our tagline, Microsoft Health Intelligence AzureML helpers, on https://pypi.org/manage/project/hi-ml/releases/ is too generic.

Add generic helper functions from InnerEye

package_setup_and_hacks - parts of it should be called by submit_if_needed (matplotlib, MKL)
is_global_rank_zero()
set_environment_variables_for_multi_node: This should probably be renamed to include something around Pytorch Lightning, because we set environment variables in a way that the PL trainer needs.

Naming: should we use hi-ml everywhere, or hi-ml on the repo and health-ml for environment variables etc.

Add coverage checks to build pipeline

Set up a PR build and other pipelines

From old repo, copy over everything that makes sense. This would include:

flake8 and mypy checks as github workflows
Running pytest and publishing pytest results to ADO
Running credscan and component governance
Copying issues to Azure DevOps

Improve coverage for `submit_to_azure_if_needed`

Break submit_to_azure_if_needed into smaller functions
Write unit tests for those smaller functions
Write unit tests for the various utility functions they rely on (copied from Inner Eye where possible)
Where appropriate (hopefully everywhere!) mark as @pytest.mark.fast

Add helper functions for use in AML jobs

Helper functions to use in code to download a file from run (that work seamlessly in local runs and in AML) – used for downloading a checkpoint

In AzureML, they should take authentication/workspace from the current run context
Outside AzureML, it needs a config.json object

Switch PR pipeline to use make

In the Makefile, we now have code to do mypy, flake, environment building. Use those as building blocks in the PR build. This way we can ensure that the local dev environment is the same as used in the cloud.

Design a better way of doing changelog

Should not cause merge conflicts
Individual files that are collated into a changelog file

Restore separate pip section in example YAML

Helps to delimit where the reordering of the PIP indexes applies.

Upload and Run Very simple script

To get things started, upload and run a xscript that just prints out a message handed in as an argument using our new submit_to_azure_if_needed function

Improve dev onboarding documentation

Ensure that the Readme file on pypi is correct and meaningful
Describe how to set up a dev environment (create conda env/pip)
How to set up and use a Service Principal
How does device login work (and when does it not work)
Describe how to do a release (pep guidelines for versions)

Add a script to download files from runs

download via wildcard from the most recent run, or a given run
downloaded folder structure should include the run, to have multiple results side by side easily
Install as executable when installing the package, see #59

Add all feature required to support InnerEye

run_config has a field max_run_duration_seconds, make that an argument of submit_if_needed
Add an argument for raise_on_error: wait_for_completion(self, ... raise_on_error=True):
Add an argument for returning after submitting the job, rather than exiting
Add an argument for experiment name

Create outline pypi package

Copy the essence of DICOM-RT package to here

Unit tests For Datasets

Add unit tests for dataset downloading/mounting for input and output.

Change build pipeline to run mypy and tests in parallel

We should get flake, mypy, and unit test results all at once, rather than stopping already after failing mypy - if I submit a change, I should get a clear indication about ALL the things that are wrong.

Deploy documentation to a public server

sphinx/readthedocs?

Use mypy_runner, not mypy

The github action, build-test-pr.yml, and makefile, both call mypy directly. Change them both to call mypy_runner instead.

Also change the build dependencies to publish to test.pypi only dependent on pytest

Basic Unit Tests

Sketch basic unit tests & mocks for:

Uploading a single Python file to AzureML, running it, and returning the run_id. Then wait for run completion, then downloading output files and stdout
Uploading a single Python file and a data file, e.g. data.csv, rest as above.
Uploading two Python files, to check imports, rest as above.

Design a better way of adding git-related information to the run

At present, we use commandline arguments, and always tag runs with git commit information. We can do that in hi-ml too, but maybe have a switch to turn that off (no run tagging)

Allow Local Tests to run without a package

To ease local development and testing, instead of requiring a package to be built, copy the contents of the "src" folder into the test folder.

Test Publish to PyPI and check correct version is tested

Introduce a while loop similar to publishing to test.pypi to check that the version downloaded is the same as the version just uploaded

When not providing ignored_folders, code crashes

amlignore_path is not assigned if ignored_folders is empty:

    if ignored_folders:
        amlignore_path = snapshot_root_directory or Path.cwd()
        amlignore_path = amlignore_path / ".amlignore"
        lines_to_append = [str(path) for path in ignored_folders] if ignored_folders else []
    with append_to_amlignore(
            amlignore=amlignore_path,
            lines_to_append=lines_to_append):

Test Against Editable or Installed Package

When running the unit tests, some folders are created using the pytest tmp_path fixture and the src folder is copied into them. This means that the coverage tool things they are different to the installed package. When running as a test in github action then the package is already installed, so this should not be necessary. When running locally it should be possible to install the src folder as an editable package, with the -e option and the tests still run.

Add tensorboard monitoring script

Script that starts tensorboard server on local box
should pick up most recent run automatically, or accept a run_id argument
See Word document for a link to examples from the chestXray project
This script should be installed automatically in /bin when installing the package.
Check with Mel if there are any extra gadgets that would be helpful here. For example: Should we not only store most_recent_run.txt, but a full submission history (most_recent_runs.txt), and tensorboard monitoring can then pick up the last 3, for example?

Create documentation and usage examples

Start with the simplest possible script. Explain what is being uploaded and executed
Then add components one by one: Conda environment, pip extra index, docker
Use of existing environment (can skip that)
Input datasets
Output datasets
From the InnerEye documentation, copy over the section that explains how we are dealing with datasets. Extend and polish that. What are AML datasets, how do they relate to datasets in blob storage?

When docker is not needed, make sure that conda fires up quickly on AzureML

Switch InnerEye to using hi-ml as a package

Branch on InnerEye-DeepLearning: antonsc/himl. This code uses hi-ml as a submodule
Remove the submodule hi-ml
Add hi-ml as a package in environment.yml
At present, there are two test failures, that should vanish when switching to hi-ml as a package:
- test_register_and_score_model in the TrainEnsemble leg
- test_submit_for_inference in the TrainInAzureMLViaSubmodule leg

Add an AML SDK user agent

Create a daily build for hi-ml that tests docker support

Building a docker image costs about 20min, that's too long for a PR build.

In the PR builds, only execute normal conda runs without docker.
In the daily build, test docker image generation, and in particular test if the docker_shm_size is respected
Ensure that, if the build fails, people get an email that is not the standard github email

Keeping private functions private

We are not consistent about prefixing private functions with one underscore (module functions), or two (instance methods). That doesn't matter much for an internal project but as we hope to have external developers using these packages we should signal clearly the difference between private functions which should only be called by code inside the package, and public functions which may be useful to consumers of the package.

Remove random post number from PyPi publication process

To publish a new package to PyPi (i.e. the public package repository, not the test one) we run these commands:

make clean
make build
twine upload dist/*

That runs setup.py. To get local testing working we added code to setup.py that checks whether it is running in GitHub and if it is not then it changes the post number to a string of nine random digits. Unfortunately that code runs when packaging for real, i.e. not as part of testing, and so instead of a build numbers like 0.0.1post1, 0.0.1post2 , 0.0.1post3 etc. we get ones like 0.0.1post 5725449762

Add Credscan to our build pipeline

(Separated out from Issue #2)

Simplify handling of core arguments to submit_if_needed.

Automatically find config.json and environment.yml in current folder or any of the parent folders (stop going up when going out of a folder that is in PythonPath)

Code coverage has 0% coverage

Unfortunately the build code coverage is still pointing at "src" when evaluating the package. Change it to point at "health"

Fix intermittently failing dataset tests

AML Dataset's get_by_name sometimes returns ServiceException 204 - 'unknown error'

Examples of failed runs:

Add/check the the `LGTM.UploadSnapshot` element of our security checks

Potential omission from #55 where it says

For CodeQL, please ensure the following (detailed instructions for CodeQL can be found here):
Select the source code language in the CodeQL task.
If your application was developed using multiple languages, add multiple CodeQL tasks.
Define the build variable LGTM.UploadSnapshot=true.
Configure the build to allow scripts to access OAuth token.
If the code is hosted in Github, create Azure DevOps PAT token with code read scope for dev.azure.com/Microsoft (or ‘all’)
organization and set the local task variable System_AccessToken with it. (Note: This only works for YAML-based pipelines.)
Review security issues by navigating to semmleportal.azurewebsites.net/lookup. It may take up to one day to process results.

This is not done yet (unless it happens automatically) and I cannot find any mention of LGTM in InnerEye to crib from.

The CodeQL Portal says "Only Visual Studio Team System and Azure Dev Ops URLs are supported" and will not upload a snapshot from GitHub

Update tooling

Add vscode config files
Add more testing to makefile
Add code coverage comment bot