microsoft / az-deep-batch-score Goto Github PK

View Code? Open in Web Editor NEW

20.0 26.0 8.0 6.11 MB

Batch scoring deep learning models with AML

License: MIT License

Jupyter Notebook 83.10% Shell 0.18% Python 14.80% Makefile 1.92%

azureml

az-deep-batch-score's Introduction

Batch Scoring Deep Learning Models With Azure Machine Learning

Overview

As described in the associated page on the Azure Reference Architecture center, in this repository, we use the scenario of applying style transfer onto a video (collection of images). This architecture can be generalized for any batch scoring with deep learning scenario. An alternative solution using Azure Kubernetes Service can be found here.

Design

The above architecture works as follows:

Upload a video file to storage.
The video file will trigger Logic App to send a request to the AML pipeline published endpoint.
The pipeline will then process the video, apply style transfer with MPI, and postprocess the video.
The output will be saved back to blob storage once the pipeline is completed.

What is Neural Style Transfer

Style image:	Input/content video:	Output video:
	click to view video	click to view

Prerequsites

Local/Working Machine:

Ubuntu >=16.04LTS (not tested on Mac or Windows)
(Optional) NVIDIA Drivers on GPU enabled machine [Additional ref: https://github.com/NVIDIA/nvidia-docker]
Conda >=4.5.4
AzCopy >=7.0.0
Azure CLI >=2.0

Accounts:

Azure Subscription
(Optional) A quota for GPU-enabled VMs

While it is not required, it is also useful to use the Azure Storage Explorer to inspect your storage account.

Setup

Clone the repo git clone https://github.com/Azure/Batch-Scoring-Deep-Learning-Models-With-AML
cd into the repo
Setup your conda env using the environment.yaml file conda env create -f environment.yml - this will create a conda environment called batchscoringdl_aml
Activate your environment conda activate batchscoringdl_aml
Log in to Azure using the az cli az login

Steps

Run throught the following notebooks:

Clean up

To clean up your working directory, you can run the clean_up.sh script that comes with this repo. This will remove all temporary directories that were generated as well as any configuration (such as Dockerfiles) that were created during the tutorials. This script will not remove the .env file.

To clean up your Azure resources, you can simply delete the resource group that all your resources were deployed into. This can be done in the az cli using the command az group delete --name <name-of-your-resource-group>, or in the portal. If you want to keep certain resources, you can also use the az cli or the Azure portal to cherry pick the ones you want to deprovision. Finally, you should also delete the service principle using the az ad sp delete command.

All the step above are covered in the final notebook.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Related projects

Microsoft AI Github Find other Best Practice projects, and Azure AI Designed patterns in our central repository.

az-deep-batch-score's People

Contributors

Stargazers

Watchers

Forkers

dave-read dciborow bhaskers-blu-org2 taffywrinkle claudiusgonzo test-mass-forker-org-1 taocao

az-deep-batch-score's Issues

quota check fails in notebook 03

There is enough quota in eastus but I get the following error

print("Checking quota for family size DSv2...")

vm_family = "DSv2"

requested_cores = ffmpeg_node_count * vm_dict[vm_family]["cores"]

diff = check_quota(vm_family)

if diff <= requested_cores:

print("Not enough cores of DSv2 in region, asking for {} but have {}".format(requested_cores, diff))

else:

print("There are enough cores, you may continue...")

Checking quota for family size DSv2...

/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/dotenv/main.py:111: UserWarning: key REGION not found in /home/maxkaz/repos/Batch-Scoring-Deep-Learning-Models-With-AML/.env.
warnings.warn("key %s not found in %s." % (key, self.dotenv_path))

TypeError Traceback (most recent call last)
in ()
3 requested_cores = ffmpeg_node_count * vm_dict[vm_family]["cores"]
4
----> 5 diff = check_quota(vm_family)
6 if diff <= requested_cores:
7 print("Not enough cores of DSv2 in region, asking for {} but have {}".format(requested_cores, diff))

in check_quota(vm_family)
18 "--location", get_key(env_path, "REGION"),
19 "--query", "[?contains(localName, '%s')].{max:limit, current:currentValue}" % (vm_family)
---> 20 ], stdout=subprocess.PIPE)
21 quota = json.loads(''.join(results.stdout.decode('utf-8')))
22 return int(quota[0]['max']) - int(quota[0]['current'])

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
401 kwargs['stdin'] = PIPE
402
--> 403 with Popen(*popenargs, **kwargs) as process:
404 try:
405 stdout, stderr = process.communicate(input, timeout=timeout)

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/subprocess.py in init(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
707 c2pread, c2pwrite,
708 errread, errwrite,
--> 709 restore_signals, start_new_session)
710 except:
711 # Cleanup if the child failed starting.

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
1273 errread, errwrite,
1274 errpipe_read, errpipe_write,
-> 1275 restore_signals, start_new_session, preexec_fn)
1276 self._child_created = True
1277 finally:

TypeError: expected str, bytes or os.PathLike object, not NoneType

Incomplete sentence at the end of Pre-requisite

There is some missing part of this sentence.

While it is not required, it is also useful to use the Azure Storage Explorer to inspect your storage account.e az cli installed and logged into

notebook 02 - move all user configs to the top of the scriptaml

Move any variables which you want the user to set to the top of the notebook, e.g. storage account name is in cell 15 which is easy to miss.

cannot create non-GPU cluster

Creating ffmpeg-cluster
Creating
AmlCompute wait for completion finished
Terminal state of "Failed" has been reached
Provisioning errors: [{'code': 'BadRequest', 'message': 'The request is invalid', 'error': {'code': 'BadRequest', 'statusCode': 400, 'message': 'The request is invalid', 'details': [{'code': 'The request is invalid', 'message': "RequestId: \nError code: 'InvalidPropertyValue'. Target: ''. Message: 'The specified value Standard_D2s_v3 for property Cluster.Properties.VMSize is not a supported VM size. See additional details for supported VM sizes.'\n Error code: 'SupportedVMSizes'. Target: ''. Message: 'STANDARD_D1,STANDARD_D11,STANDARD_D11_V2,STANDARD_D12,STANDARD_D12_V2,STANDARD_D13,STANDARD_D13_V2,STANDARD_D14,STANDARD_D14_V2,STANDARD_D1_V2,STANDARD_D2,STANDARD_D2_V2,STANDARD_D3,STANDARD_D3_V2,STANDARD_D4,STANDARD_D4_V2,STANDARD_DS11_V2,STANDARD_DS12_V2,STANDARD_DS13_V2,STANDARD_DS14_V2,STANDARD_DS15_V2,STANDARD_DS1_V2,STANDARD_DS2_V2,STANDARD_DS3_V2,STANDARD_DS4_V2,STANDARD_DS5_V2,STANDARD_F16S_V2,STANDARD_F2S_V2,STANDARD_F32S_V2,STANDARD_F4S_V2,STANDARD_F64S_V2,STANDARD_F72S_V2,STANDARD_F8S_V2,STANDARD_NC12,STANDARD_NC12S_V2,STANDARD_NC12S_V3,STANDARD_NC24,STANDARD_NC24RS_V2,STANDARD_NC24RS_V3,STANDARD_NC24S_V2,STANDARD_NC24S_V3,STANDARD_NC24r,STANDARD_NC6,STANDARD_NC6S_V2,STANDARD_NC6S_V3,STANDARD_ND12S,STANDARD_ND24RS,STANDARD_ND24S,STANDARD_ND40s_V2,STANDARD_ND6S,STANDARD_NV12,STANDARD_NV24,STANDARD_NV6'\n"}]}}]

AML storage in 03 doesn't seem to get the right storage account info

datastore

my_datastore = Datastore.register_azure_blob_container(

workspace=ws, 

datastore_name=my_datastore_name, 

container_name=get_key(env_path, "STORAGE_CONTAINER_NAME"), 

account_name=get_key(env_path, "STORAGE_ACCOUNT_NAME"), 

account_key=get_key(env_path, "STORAGE_ACCOUNT_KEY"),

overwrite=True

)

HttpOperationError Traceback (most recent call last)
~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/data/datastore_client.py in _register(ws, dto, create_if_not_exists, overwrite, auth, host)
311 client.data_store.create(ws._subscription_id, ws._resource_group, ws._workspace_name,
--> 312 dto, create_if_not_exists)
313 except HttpOperationError as e:

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/_restclient/operations/data_store_operations.py in create(self, subscription_id, resource_group_name, workspace_name, dto, create_if_not_exists, custom_headers, raw, **operation_config)
162 if response.status_code not in [200]:
--> 163 raise HttpOperationError(self._deserialize, response)
164

HttpOperationError: Operation returned an invalid status code 'Azure Storage Error. Please make sure the credential is correct.'

During handling of the above exception, another exception occurred:

HttpOperationError Traceback (most recent call last)
in ()
6 account_name=get_key(env_path, "STORAGE_ACCOUNT_NAME"),
7 account_key=get_key(env_path, "STORAGE_ACCOUNT_KEY"),
----> 8 overwrite=True
9 )

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/core/datastore.py in register_azure_blob_container(workspace, datastore_name, container_name, account_name, sas_token, account_key, protocol, endpoint, overwrite, create_if_not_exists)
113 return Datastore._client().register_azure_blob_container(workspace, datastore_name, container_name,
114 account_name, sas_token, account_key, protocol,
--> 115 endpoint, overwrite, create_if_not_exists)
116
117 @staticmethod

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/data/datastore_client.py in register_azure_blob_container(workspace, datastore_name, container_name, account_name, sas_token, account_key, protocol, endpoint, overwrite, create_if_not_exists)
86 return _DatastoreClient._register_azure_storage(
87 workspace, datastore_name, constants.AZURE_BLOB, container_name, account_name, credential_type,
---> 88 sas_token or account_key, protocol, endpoint, overwrite, create_if_not_exists)
89
90 @staticmethod

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/data/datastore_client.py in _register_azure_storage(ws, datastore_name, storage_type, container_name, account_name, credential_type, credential, protocol, endpoint, overwrite, create_if_not_exists, auth, host)
276 datastore = DataStoreDto(datastore_name, storage_type, storage_dto)
277 module_logger.debug("Converted data into DTO")
--> 278 return _DatastoreClient._register(ws, datastore, create_if_not_exists, overwrite, auth, host)
279
280 @staticmethod

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/data/datastore_client.py in _register(ws, dto, create_if_not_exists, overwrite, auth, host)
316 client = _DatastoreClient._get_client(ws, auth, host)
317 client.data_store.update(dto.name, ws._subscription_id, ws._resource_group,
--> 318 ws._workspace_name, dto, create_if_not_exists)
319 else:
320 module_logger.error("Registering datastore failed with {} error code and error message\n{}"

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/_restclient/operations/data_store_operations.py in update(self, name, subscription_id, resource_group_name, workspace_name, dto, create_if_not_exists, custom_headers, raw, **operation_config)
338
339 if response.status_code not in [200]:
--> 340 raise HttpOperationError(self._deserialize, response)
341
342 if raw:

HttpOperationError: Operation returned an invalid status code 'Data store being updated does not exist.'

Change Quota message to warning rather than just print

@jiata What do you think?

Quota Check does not check proper quota.

I saw quota check run, and tell me I had enough. But the next step failed because I have no ncsv3 cores.

in the clean up notebook, service principle doesn't exist

failing AML pipeline in notebook 03

in pipelin_run command get:

Experiment	Id	Type	Status	Details Page	Docs Page
style_transfer_mpi	35cbbd6e-2173-4433-9337-010287a76d70	azureml.PipelineRun	Failed	Link to Azure Portal	Link to Documentation

add a clear description to the test notebook 01

It's not clear whether the test notebook produces a full-blown short video with artistic style transfer or a video with partially-learned model. A description of what's being outputted is needed.

Docker image hosted in ACR or Docker file

Is it a good idea to provide a Docker file or Docker image in ACR which has the right environment? Not sure how much this tutorial is dependent on the versions of CUDA/cuDNN, but just a though. Every time I see versions mentioned I automatically think of containerization.

test the ability to re-run the notebooks

For example, first notebook fails on a re-run with
File 'local_test_orangutan/audio.aac/audio.aac' already exists. Overwrite ? [y/N]

then is just hangs because it can't get use input

Add instructions on contributing and how to run makefile tests

Error "APIVersion 2018-07-01 is not available" when creating the storage account

APIVersion 2018-07-01 is not available

notebook 02 shows up with tags

Notebook 02 initially loads as "not trusted" and has the option of adding tags to notebook cells, which makes it look unpleasing. Every cell has an "Add tag" button on the top-right corner along with menu options.

Need to set environment key LOGIC_APP before deployment

notebook 02 storage account creation failure

I get the following error in the second notebook at storage account creation - azure CLI version and error are below:

az --version
azure-cli (2.0.55)

!az storage account create \

-n {storage_account_name} \

-g {resource_group} \

--query 'provisioningState'

APIVersion 2018-07-01 is not available
Traceback (most recent call last):
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/knack/cli.py", line 206, in invoke
cmd_result = self.invocation.execute(args)
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 328, in execute
raise ex
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 386, in _run_jobs_serially
results.append(self._run_job(expanded_arg, cmd_copy))
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 379, in _run_job
six.reraise(*sys.exc_info())
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 356, in _run_job
result = cmd_copy(params)
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 171, in call
return self.handler(*args, **kwargs)
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/core/init.py", line 432, in default_command_handler
client = client_factory(cmd.cli_ctx, command_args) if client_factory else None
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/command_modules/storage/_client_factory.py", line 137, in cf_sa
return storage_client_factory(cli_ctx).storage_accounts
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/mgmt/storage/storage_management_client.py", line 230, in storage_accounts
raise NotImplementedError("APIVersion {} is not available".format(api_version))
NotImplementedError: APIVersion 2018-07-01 is not available

notebook 02 region selection

For those new to Azure I think it's hard to realize that region names in the portal map to system region names, e.g. "East US 2" maps to "eastus2" in the API. Is it possible to spell out these mappings? Also getting Subscription ID is a bit tricky, you can just say "search for 'subscriptions' in the Portal search bar".

check-quota function error

JSONDecodeError Traceback (most recent call last)
in ()
3 requested_cores = ffmpeg_node_count * vm_dict[vm_family]["cores"]
4
----> 5 diff = check_quota(vm_family)
6 if diff <= requested_cores:
7 print("Not enough cores of DSv3 in region, asking for {} but have {}".format(requested_cores, diff))

in check_quota(vm_family)
19 "--query", "[?contains(localName, '%s')].{max:limit, current:currentValue}" % (vm_family)
20 ], stdout=subprocess.PIPE)
---> 21 quota = json.loads(''.join(results.stdout.decode('utf-8')))
22 return int(quota[0]['max']) - int(quota[0]['current'])

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/json/init.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Need to add quote for aad token in logic app deployment

it should be
aad_token="{aad_token["Authorization"]}"

instead of aad_token={aad_token["Authorization"]}

Not sure why the latter works for you but not for me