Giter Club home page Giter Club logo

az-deep-batch-score's Introduction

Build Status

Batch Scoring Deep Learning Models With Azure Machine Learning

Overview

As described in the associated page on the Azure Reference Architecture center, in this repository, we use the scenario of applying style transfer onto a video (collection of images). This architecture can be generalized for any batch scoring with deep learning scenario. An alternative solution using Azure Kubernetes Service can be found here.

Design

Reference Architecture Diagram

The above architecture works as follows:

  1. Upload a video file to storage.
  2. The video file will trigger Logic App to send a request to the AML pipeline published endpoint.
  3. The pipeline will then process the video, apply style transfer with MPI, and postprocess the video.
  4. The output will be saved back to blob storage once the pipeline is completed.

What is Neural Style Transfer

Style image: Input/content video: Output video:
click to view video click to view

Prerequsites

Local/Working Machine:

Accounts:

While it is not required, it is also useful to use the Azure Storage Explorer to inspect your storage account.

Setup

  1. Clone the repo git clone https://github.com/Azure/Batch-Scoring-Deep-Learning-Models-With-AML
  2. cd into the repo
  3. Setup your conda env using the environment.yaml file conda env create -f environment.yml - this will create a conda environment called batchscoringdl_aml
  4. Activate your environment conda activate batchscoringdl_aml
  5. Log in to Azure using the az cli az login

Steps

Run throught the following notebooks:

  1. Test the scripts
  2. Setup AML.
  3. Develop & publish AML pipeline
  4. Deploy Logic Apps
  5. Clean up

Clean up

To clean up your working directory, you can run the clean_up.sh script that comes with this repo. This will remove all temporary directories that were generated as well as any configuration (such as Dockerfiles) that were created during the tutorials. This script will not remove the .env file.

To clean up your Azure resources, you can simply delete the resource group that all your resources were deployed into. This can be done in the az cli using the command az group delete --name <name-of-your-resource-group>, or in the portal. If you want to keep certain resources, you can also use the az cli or the Azure portal to cherry pick the ones you want to deprovision. Finally, you should also delete the service principle using the az ad sp delete command.

All the step above are covered in the final notebook.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Related projects

Microsoft AI Github Find other Best Practice projects, and Azure AI Designed patterns in our central repository.

az-deep-batch-score's People

Contributors

danielleodean avatar dciborow avatar grecoe avatar jiata avatar marabout2015 avatar microsoftopensource avatar msalvaris avatar msftgits avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

az-deep-batch-score's Issues

quota check fails in notebook 03

There is enough quota in eastus but I get the following error

print("Checking quota for family size DSv2...")

vm_family = "DSv2"

requested_cores = ffmpeg_node_count * vm_dict[vm_family]["cores"]

โ€‹

diff = check_quota(vm_family)

if diff <= requested_cores:

print("Not enough cores of DSv2 in region, asking for {} but have {}".format(requested_cores, diff))

else:

print("There are enough cores, you may continue...")

Checking quota for family size DSv2...

/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/dotenv/main.py:111: UserWarning: key REGION not found in /home/maxkaz/repos/Batch-Scoring-Deep-Learning-Models-With-AML/.env.
warnings.warn("key %s not found in %s." % (key, self.dotenv_path))


TypeError Traceback (most recent call last)
in ()
3 requested_cores = ffmpeg_node_count * vm_dict[vm_family]["cores"]
4
----> 5 diff = check_quota(vm_family)
6 if diff <= requested_cores:
7 print("Not enough cores of DSv2 in region, asking for {} but have {}".format(requested_cores, diff))

in check_quota(vm_family)
18 "--location", get_key(env_path, "REGION"),
19 "--query", "[?contains(localName, '%s')].{max:limit, current:currentValue}" % (vm_family)
---> 20 ], stdout=subprocess.PIPE)
21 quota = json.loads(''.join(results.stdout.decode('utf-8')))
22 return int(quota[0]['max']) - int(quota[0]['current'])

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
401 kwargs['stdin'] = PIPE
402
--> 403 with Popen(*popenargs, **kwargs) as process:
404 try:
405 stdout, stderr = process.communicate(input, timeout=timeout)

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/subprocess.py in init(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
707 c2pread, c2pwrite,
708 errread, errwrite,
--> 709 restore_signals, start_new_session)
710 except:
711 # Cleanup if the child failed starting.

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
1273 errread, errwrite,
1274 errpipe_read, errpipe_write,
-> 1275 restore_signals, start_new_session, preexec_fn)
1276 self._child_created = True
1277 finally:

TypeError: expected str, bytes or os.PathLike object, not NoneType

cannot create non-GPU cluster

Creating ffmpeg-cluster
Creating
AmlCompute wait for completion finished
Terminal state of "Failed" has been reached
Provisioning errors: [{'code': 'BadRequest', 'message': 'The request is invalid', 'error': {'code': 'BadRequest', 'statusCode': 400, 'message': 'The request is invalid', 'details': [{'code': 'The request is invalid', 'message': "RequestId: \nError code: 'InvalidPropertyValue'. Target: ''. Message: 'The specified value Standard_D2s_v3 for property Cluster.Properties.VMSize is not a supported VM size. See additional details for supported VM sizes.'\n Error code: 'SupportedVMSizes'. Target: ''. Message: 'STANDARD_D1,STANDARD_D11,STANDARD_D11_V2,STANDARD_D12,STANDARD_D12_V2,STANDARD_D13,STANDARD_D13_V2,STANDARD_D14,STANDARD_D14_V2,STANDARD_D1_V2,STANDARD_D2,STANDARD_D2_V2,STANDARD_D3,STANDARD_D3_V2,STANDARD_D4,STANDARD_D4_V2,STANDARD_DS11_V2,STANDARD_DS12_V2,STANDARD_DS13_V2,STANDARD_DS14_V2,STANDARD_DS15_V2,STANDARD_DS1_V2,STANDARD_DS2_V2,STANDARD_DS3_V2,STANDARD_DS4_V2,STANDARD_DS5_V2,STANDARD_F16S_V2,STANDARD_F2S_V2,STANDARD_F32S_V2,STANDARD_F4S_V2,STANDARD_F64S_V2,STANDARD_F72S_V2,STANDARD_F8S_V2,STANDARD_NC12,STANDARD_NC12S_V2,STANDARD_NC12S_V3,STANDARD_NC24,STANDARD_NC24RS_V2,STANDARD_NC24RS_V3,STANDARD_NC24S_V2,STANDARD_NC24S_V3,STANDARD_NC24r,STANDARD_NC6,STANDARD_NC6S_V2,STANDARD_NC6S_V3,STANDARD_ND12S,STANDARD_ND24RS,STANDARD_ND24S,STANDARD_ND40s_V2,STANDARD_ND6S,STANDARD_NV12,STANDARD_NV24,STANDARD_NV6'\n"}]}}]

AML storage in 03 doesn't seem to get the right storage account info

datastore

my_datastore = Datastore.register_azure_blob_container(

workspace=ws, 

datastore_name=my_datastore_name, 

container_name=get_key(env_path, "STORAGE_CONTAINER_NAME"), 

account_name=get_key(env_path, "STORAGE_ACCOUNT_NAME"), 

account_key=get_key(env_path, "STORAGE_ACCOUNT_KEY"),

overwrite=True

)


HttpOperationError Traceback (most recent call last)
~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/data/datastore_client.py in _register(ws, dto, create_if_not_exists, overwrite, auth, host)
311 client.data_store.create(ws._subscription_id, ws._resource_group, ws._workspace_name,
--> 312 dto, create_if_not_exists)
313 except HttpOperationError as e:

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/_restclient/operations/data_store_operations.py in create(self, subscription_id, resource_group_name, workspace_name, dto, create_if_not_exists, custom_headers, raw, **operation_config)
162 if response.status_code not in [200]:
--> 163 raise HttpOperationError(self._deserialize, response)
164

HttpOperationError: Operation returned an invalid status code 'Azure Storage Error. Please make sure the credential is correct.'

During handling of the above exception, another exception occurred:

HttpOperationError Traceback (most recent call last)
in ()
6 account_name=get_key(env_path, "STORAGE_ACCOUNT_NAME"),
7 account_key=get_key(env_path, "STORAGE_ACCOUNT_KEY"),
----> 8 overwrite=True
9 )

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/core/datastore.py in register_azure_blob_container(workspace, datastore_name, container_name, account_name, sas_token, account_key, protocol, endpoint, overwrite, create_if_not_exists)
113 return Datastore._client().register_azure_blob_container(workspace, datastore_name, container_name,
114 account_name, sas_token, account_key, protocol,
--> 115 endpoint, overwrite, create_if_not_exists)
116
117 @staticmethod

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/data/datastore_client.py in register_azure_blob_container(workspace, datastore_name, container_name, account_name, sas_token, account_key, protocol, endpoint, overwrite, create_if_not_exists)
86 return _DatastoreClient._register_azure_storage(
87 workspace, datastore_name, constants.AZURE_BLOB, container_name, account_name, credential_type,
---> 88 sas_token or account_key, protocol, endpoint, overwrite, create_if_not_exists)
89
90 @staticmethod

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/data/datastore_client.py in _register_azure_storage(ws, datastore_name, storage_type, container_name, account_name, credential_type, credential, protocol, endpoint, overwrite, create_if_not_exists, auth, host)
276 datastore = DataStoreDto(datastore_name, storage_type, storage_dto)
277 module_logger.debug("Converted data into DTO")
--> 278 return _DatastoreClient._register(ws, datastore, create_if_not_exists, overwrite, auth, host)
279
280 @staticmethod

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/data/datastore_client.py in _register(ws, dto, create_if_not_exists, overwrite, auth, host)
316 client = _DatastoreClient._get_client(ws, auth, host)
317 client.data_store.update(dto.name, ws._subscription_id, ws._resource_group,
--> 318 ws._workspace_name, dto, create_if_not_exists)
319 else:
320 module_logger.error("Registering datastore failed with {} error code and error message\n{}"

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azureml/_restclient/operations/data_store_operations.py in update(self, name, subscription_id, resource_group_name, workspace_name, dto, create_if_not_exists, custom_headers, raw, **operation_config)
338
339 if response.status_code not in [200]:
--> 340 raise HttpOperationError(self._deserialize, response)
341
342 if raw:

HttpOperationError: Operation returned an invalid status code 'Data store being updated does not exist.'

failing AML pipeline in notebook 03

in pipelin_run command get:

Experiment Id Type Status Details Page Docs Page
style_transfer_mpi 35cbbd6e-2173-4433-9337-010287a76d70 azureml.PipelineRun Failed Link to Azure Portal Link to Documentation

add a clear description to the test notebook 01

It's not clear whether the test notebook produces a full-blown short video with artistic style transfer or a video with partially-learned model. A description of what's being outputted is needed.

Docker image hosted in ACR or Docker file

Is it a good idea to provide a Docker file or Docker image in ACR which has the right environment? Not sure how much this tutorial is dependent on the versions of CUDA/cuDNN, but just a though. Every time I see versions mentioned I automatically think of containerization.

test the ability to re-run the notebooks

For example, first notebook fails on a re-run with
File 'local_test_orangutan/audio.aac/audio.aac' already exists. Overwrite ? [y/N]

then is just hangs because it can't get use input

notebook 02 shows up with tags

Notebook 02 initially loads as "not trusted" and has the option of adding tags to notebook cells, which makes it look unpleasing. Every cell has an "Add tag" button on the top-right corner along with menu options.

notebook 02 storage account creation failure

I get the following error in the second notebook at storage account creation - azure CLI version and error are below:

az --version
azure-cli (2.0.55)

!az storage account create \

-n {storage_account_name} \

-g {resource_group} \

--query 'provisioningState'

APIVersion 2018-07-01 is not available
Traceback (most recent call last):
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/knack/cli.py", line 206, in invoke
cmd_result = self.invocation.execute(args)
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 328, in execute
raise ex
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 386, in _run_jobs_serially
results.append(self._run_job(expanded_arg, cmd_copy))
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 379, in _run_job
six.reraise(*sys.exc_info())
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 356, in _run_job
result = cmd_copy(params)
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 171, in call
return self.handler(*args, **kwargs)
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/core/init.py", line 432, in default_command_handler
client = client_factory(cmd.cli_ctx, command_args) if client_factory else None
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/cli/command_modules/storage/_client_factory.py", line 137, in cf_sa
return storage_client_factory(cli_ctx).storage_accounts
File "/home/maxkaz/anaconda3/envs/batchscoringdl_aml/lib/python3.6/site-packages/azure/mgmt/storage/storage_management_client.py", line 230, in storage_accounts
raise NotImplementedError("APIVersion {} is not available".format(api_version))
NotImplementedError: APIVersion 2018-07-01 is not available

notebook 02 region selection

For those new to Azure I think it's hard to realize that region names in the portal map to system region names, e.g. "East US 2" maps to "eastus2" in the API. Is it possible to spell out these mappings? Also getting Subscription ID is a bit tricky, you can just say "search for 'subscriptions' in the Portal search bar".

check-quota function error


JSONDecodeError Traceback (most recent call last)
in ()
3 requested_cores = ffmpeg_node_count * vm_dict[vm_family]["cores"]
4
----> 5 diff = check_quota(vm_family)
6 if diff <= requested_cores:
7 print("Not enough cores of DSv3 in region, asking for {} but have {}".format(requested_cores, diff))

in check_quota(vm_family)
19 "--query", "[?contains(localName, '%s')].{max:limit, current:currentValue}" % (vm_family)
20 ], stdout=subprocess.PIPE)
---> 21 quota = json.loads(''.join(results.stdout.decode('utf-8')))
22 return int(quota[0]['max']) - int(quota[0]['current'])

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/json/init.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):

~/anaconda3/envs/batchscoringdl_aml/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

GPU cluster machine type

By default, my team subscription doesn't have any NCSv3 core quota. Do we have to use this type?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.