Giter Club home page Giter Club logo

azure / azure-orbital-analytics-samples Goto Github PK

View Code? Open in Web Editor NEW
31.0 11.0 24.0 45.06 MB

Sample solution that demonstrates how to deploy and analyze spaceborne data using Azure Synapse Analytics

Home Page: https://aka.ms/synapse-geospatial-analytics

License: MIT License

Shell 16.00% Bicep 37.08% Python 46.63% Dockerfile 0.15% ASL 0.14%
geospatial spaceborne synapse-analytics cognitive-services computer-vision azure-batch azure-storage

azure-orbital-analytics-samples's Introduction

Project

This repository contains sample solution that demonstrates how to deploy and execute Geospatial Analysis using Azure Synapse Analytics workload on your Azure tenant. We recommend that you read the document on "Geospatial Analysis using Azure Synapse Analytics" before deploying this solution.

Disclaimer: The solution and samples provided in this repository is for learning purpose only. They're intended to explore the possibilites of the Azure Services and are a starting point to developing your own solution. We recommend that you follow the security best practices as per the Microsoft documentation for individual services.

Getting Started

Start by following the README.md to setup the Azure resources required to execute the pipeline.

This solution uses the Custom Vision Model as a sample AI model to demonstrate end-to-end Azure Synapse workflow for geospatial analysis. In this sample solution, the AI model detects swimming pools for a given geospatial data.

You can use this solution to integrate other AI models. Each AI Model requires their input geospatial data to be in a specific format. When adapting this solution for a different AI Model, make sure the geospatial data transform steps are modified to adapt to the individual AI Model's needs.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

azure-orbital-analytics-samples's People

Contributors

karthick-rn avatar mandarinamdar avatar microsoft-github-operations[bot] avatar microsoftopensource avatar senthilkungumaraj avatar sjyang18 avatar sushilkm avatar taiyee avatar tushardhadiwal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

azure-orbital-analytics-samples's Issues

Replace copy noop spark job with data flow

Replace copy noop spark job with data flow. This should unblock and serve to improve performance for the overall pipeline as it will avoid the use of file share mount in mssparkutils which is currently failing.

Improve instructions for creating Synapse pipeline

Please provide more detailed instructions on which pipeline to import

As of the instructions

"Import the pipeline under the workflow folder to your Azure Synapse Analytics instance's workspace. Alternatively, you can copy the files to your repository (git or Azure DevOps) and link the repository to your Azure Synapse Analytics workspace."

Does the pipeline have to be created by using "Browse gallery" and picking "Spaceborne Data Analysis Master Pipeline"?

Under the workflow folder of the source code there are 2 folders with JSON files.
There seem to be multiple pipeline definitions under custom-vision-model-v2\pipeline for example

Implement a simplified transform for Custom Vision Model Pipeline

Implement a simplified version of the current transform pipeline for Custom Vision Model. The current model pipeline performs mosaic, crop, convert and tiling as a series of spark jobs. This translates to separate spark-submit to the spark cluster in Synapse. Every spark-submit exacts an overhead of 2 to 4 minutes. This adds significant lead time for each of the transform - mosaic, crop, convert and tiling.

As a simple update, implement a consolidate version of the transform where the four transform are submitted as a single spark job. This reduces the redundant lead time associated with having multiple spark jobs.

A Spark pool cannot have 1 nodes; it must have between 3 and 200 nodes.

While running the installation script, there is another error about Spark autoscaling configuration

{"status":"Failed","error":{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.","details":[{"code":"Conflict","message":"{\r\n "status": "Failed",\r\n "error": {\r\n "code": "ResourceDeploymentFailure",\r\n "message": "The resource operation completed with terminal provisioning state 'Failed'.",\r\n "details": [\r\n {\r\n "code": "DeploymentFailed",\r\n "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.",\r\n "details": [\r\n {\r\n "code": "Conflict",\r\n "message": "{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",\r\n \"details\": [\r\n {\r\n \"code\": \"DeploymentFailed\",\r\n \"message\": \"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.\",\r\n \"details\": [\r\n {\r\n \"code\": \"BadRequest\",\r\n \"message\": \"{\\r\\n \\\"error\\\": {\\r\\n \\\"code\\\": \\\"ValidationFailed\\\",\\r\\n \\\"message\\\": \\\"Spark pool request validation failed.\\\",\\r\\n \\\"details\\\": [\\r\\n {\\r\\n \\\"code\\\": \\\"NodeCountNotValid\\\",\\r\\n \\\"message\\\": \\\"The autoscale minimum node count is not valid. A Spark pool cannot have 1 nodes; it must have between 3 and 200 nodes.\\\"\\r\\n }\\r\\n ]\\r\\n }\\r\\n}\"\r\n }\r\n ]\r\n }\r\n ]\r\n }\r\n}"\r\n }\r\n ]\r\n }\r\n ]\r\n }\r\n}"}]}}

Setup.sh - Storage account name already taken prevents project setup

When setting up the solution with setup.sh, there is an error for storage account name conflict. Error message included below.

{"status":"Failed","error":{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.","details":[{"code":"Conflict","message":"{\r\n "status": "Failed",\r\n "error": {\r\n "code": "ResourceDeploymentFailure",\r\n "message": "The resource operation completed with terminal provisioning state 'Failed'.",\r\n "details": [\r\n {\r\n "code": "DeploymentFailed",\r\n "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.",\r\n "details": [\r\n {\r\n "code": "Conflict",\r\n "message": "{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",\r\n \"details\": [\r\n {\r\n \"code\": \"DeploymentFailed\",\r\n \"message\": \"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.\",\r\n \"details\": [\r\n {\r\n \"code\": \"Conflict\",\r\n \"message\": \"{\\r\\n \\\"status\\\": \\\"Failed\\\",\\r\\n \\\"error\\\": {\\r\\n \\\"code\\\": \\\"ResourceDeploymentFailure\\\",\\r\\n \\\"message\\\": \\\"The resource operation completed with terminal provisioning state 'Failed'.\\\",\\r\\n \\\"details\\\": [\\r\\n {\\r\\n \\\"code\\\": \\\"DeploymentFailed\\\",\\r\\n \\\"message\\\": \\\"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.\\\",\\r\\n \\\"details\\\": [\\r\\n {\\r\\n \\\"code\\\": \\\"Conflict\\\",\\r\\n \\\"message\\\": \\\"{\\\\r\\\\n \\\\\\\"error\\\\\\\": {\\\\r\\\\n \\\\\\\"code\\\\\\\": \\\\\\\"StorageAccountAlreadyTaken\\\\\\\",\\\\r\\\\n \\\\\\\"message\\\\\\\": \\\\\\\"The storage account named synhnsrqjhqd is already taken.\\\\\\\"\\\\r\\\\n }\\\\r\\\\n}\\\"\\r\\n }\\r\\n ]\\r\\n }\\r\\n ]\\r\\n }\\r\\n}\"\r\n }\r\n ]\r\n }\r\n ]\r\n }\r\n}"\r\n }\r\n ]\r\n }\r\n ]\r\n }\r\n}"}]}}

Update README.md file with additional optional parameters

Since the release of the initial version of the script to deploy and configure the reference architecture, the README.md file has been outdated specifically in the parameters part. Need to update the parameters (required and optional) in the README.md file.

Lets clean the reference of versions for pipeline

We have only one pipeline but name refers to older nomenclature v1 and v2.
Lets cleanup the nomenclature anomaly by removing the references of v2 and the pipeline would only be referred by the name of custom-vision-model

Need an option to create a Synapse workspace with a managed VNET

We need a way to create a Synapse workspace with a managed VNET. This enables the security architecture with managed private endpoints.

By design, DEP (Data Exfiltration Protection) does not support public channels for package deployment. So, in order to support the current deployment scenario and does not failure with regression, DEP would be disable by default.

Add test for `DEPLOY_PGSQL=false`

We are only testing DEPLOY_PGSQL=true, we should add another ADO tests for the false condition which will test the synapse pipeline and deployment for non-pgsql setup.

Custom Vision model offline image issue with Protofbuf

Protobuf issue introduced during the last build (triggered by the PR merge) cause AI Model run to fail. This issue needs to be addressed in a separate PR.


Traceback (most recent call last):
  File "./custom_vision.py", line 11, in <module>
    from predict import initialize, predict_image
  File "/predict.py", line 7, in <module>
    import tensorflow as tf
  File "/usr/local/lib/python3.7/site-packages/tensorflow/__init__.py", line 98, in <module>
    from tensorflow_core import *
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/__init__.py", line 40, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 959, in _find_and_load_unlocked
  File "/usr/local/lib/python3.7/site-packages/tensorflow/__init__.py", line 50, in __getattr__
    module = self._load()
  File "/usr/local/lib/python3.7/site-packages/tensorflow/__init__.py", line 44, in _load
    module = _importlib.import_module(self.__name__)
  File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/__init__.py", line 52, in <module>
    from tensorflow.core.framework.graph_pb2 import *
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/core/framework/graph_pb2.py", line 16, in <module>
    from tensorflow.core.framework import node_def_pb2 as tensorflow_dot_core_dot_framework_dot_node__def__pb2
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/core/framework/node_def_pb2.py", line 16, in <module>
    from tensorflow.core.framework import attr_value_pb2 as tensorflow_dot_core_dot_framework_dot_attr__value__pb2
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/core/framework/attr_value_pb2.py", line 16, in <module>
    from tensorflow.core.framework import tensor_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__pb2
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/core/framework/tensor_pb2.py", line 16, in <module>
    from tensorflow.core.framework import resource_handle_pb2 as tensorflow_dot_core_dot_framework_dot_resource__handle__pb2
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/core/framework/resource_handle_pb2.py", line 16, in <module>
    from tensorflow.core.framework import tensor_shape_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__shape__pb2
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/core/framework/tensor_shape_pb2.py", line 42, in <module>
    serialized_options=None, file=DESCRIPTOR),
  File "/usr/local/lib/python3.7/site-packages/google/protobuf/descriptor.py", line 560, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

Orbital Analytics end-to-end solution using downlinked .bin data

The proposed sample uses a sample GeoTIFF .tif file, which is downloaded by
deploy/scripts/copy_geotiff.sh script and afterwards converted to tiles.
However there does not seem to be any documentation on how to produce such GeoTIFF from the Azure Orbital .bin file downloaded at the end of
https://docs.microsoft.com/en-us/azure/orbital/downlink-aqua.

The instructions https://docs.microsoft.com/en-us/azure/orbital/satellite-imagery-with-orbital-ground-station also do not address this topic, hence there is a huge gap between these two tutorials.

The only relevant instruction available is the utilization of demodulation to reduce .bin file size, but not about the product.

Use restrictive permissions for batch account rather than Contributor access

Problem
Currently we are grant the managed-identities the contributor access on batch account.
This access is very open ended and could be a security issue.

Proposed Solution
Lets make use of custom-roles to create a custom role with selective permissions on batch account and assign the custom-role to managed-identities as required.

User documentation seems confusing when coming from pipeline in Synapse gallery

When we land to https://github.com/Azure/Azure-Orbital-Analytics-Samples/blob/main/deploy/gallery/instructions.md from pipeline help it suggests user to go to https://github.com/Azure/Azure-Orbital-Analytics-Samples/blob/main/deploy/README.md which has not only infrastructure deployment documentation but more than that deploying pipelines on standalone.

All-in-all this seems a little confusing for a user of the pipeline.

Lets try to simplify the documentation and make it helpful and easy for a user to understand and use.

Implement post-deployment checks

Is your feature request related to a problem? Please describe.
Currently, there is no sanity check post-deployment to make sure the components required are provisioned successfully like Spark pool, Batch pool nodes etc.

Describe the solution you'd like
Add validation script to make sure the components are provisioned successful post-deployment

Describe alternatives you've considered
Currently, I don't know if there are any alternatives but will consider evaluating in due course

Additional context
No additional context

Deploying via Shell Scripts does not set the POSTGRES_ADMIN_LOGIN_PASS

Running setup.sh or install.sh does not prompt the user for the admin password for PGSQL as described in the deployment documentation.

Without this $POSTGRES_ADMIN_LOGIN_PASS is empty and deployment will fail saying

{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.","details":[{"code":"PasswordNotComplex","message":"Password validation failed. The password does not meet policy requirements because it is not complex enough."}]}

Thanks

env_code is not unique for workflow run

Since environment_code currently is based on sha, so the reruns of workflow are running into errors because of azure keyvault recreation with the same name.
So, ideally the env_code should be unique, so that the cleanup of previous workflow run does not interferes with the new workflow run.

Copy Xml/JSON: The specified resource does not exist

The synapse pipeline was created by searching in the gallery for "Spaceborne Data Analysis Master Pipeline" and setting the 3 required linked service accordingly:

  1. A linked service to AZ blob storage rawdata******
  2. A linked service to AZ datalake storage v.2 rawdata******
  3. A linked service to

While running the Custom Vision Pipeline there is the following error (similar for Copy XML and Copy JSON activities):
It is unclear which the resource is so that

{
"errorCode": "2200",
"message": "ErrorCode=UserErrorFileNotFound,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error Message: The specified resource does not exist. (ErrorCode: 404, Detail: The specified resource does not exist., RequestId: 039b4525-001a-0061-2f79-d3625a000000),Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.Azure.Storage.StorageException,Message=The specified resource does not exist.,Source=Microsoft.Azure.Storage.Common,'",
"failureType": "UserError",
"target": "Copy Xml",
"details": []
}

The activity is using
"@concat(pipeline().parameters.Prefix, '/', activity('Read Spec Document').output['runStatus'].output.sink.value[0]['resultsDirectory'], '/other'

The results directory from Read Spec Document activity is "out" and the prefix is the container name, hence the path shall be "-test-container/out/other"

A hint about the configuration is more than welcome as the directory was not there and nothing changed after creating it manually. @jfrazee Do you think you could give a hint?

Use pre-provisioned batch account

Problem
We have an issue of quota while provisioning the pool on a new provisioned batch account, and we need to require to go to Azure support to get quota fixed and proceed once the quota management has been dealt. This delays the testing.

Proposed Solution
We can use a pre-provisioned batch account with enough quota to create new pools.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.