Microsoft Cloud Workshop Azure Synapse Analytics and AI

License: MIT License

TSQL 2.78% PowerShell 86.64% Python 10.58%

mcw-azure-synapse-analytics-and-ai's Introduction

Azure Synapse Analytics and AI

This workshop is archived and no longer being maintained. Content is read-only.

Wide World Importers (WWI) has hundreds of brick-and-mortar stores and an online store where they sell a variety of products. They would like to gain business insights using historical, real-time, and predictive analytics using structured and unstructured data sources. In doing so, they want to enable their IT team of data engineers and data scientists to bring in and run complex queries over petabytes of structured data with billions of rows and unstructured enterprise operational data. At the same time, they want to enable business analysts and their IT team to share a single source of truth and have a single workspace to collaborate and work with enterprise data and enriched customer data. They want to accomplish this by minimizing the number of disparate services they use across ingest, transformation, querying, and storage so that their team of data engineers, data scientists, and database administrators can master one tool, and can build shared best practices for development, management, and monitoring.

October 2021

Target audience

Data engineer
Data scientist
Machine Learning engineer

Abstracts

Workshop

In this workshop, you will look at the process of creating an end-to-end solution using Azure Synapse Analytics. The workshop will cover data loading, data preparation, data transformation, and data serving, along with performing machine learning and handling both batch and real-time data.

At the end of this whiteboard design session, you will be better able to design and build a complete end-to-end advanced analytics solution using Azure Synapse Analytics.

Whiteboard design session

In this whiteboard design session, you will work in a group to look at the process of designing an end-to-end solution using Azure Synapse Analytics. The design session will cover data loading, data preparation, data transformation, and data serving, along with performing machine learning and handling both batch and real-time data.

At the end of this whiteboard design session, you will be better able to design and build a complete end-to-end advanced analytics solution using Azure Synapse Analytics.

Hands-on lab

In this hands-on lab, you will build end-to-end data analytics with a machine learning solution using Azure Synapse Analytics. The information will be presented in the context of a retail scenario. We will be heavily leveraging Azure Synapse Studio, a tool that conveniently unifies the most common data operations from ingestion, transformation, querying, and visualization.

Azure services and related products

Azure Synapse Analytics
Azure Storage and Azure Data Lake gen 2
Azure Stream Analytics
Azure Machine Learning
Azure App Service
Azure Purview
Event Hubs
IoT Hub
Power BI

Related references

Help & Support

We welcome feedback and comments from Microsoft SMEs & learning partners who deliver MCWs.

Having trouble?

First, verify you have followed all written lab instructions (including the Before the Hands-on lab document).
Next, submit an issue with a detailed description of the problem.
Do not submit pull requests. Our content authors will make all changes and submit pull requests for approval.

If you are planning to present a workshop, review and test the materials early! We recommend at least two weeks prior.

Please allow 5 - 10 business days for review and resolution of issues

mcw-azure-synapse-analytics-and-ai's People

Contributors

Stargazers

Watchers

Forkers

joelhulen samratbhatnagar oddjobe masayukiozawa nasunkar liqsword danieladolcan cloudmelon haroon921 ceakbey ryancrawcour sasever latamocptechteam rajatrakesh solliancenet rrios042 benstegink techyogillc prsayeen timahenning nivedv ctesta-oneillmsft francovp skugan123 taffywrinkle claudiusgonzo dem108 kokohar forestdengtech kreidoss asackmann yang-jiayi shkumesh alangulo dwnatwick wangyihaier annt-nguyen alfeuduran xctpro srushti-714 takeokams trial-data praveenksingh7 sherantonettego johanesalxd saimachi wongamanda keshava anushabc testjiho rabijo remc0000-zz moatwork anitpatel joaosalvadomicrosoft msukkarca kuby49 thivav duynguyendang pcoric seflaherty msworkshop harsh91274 enriquecatala kwanling stversch mindis sesham18 sergiocobosgarcia sidneyocirqueira turretin imsambit nilabjaball hurtn mallik-g testertesterson1004 iamuser1337 zekestarr mariuscend shekharhs91 mathmachado amitlals charlesxxiv datascienceunbound gustavo-devops-001 djuanes shingosakamoto kele087 ncode3 mleczkom abhishekvishal slsu0424 cro27 dbinsights chinmayapadhi proof-of-concept-poc inos-soft element824 rbackupx dhangerkapil

mcw-azure-synapse-analytics-and-ai's Issues

For next test/fix: Move exercise 5 to the end of the workshop

Renumber impacted exercises
Rename impacted deployment artifacts
Review content for any links/mentions to impacted exercise numbers.

Getting "Permission Denied" when Running Powershell Script in Lab Setup

When attempting to run the PowerShell Script that is part of the environment setup I get "Permission Denied".

xxxxx@Azure:~/Synapse-MCW/Hands-on lab/environment-setup/automation$ ./01-environment-setup.ps1
bash: ./01-environment-setup.ps1: Permission denied

This is Task 5, Step 2 in the Before the Hand-on Lab instructions.

Thanks in advance for any assistance!

VS Code issue

Not really familiar with Visual Studio Code. I got a lot of errors while executing Task #5 and couldn't start the workshop. I suggest adding extra details to these steps with screen shots.

Exercise 5, Task 3, Step 8 - Editing functions in the Azure portal is not supported for Linux Consumption Function Apps.

Exercise 5, Task 3, Step 8 says to select GetInvoiceData from the Functions listing in the Azure Portal. However, the portal does not list any functions on that screen and displays the following warning near the top of the page: "Editing functions in the Azure portal is not supported for Linux Consumption Function Apps". This means step 9 cannot be completed either (obtaining the function URL for later reference).

Is there an alternative method to obtain the function URL?

Install AzCopy as requirements

AzCopy is also needed to run .ps, would be helpful to add it in Requirments
with a reminder: add azcopy.exe to system path and relaunch PowerShell

Cannot insert the value NULL into column 'TransactionDateId', table. (Exercise 2, Task 2)

I am trying to execute this tutorial, to populate the sales table (Exercise 2, Task 2): https://github.com/microsoft/MCW-Azure-Synapse-Analytics-and-AI/blob/master/Hands-on%20lab/HOL%20step-by%20step%20-%20Azure%20Synapse%20Analytics%20and%20AI.md#task-2-populate-the-sale-table

I am getting the following error when trying to execute the pipeline, ASAMCW - Exercise 2 - Copy Sales Data:

{'message':'Job failed due to reason: at Sink 'sale': java.sql.BatchUpdateException: Cannot insert the value NULL into column 'TransactionDateId', table 'Distribution_27.dbo.Table_f7418eb0b0b94ec081fa7919842ff1f9_27'; column does not allow nulls. INSERT fails.\r\nThe statement has been terminated.. Details:at Sink 'sale': java.sql.BatchUpdateException: Cannot insert the value NULL into column 'TransactionDateId', table 'Distribution_27.dbo.Table_f7418eb0b0b94ec081fa7919842ff1f9_27'; column does not allow nulls. INSERT fails.\r\nThe statement has been terminated.','failureType':'UserError','target':'Data flow1','errorCode':'DFExecutorUserError'}

Invalid schema name

Exercise 5: Synapse Pipelines and Cognitive Search -> Task 4: Create the Synapse Pipeline -> 45

The table schema name should be: wwi_mcw (not dbo)

Additional enhancement request

Received via email:
Do we have the opportunity to add an example to use the Azure Machine Learning SDK in Spark? This is good for Synapse to integrate AML.

Thank you!

Mark Chen

Additional enhancements

James Serra requested:

Future steps in the lab that I would love to see:

Exercise building a PBI report
Exercise creating an on-demand database
Exercise creating a spark database
Exercise creating an Azure Synapse Link to CosmosDB

Exercise 7, Step 1, Task 9

The SQL Query returns this message: Timeout: The SQL query could not be performed within 30 seconds. The dataset service has a timeout limit of 30 seconds which is not affected by the query timeout value supplied. It is advised to skip validation when registering datasets for such cases. Once the dataset is registered the custom timeout value will become effective.

The same Query seems to run forever if you try it in the Synapse Workspace itself. Is there something wrong with the query? Is there an alternative method to populate the data?

Exercise 6 - Task 3: Making predictions with the registered models.

This is regarding the Exercise 6 - Task 3: Making predictions with the registered models. On Synapse we are not able to run the Predict function and its giving Model is not recognized. What could be the issue internal teams pointed out

If you running in Synapse, it needs to be whitelisted and the whitelisting is being opened across the regions. For other SQL environments, ONNX runtime is only supported in SQL Managed Instance and SQL Edge at this time. Please refer to this doc here.
https://docs.microsoft.com/en-us/sql/t-sql/queries/predict-transact-sql?view=sql-server-ver15

Request you to provide some input

Create another new SQL script and replace the contents with the following:

-- Retrieve the latest hex encoded ONNX model from the table
DECLARE @model varbinary(max) = (SELECT Model FROM [wwi_mcw].[ASAMCWMLModel] WHERE Id = (SELECT Top(1) max(ID) FROM [wwi_mcw].[ASAMCWMLModel]));

-- Run a prediction query
SELECT d., p.
FROM PREDICT(MODEL = @model, DATA = [wwi_mcw].[ProductPCA] AS d) WITH (prediction real) AS p;

###Error: Failed to execute query. Error: Parse error at line: 26, column: 14: Incorrect syntax near 'MODEL'.

Improve lab experience/Save time: Very Long load timings for Ex.2/Task 2 (load sales table)

This workshop has been working out for our customers very well, but wanted to suggest couple of enhancements to better help manage the lab timings & customer experience about the product.

The sales table load step Ex.2/Task 2 for ~667M rows takes up to 55-60 mins with 200 cDWU & 4+4 General purpose IR, if all goes well :).
We can probably enhance the lab by either:

Add a step to scale up to 1000 cDWU, and use 8+8 General Purpose IR will reduce to around 20-25 mins. And add a step to scale down AGAIN - in fact this will show customers practical scenarios (scale up/down) in real life, IMHO.

Reduce the number of rows to somewhere around 200M to allow the loads to be completed around 20 mins.

Task 2: Create the Azure Synapse Analytics workspace | Deployment Failure

Hello All:
In Before the HOL Task 2: Create the Azure Synapse Analytics workspace, the template runs into a failure due to the subscription needing to be registered with Microsoft. SQL as a resource provider. To give an example:

this may be good to add some sort of note here or outline steps for registering, before attempting the deployment.

Exercise 5: Additional steps to be specified

Exercise 5 Task 2 step 20
While running the file Hands-on lab/artifacts/pocformreader.py the following exception occurs for the line "quit()" and it can be ignored so we can add a note after step 20.
Exercise 5 Task 3 step 8
While deploying the function app the function GetInvoiceData is not deployed , there are additional steps to be performed in order to deploy the function as shown, and the steps are not mentioned in the guide
Azure function core tools is required for the lab but this isn't mentioned in the before hands on part

Before the HOL - Task 5 / Permission Denied

While running the Task 5 "Environment Setup", executing the command "./01-environment-setup.ps1" results in a "permission denied" error.

Issue with lab: AutoML Dataset creation times out/fails due to very large sales dataset (Ex.7/Step 9)

While creating a dataset for AutoML (Ex.7/Step 9), setting query timeout to 100 doesn't help. I had to change the timeout to 200 and update the SQL Query to add only 1 year to complete the step. Again, this could be a good practical use case where we can use data from 2019 to train, and use data from 2020 to predict, for example.

To fix this I usually ask students to add the WHERE condition, see below, and it works.

SELECT P.ProductId,P.Seasonality,S.TransactionDateId,COUNT(*) as TransactionItemsCount
FROM wwi_mcw.SaleSmall S
JOIN wwi_mcw.Product P ON S.ProductId = P.ProductId
where TransactionDateId between 01012019 and 12312019
GROUP BY P.ProductId ,P.Seasonality,S.TransactionDateId

100 DWUs tag for tracking cost

I believe as part of the environment setup a tag is created called "100 dwus" which references the dedicated SQL pool. The size of the dedicated SQL pool which is actually created is 500. Can someone please update the tag so that it does not lead to confusion? We ran this workshop with a customer and they were initially very concerned with the pricing being much higher than expected because of that discrepancy. Thanks!

Exercise 5, Task 3, Step 23 - Could not execute skill because Web Api skill response is invalid.

The cognitive search indexer is failing with the error: Could not execute skill because Web Api skill response is invalid. Web Api response has invalid content type 'text/plain'. Expected 'application/json'

Is there a place to specify the content type?

This is the full error:
This session was created from the following error:
Operation: Enrichment.WebApiSkill.#5
Message: Could not execute skill because Web Api skill response is invalid.
Details: Web Api response has invalid content type 'text/plain'. Expected 'application/json'
Document Key: localId=https%3a%2f%2fasastoreejw2020.blob.core.windows.net%2finvoices%2fTest%2fInvoice_6.pdf&documentKey=https%3a%2f%2fasastoreejw2020.blob.core.windows.net%2finvoices%2fTest%2fInvoice_6.pdf

Do you have any lab for the hot path in the architecture?

Do you have any lab for the hot path in the architecture? Can't seem to find it in this repository.

Exercise 2: Create and populate the supporting tables in the SQL Pool - Task 1: Create the sale table

There is no data in the colum 'Transaction ID', so the pipeline is not working as we create the table with NOT NULL for this field.
Is it normal ?
To be able to load the data I had to create the table with NULL for TransactionID

Issue in Exercise 3

In exercise 3, task 2, step 1, when selecting New Notebook getting it's asking to select from Load to DataFrame and New Spark table.

Exercise 3, task 2, step 6 got an error while running data_path.printSchema() command. It worked with df.printschema()

Issue:Exercise 5-Task3

Encountered msg below while publish function in VSCode.

Do I need to create a new project?

4 of services are published.

However, unable to fine the GetInvoiceData in Function. I assume that its because of the error above.

What shoud I do more?

Running individual SQL statement do not enforce Column level security

In Exercise 5: Security -> Task 1: Column level security -> Step 2

The script provided does not specify that in order to check column level security below 2 statements should be executed together:

EXECUTE AS USER ='DataAnalystMiami';
select TOP 100 * from wwi_mcw.CampaignAnalytics;

If I run just the select query by highlighting, I am able to see all the columns.

Excercise 6 - Task 2 Registering the models with Azure Synapse Analytics

Step 1: I couldn't locate https://github.com/microsoft/MCW-Azure-Synapse-Analytics-end-to-end-solution/raw/master/Hands-on lab/artifacts/convert-to-hex.ps1
I am not sure of instructions, Am i supposed to run the script? If yes How? I am assuming we are supposed to move to next step.
So when i run Step 4. I get below error instead of showing me the model
"Failed to execute query. Error: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: MalformedInputException: Input length = 1
Total execution time: 00:00:01.090"

Output from PySpark code not matching with that of SQL Serverless

In Exercise 3: Exploring raw parquet -> Task 2 -> Step 7

PySpark code needs changes to match output of SQL Serverless in the previous Task. For example in Serverless It is Avg Profit while in PySpark it is Avg Quantity

Updated Code here:

`from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

profitByDateProduct = (data_path.groupBy("TransactionDate", "ProductId")
.agg(
round(sum("ProfitAmount"),2).alias("(sum)Profit"),
round(avg("ProfitAmount"),2).alias("(avg)Profit"),
sum("Quantity").alias("(sum)Quantity")
).orderBy("TransactionDate", "ProductId")
)
profitByDateProduct.show(100)`

Pipeline fails in Task 6

Pipeline run fails when running Task 6 for Campaign Analytics. Error is Country is null and it fails unique constraint in DW. It looks like Dataflow is unable to correctly map undefined last two columns. When I remove skip row = 1, I see that schema is identified. But I understand whole exercise is not to enable debug.

Azure ML workspace is invalid

Exercise 7: Cell 41 (Train classifier using Auto ML)

Azure ML workspace name is invalid. This has to replace with the correct workspace name.

https://github.com/microsoft/MCW-Azure-Synapse-Analytics-and-AI/blob/master/Hands-on%20lab/environment-setup/automation/notebooks/ASAMCW%20-%20Exercise%207%20-%20Machine%20Learning.ipynb

Exercise 5 - Running indexer

Although the indexer was created successfully, it didn't find any projections in the PDF file like for example customer ID or product ID.

Could not generate projection from input '/document'. Check the 'source' or 'sourceContext' property of your projection in your skillset. =$(/document) ?map { "customerid": $(/document/formrecognizer/analyzeResult/readResults/0/lines/11/words/0/text), "filename": $(/document/metadata_storage_name), "productid": $(/document/formrecognizer/analyzeResult/readResults/0/lines/16/words/0/text), "productprice": $(/document/formrecognizer/analyzeResult/readResults/0/lines/17/words/0/text), "quantity": $(/document/formrecognizer/analyzeResult/readResults/0/lines/18/words/0/text), "totalcharges": $(/document/formrecognizer/analyzeResult/readResults/0/lines/19/words/0/text), "transactionid":

Lab Guide Updates

In Exercise 2, task 4(instruction 6) and task 6(instruction 7), While adding new integration dataset to synapse workspace, the dataset name has been changed from Azure Synapse Analytics (formerly SQL DW) to Azure Synapse Analytics. Both screenshot and instruction has to be updated.
The same thing has to be changed in Exercise 5, task 4(instruction 33), the resource looks like as shown in the below image.

In Exercise 2, task 2(instruction 14 & 17) and task 6(instruction 13), While creating data flows, For selecting source type and sink type in source settings there are 2 options for the datasets as shown in the below image. Both instruction and Screenshot has to be updated.

Purview Addition to HOL

cannot connect to sqlpool01 as deployed by scripts

I ran the pre HOL scripts and they completed successfully. The first create table fails as the schema does not exist. I created the schema manually. When I try to connect to sqlpool01 via the linked service that was created under the covers it fails with the following error message:
Cannot connect to SQL Database: 'asaworkspacelg68.sql.azuresynapse.net', Database: 'SQLPool01', User: 'asa.sql.admin'. Check the linked service configuration is correct, and make sure the SQL Database firewall allows the integration runtime to access.
Login failed for user 'asa.sql.admin'., SqlErrorNumber=18456,Class=14,State=1,
. Activity ID: 35f898aa-8ca2-4158-8dff-1397945cf8ce
This may happen if your data source only allows secured connections. If that's the case, please create a new workspace with virtual network option enabled.

I can explore sqlpool01 via the workspace under data and was able to create the table as above however I cannot create a linked service to the same SQL pool. Testing the linked service fails. Switching to password from AKV and typing in the password fails.

This workshop was scheduled to be delivered to a customer next week. One of their team has tried to run through it early and is also running into issues.

Documentation bug

Exercise 8: Monitoring -> Task 1: Workload importance

This should be "Exercise 8"

Task 2: Workload isolation

Task 3: Monitoring with Dynamic Management Views

Task 4: Orchestration Monitoring with the Monitor Hub

Task 5: Monitoring SQL Requests with the Monitor Hub

Exercise 6 - Task 3: Making predictions with the registered models.

I've tried to PREDICT task with 'White listed Pool'.

-- Retrieve the latest hex encoded ONNX model from the table
DECLARE @model varbinary(max) = (SELECT Model FROM [wwi_mcw].[ASAMCWMLModel] WHERE Id = (SELECT Top(1) max(ID) FROM [wwi_mcw].[ASAMCWMLModel]));

-- Run a prediction query
SELECT d.*, p.*
FROM PREDICT(MODEL = @model, DATA = [wwi_mcw].[ProductPCA] AS d) WITH (prediction real) AS p;

But this query was return with error below:

11:31:44

Started executing query at Line 1

Failed to execute query. Error: Message(s) from 'PREDICT' engine: The type of model column 'f00' is 'real'. Doesn't match the input column type.
Total execution time: 00:00:02.064

Have you any idea for resolve this issue?

Enhancement request

This lab is excellent resource to onboard internal and external Customers on Synapse Analytics. I have received some feedback through several conversations with Customers and it would be great to include exercises on below topics:

Delta Lake operations through Spark and SQL Serverless
Hive database operations through Spark and SQL Serverless
Using Spark/ SQL Serverless to populate Synapse tables
End to end security using Azure AD Managed Service Identity/Service Principal based auth. This is the top ask from Customers as no one wants to hard code SQL credentials. How do we integrate Synapse, Serverless, Spark, ADLS Gen2 and Pipelines without hard coding credentials anywhere.

No "Dataset" option in the data blade

There is no "Dataset" option in the data blade

Lab guide:

Invalid URL in the postman template file

file: https://github.com/microsoft/MCW-Azure-Synapse-Analytics-and-AI/blob/master/Hands-on%20lab/environment-setup/skillset/InvoiceKnowledgeStore.postman_collection.json

There is an extra "1" in url:raw

October 2021 suggestions

Empty SQL pool tables

My env set up went successfully but i dont see any data in tables -

[wwi_mcw].[ASAMCWMLModel]
[wwi_mcw].[Product]

I got only below error while setting env -

_Start the SQLPool01 SQL pool if needed.
Setup SQLPool01
Invoke-Sqlcmd: /home/arvind/Synapse-MCW1/Hands-on lab/environment-setup/automation/environment-automation/environment-automation.psm1:979
Line |
979 | … $result = Invoke-SqlCmd -Query $SQLQuery -ConnectionString $sqlConn …
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| The server principal 'asa.sql.workload01' already exists. The server principal 'asa.sql.workload02' already exists. Msg 15025, Level 16, State 1, Procedure , Line 1.

Invoke-Sqlcmd: /home/arvind/Synapse-MCW1/Hands-on lab/environment-setup/automation/environment-automation/environment-automation.psm1:979
Line |
979 | … $result = Invoke-SqlCmd -Query $SQLQuery -ConnectionString $sqlConn …
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| There is already a master key in the database. Please drop it before performing this statement. Msg 15578, Level 16, State 1, Procedure , Line 1._

Unable to import schema

When Completing Task 2 - Step 6 of the HOL. I am unable to import the schema for files within wwi-02/sale-small directory when using "From connection/store".

"An error occurred when invoking java, message: java.lang.OutOfMemoryError:Unable to retrieve Java exception. total entry:31 sun.misc.Unsafe.allocateMemory(Native Method) java.nio.DirectByteBuffer.(DirectByteBuffer.java:127) java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) org.apache.parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:102) org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:46) java.io.DataInputStream.readFully(DataInputStream.java:195) java.io.DataInputStream.readFully(DataInputStream.java:169) org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:204) org.apache.parquet.column.impl.ColumnReaderImpl.readPageV1(ColumnReaderImpl.java:591) org.apache.parquet.column.impl.ColumnReaderImpl.access$300(ColumnReaderImpl.java:60) org.apache.parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:540) org.apache.parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:537) org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:96) org.apache.parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:537) org.apache.parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:529) org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:641) org.apache.parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:357) org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:82) org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:77) org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:270) org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135) org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101) org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154) org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101) org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140) org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214) org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:125) org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:129) com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.(ParquetBatchReaderBridge.java:68) com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.open(ParquetBatchReaderBridge.java:63) com.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createReader(ParquetFileBridge.java:22) . . Activity ID: ff763d59-580a-4d9c-bcab-47bb69b3a999"

When attempting to use "From sample file" (sale-small-20190101-snappy.parquet) I recieve the following error:

"Schema import failed: File size exceeded the limit: 2048 Kb."

As a workaround I have chosen "none".

Running Powershell Script in Lab Setup

In the lab setup – we are required to run a powershell script – to set up the storage account with the needed data to go through the exercises. However, it does not seem to work the way intended. I see the containers and such created but I don’t see any data. Is this something you can help with ?

https://github.com/microsoft/MCW-Azure-Synapse-Analytics-and-AI/blob/master/Hands-on%20lab/Before%20the%20HOL%20-%20Azure%20Synapse%20Analytics%20and%20AI.md#task-5-run-environment-setup-powershell-script .

Azure Synapse SQL pool : SKU

Do we really need DW500c (sqlServerSKU) for this workshop?

https://github.com/microsoft/MCW-Azure-Synapse-Analytics-and-AI/blob/master/Hands-on%20lab/environment-setup/automation/00-asa-workspace-core.json

Pricing:

Exercise 6: UniqueIdentifier types are not supported in external tables

Getting "UniqueIdentifier types are not supported in external tables" while creating "Sales" Temp View using Data Frame in the ML notebook.

Workshop Hands-on Lab - SME review and feedback

Please leave your lab review comments and feedback here.

HOL Exercise 7 Task 4 Step 2, PCAData not mentioned before

In HOL Exercise 4: Leverage Automated ML to create and deploy a Product Seasonality Classifier model. Step2,

"In the previous task, we registered our PCA dataframe (named pcadata) to use with Auto ML. Select pcadata from the list and select Next."

Did not see any mention that pcadata is registered.

Error when running 01-environment-setup.ps1

I have followed the environment setup successfully.

But when running the PowerShell script "01-environment-setup.ps1" from Cloud Shell, I get the following errors.. Can you please help? I tried deleting everything and doing the setup again, but still did not work.

Error below:

CLIInternalError: The command failed with an unexpected error. Here is the traceback:
'Response' object has no attribute 'status'
Traceback (most recent call last):
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/adal_authentication.py", line 103, in set_token
super(MSIAuthenticationWrapper, self).set_token()
File "/opt/az/lib/python3.6/site-packages/msrestazure/azure_active_directory.py", line 598, in set_token
self.scheme, _, self.token = get_msi_token(self.resource, self.port, self.msi_conf)
File "/opt/az/lib/python3.6/site-packages/msrestazure/azure_active_directory.py", line 486, in get_msi_token
result.raise_for_status()
File "/opt/az/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:50342/oauth2/token

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/az/lib/python3.6/site-packages/knack/cli.py", line 215, in invoke
cmd_result = self.invocation.execute(args)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 654, in execute
raise ex
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 718, in _run_jobs_serially
results.append(self._run_job(expanded_arg, cmd_copy))
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 711, in _run_job
six.reraise(*sys.exc_info())
File "/opt/az/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 688, in _run_job
result = cmd_copy(params)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 325, in call
return self.handler(*args, **kwargs)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/init.py", line 784, in default_command_handler
return op(**command_args)
File "/opt/az/lib/python3.6/site-packages/azure/cli/command_modules/profile/custom.py", line 75, in get_access_token
creds, subscription, tenant = profile.get_raw_token(subscription=subscription, resource=resource, tenant=tenant)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/_profile.py", line 644, in get_raw_token
creds = self._get_token_from_cloud_shell(resource)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/_profile.py", line 392, in _get_token_from_cloud_shell
auth = MSIAuthenticationWrapper(resource=resource)
File "/opt/az/lib/python3.6/site-packages/msrestazure/azure_active_directory.py", line 592, in init
self.set_token()
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/adal_authentication.py", line 114, in set_token
.format(err.response.status, err.response.reason))
AttributeError: 'Response' object has no attribute 'status'
To open an issue, please run: 'az feedback'
CLIInternalError: The command failed with an unexpected error. Here is the traceback:
'Response' object has no attribute 'status'
Traceback (most recent call last):
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/adal_authentication.py", line 103, in set_token
super(MSIAuthenticationWrapper, self).set_token()
File "/opt/az/lib/python3.6/site-packages/msrestazure/azure_active_directory.py", line 598, in set_token
self.scheme, _, self.token = get_msi_token(self.resource, self.port, self.msi_conf)
File "/opt/az/lib/python3.6/site-packages/msrestazure/azure_active_directory.py", line 486, in get_msi_token
result.raise_for_status()
File "/opt/az/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:50342/oauth2/token

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/az/lib/python3.6/site-packages/knack/cli.py", line 215, in invoke
cmd_result = self.invocation.execute(args)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 654, in execute
raise ex
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 718, in _run_jobs_serially
results.append(self._run_job(expanded_arg, cmd_copy))
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 711, in _run_job
six.reraise(*sys.exc_info())
File "/opt/az/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 688, in _run_job
result = cmd_copy(params)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/init.py", line 325, in call
return self.handler(*args, **kwargs)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/init.py", line 784, in default_command_handler
return op(**command_args)
File "/opt/az/lib/python3.6/site-packages/azure/cli/command_modules/profile/custom.py", line 75, in get_access_token
creds, subscription, tenant = profile.get_raw_token(subscription=subscription, resource=resource, tenant=tenant)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/_profile.py", line 644, in get_raw_token
creds = self._get_token_from_cloud_shell(resource)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/_profile.py", line 392, in _get_token_from_cloud_shell
auth = MSIAuthenticationWrapper(resource=resource)
File "/opt/az/lib/python3.6/site-packages/msrestazure/azure_active_directory.py", line 592, in init
self.set_token()
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/adal_authentication.py", line 114, in set_token
.format(err.response.status, err.response.reason))
AttributeError: 'Response' object has no attribute 'status'
To open an issue, please run: 'az feedback'
Invoke-RestMethod: /home/david/Synapse-MCW/Hands-on lab/environment-setup/automation/environment-automation/environment-automation.psm1:1446
Line |
1446 | … $result = Invoke-RestMethod -Uri $uri -Method $method -Body $body …
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| {"code":"BearerTokenNotFound","message":"BearerToken not found in request"}

Invoke-RestMethod: /home/david/Synapse-MCW/Hands-on lab/environment-setup/automation/environment-automation/environment-automation.psm1:1446
Line |
1446 | … $result = Invoke-RestMethod -Uri $uri -Method $method -Body $body …
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| {"code":"BearerTokenNotFound","message":"BearerToken not found in request"}

Error during environment setup

Hi,

I'm having this error during environment setup at this stage: https://github.com/microsoft/MCW-Azure-Synapse-Analytics-and-AI/blob/master/Hands-on%20lab/environment-setup/automation/environment-automation/environment-automation.psm1#L537

Seems its due to API endpoint issue (the version is hardcoded on line 535)?

Indexer: Could not execute skill because Web Api skill response is invalid (exercise 5, task 3.22)

Previous steps have been successful and I get status 201 for all 4 API calls in the collection "Create a KnowledgeStore".
Still running into this error for the indexer.

How do I get to the raw data? I am doing this workshop in my Synapse workspace in my subscription?

Exercise 6 Task1 - Library mismatch for AutoML

We have a customer report that this fails due to an error in the dependencies on the packages, was anything done for the lab to get the right versions of the packages?

OptionalDependencyMissingException : OptionalDependencyMissingException: Message: Incompatible/Missing packages found:
azureml-train-automl-runtime requires numpy<=1.16.2,>=1.16.0 but has numpy 1.18.5.;
azureml-train-automl-runtime requires onnxruntime==1.0.0 but has onnxruntime 0.4.0.;
azureml-defaults requires werkzeug==0.16.1 but has Werkzeug 0.16.0.
;azureml-automl-runtime requires nimbusml>=1.7.1 but has nimbusml 1.5.0.;
azureml-automl-runtime requires numpy<=1.16.2,>=1.16.0 but has numpy 1.18.5.;
azureml-automl-runtime requires onnxruntime==1.0.0 but has onnxruntime 0.4.0.;
azureml-core requires ruamel.yaml>0.16.7 but has ruamel.yaml 0.15.89.

Issue on Exercise 3 Task2

Hi,

Thank you for the HOL, it was great.
Below is the error when I try to load the parquet to dataset using Spark

NameError: name 'display' is not defined
Traceback (most recent call last):

NameError: name 'display' is not defined

microsoft / mcw-azure-synapse-analytics-and-ai Goto Github PK