AWS Glue and Athena Lab

Overview

This repository contains the steps to perform ETL on a dataset using AWS Glue and querying the data using Athena.

Task 1: Using an AWS Glue crawler with the GHCN-D dataset

Configure and create the AWS Glue crawler

Open the AWS Management Console and navigate to AWS Glue.
Under Databases, select Tables.
Click on Add tables using crawler.
Name the crawler "Weather" and keep the default settings for Tags.
Choose Add a data source and configure the following:
- Data source: S3
- Location of S3 data: In a different account
- S3 path: s3://noaa-ghcn-pds/csv/by_year/
- Subsequent crawler runs: Crawl all sub-folders
Choose Add an S3 data source.
For Existing IAM role, select "gluelab".
In the Output configuration section, choose Add database, name it "weatherdata", and create the database.
Select the weatherdata database.
In the Crawler schedule section, choose Frequency: On demand.
Review the crawler configuration and choose Create crawler.
Wait for the crawler status to change to Ready.

Run the crawler

On the Crawlers page, select the Weather crawler.
Choose Run and wait for the status to change to Ready.

Review the metadata

In the navigation pane, select Databases.
Choose the weatherdata database.
Click on the by_year table and review the metadata.

Edit the schema

From the Actions menu, choose Edit schema.
Change the column names according to the provided table.
Choose Update schema.

Task 2: Querying a table by using Athena

Configure an S3 bucket for Athena query results

In the AWS Glue console, navigate to Tables under Databases.
Click on the by_year table and choose Actions > View data.
Proceed to the Athena console and select Settings.
Manage the Location of query result and choose the data-science-bucket.
Save the settings.

Preview a table in Athena

Switch to the Athena Editor tab.
Set Data source to AwsDataCatalog and Database to weatherdata.
Choose the by_year table and select Preview Table to view the data.

Create a table for data after 1950

Retrieve the bucket name containing glue-1950-bucket.
In the Athena query editor, execute the provided query replacing with the bucket name.
Run the query and preview the results.

Run a query on the new table

Create a view for the maximum temperature reading.
Run a query to calculate the average maximum temperature for each year.

Task 3: Creating a CloudFormation template for an AWS Glue crawler

Find the IAM role ARN

Navigate to the IAM console and select Roles.
Locate and copy the ARN for the gluelab role.

Create a CloudFormation template

Open the AWS Cloud9 IDE and create a new file named gluecrawler.cf.yml.
Copy and paste the provided CloudFormation template into the file.
Replace with the copied IAM role ARN.
Save the file and validate the template using the AWS CLI.

Deploy the CloudFormation stack

Run the command to create the CloudFormation stack using the template.
Verify the stack creation and check the resources created.

Conclusion

By following these steps, you can perform ETL tasks with AWS Glue, query the data with Athena, and automate the creation of Glue crawlers using CloudFormation.

mouhamed-jinja / lab-7-etl-on-a-dataset-by-using-aws-glue Goto Github PK

lab-7-etl-on-a-dataset-by-using-aws-glue's Introduction