This repository contains the steps to perform ETL on a dataset using AWS Glue and querying the data using Athena.
- Open the AWS Management Console and navigate to AWS Glue.
- Under Databases, select Tables.
- Click on Add tables using crawler.
- Name the crawler "Weather" and keep the default settings for Tags.
- Choose Add a data source and configure the following:
- Data source: S3
- Location of S3 data: In a different account
- S3 path: s3://noaa-ghcn-pds/csv/by_year/
- Subsequent crawler runs: Crawl all sub-folders
- Choose Add an S3 data source.
- For Existing IAM role, select "gluelab".
- In the Output configuration section, choose Add database, name it "weatherdata", and create the database.
- Select the weatherdata database.
- In the Crawler schedule section, choose Frequency: On demand.
- Review the crawler configuration and choose Create crawler.
- Wait for the crawler status to change to Ready.
- On the Crawlers page, select the Weather crawler.
- Choose Run and wait for the status to change to Ready.
- In the navigation pane, select Databases.
- Choose the weatherdata database.
- Click on the by_year table and review the metadata.
- From the Actions menu, choose Edit schema.
- Change the column names according to the provided table.
- Choose Update schema.
- In the AWS Glue console, navigate to Tables under Databases.
- Click on the by_year table and choose Actions > View data.
- Proceed to the Athena console and select Settings.
- Manage the Location of query result and choose the data-science-bucket.
- Save the settings.
- Switch to the Athena Editor tab.
- Set Data source to AwsDataCatalog and Database to weatherdata.
- Choose the by_year table and select Preview Table to view the data.
- Retrieve the bucket name containing glue-1950-bucket.
- In the Athena query editor, execute the provided query replacing with the bucket name.
- Run the query and preview the results.
- Create a view for the maximum temperature reading.
- Run a query to calculate the average maximum temperature for each year.
- Navigate to the IAM console and select Roles.
- Locate and copy the ARN for the gluelab role.
- Open the AWS Cloud9 IDE and create a new file named gluecrawler.cf.yml.
- Copy and paste the provided CloudFormation template into the file.
- Replace with the copied IAM role ARN.
- Save the file and validate the template using the AWS CLI.
- Run the command to create the CloudFormation stack using the template.
- Verify the stack creation and check the resources created.
By following these steps, you can perform ETL tasks with AWS Glue, query the data with Athena, and automate the creation of Glue crawlers using CloudFormation.