- A good understanding of AWS Services: S3, Redshift, Glue, AWS Step functions, EventBridge.
- Good understanding of Python, and SQL.
- Knowledge of AWS security best practices, including IAM (Identity and Access Management) roles and policies.
This project aims to leverage AWS services to process large amount of daily flight data and leverages a redshift data warehouse which can then be used by end users for comprehensive analysis and reporting.
- The pipeline starts with an S3 bucket where the daily flight data files are stored or ingested.
- EventBridge Rule monitors the S3 bucket for new data arrivals and triggers the step function when a new file is detected.
- Step Functions orchestrates and coordinates the subsequent steps in the pipeline, providing a serverless workflow management.
- The Glue Crawler service crawls the data in S3, extracts metadata, and infers the schema, preparing the data for further processing.
- An Apache Spark Glue ETL job then performs data transformations, cleansing, and preparation tasks on the flight data to ensure it's in the desired format for loading into the data warehouse.
- Based on the outcome of the Glue Job, an email notification is sent to the appropriate SNS topic for either success or failure, informing stakeholders about the status of the data ingestion process.
-
Creating S3 Bucket
-
Creating Folders for Data Storage
-
Viewing Data Inside Dimension Folder
-
Starting the Redshift Cluster
-
Creating Redshift Connection in AWS Glue
-
Successful Redshift Connection
-
Creating Glue Metadata Database
-
Creating Tables in Redshift
-
Glue Crawler for Redshift Metadata
-
Glue Crawler for Flights Fact Table
-
Glue Crawler for Daily Flight Data
-
Sample Data in S3 Bucket
-
Successful Crawler Run
-
Metadata Table Created Successfully
-
Crawlers Picked Metadata from Both Redshift Tables and S3 Data Files
-
Filter to Show Only Flights Delayed by 60 Minutes or More When Departing
-
Adding Airport Dimension Data Catalog
-
Join Operation on Flight Data
-
Schema Adjustment After Join
-
Second Join Operation for Destination Details
-
Adjusting Schema to Match Destination Table
-
Final Redshift Target Table
-
Overview of the Complete Glue ETL Pipeline
-
Enabling Job Bookmarking for Incremental Ingestion
-
Enabling EventBridge Notifications for S3 Bucket
-
Adding Crawler to Step Function
-
Checking Crawler State
-
Conditional Check for Crawler State
-
Waiting and Re-checking Crawler Status
-
Successful Glue Crawler Run
-
Handling Task Failure Notifications
-
Sending Success Notifications
-
Sending Failed Notification
-
Overall Step Function Diagram
-
Creating an EventBridge Rule
-
Building EventBridge Event Pattern
-
EventBridge Built Event
-
Editing the Event Pattern for CSV Files
-
Adding Target to EventBridge Rule to trigger AWS Step functions
-
Creating EventBridge Rule for S3 Notifications
-
Uploading a File to S3 Bucket
-
Step Function Triggered
-
Graph View of Step Function showing that glue crawler step is running
-
Glue Crawler Running
-
Glue Job Started
-
Step Function Execution Successful
-
Step Function Success Notification
-
Data Successfully Ingested into Redshift
-
Analytics on Delayed Flights
- AWS SageMaker Integration: We can Expand the analytics capabilities of the pipeline by integrating with AWS SageMaker. This will allow us to implement machine learning models directly into the pipeline, enabling predictive analytics and more complex data analysis tasks. For example, using machine learning to predict flight delays based on historical data and external factors such as weather conditions or airport traffic.
- Pipeline Performance Tuning: Regularly review the performance of the AWS Glue jobs and Redshift queries. Adjust configurations such as DPUs in Glue and query optimization in Redshift to enhance the performance and reduce execution times.