Daily Flight Data Ingestion

An AWS data pipeline for ingesting the daily data of flights and load it into Redshift.

Prerequisites

A good understanding of AWS Services: S3, Redshift, Glue, AWS Step functions, EventBridge.
Good understanding of Python, and SQL.
Knowledge of AWS security best practices, including IAM (Identity and Access Management) roles and policies.

Project Motivation

This project aims to leverage AWS services to process large amount of daily flight data and leverages a redshift data warehouse which can then be used by end users for comprehensive analysis and reporting.

Architecture Diagram

AWS Glue Visual ETL Diagram

AWS Step function Diagram

Architecture Diagram Steps

The pipeline starts with an S3 bucket where the daily flight data files are stored or ingested.
EventBridge Rule monitors the S3 bucket for new data arrivals and triggers the step function when a new file is detected.
Step Functions orchestrates and coordinates the subsequent steps in the pipeline, providing a serverless workflow management.
The Glue Crawler service crawls the data in S3, extracts metadata, and infers the schema, preparing the data for further processing.
An Apache Spark Glue ETL job then performs data transformations, cleansing, and preparation tasks on the flight data to ensure it's in the desired format for loading into the data warehouse.
Based on the outcome of the Glue Job, an email notification is sent to the appropriate SNS topic for either success or failure, informing stakeholders about the status of the data ingestion process.

Steps and Descriptions to build the pipeline

Creating S3 Bucket
- Initial setup of an S3 bucket to store data resources.
Creating Folders for Data Storage
- Setting up folders in S3 to store dimensions and daily flights data for arrival and departure of flights.
Viewing Data Inside Dimension Folder
- Overview of data stored inside the dimension folder in S3.
Starting the Redshift Cluster
- Initiating the AWS Redshift cluster to handle and analyze data.
Creating Redshift Connection in AWS Glue
- Establishing a connection to AWS Redshift from AWS Glue.
Successful Redshift Connection
- Confirmation of a successful connection setup to AWS Redshift.
Creating Glue Metadata Database
- Setting up a metadata database in AWS Glue.
Creating Tables in Redshift
- Executing commands to create tables in the AWS Redshift database.
Glue Crawler for Redshift Metadata
- Configuring a Glue crawler to extract metadata from the dimensions table in Redshift.
Glue Crawler for Flights Fact Table
- Setting up a Glue crawler for the flights fact table to organize and access flight data efficiently.
Glue Crawler for Daily Flight Data
- Configuring a Glue crawler to scan and index daily flight data stored in S3.
Sample Data in S3 Bucket
- Uploading sample data in S3 bucket for metadata inference by the Glue crawler. We can delete this as soon as we run the initial scanner.
Successful Crawler Run
- Confirmation of a successful run of the Glue crawler, indicating successful metadata extraction.
Metadata Table Created Successfully
- Successful creation of metadata tables in AWS Glue, reflecting the structured data from Redshift and S3.
Crawlers Picked Metadata from Both Redshift Tables and S3 Data Files
- AWS Glue crawlers successfully extracted metadata from both Redshift tables and S3 data files.
Filter to Show Only Flights Delayed by 60 Minutes or More When Departing
- Applying filters in the data query to show only flights that are delayed by 60 minutes or more upon departure.
Adding Airport Dimension Data Catalog
- Adding a data catalog for airport dimensions in Glue Visual ETL.
Join Operation on Flight Data
- Performing a join operation between the filtered flights data and the airports CSV file to enrich flight information.
Schema Adjustment After Join
- Modifying the schema of the joined data to match the pre-existing Redshift table for consistent data integration.
Second Join Operation for Destination Details
- Executing a second join operation to integrate destination details into the flight data.
Adjusting Schema to Match Destination Table
- Aligning the schema of the processed data to fit the Redshift destination table specifications.
Final Redshift Target Table
- Overview of the target Redshift table containing the fully processed and joined flight data.
Overview of the Complete Glue ETL Pipeline
- Visual representation of the entire data processing pipeline, from data ingestion to storage.
Enabling Job Bookmarking for Incremental Ingestion
- Setting job bookmarking in AWS Glue to enable incremental ingestion of data with adjustments to worker settings for efficiency.
Enabling EventBridge Notifications for S3 Bucket
- Setting up notifications via AWS EventBridge for the source S3 bucket to trigger processes based on data updates.
Adding Crawler to Step Function
- Incorporating a Glue crawler into the AWS Step Function to automate part of the data processing workflow.
Checking Crawler State
- Utilizing a 'Get Crawler' function to check the operational state of the AWS Glue crawler.
Conditional Check for Crawler State
- Implementing a conditional check within the step function to handle different crawler states.
Waiting and Re-checking Crawler Status
- Setting a wait condition in the step function before re-checking the crawler status to ensure data readiness.
Successful Glue Crawler Run
- Proceeding to the next steps once glue crawler run is successful.
Handling Task Failure Notifications
- Implementing AWS SNS to send notifications in case of task failures during the pipeline execution.
Sending Success Notifications
- Sending success notifications via AWS SNS upon successful completion of the pipeline processes.
Sending Failed Notification
- Adding SNS for failed notifications
Overall Step Function Diagram
- Displaying the overall AWS Step Function diagram, detailing the complete process flow and integrations.
Creating an EventBridge Rule
- Setting up an AWS EventBridge rule to manage events based on specific triggers within the pipeline.
Building EventBridge Event Pattern
- Constructing an event pattern in AWS EventBridge to filter and respond to S3 Object Create events effectively.
EventBridge Built Event
- Overview of a built event in AWS EventBridge, showcasing the configured event responses.
Editing the Event Pattern for CSV Files
- Modifying the event pattern in AWS EventBridge to specifically match CSV file patterns, enhancing targeted event handling.
Adding Target to EventBridge Rule to trigger AWS Step functions
- Adding a target to the AWS EventBridge rule to direct the event response actions.
Creating EventBridge Rule for S3 Notifications
- Establishing an EventBridge rule to listen for S3 create notifications and trigger the state machine.
Uploading a File to S3 Bucket
- Uploading a new file to the S3 bucket to initiate the automated data handling and analytics pipeline.
Step Function Triggered
- AWS Step Function triggering in response to new data being uploaded to S3.
Graph View of Step Function showing that glue crawler step is running
- Displaying the graph view of the AWS Step Function, illustrating the sequence of operations and decision points.
Glue Crawler Running
- Showing the AWS Glue crawler in operation as it processes data inputs.
Glue Job Started
- Initiating an AWS Glue job to transform and load data according to predefined logic and parameters.
Step Function Execution Successful
- Confirming the successful completion of the AWS Step Function execution, marking the end of the automated process.
Step Function Success Notification
- Notification Sent confirming the successful execution of the step function.
Data Successfully Ingested into Redshift
- Verifying that data has been successfully ingested into the AWS Redshift data warehouse, ready for analysis.
Analytics on Delayed Flights
- Displaying analytics results on count of flights delayed by at least one hour, demonstrating the data processing and analysis capabilities.

Potential Next Steps

AWS SageMaker Integration: We can Expand the analytics capabilities of the pipeline by integrating with AWS SageMaker. This will allow us to implement machine learning models directly into the pipeline, enabling predictive analytics and more complex data analysis tasks. For example, using machine learning to predict flight delays based on historical data and external factors such as weather conditions or airport traffic.
Pipeline Performance Tuning: Regularly review the performance of the AWS Glue jobs and Redshift queries. Adjust configurations such as DPUs in Glue and query optimization in Redshift to enhance the performance and reduce execution times.

harikirank / airline-data-ingestion-pipeline Goto Github PK

airline-data-ingestion-pipeline's Introduction

Daily Flight Data Ingestion

An AWS data pipeline for ingesting the daily data of flights and load it into Redshift.

Prerequisites

Project Motivation

This project aims to leverage AWS services to process large amount of daily flight data and leverages a redshift data warehouse which can then be used by end users for comprehensive analysis and reporting.

Architecture Diagram

AWS Glue Visual ETL Diagram

AWS Step function Diagram

Architecture Diagram Steps

Steps and Descriptions to build the pipeline

Potential Next Steps

airline-data-ingestion-pipeline's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent