Giter Club home page Giter Club logo

airline-data-ingestion-pipeline's Introduction

Daily Flight Data Ingestion

An AWS data pipeline for ingesting the daily data of flights and load it into Redshift.

Prerequisites

  • A good understanding of AWS Services: S3, Redshift, Glue, AWS Step functions, EventBridge.
  • Good understanding of Python, and SQL.
  • Knowledge of AWS security best practices, including IAM (Identity and Access Management) roles and policies.

Project Motivation

This project aims to leverage AWS services to process large amount of daily flight data and leverages a redshift data warehouse which can then be used by end users for comprehensive analysis and reporting.

Architecture Diagram

Architecture Diagram

AWS Glue Visual ETL Diagram

ETL Pipeline

AWS Step function Diagram

AWS Step function

Architecture Diagram Steps

  1. The pipeline starts with an S3 bucket where the daily flight data files are stored or ingested.
  2. EventBridge Rule monitors the S3 bucket for new data arrivals and triggers the step function when a new file is detected.
  3. Step Functions orchestrates and coordinates the subsequent steps in the pipeline, providing a serverless workflow management.
  4. The Glue Crawler service crawls the data in S3, extracts metadata, and infers the schema, preparing the data for further processing.
  5. An Apache Spark Glue ETL job then performs data transformations, cleansing, and preparation tasks on the flight data to ensure it's in the desired format for loading into the data warehouse.
  6. Based on the outcome of the Glue Job, an email notification is sent to the appropriate SNS topic for either success or failure, informing stakeholders about the status of the data ingestion process.

Steps and Descriptions to build the pipeline

  1. Creating S3 Bucket

    • Initial setup of an S3 bucket to store data resources. Create S3 Bucket
  2. Creating Folders for Data Storage

    • Setting up folders in S3 to store dimensions and daily flights data for arrival and departure of flights. Create Folders in S3
  3. Viewing Data Inside Dimension Folder

    • Overview of data stored inside the dimension folder in S3. Data Inside Dimension Folder
  4. Starting the Redshift Cluster

    • Initiating the AWS Redshift cluster to handle and analyze data. Start Redshift Cluster
  5. Creating Redshift Connection in AWS Glue

    • Establishing a connection to AWS Redshift from AWS Glue. Create Redshift Connection in Glue
  6. Successful Redshift Connection

    • Confirmation of a successful connection setup to AWS Redshift. Redshift Connection Successful
  7. Creating Glue Metadata Database

    • Setting up a metadata database in AWS Glue. Create Glue Metadata Database
  8. Creating Tables in Redshift

    • Executing commands to create tables in the AWS Redshift database. Create Tables in Redshift
  9. Glue Crawler for Redshift Metadata

    • Configuring a Glue crawler to extract metadata from the dimensions table in Redshift. Setup Glue Crawler for Redshift Metadata
  10. Glue Crawler for Flights Fact Table

    • Setting up a Glue crawler for the flights fact table to organize and access flight data efficiently. Setup Glue Crawler for Flights Fact Table
  11. Glue Crawler for Daily Flight Data

    • Configuring a Glue crawler to scan and index daily flight data stored in S3. Setup Glue Crawler for Daily Flight Data
  12. Sample Data in S3 Bucket

    • Uploading sample data in S3 bucket for metadata inference by the Glue crawler. We can delete this as soon as we run the initial scanner. Upload Sample Data to S3
  13. Successful Crawler Run

    • Confirmation of a successful run of the Glue crawler, indicating successful metadata extraction. Crawler Run Successful
  14. Metadata Table Created Successfully

    • Successful creation of metadata tables in AWS Glue, reflecting the structured data from Redshift and S3. Metadata Table Creation Successful
  15. Crawlers Picked Metadata from Both Redshift Tables and S3 Data Files

    • AWS Glue crawlers successfully extracted metadata from both Redshift tables and S3 data files. Crawlers Metadata Extraction
  16. Filter to Show Only Flights Delayed by 60 Minutes or More When Departing

    • Applying filters in the data query to show only flights that are delayed by 60 minutes or more upon departure. Filter Flights by Delay
  17. Adding Airport Dimension Data Catalog

    • Adding a data catalog for airport dimensions in Glue Visual ETL. Add Airport Dimension Data Catalog
  18. Join Operation on Flight Data

    • Performing a join operation between the filtered flights data and the airports CSV file to enrich flight information. Join Operation on Flight Data
  19. Schema Adjustment After Join

    • Modifying the schema of the joined data to match the pre-existing Redshift table for consistent data integration. Schema Adjustment After Join
  20. Second Join Operation for Destination Details

    • Executing a second join operation to integrate destination details into the flight data. Second Join Operation
  21. Adjusting Schema to Match Destination Table

    • Aligning the schema of the processed data to fit the Redshift destination table specifications. Schema Adjustment for Destination Table
  22. Final Redshift Target Table

    • Overview of the target Redshift table containing the fully processed and joined flight data. Redshift Target Table
  23. Overview of the Complete Glue ETL Pipeline

    • Visual representation of the entire data processing pipeline, from data ingestion to storage. Complete Pipeline Overview
  24. Enabling Job Bookmarking for Incremental Ingestion

    • Setting job bookmarking in AWS Glue to enable incremental ingestion of data with adjustments to worker settings for efficiency. Enable Job Bookmarking
  25. Enabling EventBridge Notifications for S3 Bucket

    • Setting up notifications via AWS EventBridge for the source S3 bucket to trigger processes based on data updates. Enable EventBridge Notifications
  26. Adding Crawler to Step Function

    • Incorporating a Glue crawler into the AWS Step Function to automate part of the data processing workflow. Add Crawler to Step Function
  27. Checking Crawler State

    • Utilizing a 'Get Crawler' function to check the operational state of the AWS Glue crawler. Check Crawler State
  28. Conditional Check for Crawler State

    • Implementing a conditional check within the step function to handle different crawler states. Conditional Check for Crawler
  29. Waiting and Re-checking Crawler Status

    • Setting a wait condition in the step function before re-checking the crawler status to ensure data readiness. Wait and Re-check Crawler Status
  30. Successful Glue Crawler Run

    • Proceeding to the next steps once glue crawler run is successful. Assume Crawler Success
  31. Handling Task Failure Notifications

    • Implementing AWS SNS to send notifications in case of task failures during the pipeline execution. Task Failure Notification Setup
  32. Sending Success Notifications

    • Sending success notifications via AWS SNS upon successful completion of the pipeline processes. Success Notification Sent
  33. Sending Failed Notification

    • Adding SNS for failed notifications Success Notification
  34. Overall Step Function Diagram

    • Displaying the overall AWS Step Function diagram, detailing the complete process flow and integrations. Step Function Diagram
  35. Creating an EventBridge Rule

    • Setting up an AWS EventBridge rule to manage events based on specific triggers within the pipeline. Create EventBridge Rule
  36. Building EventBridge Event Pattern

    • Constructing an event pattern in AWS EventBridge to filter and respond to S3 Object Create events effectively. EventBridge Event Pattern
  37. EventBridge Built Event

    • Overview of a built event in AWS EventBridge, showcasing the configured event responses. EventBridge Built Event
  38. Editing the Event Pattern for CSV Files

    • Modifying the event pattern in AWS EventBridge to specifically match CSV file patterns, enhancing targeted event handling. Edit Event Pattern for CSV Files
  39. Adding Target to EventBridge Rule to trigger AWS Step functions

    • Adding a target to the AWS EventBridge rule to direct the event response actions. Add Target to EventBridge
  40. Creating EventBridge Rule for S3 Notifications

    • Establishing an EventBridge rule to listen for S3 create notifications and trigger the state machine. Create EventBridge Rule for S3
  41. Uploading a File to S3 Bucket

    • Uploading a new file to the S3 bucket to initiate the automated data handling and analytics pipeline. Upload File to S3
  42. Step Function Triggered

    • AWS Step Function triggering in response to new data being uploaded to S3. Step Function Started
  43. Graph View of Step Function showing that glue crawler step is running

    • Displaying the graph view of the AWS Step Function, illustrating the sequence of operations and decision points. Graph View of Step Function
  44. Glue Crawler Running

    • Showing the AWS Glue crawler in operation as it processes data inputs. Crawler Running
  45. Glue Job Started

    • Initiating an AWS Glue job to transform and load data according to predefined logic and parameters. Glue Job Started
  46. Step Function Execution Successful

    • Confirming the successful completion of the AWS Step Function execution, marking the end of the automated process. Step Function Successful
  47. Step Function Success Notification

    • Notification Sent confirming the successful execution of the step function. Success Notification Email
  48. Data Successfully Ingested into Redshift

    • Verifying that data has been successfully ingested into the AWS Redshift data warehouse, ready for analysis. Data Ingested into Redshift
  49. Analytics on Delayed Flights

    • Displaying analytics results on count of flights delayed by at least one hour, demonstrating the data processing and analysis capabilities. Analytics on Delayed Flights

Potential Next Steps

  1. AWS SageMaker Integration: We can Expand the analytics capabilities of the pipeline by integrating with AWS SageMaker. This will allow us to implement machine learning models directly into the pipeline, enabling predictive analytics and more complex data analysis tasks. For example, using machine learning to predict flight delays based on historical data and external factors such as weather conditions or airport traffic.
  2. Pipeline Performance Tuning: Regularly review the performance of the AWS Glue jobs and Redshift queries. Adjust configurations such as DPUs in Glue and query optimization in Redshift to enhance the performance and reduce execution times.

airline-data-ingestion-pipeline's People

Contributors

harikirank avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.