Giter Club home page Giter Club logo

etl_with_airflow's Introduction

Data Pipeline with Apache Airflow

Description

This project is an example of how to create an ETL data pipeline orchestrated with Apache Airflow using some tool such External Task Sensors, File Sensors, etc.

Process

  • DAG 1.- This DAG simulates loading data from a source databaes into the temp_data folder, creating a exact temporal copy of the data.

  • DAG 2.- This DAG contains 4 file sensors:

    • Waiting db file: This sensor is waiting for temp file into the temp_data folder, once that file arrives, it triggers DAG 3.
    • Waiting raw file: This sensor is waiting for raw data file into the raw folder, once that file arrives, it triggers DAG 4.
    • Waiting clean file: This sensor is waiting for clean data file into the clean folder, once that file arrives, it triggers DAG 5.
    • Waiting consumption file: This sensor is waiting for consumption data file into the consumption folder, once that file arrives, it triggers DAG 6.
  • DAG 3.- This DAG executes the integration process and load the data from temp_data folder into raw folder.

  • DAG 4.- This DAG executes the clean process and clean the raw data and save it into clean folder.

  • DAG 5.- This DAG executes the consumption process and proccess that data applying bussiness rules for specifics reasons and save it into consumption folder.

  • DAG 6.- This DAG deletes the data in temp_data, because it is a temporary stage and it is not necessary store data there, and it finalizes the process.

Requirements and Installation

Directories and project strcuture:

ETL_with_airflow/
        |---airflow/
                |---dags/
                        |---001_extract_from_db.py
                        |---01_file_sensors.py
                        |---02_raw_process.py
                        |---03_clean_process.py
                        |---04_consumption_process.py
                        |---05_delete_temp_data.py
                |---logs/
                |---plugins/
        |---data_source/
                |---data_sales.csv
        |---etl_process/
                |---data_integration/
                        |---integration.py
                |---clean_process/
                        |---extract.py
                        |---load_to_clean_stage.py
                        |---transformations_to_clean.py
                |---consumption_process/
                        |---extract.py
                        |---load_to_consumption_stage.py
                        |---transformations_to_consumption.py
        |---my_bucket/
                |---data/
                        |---raw/
                        |---clean/
                        |---consumption/
        |---temp_data/
        |---.gitignore
        |---docker-compose.yaml
        |---README.md
        |---newspaper.db
        |---requirements.txt

It requires Python 3.6 or higher, check your Python version first.

Run the follow bash command to create a virtual environment:

python -m venv venv

To activate the virtual environment run:

venv/Scripts/activate

The requirements.txt should list and install all the required Python libraries that the pipeline depend on.

pip install -r requirements.txt

Run Airflow

Navigate to the folder where the docker-compose.yaml is located and run:

docker-compose up

Development

Data Architecture

alt text

DAG´s Workflow

alt text

Task Workflow

DAG 1: 001_extract_from_db.py

alt text

DAG 2: 01_file_sensors.py

alt text

DAG 3: 02_raw_process.py

alt text

DAG 4: 03_clean_process.py

alt text

DAG 5: 04_consumption_process.py

alt text

DAG 6: 05_delete_temp_data.py

alt text

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.