Giter Club home page Giter Club logo

vitorjpc10 / etl-breweries Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 292 KB

Brewery Data Pipeline - This project implements a data pipeline to fetch, transform, and persist brewery data from the Open Brewery DB API into a data lake, following the medallion architecture (bronze, silver, gold layers). The pipeline is orchestrated using Apache Airflow and runs within Docker containers, coordinated via Docker Compose.

Home Page: https://www.openbrewerydb.org/

Dockerfile 2.60% Python 97.40%
aiflow datalake postgres pyspark requests s3 s3-bucket

etl-breweries's Introduction

ETL Data Pipeline for Breweries Data

Readme em português esta aqui: README-PT

Description

  1. This project extracts breweries data from API endpoint https://api.openbrewerydb.org/breweries
  2. Transforms, cleans the data, persists the data in json and parquet formats, including aggregated view with quantity of breweries per type and location.
  3. Loads data into Postgres Database for further querying capabilities.

Setup

Environment Variables

  1. Set the AWS Keys environment variables for in the docker-compose files to write data to S3 Storage.
  2. Define S3 Bucket location Path to write datalake files to (Optional)

If not defined it will still write data locally in container, but skip over data write to S3 Cloud storage.

Prerequisites

  • Git
  • Docker
  • Docker Compose

Steps to Run

  1. Clone the repository:
    git clone https://github.com/vitorjpc10/etl-breweries.git
  2. Move to the newly cloned repository:
    cd etl-breweries

ETL without Orchestrator (Python Docker Image)

  1. Build and run the Docker containers:

    docker-compose up --build
  2. The data will be extracted, transformed, and loaded into the PostgreSQL database based on the logic in scripts/main.py.

  3. Once built, run the following command to execute queries on breweries table from PostgreSQL database container:

    docker exec -it etl-breweries-db-1 psql -U postgres -c "\i queries/queries.sql"

    Do \q in terminal to quit query, there are 2 queries in total.

ETL with Orchestrator (Apache Airflow)

  1. Move to the Airflow directory:

    cd airflow
  2. Build and run the Docker containers:

    docker-compose up airflow-init --build
     docker-compose up
  3. Once all containers are built access local (http://localhost:8080/) and trigger etl_dag DAG (username and password are admin by default)

  4. Once DAG compiles successfully, run the following command to execute queries on breweries table:

    docker exec -it airflow-postgres-1 psql -U airflow -c "\i queries/queries.sql"

    Do \q in terminal to quit query, there are 2 queries in total.

Assumptions and Design Decisions

  • The project uses Docker and Docker Compose for containerization and orchestration to ensure consistent development and deployment environments.
  • Docker volumes are utilized to persist PostgreSQL data, ensuring that the data remains intact even if the containers are stopped or removed.
  • The PostgreSQL database is selected for data storage due to its reliability, scalability, and support for SQL queries.
  • Pure Python, SQL, and PySpark are used for data manipulation to ensure lightweight and efficient data processing.
  • The SQL queries for generating reports are stored in separate files (e.g., queries.sql). This allows for easy modification of the queries and provides a convenient way to preview the results.
  • To generate the reports, the SQL queries are executed within the PostgreSQL database container. This approach simplifies the process and ensures that the queries can be easily run and modified as needed.
  • The extracted data is saved locally ('data' folders) and AWS S3 (Optional) and mounted to the containers, including the raw data coming from the API and the transformed, both in JSON and Parquet format. This setup offers simplicity (KISS principle) and flexibility, allowing for easy access to the data.
  • An aggregate view with the quantity of breweries per type and location is created to provide insights into the data.
  • Orchestration through Apache Airflow ensures task separation and establishes a framework for executing and monitoring the ETL process. It provides notification alerts for retries or task failures, enhancing the robustness of the pipeline.

Airflow Sample DAG

img.png

AWS S3 File Write Preview

img_1.png

etl-breweries's People

Contributors

vitorjpc10 avatar

Stargazers

Ian Lanham avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.