Giter Club home page Giter Club logo

de-zoomcamp-project's Introduction

ELT themoviedb.org | Leveraging TMDB API


This product uses the TMDB API but is not endorsed or certified by TMDB.

This project leverages The Movie Database (TMDB) API to extract and load data into Google BigQuery, transforms the data using dbt, and visualizes insights using Evidence. The goal is to provide key points of movie & tv series trends that helps media professionals and enthusiasts understand what content captures viewers' interest.

Project overview

This pipeline is designed to streamline the process of data extraction, loading, transformation, and reporting. It uses modern data engineering tools and practices to ensure scalability and reproducibility.

Architecture

BI

ER diagrams

Technologies used

  • dlt (Data Load Tool): For extracting and loading data into Google Bigquery.
  • dbt (Data Build Tool): For transforming data within BigQuery.
  • Evidence.dev: Code-driven alternative to drag-and-drop BI tools.
  • Docker: For containerization of the pipeline.
  • Prefect: For workflow orchestration.
  • Terraform: For Infrastucture as Code (IaC).
  • Google BigQuery: The Data Warehouse.
  • DuckDB: For local testing.

Only BigQuery? Why not Google Cloud Storage (GCS)?

I know using GCS is part of the evaluation criteria, however, I intentionally did not include it in this project for the following reasons:

  1. Data Volume: The data volume from TMDb API is manageable within BigQuery without the need for intermediate storage.
  2. Complexity and Cost: Avoiding GCS simplifies the architecture and reduces costs associated with storage and data transfer, especially beneficial for small to medium datasets.
  3. Misconceptions about "Data Lake": New data engineers often believe that integrating cloud storage like Google Coud Storage (GCS) or AWS S3 is a mandatory step in data pipelines. However, this is not always necessary and can sometimes introduce unneccesary complexity and costs. In scenarios where data can be directly ingested and processed by data warehousing solutions like BigQuery, bypassing intermediate cloud storage can streamline workflows and reduce overhead.

Getting Started

Prerequisites

Set up environment variables

Be sure to create .env file, and ensure is configured correctly for your dbt profiles.yml

Configurations

  • dbt Configuration: Ensure ~/.dbt/profiles.yml is correctly set up to connect to your BigQuery instance.
  • dlt Configuration: Update secrets.toml under .dlt/ with your keys from themoviedb.org and Google BigQuery .
  • prefect Configuration: Ensure to change in prefect.yaml your prefect.deployments.steps.set_working_directory

Terraform

I want to clarify the purpose and setup of Terraform within this project. The configuration files located in the terraform folder primarly ensure that the enviroment is correclty prepared, especially regarding the credentials file. Technically it is to be sure your keys are correct. That's it.

Fortunately dlt handles the creation of the necessary datasets, and given the simplicity of this project, using Terraform isn't essential, but it helps in ensuring that all system components are properly configured before running the pipeline.

If you decide to test, then you must update variable "credentials_file" default path. (go to terraform/variables.tf)

# Move to terraform folder
cd terraform/

# init project
terraform init

# plan
terraform plan

# apply
terraform apply

Use of Makefile

You can refer to the help command for guidance on what commands are available and what each command does:

make help

Output:

Usage:
  make setup_uv                    - Instructions of uv using a script, system package manager, or pipx
  make install_dependencies        - Installs python dependencies using uv
  make create_venv                 - Creates a virtual environment using uv
  make activate_venv               - Instructions to activate python virtual environment
  make run_prefect_server          - Runs prefect localhost server
  make deploy_prefect              - Deploys Prefect flows
  make start_evidence              - Sets up and runs the evidence.dev project

This command will display all available options and their descriptions, allowing you to easily understand how to interact with your project using the make commands.

Installation

  1. Clone the repository and use Terraform:
git clone [email protected]:theDataFixer/de-zoomcamp-project.git
cd de-zoomcamp-project
  1. Install uv (An extremely fast Python package installer and resolver, written in Rust), and activate the virtual environment
make setup_uv
make create_venv
make activate_venv
  1. Install python dependencies
make install_dependencies

Usage

  • Start Prefect Server:
make run_prefect_server
  • Deploy Prefect Flows:
make deploy_prefect

After deploy, you'll get a message in terminal to start worker with your chosen pool name, and go to localhost:4200 and run workflow.

  • Start and use Evidence.dev:
make start_evidence

NOTE:

In case you get error in Prefect sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked then you should change your database to postgresql. Instructions here

In short, run: docker run -d --name prefect-postgres -v prefectdb:/var/lib/postgresql/data -p 5432:5432 -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=yourTopSecretPassword -e POSTGRES_DB=prefect postgres:latest

And then: prefect config set PREFECT_API_DATABASE_CONNECTION_URL="postgresql+asyncpg://postgres:yourTopSecretPassword@localhost:5432/prefect"


Contact

Feel free to reach out to me if you have any questions, comments, suggestions, or feedback: theDataFixer.xyz

de-zoomcamp-project's People

Contributors

thedatafixer avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.