ELT themoviedb.org | Leveraging TMDB API

This product uses the TMDB API but is not endorsed or certified by TMDB.

This project leverages The Movie Database (TMDB) API to extract and load data into Google BigQuery, transforms the data using dbt, and visualizes insights using Evidence. The goal is to provide key points of movie & tv series trends that helps media professionals and enthusiasts understand what content captures viewers' interest.

Project overview

This pipeline is designed to streamline the process of data extraction, loading, transformation, and reporting. It uses modern data engineering tools and practices to ensure scalability and reproducibility.

Architecture

BI

ER diagrams

Technologies used

dlt (Data Load Tool): For extracting and loading data into Google Bigquery.
dbt (Data Build Tool): For transforming data within BigQuery.
Evidence.dev: Code-driven alternative to drag-and-drop BI tools.
Docker: For containerization of the pipeline.
Prefect: For workflow orchestration.
Terraform: For Infrastucture as Code (IaC).
Google BigQuery: The Data Warehouse.
DuckDB: For local testing.

Only BigQuery? Why not Google Cloud Storage (GCS)?

I know using GCS is part of the evaluation criteria, however, I intentionally did not include it in this project for the following reasons:

Data Volume: The data volume from TMDb API is manageable within BigQuery without the need for intermediate storage.
Complexity and Cost: Avoiding GCS simplifies the architecture and reduces costs associated with storage and data transfer, especially beneficial for small to medium datasets.
Misconceptions about "Data Lake": New data engineers often believe that integrating cloud storage like Google Coud Storage (GCS) or AWS S3 is a mandatory step in data pipelines. However, this is not always necessary and can sometimes introduce unneccesary complexity and costs. In scenarios where data can be directly ingested and processed by data warehousing solutions like BigQuery, bypassing intermediate cloud storage can streamline workflows and reduce overhead.

Getting Started

Prerequisites

Docker installed
Python installed
Terraform installed
Make: While make is readily available and commonly used on Linux and macOs, it is not included by default in windows. Using Chocolately (a package manager for windows) can be easily installed: choco install make
Node.js installed: This is to run Evidence.dev, "Build Polished data products with SQL"
dlt credentials: Click here for instructions how to add credentials under .dlt/secrets.toml.
evicence.dev credentials: Click here for instructions to connect your local development environment to BigQuery.
Create Google Cloud Project
Google Cloud Platform Credentials JSON
DuckDB: This is completely optional, but in case you want to test your dlt python script locally, install DuckDB.
Generate API Key from TMDB themoviedb.org

Set up environment variables

Be sure to create .env file, and ensure is configured correctly for your dbt profiles.yml

Configurations

dbt Configuration: Ensure ~/.dbt/profiles.yml is correctly set up to connect to your BigQuery instance.
dlt Configuration: Update secrets.toml under .dlt/ with your keys from themoviedb.org and Google BigQuery .
prefect Configuration: Ensure to change in prefect.yaml your prefect.deployments.steps.set_working_directory

Terraform

I want to clarify the purpose and setup of Terraform within this project. The configuration files located in the terraform folder primarly ensure that the enviroment is correclty prepared, especially regarding the credentials file. Technically it is to be sure your keys are correct. That's it.

Fortunately dlt handles the creation of the necessary datasets, and given the simplicity of this project, using Terraform isn't essential, but it helps in ensuring that all system components are properly configured before running the pipeline.

If you decide to test, then you must update variable "credentials_file" default path. (go to terraform/variables.tf)

# Move to terraform folder
cd terraform/

# init project
terraform init

# plan
terraform plan

# apply
terraform apply

Use of Makefile

You can refer to the help command for guidance on what commands are available and what each command does:

make help

Output:

Usage:
  make setup_uv                    - Instructions of uv using a script, system package manager, or pipx
  make install_dependencies        - Installs python dependencies using uv
  make create_venv                 - Creates a virtual environment using uv
  make activate_venv               - Instructions to activate python virtual environment
  make run_prefect_server          - Runs prefect localhost server
  make deploy_prefect              - Deploys Prefect flows
  make start_evidence              - Sets up and runs the evidence.dev project

This command will display all available options and their descriptions, allowing you to easily understand how to interact with your project using the make commands.

Installation

Clone the repository and use Terraform:

git clone [email protected]:theDataFixer/de-zoomcamp-project.git
cd de-zoomcamp-project

Install uv (An extremely fast Python package installer and resolver, written in Rust), and activate the virtual environment

make setup_uv

make create_venv

make activate_venv

Install python dependencies

make install_dependencies

Usage

Start Prefect Server:

make run_prefect_server

Deploy Prefect Flows:

make deploy_prefect

After deploy, you'll get a message in terminal to start worker with your chosen pool name, and go to localhost:4200 and run workflow.

Start and use Evidence.dev:

make start_evidence

NOTE:

In case you get error in Prefect sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked then you should change your database to postgresql. Instructions here

In short, run: docker run -d --name prefect-postgres -v prefectdb:/var/lib/postgresql/data -p 5432:5432 -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=yourTopSecretPassword -e POSTGRES_DB=prefect postgres:latest

And then: prefect config set PREFECT_API_DATABASE_CONNECTION_URL="postgresql+asyncpg://postgres:yourTopSecretPassword@localhost:5432/prefect"

Contact

Feel free to reach out to me if you have any questions, comments, suggestions, or feedback: theDataFixer.xyz

thedatafixer / de-zoomcamp-project Goto Github PK

de-zoomcamp-project's Introduction