Movies data platform on Google Cloud Platform (GCP), using serverless services.
This platform automatically extracts and processes movie-related data from The Movie Database (TMDb) API, with the following steps :
- Data extraction from TMDb API and storage in a Cloud Storage bucket
- Data migration from the bucket to BigQuery tables
- Data transformation in BigQuery to extract specific information in new tables
- Data rendering with Looker Studio
These steps are deployed to GCP as Cloud functions. A workflow runs each cloud function sequentially, using the GCP Workflows tool. A Cloud Scheduler is used to trigger the workflow every day, at 9am.
Curated data rendering is done with Looker Studio. The following looker studio report is updated when new curated data is available and shows charts related to the 250 most popular movies that are currently in theatres :
The infrastructure is deployed continuously using Terraform and a Github Actions workflow.
Note : A high level code quality is provided by using Isort
, Black
, Flake8
, Pylint
linting tools and pre-commit hooks.
This sections explains how to reproduce the data platform with your own Google account. It is made for Linux.
Clone the github repository:
git clone https://github.com/barney11/tmdb-data-platform.git
Build a Python virtual environment and activate it:
sudo apt install python3.8 python3.8-dev
python3 -m venv platform_venv
platform_venv/bin/activate
Move into the repository:
cd tmdb-data-platform
Install the dependancies:
pip install -e .
Optional : If you plan to make a fork of this project and keep maximum code quality level, install the dev dependancies and activate pre-commit :
pip install -e .[dev]
pre-commit install
Create an account and generate an API key on The Movie Database : https://www.themoviedb.org/
Export your API key :
export TF_VAR_TMDB_API_KEY=<your-api-key>
Make sure you have a valid google account. The go to the Google Cloud Platform console and create a new project. Export the project id and the project number:
export TF_VAR_PROJECT_ID=<your-project-id>
export TF_VAR_PROJECT_NUMBER=<your-project-number>
Create a service account on GCP with the following roles :
- Storage objects admin
- Bigquery admin
- Cloud functions admin
- Workflows admin
- Cloud scheduler admin
- Secret manager admin
- Service account user
Build the JSON key file that corresponds to this account. Export the path to this file:
export TF_VAR_GCP_CREDENTIALS=$(realpath <path/to/your/json/key/file>)
curl -LO https://releases.hashicorp.com/terraform/1.6.2/terraform_1.6.2_linux_amd64.zip
unzip terraform_1.6.2_linux_amd64.zip
sudo mv terraform /usr/local/bin/
Choose names for:
- The bigquery dataset
- The data storage bucket
- The cloud functions storage bucket
- The terraform states storage bucket
Export those names:
export TF_VAR_DATASET_NAME="<your-dataset-name>"
export TF_VAR_DATA_STORAGE_BUCKET_NAME="<your-data-storage-bucket-name>"
export TF_VAR_CF_BUCKET_NAME="<your-cloud-functions-bucket-name>"
export TF_VAR_TF_STATES_BUCKET_NAME="<your-terraform-states-bucket-name>"
Move to the terraform directory:
cd platform/terraform
Init terraform:
terraform init
Build terraform plan:
terraform plan \
-out=tfplan \
-var tmdb_api_key=$TF_VAR_TMDB_API_KEY \
-var gcp_credentials_json=$TF_VAR_GCP_CREDENTIALS \
-var project_id=$TF_VAR_PROJECT_ID \
-var project_number=$TF_VAR_PROJECT_NUMBER \
-var dataset_name=$TF_VAR_DATASET_NAME \
-var data_storage_bucket_name=$TF_VAR_DATA_STORAGE_BUCKET_NAME \
-var terraform_states_bucket_name=$TF_VAR_TF_STATES_BUCKET_NAME \
-var cloud_functions_bucket_name=$TF_VAR_CF_BUCKET_NAME
Deploy:
terraform apply "tfplan"
You can easily build a Looker Studio report with charts from the curated bigquery tables.
This project was developed while listening to the album "Looping", a collaboration between Rone and the Orchestre National de Lyon.