The "Cloud Data Vault with dbt and Airbyte" project is a comprehensive data management solution for integrating, transforming, and orchestrating ETL processes using dbt (Data Build Tool) and Airbyte. The project aims to create a cloud-based data vault, ensuring efficient data workflows.
This project is a part of the course Data Warehouse Analyst (OTUS). It demonstrates the integration and automation of data workflows using dbt (Data Build Tool) and Airbyte, along with infrastructure management via Terraform and data visualization through Apache Superset.
- Successful integration of data sources using Airbyte.
- Data transformation and modeling in dbt.
- Automated deployment of infrastructure using Terraform.
- Visualization of data in Apache Superset dashboards.
- Presentation of actionable insights from the data.
- Enhanced data-driven decision-making.
- Improved data accessibility and usability.
- Automated ETL processes to save time and reduce errors.
-
Terraform==1.4.6:
- Tool for automating and managing infrastructure.
- Used to deploy the cloud infrastructure required for the project.
-
Airbyte==0.40.32:
- Data integration platform for extracting and loading data from various sources.
- Supports multiple connectors for popular data sources.
-
dbt==1.5.3:
- Handles data transformation and modeling in the data warehouse utilizing AutomateDV package.
- Integrates with Airbyte for workflow automation.
-
Apache Superset==3.1.3:
- Data visualization tool.
- Used to create dashboards and reports based on the data transformed by dbt and Airbyte.
graph TD
A[Data Sources] -->|Extract| B[Airbyte]
subgraph PostgreSQL Data Warehouse
C[raw]
I[stg]
J[dds]
K[mart]
end
B -->|Load| C
C -->|dbt| I
I -->|dbt| J
J -->|dbt| K
subgraph Docker Superset
F[Apache Superset]
end
K -->|Query| F[Apache Superset]
subgraph Docker Dev
G[Terraform]
H[dbt]
end
G -->|Yandex Cloud| B
G -->|Yandex Cloud| C
- Data Sources: Fake e-commerce-like data.
- Load Frequency: Hourly.
- Source Data Formats: JSON.
- Data Structure: Data Vault.
erDiagram
HUB_PRODUCTS {
binary product_hk
string brand_model
timestamp load_date
string record_source
}
HUB_USERS {
binary user_hk
string email
timestamp load_date
string record_source
}
LINK_PURCHASES {
binary purchase_hk
binary product_hk
binary user_hk
timestamp purchased_at
timestamp added_to_cart_at
timestamp returned_at
timestamp effective_from
timestamp load_date
string record_source
}
SAT_PRODUCTS {
binary product_hk
binary hashdiff
string make
int year
string model
float price
timestamp created_at
timestamp effective_from
timestamp load_date
string record_source
}
SAT_USERS {
binary user_hk
binary hashdiff
int age
string name
string email
string title
string gender
string height
int weight
string language
string telephone
string blood_type
string occupation
string nationality
string academic_degree
timestamp created_at
timestamp updated_at
timestamp effective_from
timestamp load_date
string record_source
}
HUB_PRODUCTS ||--o{ SAT_PRODUCTS : "product_hk"
HUB_USERS ||--o{ SAT_USERS : "user_hk"
HUB_PRODUCTS ||--o{ LINK_PURCHASES : "product_hk"
HUB_USERS ||--o{ LINK_PURCHASES : "user_hk"
- Fork this repository.
- Copy template file
.env.template
to.env
file and add credentials. - Start Docker containers.
docker-compose up -d
- Enter into the
infrastructure
container.docker exec -it infrastructure bash
- Configure yc CLI.
yc init
- Set environment variables.
export YC_TOKEN=$(yc iam create-token) export YC_CLOUD_ID=$(yc config get cloud-id) export YC_FOLDER_ID=$(yc config get folder-id) export TF_VAR_folder_id=$(yc config get folder-id) export $(xargs < .env)
- Configure YC Terraform provider.
cd terraform/ cp terraformrc ~/.terraformrc
- Run Terraform.
By the end of the deployment you will see the output you will need in a moment.
terraform init terraform validate terraform fmt terraform plan terraform apply
Outputs: postgresql_cluster_id = "<id>" postgresql_dbname = "dbname" # from .env postgresql_host = "*.mdb.yandexcloud.net" postgresql_user = "user" # from .env yandex_compute_instance_nat_ip_address = "<ip>" yandex_iam_service_account_static_access_key = "<key>" yandex_iam_service_account_static_secret_key = <sensitive>
- Open and log into Airbyte instance at
{yandex_compute_instance_nat_ip_address}:8000
with credentials:login: airbyte password: password
- Create source Faker.
- Create destination PostgreSQL with credentials:
- host: {postgresql_host}
- port: 6432
- username: {postgresql_user}
- password: {postgresql_password} # from .env
- schema: public
- database: {postgresql_dbname}
- Set up connection in Airbyte:
- Choose
Replication frequency
. - Change
Destination Namespace
toCustom
. - Change
Namespace Custom Format
toraw
. - Change
Sync mode
toIncremental | Append
. - Click
Set up connection
.
- Choose
- Set
DBT_HOST
environment variable, move to dbt folder and check the connection.export DBT_HOST=$(terraform output -raw postgresql_host) cd ../dbt dbt debug
- Install dbt packages:
dbt deps
- Run dbt models:
dbt run
- Open Superset at http://localhost:8088/.
- Total Purchases: There were 19 purchases made, it didn't change compared to the last month.
- Income: The total income is $714K, which shows an 8.1% decrease from the previous month.
- Returns: There were 8 returns, representing a 60% increase from the last month.
- Items Added to Cart: Only 4 items were added to the cart, indicating a 33.3% decrease from the previous month.
-
Top Manufacturers:
- The US manufacturers dominate the top 5.
-
Product Returns Analysis:
- In 2024 the count of returns significantly increased.
- Lately customers return purchases much sooner.
-
Product Price Distribution by Year of Manufacture:
- Products manufactured between 2000-2009 are mostly cheaper.
- Products manufactured before 2000 have both cheap and expensive models (likely related to the condition and retro attitude).
-
User Account Creation Over Time:
- There is a general decline in the creation of user accounts over the years, with some peaks indicating occasional surges in new users.
-
Age Distribution of Users by Gender:
- Users are primarily in the age ranges of 40-60.
- Users prefer not to disclose their gender.
- Perhaps, some issue allowing users choosing 'Fluid' gender.
-
Users by Nationality and Academic Degree:
- Our products are more in demand for Latin America and Africa.
- Most of our customers have M.Sc., PhD degrees.
- Address Returns: Investigate the reasons behind the high return rate, especially for products from specific years of manufacture, and implement quality control measures.
- Enhance User Acquisition: Develop targeted campaigns to attract new users and re-engage existing ones, focusing on the peaks in user account creation.
- Leverage Demographic Insights: Tailor marketing and product offerings to the well-educated user base, potentially offering premium or specialized products.