This repository is for project 4 of the Udacity Data Engineer Nanodegree program, submitted in October 2020. For this project, we are working with a (fictitious) music company called Sparkify -- a music streaming startup. We are tasked with building an ETL pipeline using Spark, which reads data from an S3
bucket in json
format (both archives of songs and song listening log data) and transforms it into a relational database (called "data lake" for the purposes of the exercise), finally saving the results as parquet
files also in S3.
In order to be able to run this project, you need the following:
- AWS credentials (IAM user credentials, stored in the
dl.cfg
file; make sure you don't use any quotation marks) - Python 3.7 (and an environment with
pyspark
installed) - An
S3
bucket where you can store the parquet files to (specified asoutput_data
inmain()
inetl.py
)
To actually run the ETL process, all you need to do it run
python3 etl.py
in a bash terminal.
This will run the script that does the following:
- Creates a
SparkSession
- Reads the
song_data
json files from Udacity's S3 bucket (s3://udacity-dend/song_data
) and processes them into a Spark dataframe. Then:- Processes the song data into two tables:
songs
andartists
. These constitute the fact tables that contain all information about possible sonds and artists (the source is from the Million Song Dataset) - Partitions the
songs
table byyear
andartist
. - Write the
songs
andartists
tables as parquet files to an output bucket on S3 (private bucket, you must have permissions to write to it)
- Processes the song data into two tables:
- Reads the
log_data
json files from Udacity's S3 bucket (s3://udacity-dend/log_data
) and processes them to a Spark dataframe. Then:- Keeps only records of songs that were started (
page = NextSong
) - Process any user data into a
users
Spark dataframe - Process the unique timestamps (after transformation) into a
time
Spark dataframe - Read again the
song_data
json files, and join it with thelog_data
on song name and song duration. From here, create asongplays
dataframe that is essentially a fact table for all the songs that were played on Sparkify. - Finally, write all dataframes to S3 as
parquet
files (thetime
andsongplays
tables are partitioned byyear
andmonth
)
- Keeps only records of songs that were started (
All output of the ETL process will be looged ot the datalake_udacity.log
file, with custom logging level to INFO
. This helps troubleshoot future issues, by logging errors and status updates.