ETL pipeline for Song Play Analysis
-----------------------------------------------
Data Engineer for AI Applications Nanodegree
Bosch AI Talent Accelerator Scholarship Program
Table of Contents
A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
The startup wants to extract their logs
and songs
data from S3, process them using Spark, and load the data back into S3 as a set of dimensional tables for their analytics team to continue finding insights into what songs their users are listening to. This project builds an ETL pipeline for a data lake hosted on S3. This project includes loading data from S3, processing the data into analytics tables using Spark, and loading the tables back into S3. The project defines dimension and fact tables for a star schema and creates an ETL pipeline that transforms data from JSON files present in an S3 bucket into these tables hosted on S3 for a particular analytic focus.
-
Dataset available in S3 bucket (For demo purpose subset of data also available in
log_data
andsong_data
folders):-
Song Dataset
: Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are file paths to two files in this dataset.song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json
And below is an example of what a single song file looks like.
-
Log Dataset
: These files are also in JSON format and contains user activity data from a music streaming app. The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset.log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json
And below is an example of what the data in a log file looks like.
-
-
dl.cfg
contains config and access key variables (use your own access and secret key). -
etl.py
lodas data from the S3 bucket into a spark dataframe. extracts appropriate columns from the data and then saves the data into analytics tables in parquet file format on S3. -
README.md
provides details on the project.
-
Python 2.7 or greater
-
Install
PySpark
package in your virtual environment -
AWS Account
-
Set your AWS access and secret key in the
dl.cfg
file[AWS] AWS_ACCESS_KEY_ID = <your aws key> AWS_SECRET_ACCESS_KEY = <your aws secret>
-
Run the
etl.py
script to load the data from an S3 bucket, process it, and then save it back into an S3 bucket. -
You can execute one of the following command inside a python environment to run the
etl.py
$ python etl.py or $ python3 etl.py
Database schema consist five tables with the following fact and dimension tables:
-
Fact Table
songplays
: records in log data associated with song plays filter byNextSong
page value. The table contains songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location and user_agent columns.
-
Dimension Tables
-
users
: stores the user data available in the app. The table contains user_id, first_name, last_name, gender and level columns. -
songs
: contains songs data. The table consist of the following columns song_id, title, artist_id, year and duration. -
artists
: artists in the database. The table contains artist_id, name, location, latitude and longitude columns. -
time
: timestamps of records insongplays
broken down into specific units with the following columns start_time, hour, day, week, month, year and weekday.
-
The ETL pipeline follows the following procedure:
-
Create Spark Session to work with.
-
Process song data.
-
Load song data dataset from public S3 bucket.
-
Extract appropirate columns for song and artist dimension tables.
-
Remove duplicate records if present in the dataset.
-
Write the resulted data in partitions on S3 bucket in parquet file format.
-
-
Process log data.
-
Load log data dataset from public S3 bucket.
-
Extract appropirate columns for user and time dimension tables.
-
Join song and log data dataset and extract appropirate columns songplay fact table.
-
Remove duplicate records if present in the dataset.
-
Finally, write the resulted data in partitions on S3 bucket in parquet file format.
-