Project: Data Lake

Project overview

Sparkify offers a music streaming service through desktop and hand-held devices.
The music streaming startup has grown their user base and song database even more and want to move their data warehouse
to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a
directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, processes them
using Spark, and loads the data back into S3 as a set of dimensional tables. You'll deploy this Spark process on a cluster using AWS.
This will allow their analytics team to continue finding insights in what songs their users are listening to.

Repository

Files to run project...

[ etl.py ] (Python 3 script):
Reads song and app log data from S3 hosted json files, transforms them to create five different tables, and writes
them to partitioned parquet files in table directories back to S3.
[ dl.cfg ] (config text file):
Contains user AWS credentials, utilised by etl.py script.
[ EMR_notebook.ipynb ] (Jupyter Notebook):
EMR Notebook of etl.py script for running on cluster via AWS console.

Running the project locally

(Prerequisite: ensure Python 3, Pyspark package and Apache Spark are all installed on local machine)

Download project Python script and config file, as listed above, to a local directory.
Add AWS 'Access Key ID & 'Secret Access Key' credentials to details to dl.cfg.
Within etl.py script, set 'output_data' variable to S3 bucket where parquet files are to be written.
Open your system CLI and change directory to where the project files are saved.
```
 C:\users\username>cd C:\users\username\path\to\project
```

Run Python script to create a spark session and process data... etl.py;

 C:\users\username>cd C:\users\username\path\to\project>python3 etl.py

Dataset

2No. datasets are available for ingest and processing, residing in AWS S3.

Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data

Song data

Song data resides in JSON format, with each file containing metadata about a specific song, and the song's artist.
Within Sparkify's file storage, song files are partitioned by the first three letters of each song's track ID.

Filepath example...

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

TRAABJL12903CDCF1A.json song file content...

{
"num_songs": 1,
"artist_id": "ARJIE2Y1187B994AB7",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "",
"artist_name": "Line Renaud",
"song_id": "SOUPIRU12A6D4FA1E1",
"title": "Der Kleine Dompfaff",
"duration": 152.92036,
"year": 0
}

Log data

User activity logs, collected via the Sparkify music streaming applications, also resides in JSON format.
Each file represents a single day and contains information about each user and their session details for that day. Within Sparkify's file storage, log files are partitioned by the month and year.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

2018-11-12-events.json log file content...

{
"artist":null,
"auth":"Logged In",
"firstName":"Celeste",
"gender":"F",
"itemInSession":0,
"lastName":"Williams",
"length":null,
"level":"free",
"location":"Klamath Falls, OR",
"method":"GET",
"page":"Home",
"registration":1541077528796.0,
"sessionId":438,
"song":null,
"status":200,
"ts":1541990217796,
"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64)<br>
            AppleWebKit\/537.36 (KHTML, like Gecko)<br>
            Chrome\/37.0.2062.103 Safari\/537.36\"",
"userId":"53"
}

Table summary

Each of the five tables are written to parquet files in a separate analytics directory on S3.
Each table has its own folder within the directory.
Songplays table files are partitioned by year and month.
Songs table files are partitioned by year and then artist.
Time table files are partitioned by year and month.

Table Name	Description
songplays	Fact Table; Log data associated with song plays, filtered by user action 'Next Song'.
users	Dimension Table; Registered application users
songs	Dimension Table; Songs in music database
artists	Dimension Table; Artists in music database
time	Dimension Table; Timestamps of songplays records, broken down into specific units

massakera / data_lake-amazon_s3 Goto Github PK

data_lake-amazon_s3's Introduction

Project: Data Lake

Project overview

Repository

Files to run project...

Running the project locally

Dataset

Song data

Log data

Table summary

data_lake-amazon_s3's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent