The data_lake_and_etl_pipeline_on_aws_using_spark's intro from arfatmateen

Data Lake and ETL Pipeline on AWS using Spark

ETL pipeline for Song Play Analysis

-----------------------------------------------

Data Engineer for AI Applications Nanodegree
Bosch AI Talent Accelerator Scholarship Program

Table of Contents

About The Project
- Built With
Datasets
File Structure
How To Run
Database Schema Design
ETL Pipeline
Acknowledgments

About The Project

A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

The startup wants to extract their logs and songs data from S3, process them using Spark, and load the data back into S3 as a set of dimensional tables for their analytics team to continue finding insights into what songs their users are listening to. This project builds an ETL pipeline for a data lake hosted on S3. This project includes loading data from S3, processing the data into analytics tables using Spark, and loading the tables back into S3. The project defines dimension and fact tables for a star schema and creates an ETL pipeline that transforms data from JSON files present in an S3 bucket into these tables hosted on S3 for a particular analytic focus.

(back to top)

Built With

(back to top)

Datasets

Dataset available in S3 bucket (For demo purpose subset of data also available in log_data and song_data folders):
- Song Dataset: Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are file paths to two files in this dataset.
```
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
```
  And below is an example of what a single song file looks like.
- Log Dataset: These files are also in JSON format and contains user activity data from a music streaming app. The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset.
```
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
```
  And below is an example of what the data in a log file looks like.

(back to top)

File Structure

dl.cfg contains config and access key variables (use your own access and secret key).
etl.py lodas data from the S3 bucket into a spark dataframe. extracts appropriate columns from the data and then saves the data into analytics tables in parquet file format on S3.
README.md provides details on the project.

(back to top)

How To Run

Prerequisites

Python 2.7 or greater
Install PySpark package in your virtual environment
AWS Account

Set your AWS access and secret key in the dl.cfg file

[AWS]
AWS_ACCESS_KEY_ID = <your aws key>
AWS_SECRET_ACCESS_KEY = <your aws secret>

Terminal commands

Run the etl.py script to load the data from an S3 bucket, process it, and then save it back into an S3 bucket.
You can execute one of the following command inside a python environment to run the etl.py
```
$ python etl.py
or
$ python3 etl.py
```

(back to top)

Database Schema Design

Database schema consist five tables with the following fact and dimension tables:

Fact Table
1. songplays: records in log data associated with song plays filter by NextSong page value. The table contains songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location and user_agent columns.

Dimension Tables
1. users: stores the user data available in the app. The table contains user_id, first_name, last_name, gender and level columns.
2. songs: contains songs data. The table consist of the following columns song_id, title, artist_id, year and duration.
3. artists: artists in the database. The table contains artist_id, name, location, latitude and longitude columns.
4. time: timestamps of records in songplays broken down into specific units with the following columns start_time, hour, day, week, month, year and weekday.

(back to top)

ETL Pipeline

The ETL pipeline follows the following procedure:

Create Spark Session to work with.
Process song data.
- Load song data dataset from public S3 bucket.
- Extract appropirate columns for song and artist dimension tables.
- Remove duplicate records if present in the dataset.
- Write the resulted data in partitions on S3 bucket in parquet file format.
Process log data.
- Load log data dataset from public S3 bucket.
- Extract appropirate columns for user and time dimension tables.
- Join song and log data dataset and extract appropirate columns songplay fact table.
- Remove duplicate records if present in the dataset.
- Finally, write the resulted data in partitions on S3 bucket in parquet file format.

(back to top)

Acknowledgments

(back to top)

arfatmateen / data_lake_and_etl_pipeline_on_aws_using_spark Goto Github PK

data_lake_and_etl_pipeline_on_aws_using_spark's Introduction

Data Lake and ETL Pipeline on AWS using Spark

About The Project

Built With

Datasets

File Structure

How To Run

Prerequisites

Terminal commands

Database Schema Design

ETL Pipeline

Acknowledgments

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent