datalakewithspark's Introduction

Data Lake with Spark

Overview

The schema for analyses Song play dataset from the start-up called Sparkify In this project, demonstrate of analytic data using Spark with Amazon EMR

Datasets

We will working with two dataset reside in S3 using the following links:

1.Song Play: s3a://udacity-dend/song_data/*/*/*/*.json 2.Log Data: s3a://udacity-dend/log_data/*/*/*/*.json

Song Play data

In this data, it is a subset of the real data from Million Song Dataset Each file is in JSON format and contains metadata about a song and the artist of that song.

Log Dataset

This dataset, is consist of log file in JSON format that generated by Event Simulator based on the song event from first dataset

For example

UDF timestamp convert

In this project we need to convert timestamp into datetime. Due to the native convertion from native pyspark.

Process data function

In the data process, it has 2 database, song_data and log_data.

Step for Songs, Artists, User tables

Read data and save into dataframe variable using spark.read
Extract data into table
Overwrite data using 'write.mode'

Step for extracting timestamp

Get data from spark.read
Covert timestamp using datetime UDF function
Adding timestamp into table using '.withColumn'

Files

dl.cfg = Configuration file
etl.py = python file for runing ETL process
README.md = Instruction and explanation for the project
workingspace.ipynb = working space for testing code and checking in Python notebook

Running

Open EMR cluster
Add key and secret key to config file
Running command '''python3 etl.py'''

Recommend Projects

xenonar / datalakewithspark Goto Github PK