Giter Club home page Giter Club logo

pyspark_s3_etl's Introduction

Sparkfy ETL Song Json Data ETL

Sumary

    1. Project Overview
    2. Data
    3. Model
    4. Project Structure
    5. Execution

1. Project Overview

This project is an ETL of json datas for a startup called Sparkfy. The startup wants to create a datalake to insert and query the json data in a parquet format. For this ETL we use a pyspark in local mode to manipulate the files, and retrieve and insert data into amazon S3 buckets.

2. Data

We have two directorys of data, the "log_data" and the "song_data". The log_data directory have log files of user song requests for execution.

example:
{"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"","userId":"39"} {"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"","userId":"8"}

The song_data directory have files with the metadata of songs and artists.

example:
{"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOMZWCG12A8C13C480", "title": "I Didn't Mean To", "duration": 218.93179, "year": 0}

3. Model

In this project we build a star schema model with 4 dimensions and 1 fact table, that schema can be consumed by using a Amazon Glue crawler into the output directories + Amazon Athena to query the parquet files.

Fact Table

table: songplays  
description: records in log data associated with song plays i.e. records with page NextSong (font: log_data)
columns: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

1. table: users
   description: users in the app (font: log_data)
   columns: user_id, first_name, last_name, gender, level

2. table: songs
   description: songs in music database (font: song_data)
   columns: song_id, title, artist_id, year, duration

3. table: artists
   description: artists in music database (font: song_data)
   columns: artist_id, name, location, latitude, longitude

4. table: time 
   description: timestamps of records in songplays broken down into specific units (font: log_data)
   columns: start_time, hour, day, week, month, year, weekday

4. Project Structure

Project Files
  • etl.py: Retrieve json files from one S3 bucket, proccess data using spark and output in another S3 bucket.
  • s3_iac_creation.ipynb: Jupyter notebook to help you to configure the output S3 bucket.
  • dl.cfg: Configuration file.

5. Execution

You need pyspark installed in your machine, after the install you:

  • Create the IAM role and the bucket, the jupyter notebook inside the repository can help you.
  • Run the etl.py python etl.py

pyspark_s3_etl's People

Contributors

junioraze avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.