Giter Club home page Giter Club logo

data_lake_and_etl_pipeline_on_aws_using_spark's Introduction

LinkedIn Generic badge


Logo

Data Lake and ETL Pipeline on AWS using Spark

ETL pipeline for Song Play Analysis

-----------------------------------------------

Data Engineer for AI Applications Nanodegree
Bosch AI Talent Accelerator Scholarship Program


Table of Contents
  1. About The Project
  2. Datasets
  3. File Structure
  4. How To Run
  5. Database Schema Design
  6. ETL Pipeline
  7. Acknowledgments

About The Project

A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

The startup wants to extract their logs and songs data from S3, process them using Spark, and load the data back into S3 as a set of dimensional tables for their analytics team to continue finding insights into what songs their users are listening to. This project builds an ETL pipeline for a data lake hosted on S3. This project includes loading data from S3, processing the data into analytics tables using Spark, and loading the tables back into S3. The project defines dimension and fact tables for a star schema and creates an ETL pipeline that transforms data from JSON files present in an S3 bucket into these tables hosted on S3 for a particular analytic focus.

(back to top)

Built With

  • AWS
  • Python
  • Spark
  • Jupyter
  • VSCode

(back to top)

Datasets

  • Dataset available in S3 bucket (For demo purpose subset of data also available in log_data and song_data folders):

    • Song Dataset: Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are file paths to two files in this dataset.

      song_data/A/B/C/TRABCEI128F424C983.json
      song_data/A/A/B/TRAABJL12903CDCF1A.json
      

      And below is an example of what a single song file looks like.

      Song Data


    • Log Dataset: These files are also in JSON format and contains user activity data from a music streaming app. The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset.

      log_data/2018/11/2018-11-12-events.json
      log_data/2018/11/2018-11-13-events.json
      

      And below is an example of what the data in a log file looks like.

      Log Data


(back to top)

File Structure

  1. dl.cfg contains config and access key variables (use your own access and secret key).

  2. etl.py lodas data from the S3 bucket into a spark dataframe. extracts appropriate columns from the data and then saves the data into analytics tables in parquet file format on S3.

  3. README.md provides details on the project.

(back to top)

How To Run

Prerequisites

  • Python 2.7 or greater

  • Install PySpark package in your virtual environment

  • AWS Account

  • Set your AWS access and secret key in the dl.cfg file

    [AWS]
    AWS_ACCESS_KEY_ID = <your aws key>
    AWS_SECRET_ACCESS_KEY = <your aws secret>
    

Terminal commands

  • Run the etl.py script to load the data from an S3 bucket, process it, and then save it back into an S3 bucket.

  • You can execute one of the following command inside a python environment to run the etl.py

    $ python etl.py
    or
    $ python3 etl.py
    

(back to top)

Database Schema Design

Database schema consist five tables with the following fact and dimension tables:

  • Fact Table

    1. songplays: records in log data associated with song plays filter by NextSong page value. The table contains songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location and user_agent columns.

  • Dimension Tables

    1. users: stores the user data available in the app. The table contains user_id, first_name, last_name, gender and level columns.

    2. songs: contains songs data. The table consist of the following columns song_id, title, artist_id, year and duration.

    3. artists: artists in the database. The table contains artist_id, name, location, latitude and longitude columns.

    4. time: timestamps of records in songplays broken down into specific units with the following columns start_time, hour, day, week, month, year and weekday.


    Sparkifydb ERD

(back to top)

ETL Pipeline

The ETL pipeline follows the following procedure:

  • Create Spark Session to work with.

  • Process song data.

    • Load song data dataset from public S3 bucket.

    • Extract appropirate columns for song and artist dimension tables.

    • Remove duplicate records if present in the dataset.

    • Write the resulted data in partitions on S3 bucket in parquet file format.

  • Process log data.

    • Load log data dataset from public S3 bucket.

    • Extract appropirate columns for user and time dimension tables.

    • Join song and log data dataset and extract appropirate columns songplay fact table.

    • Remove duplicate records if present in the dataset.

    • Finally, write the resulted data in partitions on S3 bucket in parquet file format.

(back to top)

Acknowledgments

(back to top)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.