Giter Club home page Giter Club logo

udacity-data-lake's Introduction

Data Lake with Spark

This repository is for project 4 of the Udacity Data Engineer Nanodegree program, submitted in October 2020. For this project, we are working with a (fictitious) music company called Sparkify -- a music streaming startup. We are tasked with building an ETL pipeline using Spark, which reads data from an S3 bucket in json format (both archives of songs and song listening log data) and transforms it into a relational database (called "data lake" for the purposes of the exercise), finally saving the results as parquet files also in S3.

How to run

Prerequisites

In order to be able to run this project, you need the following:

  1. AWS credentials (IAM user credentials, stored in the dl.cfg file; make sure you don't use any quotation marks)
  2. Python 3.7 (and an environment with pyspark installed)
  3. An S3 bucket where you can store the parquet files to (specified as output_data in main() in etl.py)

Run the script

To actually run the ETL process, all you need to do it run

python3 etl.py

in a bash terminal.

This will run the script that does the following:

  1. Creates a SparkSession
  2. Reads the song_data json files from Udacity's S3 bucket (s3://udacity-dend/song_data) and processes them into a Spark dataframe. Then:
    1. Processes the song data into two tables: songs and artists. These constitute the fact tables that contain all information about possible sonds and artists (the source is from the Million Song Dataset)
    2. Partitions the songs table by year and artist.
    3. Write the songs and artists tables as parquet files to an output bucket on S3 (private bucket, you must have permissions to write to it)
  3. Reads the log_data json files from Udacity's S3 bucket (s3://udacity-dend/log_data) and processes them to a Spark dataframe. Then:
    1. Keeps only records of songs that were started (page = NextSong)
    2. Process any user data into a users Spark dataframe
    3. Process the unique timestamps (after transformation) into a time Spark dataframe
    4. Read again the song_data json files, and join it with the log_data on song name and song duration. From here, create a songplays dataframe that is essentially a fact table for all the songs that were played on Sparkify.
    5. Finally, write all dataframes to S3 as parquet files (the time and songplays tables are partitioned by year and month)

Logging

All output of the ETL process will be looged ot the datalake_udacity.log file, with custom logging level to INFO. This helps troubleshoot future issues, by logging errors and status updates.

udacity-data-lake's People

Watchers

Tania Vasilikioti avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.