Giter Club home page Giter Club logo

sparkify-data-pipeline's Introduction

Star Badge Open Source Love

Sparkify - Data Pipelines

Introduction

A music streaming company, Sparkify, has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and come to the conclusion that the best tool to achieve this is Apache Airflow.

The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

Dataset

Note that the actual data (in JSON) used in this project is a subset of original dataset preprocessed by the course. The provided data resides in AWS S3 (publically available).

  1. Song data from Million Song Dataset
  2. User activity data from Event Simulator based on Million Song Dataset

ELT Process

The tool used for scheduling and orchestrationg ELT is Apache Airflow.

'Airflow is a platform to programmatically author, schedule and monitor workflows.'

Source: Apache Foundation

The schema of the ELT is this DAG:

DAG

Sources

The sources are the same than previous projects:

  • Log data: s3://udacity-dend/log_data
  • Song data: s3://udacity-dend/song_data

Destinations

Data is inserted into Amazon Redshift Cluster. The goal is populate an star schema:

  • Fact Table:

    • songplays
  • Dimension Tables

    • users - users in the app
    • songs - songs in music database
    • artists - artists in music database
    • time - timestamps of records in songplays broken down into specific units

By the way we need two staging tables:

  • Stage_events
  • Stage_songs
Prerequisite

Tables must be created in Redshift before executing the DAG workflow. The create tables script can be found in:

create_tables.sql

Project Structure

  • /
    • create_tables.sql - Contains the DDL for all tables used in this projecs
  • dags
    • udac_example_dag.py - The DAG configuration file to run in Airflow
  • plugins
    • operators
      • stage_redshift.py - Operator to read files from S3 and load into Redshift staging tables
      • load_fact.py - Operator to load the fact table in Redshift
      • load_dimension.py - Operator to read from staging tables and load the dimension tables in Redshift
      • data_quality.py - Operator for data quality checking
    • helpers
      • sql_queries - Redshift statements used in the DAG

Data Quality Checks

In order to ensure the tables were loaded, a data quality checking is performed to count the total records each table has. If a table has no rows then the workflow will fail and throw an error message.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.