Giter Club home page Giter Club logo

data_lake-amazon_s3's Introduction

Project: Data Lake

Project overview

Sparkify offers a music streaming service through desktop and hand-held devices.
The music streaming startup has grown their user base and song database even more and want to move their data warehouse
to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a
directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, processes them
using Spark, and loads the data back into S3 as a set of dimensional tables. You'll deploy this Spark process on a cluster using AWS.
This will allow their analytics team to continue finding insights in what songs their users are listening to.

Repository

Files to run project...

  • [ etl.py ] (Python 3 script):
    Reads song and app log data from S3 hosted json files, transforms them to create five different tables, and writes
    them to partitioned parquet files in table directories back to S3.

  • [ dl.cfg ] (config text file):
    Contains user AWS credentials, utilised by etl.py script.

  • [ EMR_notebook.ipynb ] (Jupyter Notebook):
    EMR Notebook of etl.py script for running on cluster via AWS console.

Running the project locally

(Prerequisite: ensure Python 3, Pyspark package and Apache Spark are all installed on local machine)

  1. Download project Python script and config file, as listed above, to a local directory.

  2. Add AWS 'Access Key ID & 'Secret Access Key' credentials to details to dl.cfg.

  3. Within etl.py script, set 'output_data' variable to S3 bucket where parquet files are to be written.

  4. Open your system CLI and change directory to where the project files are saved.

     C:\users\username>cd C:\users\username\path\to\project
    
  5. Run Python script to create a spark session and process data... etl.py;

     C:\users\username>cd C:\users\username\path\to\project>python3 etl.py
    

Dataset

2No. datasets are available for ingest and processing, residing in AWS S3.

Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data

Song data

Song data resides in JSON format, with each file containing metadata about a specific song, and the song's artist.
Within Sparkify's file storage, song files are partitioned by the first three letters of each song's track ID.

Filepath example...

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

TRAABJL12903CDCF1A.json song file content...

{
"num_songs": 1,
"artist_id": "ARJIE2Y1187B994AB7",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "",
"artist_name": "Line Renaud",
"song_id": "SOUPIRU12A6D4FA1E1",
"title": "Der Kleine Dompfaff",
"duration": 152.92036,
"year": 0
}

Log data

User activity logs, collected via the Sparkify music streaming applications, also resides in JSON format.
Each file represents a single day and contains information about each user and their session details for that day. Within Sparkify's file storage, log files are partitioned by the month and year.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

2018-11-12-events.json log file content...

{
"artist":null,
"auth":"Logged In",
"firstName":"Celeste",
"gender":"F",
"itemInSession":0,
"lastName":"Williams",
"length":null,
"level":"free",
"location":"Klamath Falls, OR",
"method":"GET",
"page":"Home",
"registration":1541077528796.0,
"sessionId":438,
"song":null,
"status":200,
"ts":1541990217796,
"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64)<br>
            AppleWebKit\/537.36 (KHTML, like Gecko)<br>
            Chrome\/37.0.2062.103 Safari\/537.36\"",
"userId":"53"
}

Table summary

Each of the five tables are written to parquet files in a separate analytics directory on S3.
Each table has its own folder within the directory.
Songplays table files are partitioned by year and month.
Songs table files are partitioned by year and then artist.
Time table files are partitioned by year and month.

Table Name Description
songplays Fact Table; Log data associated with song plays, filtered by user action 'Next Song'.
users Dimension Table; Registered application users
songs Dimension Table; Songs in music database
artists Dimension Table; Artists in music database
time Dimension Table; Timestamps of songplays records, broken down into specific units

data_lake-amazon_s3's People

Contributors

mrbena avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.