The schema for analyses Song play dataset from the start-up called Sparkify In this project, demonstrate of analytic data using Spark with Amazon EMR
We will working with two dataset reside in S3 using the following links:
1.Song Play: s3a://udacity-dend/song_data/*/*/*/*.json
2.Log Data: s3a://udacity-dend/log_data/*/*/*/*.json
In this data, it is a subset of the real data from Million Song Dataset Each file is in JSON format and contains metadata about a song and the artist of that song.
This dataset, is consist of log file in JSON format that generated by Event Simulator based on the song event from first dataset
In this project we need to convert timestamp into datetime. Due to the native convertion from native pyspark.
In the data process, it has 2 database, song_data and log_data.
Step for Songs, Artists, User tables
- Read data and save into dataframe variable using spark.read
- Extract data into table
- Overwrite data using 'write.mode'
Step for extracting timestamp
- Get data from spark.read
- Covert timestamp using datetime UDF function
- Adding timestamp into table using '.withColumn'
- dl.cfg = Configuration file
- etl.py = python file for runing ETL process
- README.md = Instruction and explanation for the project
- workingspace.ipynb = working space for testing code and checking in Python notebook
- Open EMR cluster
- Add key and secret key to config file
- Running command '''python3 etl.py'''