This repository contains the final project of Big Data course, which analyses a large movie dataset.
- Data Source: Movie dataset from kaggle: the movies dataset, more than 2GB
- Project Objective: Find what characteristics is common to movies that are financially successful
- Framework: Data processing & analysis using PySpark, Visualizaiton using Tableau; From genre, Popularity, Runtime and Rate to do feature analysis
An introduction to storage, retrieval, analyses, and display of data sets so large and complex that traditional data processing and analysis applications cannot readily be used. Topics include big data management, data architecture of hosting big data, big data retrieval languages, parallel computing methods, big data analytical methods, and data visualization.
Software: Based on the technology on hand, we might use Bash coding, Cloudera (through AWS server), Python, Tableau, and Amazon AWS tools. Bash tools, Python, and Cloudera are available while the instructions are provided.
Course Instructor: Hossein Amini