This project was created on 2021-02-09.
This project explores the movie dataset of TMDb which is a popular, user editable database for movies and TV shows. The goal is to identify the features that can best predict the return on investement of a movie.
This work was made to complete an assignement for the Udacity Data Analyst Nanodegree.
Explore TMDb Movie Dataset.ipynb
: The jupyter notebook for wrangling and analyzing the data.
tmdb-movies.csv
: The movie dataset.
There is also another file that is not included in the repository MovieData.csv
. This file was used to fill the missing values for the revenue of the movies. It was provided by The Numbers, a company that tracks financial movie data. Since it was a proprietary dataset, I could not share it on Github. However, you can obtain it for free by simply filling their form. They will send you an email with a dropbox link and a password to download it.
To run the notebook, you will need python
, numpy
, pandas
, matplotlib
, urllib
, json
, and seaborn
installed. You can download all these libraries individually or with Anaconda, a python distribution with a focus on data science. If you’re interested in Anaconda you can follow their installation guide.
You will also need to get an API key for OMDb API. It is a web service that offers a great wealth of information about movies. It is used in this project to fill missing values. You can get a free key by creating an account and you will receive it at the email address that you used for registration.
This dataset contains information about 10866 movies and has 21 columns. The missing values were filled using both OMDb API and MovieData.csv
. The columns that were selected for analysis are : id
, popularity
, budget
, revenue
, original_title
, runtime
, genres
, release_date
, vote_count
, vote_average
, release_year
, and roi
Thanks to the One Million Arab Coders' initiative for offering me a chance to learn data science.
Thanks to Udacity for their great content.
Thanks to TMDb for providing their data to students.
Thanks to OMDb API for their free API keys.
Thanks to The Numbers for sharing their data.