Giter Club home page Giter Club logo

imdb's Introduction

Marketing, Movie Hype and Movie Ratings

We've all got used to looking at average review scores on rating sites, whether for restaurants, wines or movies. Sites like imdb.com are sitting on a huge dataset on human preferences. Lots of work has been done in the field of Recommender Systems which looks at how we can make predictions about what people will like given ratings they have already provided, but I haven't seen anything great which looks at these data from a cognitive science angle, which tries to form a theory-based understanding of why individuals give the ratings they do, and how things like expectations and previous experience affect ratings.

If you know about such work, please let me know. As a way of exploring these kind of data, I thought I'd look at the IMDB movie ratings and try and answer some simple questions:

  • How long after a movie has been released does the average rating stabilise?
  • Does the (marketing) budget of the movie "boost" average ratings in the weeks immediate after release?

Summer movies

I went to the-numbers.com and got data on 2023 releases and their production budgets. I took the top 20 movies, by budget, with release dates after 12 April 2023 (when I started collecting rating data). Here's the top five:

released title budget
2023-05-19 Fast X 340000000
2023-06-30 Indiana Jones and the Dial of Destiny 300000000
2023-07-12 Mission: Impossible Dead Reckoning Part One 290000000
2023-05-26 The Little Mermaid 250000000
2023-05-05 Guardians of the Galaxy Vol 3 250000000
2023-06-16 The Flash 200000000

Average rating vs time

Next I plotted the average rating the movie got on the first time it appeared in the IMDB data, and every day since. Here's the plot

From this you can see two things:

  • Movie ratings fluctuate over time, sometimes bouncing up, but mostly dropping down
  • Ratings tend to stabilise after about 90 days

I can think of two factors which might cause movie ratings to drop down. One is that marketing and other hype associated with the movie makes people think that the movie is better than it is. The other is that the people who tend to like to movie the most - the kind of people who think they'll like particularly this movie - are the ones who go to see it first. Over time, the movie gathers more ratings from people less keen to see it, so the ratings tend to drop. Both factors could be true, but notice that the effect of marketing will be in different directions. If marketing confuses you about what you actually enjoyed after you've experienced it, the boost in movie ratings will be higher for more heavily marketed movies. If it marketing confuses you about what you are going to enjoy, it will persuade people who won't enjoy a movie to see it sooner and so the movie ratings will be lower than they would be otherwise.

Budgets and marketing hype in ratings

I calculate the "boost" for each movie - simply the difference between the highest average rating a movie has received and the most recent rating (this is the rating which combined all votes every received, so it assumed to be most accurate). I can then plot this boost against the budget of each movie (assuming that a consistent propotion of each budget is spent on marketing, or at least that bigger budget movies have bigger marketing budgets).

From this plot you can see that

  • Most movies get a boost of around ~0.5
  • Movies with fewer ratings are more likely to have unstable averages (and larger boosts)
  • There is no obvious relation between movie budget and boost

Conclusion

The lack of relation between budget and initial boost in ratings could be for a few reasons. It could be that, as mentioned above, marketing can pull average ratings in two directions and these effects cancel out; or it could be that all of these movies have large enough marketing budgets that you can't see any difference between them in the effect of marketing; or it could be that marketing has no effect; or it could be that the marketing budget varies independently from the production budget.

As a movie consumer, it is possible to make one conclusion: if you are looking at IMDB within 90 days of something being released you should mentally subtract ~0.5 from the average rating before using this to decide whether to see it or not.

I'm thinking about what else to do with these data, so feedback is welcome, by email or to @tomstafford

I'm also interested in other large datasets of rating/preference data, so if you work on these please get in touch.


Colophon

This project uses a cron job to grab the daily data, and python (including pandas library) for subsequent data munging. Visualisations done using matplotlib.

IMDB data is available from https://developer.imdb.com/non-commercial-datasets/ under these license terms

Information courtesy of
IMDb
(https://www.imdb.com).
Used with permission.

This project, including plots, are CC-BY Tom Stafford

Code and a few more details are available in the repo (but not the data, since it is not mine for onwards sharing)

Director's Cut

Did you notice how I used an occlusion cue in the first line plot? This takes advantage of our visual system's natural expertise in perceiving depth to make overlapping lines appear less confused. The code is really hacky, you just plot a slightly thicker white line before you plot the main (coloured) data line:

plt.plot(df['days'],df['averageRating'],'-',lw=3,color='white')
plt.plot(df['days'],df['averageRating'],'-',lw=2,color=moviecolor)   

Without this larger, white, line of you get this plot, compare to the version above to see the difference produces on visual confusion:

Update 2023-08-30

I realised I could further improve the plot by having the lines layered so earlier releases/longer lines were at the back. Bonus: Barbie is now in pink.

Old plot, for comparison:

imdb's People

Contributors

tomstafford avatar

Stargazers

 avatar

Watchers

 avatar

imdb's Issues

Suggestion: An alternative way to obtain historical imdb ratings, via WaybackMachine

This is a great delightful result!

I'm outlining a suggestion to obtain the data. Please let me know if this has been done before, if there's a better way or there are issues with the outlined suggestion.

General

Purpose: This helps, to some extent, with obtaining the IMDB ratings data, for those who may want to reproduce the results.

Approach: One can use the Internet Archive Wayback Machine to download snapshots of the files.

Potential issues: Not everyday is guaranteed to have a snapshot.

Detailed approach

Requirements

CLI:

Python envinronment:

Organization

Also the scripts below assume there exists a data/raw folder, which contains:

  • links.csv: CSV file with links to Wayback machine snapshots, and corresponding output files
  • links.aria: the above file converted into aria2 input file
  • all the snapshots with this file format YYYYMMDD-HHMMSS-title.ratings.tsv.gz, in which YYYYMMDD-HHMMSS is from the Wayback snapshot's timestamps.

Steps

Step 1: Obtain snapshot links

Below is content of get-links.py

# get-links.py

import os
import wayback
import datetime
from tqdm import tqdm
import pandas as pd

URL = "https://datasets.imdbws.com/title.ratings.tsv.gz"     # URL to find snapshot 
DATA_DIRECTORY = 'data/raw'                                  # where data will be stored
FILENAME_FORMAT = os.path.join(
    DATA_DIRECTORY, 
    '%Y%m%d-%H%M%S-title.ratings.tsv.gz'
)

start_date = datetime.datetime(2023,4,12)                   # from date to find snapshot 
today_date = datetime.datetime.now()
num_days = (today_date - start_date).days

client = wayback.WaybackClient()
links = []

pbar = tqdm(total=num_days, desc='Expected progress')

for r in client.search(URL, from_date=start_date):
    links.append(dict(
        url = r.raw_url, 
        file = r.timestamp.strftime(FILENAME_FORMAT)
    ))
    pbar.update(1)

pbar.close()

pd.DataFrame(links).to_csv(
    os.path.join(DATA_DIRECTORY, 'links.csv'), 
    index=False
)

Run with python get-links.py (pretty quick)

This will produce data/raw/links.csv with the following content

url,file
https://web.archive.org/web/20230412003412id_/https://datasets.imdbws.com/title.ratings.tsv.gz,data/raw/20230412-003412-title.ratings.tsv.gz
https://web.archive.org/web/20230413003404id_/https://datasets.imdbws.com/title.ratings.tsv.gz,data/raw/20230413-003404-title.ratings.tsv.gz
...

Step 2: Convert to aria2 input format

At this point one can use aria2, curl, wget to download.

To use aria2, the file links.csv can be converted to a text file formatted for aria2 input:

Run the following code:

sed 1,1d data/raw/links.csv |\
  sed -E 's/([^,]*),(.*)/\1\n  out=\2/' \
  > data/raw/links.aria 

The content of links.aria will look like this:

https://web.archive.org/web/20230412003412id_/https://datasets.imdbws.com/title.ratings.tsv.gz
  out=data/raw/20230412-003412-title.ratings.tsv.gz
https://web.archive.org/web/20230413003404id_/https://datasets.imdbws.com/title.ratings.tsv.gz
  out=data/raw/20230413-003404-title.ratings.tsv.gz

Step 3: Download

# download files from `links.aria`
aria2c -c -i data/raw/links.aria -j 4

# optional
aria2c https://datasets.imdbws.com/title.basics.tsv.gz -d data/raw

Here's how the data/raw will look:

data/raw
├── 20230412-003412-title.ratings.tsv.gz
├── 20230413-003404-title.ratings.tsv.gz
├── 20230414-003406-title.ratings.tsv.gz
├── 20230415-003407-title.ratings.tsv.gz
├── 20230416-003534-title.ratings.tsv.gz
├── 20230417-003525-title.ratings.tsv.gz
├── 20230418-003405-title.ratings.tsv.gz
├── 20230419-003405-title.ratings.tsv.gz
├── 20230420-003424-title.ratings.tsv.gz
...

├── links.aria
├── links.csv
└── title.basics.tsv.gz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.