Marketing, Movie Hype and Movie Ratings

We've all got used to looking at average review scores on rating sites, whether for restaurants, wines or movies. Sites like imdb.com are sitting on a huge dataset on human preferences. Lots of work has been done in the field of Recommender Systems which looks at how we can make predictions about what people will like given ratings they have already provided, but I haven't seen anything great which looks at these data from a cognitive science angle, which tries to form a theory-based understanding of why individuals give the ratings they do, and how things like expectations and previous experience affect ratings.

If you know about such work, please let me know. As a way of exploring these kind of data, I thought I'd look at the IMDB movie ratings and try and answer some simple questions:

How long after a movie has been released does the average rating stabilise?
Does the (marketing) budget of the movie "boost" average ratings in the weeks immediate after release?

Summer movies

I went to the-numbers.com and got data on 2023 releases and their production budgets. I took the top 20 movies, by budget, with release dates after 12 April 2023 (when I started collecting rating data). Here's the top five:

released	title	budget
2023-05-19	Fast X	340000000
2023-06-30	Indiana Jones and the Dial of Destiny	300000000
2023-07-12	Mission: Impossible Dead Reckoning Part One	290000000
2023-05-26	The Little Mermaid	250000000
2023-05-05	Guardians of the Galaxy Vol 3	250000000
2023-06-16	The Flash	200000000

Average rating vs time

Next I plotted the average rating the movie got on the first time it appeared in the IMDB data, and every day since. Here's the plot

From this you can see two things:

Movie ratings fluctuate over time, sometimes bouncing up, but mostly dropping down
Ratings tend to stabilise after about 90 days

I can think of two factors which might cause movie ratings to drop down. One is that marketing and other hype associated with the movie makes people think that the movie is better than it is. The other is that the people who tend to like to movie the most - the kind of people who think they'll like particularly this movie - are the ones who go to see it first. Over time, the movie gathers more ratings from people less keen to see it, so the ratings tend to drop. Both factors could be true, but notice that the effect of marketing will be in different directions. If marketing confuses you about what you actually enjoyed after you've experienced it, the boost in movie ratings will be higher for more heavily marketed movies. If it marketing confuses you about what you are going to enjoy, it will persuade people who won't enjoy a movie to see it sooner and so the movie ratings will be lower than they would be otherwise.

Budgets and marketing hype in ratings

I calculate the "boost" for each movie - simply the difference between the highest average rating a movie has received and the most recent rating (this is the rating which combined all votes every received, so it assumed to be most accurate). I can then plot this boost against the budget of each movie (assuming that a consistent propotion of each budget is spent on marketing, or at least that bigger budget movies have bigger marketing budgets).

From this plot you can see that

Most movies get a boost of around ~0.5
Movies with fewer ratings are more likely to have unstable averages (and larger boosts)
There is no obvious relation between movie budget and boost

Conclusion

The lack of relation between budget and initial boost in ratings could be for a few reasons. It could be that, as mentioned above, marketing can pull average ratings in two directions and these effects cancel out; or it could be that all of these movies have large enough marketing budgets that you can't see any difference between them in the effect of marketing; or it could be that marketing has no effect; or it could be that the marketing budget varies independently from the production budget.

As a movie consumer, it is possible to make one conclusion: if you are looking at IMDB within 90 days of something being released you should mentally subtract ~0.5 from the average rating before using this to decide whether to see it or not.

I'm thinking about what else to do with these data, so feedback is welcome, by email or to @tomstafford

I'm also interested in other large datasets of rating/preference data, so if you work on these please get in touch.

Colophon

This project uses a cron job to grab the daily data, and python (including pandas library) for subsequent data munging. Visualisations done using matplotlib.

IMDB data is available from https://developer.imdb.com/non-commercial-datasets/ under these license terms

Information courtesy of
IMDb
(https://www.imdb.com).
Used with permission.

This project, including plots, are CC-BY Tom Stafford

Code and a few more details are available in the repo (but not the data, since it is not mine for onwards sharing)

Director's Cut

Did you notice how I used an occlusion cue in the first line plot? This takes advantage of our visual system's natural expertise in perceiving depth to make overlapping lines appear less confused. The code is really hacky, you just plot a slightly thicker white line before you plot the main (coloured) data line:

plt.plot(df['days'],df['averageRating'],'-',lw=3,color='white')
plt.plot(df['days'],df['averageRating'],'-',lw=2,color=moviecolor)

Without this larger, white, line of you get this plot, compare to the version above to see the difference produces on visual confusion:

Update 2023-08-30

I realised I could further improve the plot by having the lines layered so earlier releases/longer lines were at the back. Bonus: Barbie is now in pink.

Old plot, for comparison:

Suggestion: An alternative way to obtain historical imdb ratings, via WaybackMachine

This is a great delightful result!

I'm outlining a suggestion to obtain the data. Please let me know if this has been done before, if there's a better way or there are issues with the outlined suggestion.

General

Purpose: This helps, to some extent, with obtaining the IMDB ratings data, for those who may want to reproduce the results.

Approach: One can use the Internet Archive Wayback Machine to download snapshots of the files.

Potential issues: Not everyday is guaranteed to have a snapshot.

Detailed approach

Requirements

CLI:

aria2 (or wget/curl)

Python envinronment:

pandas
tqdm
wayback

Organization

Also the scripts below assume there exists a data/raw folder, which contains:

links.csv: CSV file with links to Wayback machine snapshots, and corresponding output files
links.aria: the above file converted into aria2 input file
all the snapshots with this file format YYYYMMDD-HHMMSS-title.ratings.tsv.gz, in which YYYYMMDD-HHMMSS is from the Wayback snapshot's timestamps.

Steps

Step 1: Obtain snapshot links

Below is content of get-links.py

# get-links.py

import os
import wayback
import datetime
from tqdm import tqdm
import pandas as pd

URL = "https://datasets.imdbws.com/title.ratings.tsv.gz"     # URL to find snapshot 
DATA_DIRECTORY = 'data/raw'                                  # where data will be stored
FILENAME_FORMAT = os.path.join(
    DATA_DIRECTORY, 
    '%Y%m%d-%H%M%S-title.ratings.tsv.gz'
)

start_date = datetime.datetime(2023,4,12)                   # from date to find snapshot 
today_date = datetime.datetime.now()
num_days = (today_date - start_date).days

client = wayback.WaybackClient()
links = []

pbar = tqdm(total=num_days, desc='Expected progress')

for r in client.search(URL, from_date=start_date):
    links.append(dict(
        url = r.raw_url, 
        file = r.timestamp.strftime(FILENAME_FORMAT)
    ))
    pbar.update(1)

pbar.close()

pd.DataFrame(links).to_csv(
    os.path.join(DATA_DIRECTORY, 'links.csv'), 
    index=False
)

Run with python get-links.py (pretty quick)

This will produce data/raw/links.csv with the following content

url,file
https://web.archive.org/web/20230412003412id_/https://datasets.imdbws.com/title.ratings.tsv.gz,data/raw/20230412-003412-title.ratings.tsv.gz
https://web.archive.org/web/20230413003404id_/https://datasets.imdbws.com/title.ratings.tsv.gz,data/raw/20230413-003404-title.ratings.tsv.gz
...

Step 2: Convert to `aria2` input format

At this point one can use aria2, curl, wget to download.

To use aria2, the file links.csv can be converted to a text file formatted for aria2 input:

Run the following code:

sed 1,1d data/raw/links.csv |\
  sed -E 's/([^,]*),(.*)/\1\n  out=\2/' \
  > data/raw/links.aria

The content of links.aria will look like this:

https://web.archive.org/web/20230412003412id_/https://datasets.imdbws.com/title.ratings.tsv.gz
  out=data/raw/20230412-003412-title.ratings.tsv.gz
https://web.archive.org/web/20230413003404id_/https://datasets.imdbws.com/title.ratings.tsv.gz
  out=data/raw/20230413-003404-title.ratings.tsv.gz

Step 3: Download

# download files from `links.aria`
aria2c -c -i data/raw/links.aria -j 4

# optional
aria2c https://datasets.imdbws.com/title.basics.tsv.gz -d data/raw

Here's how the data/raw will look:

data/raw
├── 20230412-003412-title.ratings.tsv.gz
├── 20230413-003404-title.ratings.tsv.gz
├── 20230414-003406-title.ratings.tsv.gz
├── 20230415-003407-title.ratings.tsv.gz
├── 20230416-003534-title.ratings.tsv.gz
├── 20230417-003525-title.ratings.tsv.gz
├── 20230418-003405-title.ratings.tsv.gz
├── 20230419-003405-title.ratings.tsv.gz
├── 20230420-003424-title.ratings.tsv.gz
...

├── links.aria
├── links.csv
└── title.basics.tsv.gz

tomstafford / imdb Goto Github PK

imdb's Introduction

Marketing, Movie Hype and Movie Ratings

Summer movies

Average rating vs time

Budgets and marketing hype in ratings

Conclusion

Colophon

Director's Cut

Update 2023-08-30

imdb's People

Contributors

Stargazers

Watchers

imdb's Issues

General

Detailed approach

Requirements

Organization

Steps

Step 1: Obtain snapshot links

Step 2: Convert to aria2 input format

Step 3: Download

Recommend Projects

Recommend Topics

Recommend Org

Step 2: Convert to `aria2` input format