Giter Club home page Giter Club logo

populate-movies's Introduction

Populates a full database of movies from TMDB and IMDB into Postgres.

Populates a high quality database of movies from TMDB, IMDB, and Rotten Tomatoes.

Build Status MIT License Prettier Code Formatting

Intro

This project resolves a full set of movies from TMDB and IMDB in batches, storing them into a Postgres database using Prisma.

This includes difficult-to-access data like IMDB ratings, Rotten Tomatoes audience & critic scores, and YouTube trailers.

Everything is handled including data fetching, normalization, cleaning, and filtering. All scripts are idempotent, so you should be able to re-run them whenever you want to refresh your data.

Movie Schema

Movies are transformed into the following format, which combines metadata from TMDB, IMDB, Rotten Tomatoes, and FlickMetrix.

{
  "tmdbId": 680,
  "imdbId": "tt0110912",
  "title": "Pulp Fiction",
  "originalTitle": "Pulp Fiction",
  "language": "en",
  "releaseYear": 1994,
  "releaseDate": "1994-09-10",
  "genres": ["thriller", "crime", "drama"],
  "plot": "The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.",
  "runtime": 154,
  "adult": false,
  "budget": "8000000",
  "revenue": "213928762",
  "homepage": "https://www.miramax.com/movie/pulp-fiction/",
  "mpaaRating": "R",
  "keywords": ["nonlinear timeline", "overdose", "drug use", "drug overdose"],
  "countriesOfOrigin": ["United States"],
  "languages": ["English", "Spanish", "French"],
  "cast": ["John Travolta", "Samuel L. Jackson", "Uma Thurman", "Bruce Willis"],
  "director": "Quentin Tarantino",
  "production": "Miramax Films",
  "awardsSummary": "Won 1 Oscar. Another 63 wins & 47 nominations.",
  "status": "released",
  "posterUrl": "https://image.tmdb.org/t/p/w780/fIE3lAGcZDV1G6XM5KmuWnNsPp1.jpg",
  "posterWidth": 2000,
  "posterHeight": 3000,
  "backdropUrl": "https://image.tmdb.org/t/p/w1280/suaEOtk1N1sgg2MTM7oZd2cfVp3.jpg",
  "backdropWidth": 2631,
  "backdropHeight": 1480,
  "trailerUrl": "https://youtube.com/watch?v=tGpTpVyI_OQ",
  "trailerYouTubeId": "tGpTpVyI_OQ",
  "imdbRating": 8.9,
  "imdbVotes": 2033927,
  "imdbType": "movie",
  "tmdbPopularity": 74.051,
  "tmdbRating": 8.491,
  "tmdbVotes": 24004,
  "metacriticRating": 94,
  "metacriticVotes": 24,
  "rtCriticRating": 93,
  "rtCriticVotes": 103,
  "rtAudienceRating": 93,
  "rtAudienceVotes": 1120247,
  "rtUrl": "https://www.rottentomatoes.com/m/pulp_fiction",
  "letterboxdScore": 86,
  "letterboxdVotes": 1102425,
  "flickMetrixScore": 90,
  "flickMetrixId": 110912,
  "foreign": false,
  "relevancyScore": 1883295.240641344,
  "imdbCustomPopularity": 2033927,
  "createdAt": "2022-11-07T03:51:42.276Z",
  "updatedAt": "2022-11-07T03:51:42.276Z"
}

In addition to basic movie metadata, most movies also include:

  • Ratings
    • IMDB ratings
    • Rotten Tomatoes critic and audience scores
    • Metacritic ratings
    • Letterboxd ratings
    • FlickMetrix ratings
  • Media
    • YouTube trailers
    • Poster images
    • Backdrop images

We also compute some useful custom fields like foreign, which is a flag for whether or not most Western audience members would recognize the movie. This includes most English movies as well as many of the most popular foreign films.

This format aims to be as compact as possible while still including the most important metadata, so for certain fields like the trailer, we choose the best possible YouTube video according to a set of heuristics and ignore any other related videos.

This may result in less metadata in your database for fallbacks and such, but the project is explicitly designed to be re-run regularly so your data remains valid and up-to-date, relying on these third-parties to deal with data drift.

Prerequisites

Store your TMDB_BEARER_TOKEN and DATABASE_URL in a local .env file.

Steps

Once we have our environment setup, we can start processing movies. Most of the processing steps break things up into batches (defaults to 32k movies per batch) in order to allow for incremental processing and easier debugging along the way.

All scripts are idempotent, so you can run these steps repeatedly and expect up-to-date results.

Before getting started, make sure you've run pnpm install to initialize the Node.js project.

1. Download Data Dumps

We'll start off by running npx tsx src/download-data-dumps.tsx, which downloads the following data dumps to seed our scripts:

The output will be stored in the data/ directory and is ignored by git.

2. Populate TMDB Movies

Next, we'll run npx tsx src/populate-tmdb-movie-dump.ts which populates each of the TMDB movie IDs with its corresponding TMDB movie details and stores the results into a series of batched JSON files out/tmdb-0.json, out/tmdb-1.json, etc. (takes ~1 hour)

The result is ~655k movies in 24 batches.

3. Populate Flick Metrix Movies (Optional)

The next optional step is to download movie metadata using Flick Metrix's private API. This is really only necessary if you want Rotten Tomatoes scores. (takes ~3 minutes)

Run npx tsx src/populate-flick-metrix-movies which will fetch ~70k movies and store the results into out/flick-metrix-movies.json. This optional metadata will be used by src/process-movies.ts if it exists and will be ignored if it doesn't.

4. Process Movies

Next, we'll run npx tsx src/process-movies.ts which takes all of the previously resolved TMDB movies and transforms them to a normalized schema, adding in IMDB ratings from our partial IMDB data dump. This will output normalized JSON movies in batched JSON files out/movies-0.json, out/movies-1.json, etc. (takes ~30 seconds)

This script also filters movies which are unlikely to be relevant for most use cases:

  • filters adult movies
  • filters movies which do not have a valid IMDB id (~40%)
  • filters movies which do not have a valid YouTube trailer (~58%)
  • filters movies which are too far in the future (~1.5%)
  • filters movies which are too short (< 30min)
  • filters music videos
  • filters tv series and episodes
  • filters live concerts
  • optionally computes LQIP placeholder images for all movie posters and backdrops
    • caches the results in redis
    • this is controlled by enablePreviewImages in src/lib/config.ts
    • if you have it enabled, set REDIS_HOST and REDIS_PASSWORD
    • I just use a local redis instance (brew install redis)
  • adds additional IMDB info from any previous populate-imdb-movies cache (if out/imdb-movies.json exists)
  • adds additional Rotten Tomatoes info from any previous populate-rt-movies cache (if out/rt-movies.json exists)
  • adds additional Flick Metrix info from any previous populate-flick-metrix-movies cache (if out/flick-metrix-movies.json exists)

NOTE: this is the most important step, and you will likely find yourself running it multiple times. It is pretty quick, though, since it doesn't require any network access. It is essentially just aggregating all of the data we've downloaded in the other steps into a normalized format and performing some filtering.

The result is ~73k movies.

5. Populate Rotten Tomatoes Movies (Optional)

The next optional step is to download additional RT info for each movie, using a cheerio-based scraper. We self-impose a strict rate-limit on the scraping, so this step takes a long time to run and requires a solid internet connection with minimal interruptions. (takes ~4 hours)

Currently, the only way this script will work is if you've previously downloaded all Flick Metrix movies.

This step is important if you want up-to-date Rotten Tomatoes critic and audience scores, along with a criticsConsensus description.

If you want to proceed with this step, you can run npx tsx src/populate-rt-movies.tsx which will read in our normalized movies from out/movie-0.json, out/movie-1.json, etc, scrape each RT movie and store the results to out/rt-movies.json.

Once this step finishes, you'll need to re-run npx tsx src/process-movies.ts, which will now take into account all of the extra RT metadata that was downloaded to our local cache.

6. Populate IMDB Movies (Optional)

The next optional step is to download additional IMDB info for each movie, using a cheerio-based scraper called movier. We self-impose a strict rate-limit on the scraping, so this step takes a long time to run and requires a solid internet connection with minimal interruptions. (takes ~12 hours)

NOTE: see IMDB's personal and non-commercial licensing before proceeding. This step is optional for a reason. IMDB does have an official API, but it is extremely expensive to use (starting at $50k + usage-based billing).

If you want to proceed with this step, you'll can run npx tsx src/populate-imdb-movies.ts which will read in our normalized movies from out/movie-0.json, out/movie-1.json, etc, scrape each IMDB movie individually with movier and store the results to out/imdb-movies.json.

If you are running into rate-limit issues (likely 503 errors), then you'll need to adjust the rate-limit in src/lib/imdb.ts.

Once this step finishes, you'll need to re-run npx tsx src/process-movies.ts, which will now take into account all of the extra IMDB metadata that was downloaded to our local cache.

7. Upsert Movies into Prisma

The final step is to upsert all of the movies into your Prisma database. (takes ~30 seconds)

Make sure that you have DATABASE_URL set to a Postgres instance in your .env file and then run npx prisma db push to sync the Prisma schema with your database (as well as generating the prisma client locally in node_modules). You can alternatively use prisma db migrate and prisma generate if you prefer that workflow.

NOTE: this script defaults to upserting movies into your database, so it shouldn't be destructive, but if you'd rather clear your movies table first, you can set DROP_MOVIES=true — which will drop the movies table and perform upserts in batches. This method is much faster, but be careful with it so that you don't accidentally wipe out movie data.

NOTE if you set DRY_RUN=true, this script will not make any changes to your database. Instead, it will loop through all movies that would've been inserted and print the diffs — which can be extremely useful for making sure that your upsert will work as intended.

Now you should be ready to run npx tsx src/upsert-movies-to-db.ts which will run through out/movies-0.json, out/movies-1.json, etc and insert each movie into the database.

8. Query Movies from Prisma

You should now have a full Postgres database of movies, complete with the most important metadata from TMDB and IMDB. Huzzah!

You can run npx tsx scripts/scratch.ts to run an example Prisma query.

Stats

  • ~750k "movies" in TMDB
  • ~73k movies after resolving and filtering
    • 8.6k documentaries
    • 44k english movies
    • 34k movies made in the U.S.
      • (80% of these are made exclusively in the U.S.)
    • 4.3k movies made in India
      • (99% of these are made exclusively in India)
    • 1.3k movies made in China
      • (40% of these are made exclusively in China)
  • IMDB stats
    • 30k movies have at least 1k votes on IMDB
    • 40k movies have an IMDB rating of at least 6
    • 29k movies have an IMDB rating of at least 6.5
    • 17k movies have an IMDB rating of at least 7
    • 3k movies have an IMDB rating of at least 8
    • 300 movies have an IMDB rating of at least 9

(all stats are approximate)

Database Notes

The resulting movie dataset is ~130MB (about half of which is Postgres indices) and should fit in most free-tier Postgres instances.

Some fields like budget and revenue use strings instead of numbers, because they can overflow Postgres integers.

If you want to use a different type of database, it should be pretty easy since the majority of the processing happens with local JSON files. Note that Prisma supports several popular databases, including MongoDB, MySQL, and SQLite.

License

MIT © Travis Fischer

Support my open source work by following me on twitter twitter

TMDB         IMDB         Rotten Tomatoes

populate-movies's People

Contributors

transitive-bullshit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

populate-movies's Issues

ENOENT: No such file or directory, open data/tmdb_dump_movie_ids.json

Hello when I run: npx tsx src/download-data-dumps.tsx
I received the following error:

Could not find module '/home/ubuntu/transitive/populate-movies/src/download-data-dumps.tsx

Then I run npx tsx src/download-data-dumps.ts

And it seems to work, but when I run the next command: npx tsx src/populate-tmdb-movie-dump.ts
I get the following error: Error: ENOENT: No such file or directory, open 'data/tmdb_dump_movie_ids.json

Populating IMDB movies gives 403 forbidden for all movies

d:\populate-movies-main>npx tsx src/populate-imdb-movies.ts
(node:31032) ExperimentalWarning: Custom ESM Loaders is an experimental feature and might change at any time
(Use node --trace-warnings ... to show where the warning was created)
loading IMDB movies from cache (out/imdb-movies.json)
warn: unable to load IMDB movie cache (out/imdb-movies.json)
You can safely ignore this warning if you haven't run populate-imdb-movies.ts.
[Error: ENOENT: no such file or directory, open 'd:\populate-movies-main\out\imdb-movies.json'] {
errno: -4058,
code: 'ENOENT',
syscall: 'open',
path: 'd:\populate-movies-main\out\imdb-movies.json'
}

populating 29997 movies in batch 0 (out/movies-0.json)

0:0 imdb tt13345606 (2023) Evil Dead Rise
0:1 imdb tt10954600 (2023) Ant-Man and the Wasp: Quantumania
0:2 imdb tt6718170 (2023) The Super Mario Bros. Movie
0:3 imdb tt13375076 (2023) The Pope's Exorcist
imdb error tt13375076 The Pope's Exorcist Error: failed to get the result from IMDB : Request failed with status code 403
imdb error tt13345606 Evil Dead Rise Error: failed to get the result from IMDB : Request failed with status code 403
imdb error tt6718170 The Super Mario Bros. Movie Error: failed to get the result from IMDB : Request failed with status code 403
imdb error tt10954600 Ant-Man and the Wasp: Quantumania Error: failed to get the result from IMDB : Request failed with status code 403
0:3 imdb tt13375076 (2023) The Pope's Exorcist
0:0 imdb tt13345606 (2023) Evil Dead Rise
0:2 imdb tt6718170 (2023) The Super Mario Bros. Movie
0:1 imdb tt10954600 (2023) Ant-Man and the Wasp: Quantumania
imdb error tt13375076 The Pope's Exorcist Error: failed to get the result from IMDB : Request failed with status code 403
imdb error tt13345606 Evil Dead Rise Error: failed to get the result from IMDB : Request failed with status code 403
imdb error tt6718170 The Super Mario Bros. Movie Error: failed to get the result from IMDB : Request failed with status code 403
imdb error tt10954600 Ant-Man and the Wasp: Quantumania Error: failed to get the result from IMDB : Request failed with status code 403

Terminate batch job (Y/N)? y

populate-rt-movies.ts not processing movies-##.json

I'm not sure what else to provide as far as information. All previous processes seem to have been completed correctly. The movies-#.json and tmdb-#.json files are present in /old. At the end rt-movies.json is present, but empty other than brackets. Process completed in about 10 seconds. Thanks for your help, I'm very excited to be able to get this process working.

Output:
d:\populate-movies-main>npx tsx src/populate-rt-movies.ts
(node:16868) ExperimentalWarning: Custom ESM Loaders is an experimental feature and might change at any time
(Use node --trace-warnings ... to show where the warning was created)
loading RT movies from cache (out/rt-movies.json)
loading OMDB movies from cache (out/omdb-movies.json)
warn: unable to load OMDB movie cache (out/omdb-movies.json)
You can safely ignore this warning if you haven't run populate-omdb-movies.ts.
[Error: ENOENT: no such file or directory, open 'd:\populate-movies-main\out\omdb-movies.json'] {
errno: -4058,
code: 'ENOENT',
syscall: 'open',
path: 'd:\populate-movies-main\out\omdb-movies.json'
}
loaded 0 RT movies from cache (out/rt-movies.json)
TODO

populating 21231 movies in batch 3 (out/movies-3.json)

batch 3 done { numMovies: 21231, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 20427 movies in batch 4 (out/movies-4.json)

batch 4 done { numMovies: 20427, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 20366 movies in batch 5 (out/movies-5.json)

batch 5 done { numMovies: 20366, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 15642 movies in batch 6 (out/movies-6.json)

batch 6 done { numMovies: 15642, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 18466 movies in batch 7 (out/movies-7.json)

batch 7 done { numMovies: 18466, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 13063 movies in batch 8 (out/movies-8.json)

batch 8 done { numMovies: 13063, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 14762 movies in batch 9 (out/movies-9.json)

batch 9 done { numMovies: 14762, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 12610 movies in batch 10 (out/movies-10.json)

batch 10 done { numMovies: 12610, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 9826 movies in batch 11 (out/movies-11.json)

batch 11 done { numMovies: 9826, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 10885 movies in batch 12 (out/movies-12.json)

batch 12 done { numMovies: 10885, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 10982 movies in batch 13 (out/movies-13.json)

batch 13 done { numMovies: 10982, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 9905 movies in batch 14 (out/movies-14.json)

batch 14 done { numMovies: 9905, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 9733 movies in batch 15 (out/movies-15.json)

batch 15 done { numMovies: 9733, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 7993 movies in batch 16 (out/movies-16.json)

batch 16 done { numMovies: 7993, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 7382 movies in batch 17 (out/movies-17.json)

batch 17 done { numMovies: 7382, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 5289 movies in batch 18 (out/movies-18.json)

batch 18 done { numMovies: 5289, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 4911 movies in batch 19 (out/movies-19.json)

batch 19 done { numMovies: 4911, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 5047 movies in batch 20 (out/movies-20.json)

batch 20 done { numMovies: 5047, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 4802 movies in batch 21 (out/movies-21.json)

batch 21 done { numMovies: 4802, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 4950 movies in batch 22 (out/movies-22.json)

batch 22 done { numMovies: 4950, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

populating 4494 movies in batch 23 (out/movies-23.json)

batch 23 done { numMovies: 4494, numRTMoviesDownloaded: 0, numRTMoviesTotal: 0 }

done {
numMoviesTotal: 232766,
numRTMoviesDownloadedTotal: 0,
numRTMoviesTotal: 0
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.