Giter Club home page Giter Club logo

movielens-imdb's Introduction

MovieLens-IMDB

How to match the MovieLens dataset and the IMDB dataset?

ML-100K, ML-1M, MK-10M

For these three datasets, we need to match the movies using the title name and release year. As shown in the README in ML-10M: Movie titles, by policy, should be entered identically to those found in IMDB, including year of release. However, they are entered manually, so errors and inconsistencies may exist.

So in order to fix the inconsistencies, we manually match the movies in MovieLens, where they cannot be directly found in IMDB according to the original title name. For example, the title name in MovieLens is jungle2jungle, wheaeas the title name in IMDB is jungle 2 jungle. The manually fixed title inconsistencies are in movielens/statistics/manually_fixed_title_name.

We provide a script for your convenience to match the directors and the writers for each movie.

Step1: Download the MovieLens dataset (https://grouplens.org/datasets/movielens/) and save them in movielens/raw/{}.format(ML-100K), if you use the ML-100K dataset. Download the IMDB dataset (https://datasets.imdbws.com/) and save them in _IMDB/* and then unzip all the files.

Step2: Run the script

python preprocess_movie_imdb.py

You can obtain a heterogeneous graph with the network schema as follows:

network_schema

Besides, director and writer, you can obtain other knowledge such as editor, producer, actor, cinematographer, composer, etc. People also have basic information such as birth year, death year, primary profession, known in movies. etc.

ML-20M

For this dataset,it already has the links for other sources. As indicated in the README file (http://files.grouplens.org/datasets/movielens/ml-20m-README.html): Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following format: movieId,imdbId,tmdbId.

But these identifiers have not been checked by us and we don't know whether there exists inconsistencies or not.

movielens-imdb's People

Contributors

jennyzhang0215 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

movielens-imdb's Issues

Bugs From "preprocess_movie_imdb.py" file

Hi, thanks for your codes. When I run your codes of the "preprocess_movie_imdb.py" file, it reported that the line 638 "assert len(imdb_id) == 9" was an error.

Then I commented this line and tried to continue run the codes, it reported another bugs as follows:
"

found1652/1681, dropped: 8 (not matched) + 3 (>2 titles)

mapped (director): 12621, # mapped (writer): 12518

Traceback (most recent call last):
File "preprocess_movie_imdb.py", line 682, in
year_diff_thres=4, chosen_title_type_l=["movie", "tvmovie", "tvminiseries", "video"], COS_SIM_THRES=0.3)
File "preprocess_movie_imdb.py", line 664, in match
["writer_id", "name"])
File "preprocess_movie_imdb.py", line 496, in gen_unique_info
unique_l = [[id, people_id2name_dic[id]] for id in unique
.values]
File "preprocess_movie_imdb.py", line 496, in
unique_l = [[id, people_id2name_dic[id]] for id in unique_.values]
KeyError: 'nm0799875'
"

Could you help me solve this problem? Or if you have the processed data about Moivelens 10m/20m, could you send them to me? My email is [email protected].

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.