jadianes / spark-movie-lens Goto Github PK

View Code? Open in Web Editor NEW

813.0 813.0 401.0 52 KB

An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

License: Other

Shell 0.64% Python 10.77% Jupyter Notebook 88.58%

big-data bigdata flask movie-recommendation movielens-dataset python spark

spark-movie-lens's People

Contributors

Stargazers

Watchers

Forkers

yuruidong masdude hivefans yangmingzi jjdblast zhangf911 neufang ruinnight darcy0511 saiichi airqj hope-onely wavelets ferrero-zhang patrickzheng ymero saurfang ml-lab adzilla hanxirui raidyue 4everer qlgu acheriat bvanderlugt rodyou annahpryor niutyut ssrinivasan augustlong janvandekerkhof michaelshing rickyking zhengrenfeng piggybox double-headed-eagle anindya-saha hochanh dpzhou tristaneljed chengrong yonglehou cqation rolinston sasha-ruby chitrank-dixit folkcode willguan105 vadi88 morgengc ksivathanu swimablefish chenjz justinzhq ericdoug zwx-zsy qsbbq rizen1892 raybenchen revskill10 kalyankumarpichuka jackieqizhu zhouhoo agilemobiledev zhouchunyu bitrussell anupampandey1924 mark2016goog caohy1988 pesong rbandara realmichaelzyy marius92mc dtsukiyama jayinai shartoo muthu1086 mayankti molhamaleh patricktang786 kungfupandey zhou42 fashtimedotcom tophercf ranaivosonherimanitra jfefes liubai521 microhuang sunlchk rlugojr lucentcosmos ahndong veterun jinyu0310 sandy4321 n00bc0der89 heladolk eformat chrisdamba atesop

spark-movie-lens's Issues

update the model for new users and new movies

Thanks for you open source code and I have a slight problem about update the model for new users and new movies.
In the engine.py class when we add rating , the "enginer" will do this "self.__train_model()",it's mean compute all rating onces again.do you know how to augmenting the model using new ratings.
thangk you.
Any pointers would help.Thank you very much.

Getting individual ratings

Currently, the example code looks like:

my_movie = sc.parallelize([(0, 500)]) # Quiz Show (1994)
individual_movie_rating_RDD = new_ratings_model.predictAll(new_user_unrated_movies_RDD)
individual_movie_rating_RDD.take(1)

Should it be this ...?

my_movie = sc.parallelize([(0, 500)]) # Quiz Show (1994)
individual_movie_rating_RDD = new_ratings_model.predictAll(my_movie)
individual_movie_rating_RDD.collect()

logic error in function "get_top_ratings" when get "user_unrated_moies_RDD"

file:engine.py ->function:get_top_ratings, code as
user_unrated_movies_RDD = self.movies_RDD.filter(lambda rating: not rating[1]==user_id).map(lambda x: (user_id, x[0]))
Element of self.movies_RDD as (movie_id, movie_title, movie_category), "rating[1]" represent "movie_title";I guess "self.movies_RDD" should be "self.ratings_RDD"; Please check this question.

Ger error when execute " error_complete = math.sqrt(rates_and_preds_complete.map(lambda r:(r[1][0]-r[1][1])**2).mean())"

Hello, i submit the code building-recommender.ipynb by pyspark ,when code goes here

error_complete = math.sqrt(rates_and_preds_complete.map(lambda  r:(r[1][0]-r[1][1])**2).mean())

i got an error below

It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation.

duplicates of new user unrated moves passed to predict

new_user_unrated_movies_RDD = (complete_movies_data.filter(lambda x: x[0] not in new_user_ratings_ids).map(lambda x: (new_user_ID, x[0])))

The list of unrated movies contains duplicates:

print(new_user_unrated_movies_RDD.take(10))
[(0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1)]

Should there be a distinct added?

new_user_unrated_movies_RDD = (complete_movies_data.filter(lambda x: x[0] not in new_user_ratings_ids).map(lambda x: (new_user_ID, x[0]))).distinct()

print(new_user_unrated_movies_RDD.take(10))
[(0, 378), (0, 1934), (0, 3282), (0, 5606), (0, 862), (0, 2146), (0, 3766), (0, 1330), (0, 2630), (0, 4970)]

The predict function that receives new_user_unrated_movies_RDD:

# Use the input RDD, new_user_unrated_movies_RDD, with new_ratings_model.predictAll() to predict new ratings for the movies
new_user_recommendations_RDD = new_ratings_model.predictAll(new_user_unrated_movies_RDD)

localhost:5432/0/ratings/top/10

spark-movie-lens/engine.py", line 80, in get_top_ratings
user_unrated_movies_rdd = self.movies_rdd.filter(lambda rating: not rating[1] == user_id)\

AttributeError: RecommendationEngine instance has no attribute 'movies_rdd'

real-time recommend

I want a real-time recommend system,but how can i achieve it by your spark-movie-lens project,Can you give me some suggistions?thank you very much

StackOverflow

Getting StackOverflow error while running the application engine
An error occurred while calling o90.trainALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage 28.0 (TID 49, 192.168.110.130): java.lang.StackOverflowError
at java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2846)
at java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1455)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invok

engine.iteration

Hi jadianes,
Thank you for your all work, it really helped me. But iteration in engine causes error in my system when it gets bigger than 5. I think 5 iterations are not enough for a good recommendation. Can you suggest any way to fix it? Error is this:

File "F:\bitirme\spark-2.0.1-bin-hadoop2.7\python\pyspark\mllib\common.py", line 123, in callJavaFunc
17/04/24 22:58:09 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-7,5,main]
java.lang.StackOverflowError
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:147)
at org.apache.spark.util.ByteBufferInputStream.read(ByteBufferInputStream.scala:52)
at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)

ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(5,1493063889491,JobFailed(org.apache.spark.SparkException: Job 5 cancelled because SparkContext was shut down)

my system spesifications
i7 2th gen
6 gb ram
ssd 550/440
asus n53sv laptop

New User Ratings

Hi Jose, Great job !
Am new to github so please pardon if this should not be reported as an issue but I just wanted to bring to your attention on the ratings that we are providing to the complete data for a new user.

The range for new user ratings seems to be [0,10] and when the reco engine makes predictions it throws predicted ratings in the similar range. Shouldn't it be in [0,5] range. When i supply ratings in this range it predicts movie ratings in the [0,5] range. But the predictions are drastically different than what they were earlier. Am i missing something here ?

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist

I am using the same notebook on Cloudera's quickstart VM and Anaconda installed. I have done no other changes.

On this step:

small_ratings_raw_data_header = small_ratings_raw_data.take(1)[0]

it gives an error:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-21-61849ee50ee7> in <module>()
----> 1 small_ratings_raw_data_header = small_ratings_raw_data.take(1)[0]

/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
   1265         """
   1266         items = []
-> 1267         totalParts = self.getNumPartitions()
   1268         partsScanned = 0
   1269 

/usr/lib/spark/python/pyspark/rdd.py in getNumPartitions(self)
    354         2
    355         """
--> 356         return self._jrdd.partitions().size()
    357 
    358     def filter(self, f):

/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Py4JJavaError: An error occurred while calling o108.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/datasets/ml-latest-small/ratings.csv
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:64)
    at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:46)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

I checked the previous step:

small_ratings_raw_data

This gives the result:

/home/cloudera/datasets/ml-latest-small/ratings.csv MapPartitionsRDD[7] at textFile at NativeMethodAccessorImpl.java:-2

Could you please help me with this?

Unable to proceed past stage 7.0 (OutOfMemoryError: Java heap space)

py4j.protocol.Py4JJavaError: An error occurred while calling o96.trainALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 7.0 failed 1 times, most recent failure: Lost task 5.0 in stage 7.0 (TID 54, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space

Unable to proceed further..
Any help is much appreciated!

self.seed

FYI:

spark-movie-lens/engine.py", line 121
    self.seed = 5L
SyntaxError: invalid syntax

Python 3.5 does not support this.

Invalid Syntax in engine.py - line 115, self.seed = 5L

(C:\Program Files\Anaconda3) F:\Data Science\movielens\spark-movie-lens>server.py
Traceback (most recent call last):
File "F:\Data Science\movielens\spark-movie-lens\server.py", line 3, in
from app import create_app
File "F:\Data Science\movielens\spark-movie-lens\app.py", line 5, in
from engine import RecommendationEngine
File "F:\Data Science\movielens\spark-movie-lens\engine.py", line 115
self.seed = 5L
^
SyntaxError: invalid syntax

Importing Spark

Hello,
As I was following the guide, I found the variable sc which was not defined, I figured it belonged to Spark.
However, I don't know how to configure Spark to run the notebook.
I'm on windows, any help?

GETing top recommendations shows the same movie

Hello,

I've managed to run the project locally, and the output from getting top recommendations shows the same movie.
Does anyone else experience the same behavior?
I mention that I've run it with the exact source files as in this repo.

The mentioned output looks like this.
"[["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30]]"

Thank you.