Giter Club home page Giter Club logo

spark-movie-lens's People

Contributors

bitdeli-chef avatar danionescu0 avatar fulcommit avatar hoarf avatar jadianes avatar marius92mc avatar saurfang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-movie-lens's Issues

update the model for new users and new movies

Thanks for you open source code and I have a slight problem about update the model for new users and new movies.
In the engine.py class when we add rating , the "enginer" will do this "self.__train_model()",it's mean compute all rating onces again.do you know how to augmenting the model using new ratings.
thangk you.
Any pointers would help.Thank you very much.

Getting individual ratings

Currently, the example code looks like:

my_movie = sc.parallelize([(0, 500)]) # Quiz Show (1994)
individual_movie_rating_RDD = new_ratings_model.predictAll(new_user_unrated_movies_RDD)
individual_movie_rating_RDD.take(1)

Should it be this ...?

my_movie = sc.parallelize([(0, 500)]) # Quiz Show (1994)
individual_movie_rating_RDD = new_ratings_model.predictAll(my_movie)
individual_movie_rating_RDD.collect()

logic error in function "get_top_ratings" when get "user_unrated_moies_RDD"

file:engine.py ->function:get_top_ratings, code as
user_unrated_movies_RDD = self.movies_RDD.filter(lambda rating: not rating[1]==user_id).map(lambda x: (user_id, x[0]))
Element of self.movies_RDD as (movie_id, movie_title, movie_category), "rating[1]" represent "movie_title";I guess "self.movies_RDD" should be "self.ratings_RDD"; Please check this question.

duplicates of new user unrated moves passed to predict

new_user_unrated_movies_RDD = (complete_movies_data.filter(lambda x: x[0] not in new_user_ratings_ids).map(lambda x: (new_user_ID, x[0])))

The list of unrated movies contains duplicates:

print(new_user_unrated_movies_RDD.take(10))
[(0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1)]

Should there be a distinct added?

new_user_unrated_movies_RDD = (complete_movies_data.filter(lambda x: x[0] not in new_user_ratings_ids).map(lambda x: (new_user_ID, x[0]))).distinct()
print(new_user_unrated_movies_RDD.take(10))
[(0, 378), (0, 1934), (0, 3282), (0, 5606), (0, 862), (0, 2146), (0, 3766), (0, 1330), (0, 2630), (0, 4970)]

The predict function that receives new_user_unrated_movies_RDD:

# Use the input RDD, new_user_unrated_movies_RDD, with new_ratings_model.predictAll() to predict new ratings for the movies
new_user_recommendations_RDD = new_ratings_model.predictAll(new_user_unrated_movies_RDD)

localhost:5432/0/ratings/top/10

spark-movie-lens/engine.py", line 80, in get_top_ratings
user_unrated_movies_rdd = self.movies_rdd.filter(lambda rating: not rating[1] == user_id)\

AttributeError: RecommendationEngine instance has no attribute 'movies_rdd'

real-time recommend

I want a real-time recommend system,but how can i achieve it by your spark-movie-lens project,Can you give me some suggistions?thank you very much

StackOverflow

Getting StackOverflow error while running the application engine
An error occurred while calling o90.trainALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage 28.0 (TID 49, 192.168.110.130): java.lang.StackOverflowError
at java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2846)
at java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1455)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invok

engine.iteration

Hi jadianes,
Thank you for your all work, it really helped me. But iteration in engine causes error in my system when it gets bigger than 5. I think 5 iterations are not enough for a good recommendation. Can you suggest any way to fix it? Error is this:

File "F:\bitirme\spark-2.0.1-bin-hadoop2.7\python\pyspark\mllib\common.py", line 123, in callJavaFunc
17/04/24 22:58:09 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-7,5,main]
java.lang.StackOverflowError
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:147)
at org.apache.spark.util.ByteBufferInputStream.read(ByteBufferInputStream.scala:52)
at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)


ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(5,1493063889491,JobFailed(org.apache.spark.SparkException: Job 5 cancelled because SparkContext was shut down)

my system spesifications
i7 2th gen
6 gb ram
ssd 550/440
asus n53sv laptop

New User Ratings

Hi Jose, Great job !
Am new to github so please pardon if this should not be reported as an issue but I just wanted to bring to your attention on the ratings that we are providing to the complete data for a new user.

The range for new user ratings seems to be [0,10] and when the reco engine makes predictions it throws predicted ratings in the similar range. Shouldn't it be in [0,5] range. When i supply ratings in this range it predicts movie ratings in the [0,5] range. But the predictions are drastically different than what they were earlier. Am i missing something here ?

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist

I am using the same notebook on Cloudera's quickstart VM and Anaconda installed. I have done no other changes.

On this step:

small_ratings_raw_data_header = small_ratings_raw_data.take(1)[0]

it gives an error:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-21-61849ee50ee7> in <module>()
----> 1 small_ratings_raw_data_header = small_ratings_raw_data.take(1)[0]

/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
   1265         """
   1266         items = []
-> 1267         totalParts = self.getNumPartitions()
   1268         partsScanned = 0
   1269 

/usr/lib/spark/python/pyspark/rdd.py in getNumPartitions(self)
    354         2
    355         """
--> 356         return self._jrdd.partitions().size()
    357 
    358     def filter(self, f):

/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Py4JJavaError: An error occurred while calling o108.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/datasets/ml-latest-small/ratings.csv
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:64)
    at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:46)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

I checked the previous step:

small_ratings_raw_data

This gives the result:

/home/cloudera/datasets/ml-latest-small/ratings.csv MapPartitionsRDD[7] at textFile at NativeMethodAccessorImpl.java:-2

Could you please help me with this?

Unable to proceed past stage 7.0 (OutOfMemoryError: Java heap space)

py4j.protocol.Py4JJavaError: An error occurred while calling o96.trainALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 7.0 failed 1 times, most recent failure: Lost task 5.0 in stage 7.0 (TID 54, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space

Unable to proceed further..
Any help is much appreciated!

self.seed

FYI:

spark-movie-lens/engine.py", line 121
    self.seed = 5L
SyntaxError: invalid syntax

Python 3.5 does not support this.

Invalid Syntax in engine.py - line 115, self.seed = 5L

(C:\Program Files\Anaconda3) F:\Data Science\movielens\spark-movie-lens>server.py
Traceback (most recent call last):
File "F:\Data Science\movielens\spark-movie-lens\server.py", line 3, in
from app import create_app
File "F:\Data Science\movielens\spark-movie-lens\app.py", line 5, in
from engine import RecommendationEngine
File "F:\Data Science\movielens\spark-movie-lens\engine.py", line 115
self.seed = 5L
^
SyntaxError: invalid syntax

Importing Spark

Hello,
As I was following the guide, I found the variable sc which was not defined, I figured it belonged to Spark.
However, I don't know how to configure Spark to run the notebook.
I'm on windows, any help?

GETing top recommendations shows the same movie

Hello,

I've managed to run the project locally, and the output from getting top recommendations shows the same movie.
Does anyone else experience the same behavior?
I mention that I've run it with the exact source files as in this repo.

The mentioned output looks like this.
"[["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30]]"

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.