Giter Club home page Giter Club logo

funk-svd's Introduction

โšก funk-svd Build Status License

funk-svd is a Python 3 library implementing a fast version of the famous SVD algorithm popularized by Simon Funk during the Neflix Prize contest.

Numba is used to speed up our algorithm, enabling us to run over 10 times faster than Surprise's Cython implementation (cf. benchmark notebook).

Movielens 20M RMSE MAE Time
Surprise 0.88 0.68 10 min 40 sec
Funk-svd 0.88 0.68 42 sec

Installation

Run pip install git+https://github.com/gbolmier/funk-svd in your terminal.

Contributing

All contributions, bug reports, bug fixes, enhancements, and ideas are welcome.

A detailed overview on how to contribute can be found in the contributor guide.

Quick example

run_experiment.py:

>>> from funk_svd.dataset import fetch_ml_ratings
>>> from funk_svd import SVD

>>> from sklearn.metrics import mean_absolute_error


>>> df = fetch_ml_ratings(variant='100k')

>>> train = df.sample(frac=0.8, random_state=7)
>>> val = df.drop(train.index.tolist()).sample(frac=0.5, random_state=8)
>>> test = df.drop(train.index.tolist()).drop(val.index.tolist())

>>> svd = SVD(lr=0.001, reg=0.005, n_epochs=100, n_factors=15,
...           early_stopping=True, shuffle=False, min_rating=1, max_rating=5)

>>> svd.fit(X=train, X_val=val)
Preprocessing data...

Epoch 1/...

>>> pred = svd.predict(test)
>>> mae = mean_absolute_error(test['rating'], pred)

>>> print(f'Test MAE: {mae:.2f}')
Test MAE: 0.75

Funk SVD for recommendation in a nutshell

We have a huge sparse matrix:

storing known ratings for a set of users and items:

The idea is to estimate unknown ratings by factorizing the rating matrix into two smaller matrices representing user and item characteristics:

We call these two matrices users and items latent factors. Then, by applying the dot product between both matrices we can reconstruct our rating matrix. The trick is that the empty values will now contain estimated ratings.

In order to get more accurate results, the global average rating as well as the user and item biases are used in addition:

where K stands for known ratings.

Then, we can estimate any rating by applying:

The learning step consists in performing the SGD algorithm where for each known rating the biases and latent factors are updated as follows:

where alpha is the learning rate and lambda is the regularization term.

References

License

MIT license, see here.

funk-svd's People

Contributors

gbolmier avatar guedes-joaofelipe avatar maxhalford avatar sicot-f avatar sicotfre avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

funk-svd's Issues

(Feature) Recommend top N items given a user

Querying every item on the table would be inefficient. There could be a way to make this process simpler by multiplying a user's vector with the whole item matrix.
Guessing: np.dot(self.pu_[u_ix], self.qi_) + self.bi_+ self.bu_[u_ix]

Documentation: Difference between this and "regular" SVD

As Wikipedia had suggested, Funk et. al. is not using the regular form of SVD with three variable, but rather a modified form of NMF with two variables but faster convergence and speed.

With reference to the Funk:

With reference to SVD:

Would it be good to explain how the optimization differs from NMF?

TypeError: 'Series' objects are mutable, thus they cannot be hashed

When run benchmark.ipynb, got the following error:


TypeError Traceback (most recent call last)
in
8 pool = mp.Pool(processes=n_process)
9
---> 10 df_splitted = [df.query('u_id.isin(@users_subset)') for users_subset in np.array_split(users, n_process)]
11
12 results = [

in (.0)
8 pool = mp.Pool(processes=n_process)
9
---> 10 df_splitted = [df.query('u_id.isin(@users_subset)') for users_subset in np.array_split(users, n_process)]
11
12 results = [

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
3467 kwargs["level"] = kwargs.pop("level", 0) + 1
3468 kwargs["target"] = None
-> 3469 res = self.eval(expr, **kwargs)
3470
3471 try:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in eval(self, expr, inplace, **kwargs)
3597 kwargs["resolvers"] = kwargs.get("resolvers", ()) + tuple(resolvers)
3598
-> 3599 return _eval(expr, inplace=inplace, **kwargs)
3600
3601 def select_dtypes(self, include=None, exclude=None) -> DataFrame:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
345 eng = ENGINES[engine]
346 eng_inst = eng(parsed_expr)
--> 347 ret = eng_inst.evaluate()
348
349 if parsed_expr.assigner is None:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/engines.py in evaluate(self)
71
72 # make sure no names in resolvers and locals/globals clash
---> 73 res = self._evaluate()
74 return reconstruct_object(
75 self.result_type, res, self.aligned_axes, self.expr.terms.return_type

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/engines.py in _evaluate(self)
111 env = self.expr.env
112 scope = env.full_scope
--> 113 _check_ne_builtin_clash(self.expr)
114 return ne.evaluate(s, local_dict=scope)
115

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/engines.py in _check_ne_builtin_clash(expr)
27 Terms can contain
28 """
---> 29 names = expr.names
30 overlap = names & _ne_builtins
31

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/expr.py in names(self)
823 """
824 if is_term(self.terms):
--> 825 return frozenset([self.terms.name])
826 return frozenset(term.name for term in com.flatten(self.terms))
827

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in hash(self)
1783
1784 def hash(self) -> int:
-> 1785 raise TypeError(
1786 f"{repr(type(self).name)} objects are mutable, "
1787 f"thus they cannot be hashed"

TypeError: 'Series' objects are mutable, thus they cannot be hashed

verbose mode

Hi. It would be nice in svd.fit() to have a verbose mode, where train- and validation error would be printed in every iteration. Currently only validation error is printed and even that only in the case, when early_stop==True.

Documentation: Dataset shape requirements

For SVD to be usable in other settings, it would be useful to describe the acceptable datatype for SVD.

  1. accepted inputs: From reading this as part of the MovieLens dataset we can safely assume that it follows the user-item-ratings format, while the testing tool uses the user-item format. These format specifications should be well documented for alternative usage (due to dataframe columns having the possible issue of misordering).
  2. expanding dataset: Since the ID hash table is done within the SVD object, every time the data expands, the SVD object would also have to be rebuilt from scratch instead of expanding _preprocess_data fields accordingly.
  3. re-fitting: If there are data that is given for updating, the SVD latent factor matrices, and bias matrices should be updated from the new data. However, whether the updating procedures should include times 1 to X-1 or not in time step X, is not documented.
  4. caching to external object: Since sometimes the SVD has to go offline, caching the object is necessary. It would be useful to know if it is better to store the data in CSVs (for readability) or Pickle/DataTables (for legibility).
  5. Explanation of feature: It should be noted that, unlike other base repos that utilize dense or sparse matrices, this repo is more akin to machine learning, and that it does not "update by the table" but rather "update by the entry". Also an explanation of validation/test dataset is needed (especially in a practical non-evaluation setting). https://www.wikiwand.com/en/Training,_validation,_and_test_sets

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.