gbolmier / funk-svd Goto Github PK

:zap: A python fast implementation of the famous SVD algorithm popularized by Simon Funk during Netflix Prize

License: MIT License

Jupyter Notebook 48.40% Python 51.60%

funk-svd's Introduction

⚡ funk-svd

funk-svd is a Python 3 library implementing a fast version of the famous SVD algorithm popularized by Simon Funk during the Neflix Prize contest.

Numba is used to speed up our algorithm, enabling us to run over 10 times faster than Surprise's Cython implementation (cf. benchmark notebook).

Movielens 20M	RMSE	MAE	Time
Surprise	0.88	0.68	10 min 40 sec
Funk-svd	0.88	0.68	42 sec

Installation

Run pip install git+https://github.com/gbolmier/funk-svd in your terminal.

Contributing

All contributions, bug reports, bug fixes, enhancements, and ideas are welcome.

A detailed overview on how to contribute can be found in the contributor guide.

Quick example

run_experiment.py:

>>> from funk_svd.dataset import fetch_ml_ratings
>>> from funk_svd import SVD

>>> from sklearn.metrics import mean_absolute_error


>>> df = fetch_ml_ratings(variant='100k')

>>> train = df.sample(frac=0.8, random_state=7)
>>> val = df.drop(train.index.tolist()).sample(frac=0.5, random_state=8)
>>> test = df.drop(train.index.tolist()).drop(val.index.tolist())

>>> svd = SVD(lr=0.001, reg=0.005, n_epochs=100, n_factors=15,
...           early_stopping=True, shuffle=False, min_rating=1, max_rating=5)

>>> svd.fit(X=train, X_val=val)
Preprocessing data...

Epoch 1/...

>>> pred = svd.predict(test)
>>> mae = mean_absolute_error(test['rating'], pred)

>>> print(f'Test MAE: {mae:.2f}')
Test MAE: 0.75

Funk SVD for recommendation in a nutshell

We have a huge sparse matrix:

$R = \begin{pmatrix} {\color{Red} ?} & 2 & \cdots & {\color{Red} ?} & {\color{Red} ?} \\ {\color{Red} ?} & {\color{Red} ?} & \cdots & {\color{Red} ?} & 4.5 \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 3 & {\color{Red} ?} & \cdots & {\color{Red} ?} & {\color{Red} ?} \\ {\color{Red} ?} & {\color{Red} ?} & \cdots & 5 & {\color{Red} ?} \end{pmatrix}$

storing known ratings for a set of users and items:

The idea is to estimate unknown ratings by factorizing the rating matrix into two smaller matrices representing user and item characteristics:

$P = \begin{pmatrix} 0.37 & \cdots & 0.69 \\ \vdots & \ddots & \vdots \\ \vdots & \ddots & \vdots \\ \vdots & \ddots & \vdots \\ 1.08 & \cdots & 0.24 \end{pmatrix} , Q = \begin{pmatrix} 0.09 & \cdots & \cdots & \cdots & 0.46 \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0.51 & \cdots & \cdots & \cdots & 0.72 \end{pmatrix}$

We call these two matrices users and items latent factors. Then, by applying the dot product between both matrices we can reconstruct our rating matrix. The trick is that the empty values will now contain estimated ratings.

In order to get more accurate results, the global average rating as well as the user and item biases are used in addition:

$\bar{r} = \frac{1}{N} \sum_{i=1}^{N} K_{i}$

where K stands for known ratings.

$bu = \begin{pmatrix} 0.35 & \cdots & 0.07 \end{pmatrix}$

$bi = \begin{pmatrix} 0.16 & \cdots & 0.40 \end{pmatrix}$

Then, we can estimate any rating by applying:

$\hat{r}_{u, i} = \bar{r} + bu_{u} + bi_{i} + \sum_{f=1}^{F} P_{u, f} * Q_{i, f}$

The learning step consists in performing the SGD algorithm where for each known rating the biases and latent factors are updated as follows:

$err = r - \hat{r}$

$bu_{u} = bu_{u} + \alpha * (err - \lambda * bu_{u})$

$bi_{i} = bi_{i} + \alpha * (err - \lambda * bi_{i})$

$P_{u, f} = P_{u, f} + \alpha * (err * Q_{i, f} - \lambda * P_{u, f})$

$Q_{i, f} = Q_{i, f} + \alpha * (err * P_{u, f} - \lambda * Q_{i, f})$

where alpha is the learning rate and lambda is the regularization term.

References

License

MIT license, see here.

funk-svd's People

Contributors

Stargazers

Watchers

funk-svd's Issues

Package 'funk-svd' requires a different Python: 3.9.7 not in '<=3.9.1,>=3.6.5'

Hi,

Is it possible to support version 3.9.7 in the near future?

Thanks

(Feature) Recommend top N items given a user

Querying every item on the table would be inefficient. There could be a way to make this process simpler by multiplying a user's vector with the whole item matrix.
Guessing: np.dot(self.pu_[u_ix], self.qi_) + self.bi_+ self.bu_[u_ix]

Documentation: Difference between this and "regular" SVD

As Wikipedia had suggested, Funk et. al. is not using the regular form of SVD with three variable, but rather a modified form of NMF with two variables but faster convergence and speed.

With reference to the Funk:

https://www.wikiwand.com/en/Matrix_factorization_(recommender_systems)#/Funk_MF

With reference to SVD:

Would it be good to explain how the optimization differs from NMF?

(Feature) Handling updated rating

This is relatively problematic when people's opinions change.

(Feature Request) Implement Baseline Only predictions

Implement Baseline Only predictions, based on user preferences and item preferences without collaborative filtering

Question: Increasing Rows

There are other repos that noted this problem https://github.com/AlexGrig/svd_update

Help: prediction function returning more values than df given to it?

TypeError: 'Series' objects are mutable, thus they cannot be hashed

When run benchmark.ipynb, got the following error:

TypeError Traceback (most recent call last)
in
8 pool = mp.Pool(processes=n_process)
9
---> 10 df_splitted = [df.query('u_id.isin(@users_subset)') for users_subset in np.array_split(users, n_process)]
11
12 results = [

in (.0)
8 pool = mp.Pool(processes=n_process)
9
---> 10 df_splitted = [df.query('u_id.isin(@users_subset)') for users_subset in np.array_split(users, n_process)]
11
12 results = [

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
3467 kwargs["level"] = kwargs.pop("level", 0) + 1
3468 kwargs["target"] = None
-> 3469 res = self.eval(expr, **kwargs)
3470
3471 try:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in eval(self, expr, inplace, **kwargs)
3597 kwargs["resolvers"] = kwargs.get("resolvers", ()) + tuple(resolvers)
3598
-> 3599 return _eval(expr, inplace=inplace, **kwargs)
3600
3601 def select_dtypes(self, include=None, exclude=None) -> DataFrame:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
345 eng = ENGINES[engine]
346 eng_inst = eng(parsed_expr)
--> 347 ret = eng_inst.evaluate()
348
349 if parsed_expr.assigner is None:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/engines.py in evaluate(self)
71
72 # make sure no names in resolvers and locals/globals clash
---> 73 res = self._evaluate()
74 return reconstruct_object(
75 self.result_type, res, self.aligned_axes, self.expr.terms.return_type

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/engines.py in _evaluate(self)
111 env = self.expr.env
112 scope = env.full_scope
--> 113 _check_ne_builtin_clash(self.expr)
114 return ne.evaluate(s, local_dict=scope)
115

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/engines.py in _check_ne_builtin_clash(expr)
27 Terms can contain
28 """
---> 29 names = expr.names
30 overlap = names & _ne_builtins
31

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/expr.py in names(self)
823 """
824 if is_term(self.terms):
--> 825 return frozenset([self.terms.name])
826 return frozenset(term.name for term in com.flatten(self.terms))
827

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in hash(self)
1783
1784 def hash(self) -> int:
-> 1785 raise TypeError(
1786 f"{repr(type(self).name)} objects are mutable, "
1787 f"thus they cannot be hashed"

TypeError: 'Series' objects are mutable, thus they cannot be hashed

verbose mode

Hi. It would be nice in svd.fit() to have a verbose mode, where train- and validation error would be printed in every iteration. Currently only validation error is printed and even that only in the case, when early_stop==True.

Documentation: Dataset shape requirements

For SVD to be usable in other settings, it would be useful to describe the acceptable datatype for SVD.

accepted inputs: From reading this as part of the MovieLens dataset we can safely assume that it follows the user-item-ratings format, while the testing tool uses the user-item format. These format specifications should be well documented for alternative usage (due to dataframe columns having the possible issue of misordering).
expanding dataset: Since the ID hash table is done within the SVD object, every time the data expands, the SVD object would also have to be rebuilt from scratch instead of expanding _preprocess_data fields accordingly.
re-fitting: If there are data that is given for updating, the SVD latent factor matrices, and bias matrices should be updated from the new data. However, whether the updating procedures should include times 1 to X-1 or not in time step X, is not documented.
caching to external object: Since sometimes the SVD has to go offline, caching the object is necessary. It would be useful to know if it is better to store the data in CSVs (for readability) or Pickle/DataTables (for legibility).
Explanation of feature: It should be noted that, unlike other base repos that utilize dense or sparse matrices, this repo is more akin to machine learning, and that it does not "update by the table" but rather "update by the entry". Also an explanation of validation/test dataset is needed (especially in a practical non-evaluation setting). https://www.wikiwand.com/en/Training,_validation,_and_test_sets