dawenl / vae_cf Goto Github PK

View Code? Open in Web Editor NEW

524.0 524.0 157.0 51 KB

Variational autoencoders for collaborative filtering

License: Apache License 2.0

Jupyter Notebook 100.00%

bayesian-inference collaborative-filtering recommender-systems variational-autoencoder

vae_cf's People

Contributors

Stargazers

Watchers

Forkers

zaxtax skallumadi zhouyonglong hal2001 wubinzzu christopher-beckham dionman johnyjyu mindis lirongming schaelle seanhecol nguyenvo09 pegahs cjnolet prateekkolhar zoujun123 sjlee7 rewreu kuhnl benjamin-hahn findlaydavid lironghuo gegetang eridgd aasish rlpraveena keldlundgaard afcarl xingyuezhiji rk2900 mvijaikumar schelotto wentaotao mauriziofd tlv riviera2015 zn16 swordsmanxyz xiangyongcao smokh004 freshsiyi david30907d xy1234552 tiffen shubhampachori12110095 lunhchina matachuann songfgh ximinwu vp2346 qianrenjian wenzhaooooo hyqiu mmnnlight cxxzc gennaro-dibrino baohq1595 siamakz jgera yanggang12311 septumcapital zj-tronker li-study suitup wongyuchoi jkfdre squirren dorin16s ethanwangguoliang lukas235 flyounger akyoung dimple-bansal collapseyu yu3401 prometeoai onurboyar wmsout minghao2016 githubgreat886 whli joojowalker ccdllyy hitdahit chandanpanda ethanyeoh eoc1991 arita37 syedtauhidullahshah yingjieqiu sandy4321 aislivingston nmonath marsstones tmacmilan sunshinenum omkar20895 manavfcb deepstatsanalysis

vae_cf's Issues

Superscript “PR” means partial regularization or personal ranking?

Normalization of multinomial probability

First of all, thanks for a great paper and code! I wanted to ask if there is a way to normalize predicted values, the un-normalized multinomial probability?

beta annealing

Would you publish the code described in the "selection beta" in section 2.2 of your paper?
Or, please give me the details (beta training method).
I don't understand how to train with validation set.

Making a script out of the notebook?

Thanks a lot for the Jupyter notebook :)
Do you plan to prepare one or several Python files? It would make it easier to run it on remote GPUs.

One possible start: https://stackoverflow.com/a/19779226/827989

A question about the way of how to split data

Hi, I read your paper and run your experiment code in detail, it is a really nice work. Now I am doing some experiment and I have some question about how to split the data. In your paper and code, the user is split into training/validation/test sets, and train the model using the entire history of training users, and when evaluate, the model take part of the history from held-out users.

But Neural collaborative filtering(NCF) split the data in (user,item) level, I have question about how you split the data for NCF in user level? Cause NCF learn embedding for different users from the data. I guess that when training maybe we feed the entire history of training user to train the model and get the embedding of users. When testing, we use part of the history of test user to learn the embedding of the test user, and use the embedding to predict the rest of the history of test user. But the embedding of training user can't have any influence on test users, so is the training user set pointless?

The data split method in your paper is different from the recommendation papers I have read, mostly the split is do like that, we use part of history of all the users to learn the embedding of all users, and evaluate it on the rest history of all users. So I have some question here.

And I am looking forward your reply sincerely.

A question about Negative sampling？

Hi, I have two questions about the paper that I can't understand to ask you for full sincerity, hoping you can give some detail or explanation about them.

The first is the assumption that multinomial distribution is better suited for ranking metrics, in other view multinomial distribution means the limited budget for probability mass, and the purchase of different goods is exclusive. But in some situation, the purchase of different goods is Not exclusive, for example, Buying a mobile phone and mobile phone case is not mutually exclusive.

The second is the experiment about Table 4 which compare the performance of different likelihood functions. As I know, most collaborative filter method using Gaussian likelihood functions and logistic likelihood functions with Negative sampling or weighting. In the equation (3), you have showed the Gaussian likelihood functions with c_{ui}, that means you only care about the entry 1, and as well as in equation (4). But the most important trick or method in recommendation, negative sampling which I can't find in Gaussian ,Logistic, Multinomial likelihood function. As I know the NCF[1] and CVAE[2] and many other method both use Negative sampling in their method (including Gaussian and Logistic) which can boost their performance. And I concern that the Multinomial likelihood function can't take the Negative sampling cause the it mathematical form. So I wonder can the Multinomial likelihood function beat the logistic likelihood functions with Negative sampling. And did you use NCF with Negative sampling ?

[1] Neural Collaborative Filtering ∗ Xiangnan,2017,WWW
[2] Collaborative Variational Autoencoder for Recommender Systems, 2017,KDD

Implementation of CDAE (as a baseline)

Hello,

I noticed that you were not using the standard CDAE in your paper, would it be possible to share the implementation you used for CDAE?

I was looking for some implementation of CDAE that can be plugged into your way of data splitting.

Thanks

Request for Modification: filter_triplets Function

I am proposing a modification to the filter_triplets function.

The proposed modification is to change:

    if min_sc > 0:
        itemcount = get_count(tp, 'movieId')
        tp = tp[tp['movieId'].isin(itemcount.index[itemcount >= min_sc])]
    if min_uc > 0:
        usercount = get_count(tp, 'userId')
        tp = tp[tp['userId'].isin(usercount.index[usercount >= min_uc])]

to:

    if min_sc > 0:
        itemcount = get_count(tp, 'movieId')
        tp = tp[tp['movieId'].isin(itemcount[itemcount >= min_sc]['movieId'])]  # Modified

    if min_uc > 0:
        usercount = get_count(tp, 'userId')
        tp = tp[tp['userId'].isin(usercount[usercount >= min_uc]['userId'])]  # Modified

The reason for this change is that 'movieId' and 'userId' values may not align perfectly with the index due to possible gaps or mismatches.

Thank you for considering this adjustment.

Best regards,
JunHyuck

some question

Hi Dawen,

Sorry to bother. I am not familiar with this specific task setting but I am interested in variational autoencoder application to this task. It will be really nice if you could help me understand this task when you are free.

As far as I understand, the VAE is trying to reconstruct the input(in this task, a vector of user-item which is like [0,0,0,1,0,1]. 1 indicates the user watched the particular movie and 0 indicates the user did not.) I see the logsoftmax layer at the end of decoder, so the output of VAE is a log probability distribution across all the items for one user.

My question is: for evaluation, how could the VAE make a prediction(pred_val) if the decoder is simply reconstruct the input from latent space? I think I misunderstand the vad_data_tr and vad_data_te.

for bnum, st_idx in enumerate(range(0, N_vad, batch_size_vad)):
            end_idx = min(st_idx + batch_size_vad, N_vad)
            X = vad_data_tr[idxlist_vad[st_idx:end_idx]]

            if sparse.isspmatrix(X):
                X = X.toarray()
            X = X.astype('float32')
        
            pred_val = sess.run(logits_var, feed_dict={dae.input_ph: X} )
            # exclude examples from training and validation (if any)
            pred_val[X.nonzero()] = -np.inf
            ndcg_dist.append(NDCG_binary_at_k_batch(pred_val, vad_data_te[idxlist_vad[st_idx:end_idx]]))

Thanks for you time.

All kind of measure (Recall, NDCG_binary_at_k_batch) alway return NaN

Hi all, when training the Recall measure always returns the NaN value, which means the model could not learn anything? right? that means the model could not describe a latent vector similar input vector? Even though I tried to compute the recall measure on training data, it's still NaN :) . it's similar to training accuracy = 0?
And finally, this model is useless??
Thank you

One problem reveals when I rerun the program

Hi Dawen,

I am writing to ask if you ever had the following error when you implemented your program, which is shown in the snapshot attached with this message. This error came out after I converted your Jupyter notebook file into a python file named old_code.py and tried to run it.

After filtering, there are 9990682 watching events from 136677 users and 20720 movies (sparsity: 0.353%)
0 users sampled
1000 users sampled
2000 users sampled
3000 users sampled
4000 users sampled
5000 users sampled
6000 users sampled
7000 users sampled
8000 users sampled
9000 users sampled
0 users sampled
1000 users sampled
2000 users sampled
3000 users sampled
4000 users sampled
5000 users sampled
6000 users sampled
7000 users sampled
8000 users sampled
9000 users sampled
Traceback (most recent call last):
File "old_code.py", line 250, in
train_data = numerize(train_plays)
File "old_code.py", line 244, in numerize
return pd.DataFrame(data={'uid': uid, 'sid': sid}, columns=['uid', 'sid'])
File "C:\Software\Anaconda\lib\site-packages\pandas\core\frame.py", line 348, in init
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "C:\Software\Anaconda\lib\site-packages\pandas\core\frame.py", line 439, in _init_dict
index = extract_index(arrays[~missing])
File "C:\Software\Anaconda\lib\site-packages\pandas\core\frame.py", line 7349, in extract_index
raw_lengths.append(len(v))
TypeError: object of type 'map' has no len()

Apologize for such inconvenience and thank you for your attention meanwhile! :)

Question about l2 normalization of input

Hi, I have a question about l2 normalization of input.

At q_graph method in Multi_VAE and forward_pass method in Multi_DAE, why do you apply l2 normalization to input vector?
I don't know the meaning of that normalization.

Sorry to bother you, Thank you!

Error in python 3.5+

Hi all,

Is there a way to get the hit ratio along with ndcg from the code? Any help would be appreciated.

Thanks

confused about split data

Hi, nice work about Variational Autoencoder on recommendation. However, I am confused about the method of data split.
In the preprocessing.py,

unique_uid = user_activity.index

unique_uid is the index of active user rather than the uid (unique_uid['userId']). Owing to the filter operator before, some userId are moved out. Then some valid userId at the end will not be considered if we adopt the index of user_activity rather than the actual uid. I guess it might be a error or is there any other meaning of that?

Looking forward to your reply, Thanks.
Best.

Not an issue, just a question about the other datasets

Do you have the code for downloading and preprocessing the Netflix and Million Song Datasets?

Thanks for the clear code.

Running on Python 3.5+

Hello,

I run your code with Python 3.5+ and it throws several errors. I suppose it is supposed to run with Python 2.7?
Some lines that could be edited for Python 3.5+:

uid = map(lambda x: profile2id[x], tp['userId'])
sid = map(lambda x: show2id[x], tp['movieId'])

Should be:

uid = list(map(lambda x: profile2id[x], tp['userId']))
sid = list(map(lambda x: show2id[x], tp['movieId']))

idxlist = range(N) should be idxlist = np.arange(N)

getting NaN values in ndcg

Hi,
I am playing around with the vae_cf implementation and I tried to my dataset, instead of the ML-20M. It has the same structure (userid, itemid, ratings) and I made sure data type are the same.
I left all the pre-processing part as it is. I am using google colab to run the code. When I use the ML-20M dataset everything works just fine. Instead, when I try to train the model using my dataset, I get nan values in ndcg_dist list.

Here i copy/paste the error i get:
InvalidArgumentErrorTraceback (most recent call last)

in ()
82 ndcgs_vad.append(ndcg_)
83 print("printing ndcg_var: ", ndcg_var)
---> 84 merged_valid_val = sess.run(merged_valid, feed_dict={ndcg_var: ndcg_, ndcg_dist_var: ndcg_dist})
85 print("printing feed_dict:", feed_dict)
86 summary_writer.add_summary(merged_valid_val, epoch)

3 frames

/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/client/session.pyc in _do_call(self, fn, *args)
1382 '\nsession_config.graph_options.rewrite_options.'
1383 'disable_meta_optimizer = True')
-> 1384 raise type(e)(node_def, op, message)
1385
1386 def _extend_graph(self):

InvalidArgumentError: Nan in summary histogram for: ndcg_at_k_hist_validation
[[node ndcg_at_k_hist_validation (defined at /usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Does anyone have any clue what is the cause of this issue?
I would really appreciate any kind of advice and suggestions you may give me.

Thank you very much.
Have a great day.

Getting Error When importing "import apply_regularization, l2_regularizer"

Hi I am getting error while running the notebook in line where we are importing

from tensorflow.contrib.layers import apply_regularization, l2_regularizer

`---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
in
15
16 import tensorflow as tf
---> 17 from tensorflow.contrib.layers import apply_regularization, l2_regularizer
18
19 import bottleneck as bn

ModuleNotFoundError: No module named 'tensorflow.contrib'`

Please help