dawenl / vae_cf Goto Github PK
View Code? Open in Web Editor NEWVariational autoencoders for collaborative filtering
License: Apache License 2.0
Variational autoencoders for collaborative filtering
License: Apache License 2.0
First of all, thanks for a great paper and code! I wanted to ask if there is a way to normalize predicted values, the un-normalized multinomial probability?
Would you publish the code described in the "selection beta" in section 2.2 of your paper?
Or, please give me the details (beta training method).
I don't understand how to train with validation set.
Thanks a lot for the Jupyter notebook :)
Do you plan to prepare one or several Python files? It would make it easier to run it on remote GPUs.
One possible start: https://stackoverflow.com/a/19779226/827989
Hi, I read your paper and run your experiment code in detail, it is a really nice work. Now I am doing some experiment and I have some question about how to split the data. In your paper and code, the user is split into training/validation/test sets, and train the model using the entire history of training users, and when evaluate, the model take part of the history from held-out users.
But Neural collaborative filtering(NCF) split the data in (user,item) level, I have question about how you split the data for NCF in user level? Cause NCF learn embedding for different users from the data. I guess that when training maybe we feed the entire history of training user to train the model and get the embedding of users. When testing, we use part of the history of test user to learn the embedding of the test user, and use the embedding to predict the rest of the history of test user. But the embedding of training user can't have any influence on test users, so is the training user set pointless?
The data split method in your paper is different from the recommendation papers I have read, mostly the split is do like that, we use part of history of all the users to learn the embedding of all users, and evaluate it on the rest history of all users. So I have some question here.
And I am looking forward your reply sincerely.
Hi, I have two questions about the paper that I can't understand to ask you for full sincerity, hoping you can give some detail or explanation about them.
The first is the assumption that multinomial distribution is better suited for ranking metrics, in other view multinomial distribution means the limited budget for probability mass, and the purchase of different goods is exclusive. But in some situation, the purchase of different goods is Not exclusive, for example, Buying a mobile phone and mobile phone case is not mutually exclusive.
The second is the experiment about Table 4 which compare the performance of different likelihood functions. As I know, most collaborative filter method using Gaussian likelihood functions and logistic likelihood functions with Negative sampling or weighting. In the equation (3), you have showed the Gaussian likelihood functions with c_{ui}, that means you only care about the entry 1, and as well as in equation (4). But the most important trick or method in recommendation, negative sampling which I can't find in Gaussian ,Logistic, Multinomial likelihood function. As I know the NCF[1] and CVAE[2] and many other method both use Negative sampling in their method (including Gaussian and Logistic) which can boost their performance. And I concern that the Multinomial likelihood function can't take the Negative sampling cause the it mathematical form. So I wonder can the Multinomial likelihood function beat the logistic likelihood functions with Negative sampling. And did you use NCF with Negative sampling ?
[1] Neural Collaborative Filtering ∗ Xiangnan,2017,WWW
[2] Collaborative Variational Autoencoder for Recommender Systems, 2017,KDD
Hello,
I noticed that you were not using the standard CDAE in your paper, would it be possible to share the implementation you used for CDAE?
I was looking for some implementation of CDAE that can be plugged into your way of data splitting.
Thanks
I am proposing a modification to the filter_triplets
function.
The proposed modification is to change:
if min_sc > 0:
itemcount = get_count(tp, 'movieId')
tp = tp[tp['movieId'].isin(itemcount.index[itemcount >= min_sc])]
if min_uc > 0:
usercount = get_count(tp, 'userId')
tp = tp[tp['userId'].isin(usercount.index[usercount >= min_uc])]
to:
if min_sc > 0:
itemcount = get_count(tp, 'movieId')
tp = tp[tp['movieId'].isin(itemcount[itemcount >= min_sc]['movieId'])] # Modified
if min_uc > 0:
usercount = get_count(tp, 'userId')
tp = tp[tp['userId'].isin(usercount[usercount >= min_uc]['userId'])] # Modified
The reason for this change is that 'movieId' and 'userId' values may not align perfectly with the index due to possible gaps or mismatches.
Thank you for considering this adjustment.
Best regards,
JunHyuck
Hi Dawen,
Sorry to bother. I am not familiar with this specific task setting but I am interested in variational autoencoder application to this task. It will be really nice if you could help me understand this task when you are free.
As far as I understand, the VAE is trying to reconstruct the input(in this task, a vector of user-item which is like [0,0,0,1,0,1]. 1 indicates the user watched the particular movie and 0 indicates the user did not.) I see the logsoftmax layer at the end of decoder, so the output of VAE is a log probability distribution across all the items for one user.
My question is: for evaluation, how could the VAE make a prediction(pred_val) if the decoder is simply reconstruct the input from latent space? I think I misunderstand the vad_data_tr and vad_data_te.
for bnum, st_idx in enumerate(range(0, N_vad, batch_size_vad)):
end_idx = min(st_idx + batch_size_vad, N_vad)
X = vad_data_tr[idxlist_vad[st_idx:end_idx]]
if sparse.isspmatrix(X):
X = X.toarray()
X = X.astype('float32')
pred_val = sess.run(logits_var, feed_dict={dae.input_ph: X} )
# exclude examples from training and validation (if any)
pred_val[X.nonzero()] = -np.inf
ndcg_dist.append(NDCG_binary_at_k_batch(pred_val, vad_data_te[idxlist_vad[st_idx:end_idx]]))
Thanks for you time.
Hi all, when training the Recall measure always returns the NaN value, which means the model could not learn anything? right? that means the model could not describe a latent vector similar input vector? Even though I tried to compute the recall measure on training data, it's still NaN :) . it's similar to training accuracy = 0?
And finally, this model is useless??
Thank you
Hi Dawen,
I am writing to ask if you ever had the following error when you implemented your program, which is shown in the snapshot attached with this message. This error came out after I converted your Jupyter notebook file into a python file named old_code.py and tried to run it.
After filtering, there are 9990682 watching events from 136677 users and 20720 movies (sparsity: 0.353%)
0 users sampled
1000 users sampled
2000 users sampled
3000 users sampled
4000 users sampled
5000 users sampled
6000 users sampled
7000 users sampled
8000 users sampled
9000 users sampled
0 users sampled
1000 users sampled
2000 users sampled
3000 users sampled
4000 users sampled
5000 users sampled
6000 users sampled
7000 users sampled
8000 users sampled
9000 users sampled
Traceback (most recent call last):
File "old_code.py", line 250, in
train_data = numerize(train_plays)
File "old_code.py", line 244, in numerize
return pd.DataFrame(data={'uid': uid, 'sid': sid}, columns=['uid', 'sid'])
File "C:\Software\Anaconda\lib\site-packages\pandas\core\frame.py", line 348, in init
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "C:\Software\Anaconda\lib\site-packages\pandas\core\frame.py", line 439, in _init_dict
index = extract_index(arrays[~missing])
File "C:\Software\Anaconda\lib\site-packages\pandas\core\frame.py", line 7349, in extract_index
raw_lengths.append(len(v))
TypeError: object of type 'map' has no len()
Apologize for such inconvenience and thank you for your attention meanwhile! :)
Hi, I have a question about l2 normalization of input.
At q_graph method in Multi_VAE and forward_pass method in Multi_DAE, why do you apply l2 normalization to input vector?
I don't know the meaning of that normalization.
Sorry to bother you, Thank you!
Hi all,
Is there a way to get the hit ratio along with ndcg from the code? Any help would be appreciated.
Thanks
Hi, nice work about Variational Autoencoder on recommendation. However, I am confused about the method of data split.
In the preprocessing.py,
unique_uid = user_activity.index
unique_uid
is the index of active user rather than the uid
(unique_uid['userId']
). Owing to the filter operator before, some userId
are moved out. Then some valid userId
at the end will not be considered if we adopt the index of user_activity
rather than the actual uid
. I guess it might be a error or is there any other meaning of that?
Looking forward to your reply, Thanks.
Best.
Do you have the code for downloading and preprocessing the Netflix and Million Song Datasets?
Thanks for the clear code.
Hello,
I run your code with Python 3.5+ and it throws several errors. I suppose it is supposed to run with Python 2.7?
Some lines that could be edited for Python 3.5+:
uid = map(lambda x: profile2id[x], tp['userId'])
sid = map(lambda x: show2id[x], tp['movieId'])
Should be:
uid = list(map(lambda x: profile2id[x], tp['userId']))
sid = list(map(lambda x: show2id[x], tp['movieId']))
idxlist = range(N) should be idxlist = np.arange(N)
Hi,
I am playing around with the vae_cf implementation and I tried to my dataset, instead of the ML-20M. It has the same structure (userid, itemid, ratings) and I made sure data type are the same.
I left all the pre-processing part as it is. I am using google colab to run the code. When I use the ML-20M dataset everything works just fine. Instead, when I try to train the model using my dataset, I get nan values in ndcg_dist list.
Here i copy/paste the error i get:
InvalidArgumentErrorTraceback (most recent call last)
in ()
82 ndcgs_vad.append(ndcg_)
83 print("printing ndcg_var: ", ndcg_var)
---> 84 merged_valid_val = sess.run(merged_valid, feed_dict={ndcg_var: ndcg_, ndcg_dist_var: ndcg_dist})
85 print("printing feed_dict:", feed_dict)
86 summary_writer.add_summary(merged_valid_val, epoch)
3 frames
/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/client/session.pyc in _do_call(self, fn, *args)
1382 '\nsession_config.graph_options.rewrite_options.'
1383 'disable_meta_optimizer = True')
-> 1384 raise type(e)(node_def, op, message)
1385
1386 def _extend_graph(self):
InvalidArgumentError: Nan in summary histogram for: ndcg_at_k_hist_validation
[[node ndcg_at_k_hist_validation (defined at /usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Does anyone have any clue what is the cause of this issue?
I would really appreciate any kind of advice and suggestions you may give me.
Thank you very much.
Have a great day.
Hi I am getting error while running the notebook in line where we are importing
from tensorflow.contrib.layers import apply_regularization, l2_regularizer
`---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
in
15
16 import tensorflow as tf
---> 17 from tensorflow.contrib.layers import apply_regularization, l2_regularizer
18
19 import bottleneck as bn
ModuleNotFoundError: No module named 'tensorflow.contrib'`
Please help
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.