github-hongweizhang / prediction-flow Goto Github PK

Deep-Learning based CTR models implemented by PyTorch

License: MIT License

Python 100.00%

pytorch ctr ctr-prediction deep-learning deepfm deepinterestnetwork dnn deepneuralnetworks din recommendation

prediction-flow's Introduction

prediction-flow

prediction-flow is a Python package providing modern Deep-Learning based CTR models. Models are implemented by PyTorch.

how to use

Install using pip.

pip install prediction-flow

feature

how to define feature

There are two parameters for all feature types, name and column_flow. The name parameter is used to index the column raw data from input data frame. The column_flow parameter is a single transformer of a list of transformers. The transformer is used to pre-process the column data before training the model.

dense number feature

Number('age', StandardScaler())
Number('ctr', None)

sparse category feature

Category('movieId', CategoryEncoder(min_cnt=1))

var length sequence feature

Sequence('genres', SequenceEncoder(sep='|', min_cnt=1))

transformer

The following transformers are provided now.

transformer	supported feature type	detail
StandardScaler	Number	Wrapper of scikit-learn's StandardScaler. Null value must be filled in advance.
LogTransformer	Number	Log scaler. Null value must be filled in advance.
CategoryEncoder	Category	Converting str value to int. Null value must be filled in advance using '__UNKNOWN__'.
SequenceEncoder	Sequence	Converting sequence str value to int. Null value must be filled in advance using '__UNKNOWN__'.

model

model	reference
DNN	-
Wide & Deep	[DLRS 2016]Wide & Deep Learning for Recommender Systems
DeepFM	[IJCAI 2017]DeepFM: A Factorization-Machine based Neural Network for CTR Prediction
DIN	[KDD 2018]Deep Interest Network for Click-Through Rate Prediction
DNN + GRU + GRU + Attention	[AAAI 2019]Deep Interest Evolution Network for Click-Through Rate Prediction
DNN + GRU + AIGRU	[AAAI 2019]Deep Interest Evolution Network for Click-Through Rate Prediction
DNN + GRU + AGRU	[AAAI 2019]Deep Interest Evolution Network for Click-Through Rate Prediction
DNN + GRU + AUGRU	[AAAI 2019]Deep Interest Evolution Network for Click-Through Rate Prediction
DIEN	[AAAI 2019]Deep Interest Evolution Network for Click-Through Rate Prediction
OTHER	TODO

example

movielens-1M

This dataset is just used to test the code can run, accuracy does not make sense.

Prepare the dataset. preprocess.ipynb
Run the model. movielens-1m.ipynb

amazon

Prepare the dataset. prepare_neg.ipynb
Run the model. amazon.ipynb
An example using pytorch-lightning. amazon-lightning.ipynb

accuracy

acknowledge and reference

Referring the design from DeepCTR, the features are divided into dense (class Number), sparse (class Category), sequence (class Sequence) types.

prediction-flow's People

Contributors

Stargazers

Watchers

prediction-flow's Issues

implement embedding weight sharing (实现embedding层共享机制)

pip install failed due to pandas issue

When I install with pip. An error occurred saying pandas built failed....
But my pandas is updated. Is there a way to use the package with "git clone"?

内存不足

我在自己的数据集通过DNN模型进行训练，但每一个epoch都会增加内存，大概训练10个epoch内存就爆了，请问该怎样解决？

内存：16GB
数据集大小：400MB

Incremental training

I am interested to use the DIEN model. What are the strategies to do incremental training to fit new user interactions and new items without having to retrain the whole model from scratch ?

I read the paper of the model and they only discuss how to reduce latency for model serving and not how to do incremental training.

请问您有完整复现过DIN论文中的movielen-20m实验吗？

我最近在做论文的实验，复现DIN的movielen实验，发现结果与论文相差很大。主要是不清楚数据集该如何预处理（打标签以及生成用户历史行为序列），请问您有完整做过这个实验吗？

RuntimeError: Expected object of scalar type Float but got scalar type Long for sequence element 1 in sequence argument at position #1 'tensors'

Traceback (most recent call last):
File "D:/项目/CTR/video-click-contest/src/model/Flow_DeepFM.py", line 272, in
fit(10, model, loss_func, optimizer, train_loader, valid_loader, notebook=True, auxiliary_loss_rate=0.1)
File "D:\项目\CTR\video-click-contest\prediction_flow\pytorch\functions.py", line 49, in fit
pred = model(batch)
File "D:\Programs\Anaconda\envs\python3.6\lib\site-packages\torch\nn\modules\module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "D:\项目\CTR\video-click-contest\prediction_flow\pytorch\deepfm.py", line 132, in forward
linear_concat = torch.cat(number_inputs, dim=1)
RuntimeError: Expected object of scalar type Float but got scalar type Long for sequence element 1 in sequence argument at position #1 'tensors'

在我自己的数据集上运行报错；
模型输入的构建方式应该是没有问题。

Possibility of adding DICE to replace prelu? Also, a small bug for the GPU implementation for newer PyTorch versions

Hi，

谢谢大佬的code。想问一下在DIN/DIEN里是否会在未来加入DICE激活函数来取代torch自身提供的prelu?

目前发现的小问题，在gpu运行下(torch 1.9.0)：
在DIEN.py里因为pytorch版本更新带来的一些new behaviour，需要加入
keys_length=keys_length.to(torch.device("cpu"))
作为torch里的改动要求（需要非cuda的cpu long type）。之后在interest.py里加入
keys_length=keys_length.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
来改回为GPU。

非常感谢！

为啥不用 hash encoder

How to deal with the multiple lables?

I'm frustrated when I realize the ESMM model based on prediction-flow, it has two labels which are click and post-click-conversation lables. It seems like create_dataloader_fn should be rewriten?

how to deal with the uid?

通过nn.Embedding.from_pretrained()加入了自己预训练的embedding向量，请问怎样在模型训练过程中，保持embeding不会被改变？

我在构建embedding的时候，通过nn.Embedding.from_pretrained()加入了自己预训练的embedding向量，请问怎样在模型训练过程中，保持embeding不会被改变？

我发现在DIEN论文里它的book数据集可以达到84，为什么我使用了Auxloss，还是无法达到84的效果。

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

attendance one key and many values

pairs=[{'ad': 'q_topic_1', 'pos_hist': 'm_interested_topics'},
{'ad': 'q_topic_2', 'pos_hist': 'm_interested_topics'},
{'ad': 'q_topic_3', 'pos_hist': 'm_interested_topics'},
{'ad': 'q_topic_4', 'pos_hist': 'm_interested_topics'},
{'ad': 'q_topic_5', 'pos_hist': 'm_interested_topics'}
],

RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor

Describe the bug
When I run DIEN model, I get this error. I tried lengths.cpu(), lengths.to('cpu') but none of them work. Would you provide a solution for this?

DIEN_augru
HBox(children=(FloatProgress(value=0.0, description='training routine', max=2.0, style=ProgressStyle(descripti…
HBox(children=(FloatProgress(value=0.0, description='train', max=8486.0, style=ProgressStyle(description_width…
HBox(children=(FloatProgress(value=0.0, description='valid', max=947.0, style=ProgressStyle(description_width=…
GPU is available, transfer model to GPU.
Traceback (most recent call last):

  File "<ipython-input-47-cf075d4611bd>", line 18, in <module>
    scores1, model_loss_curves1 = run(models)

  File "<ipython-input-47-cf075d4611bd>", line 9, in run
    train_loader, valid_loader, notebook=True, auxiliary_loss_rate=1)

  File "/home/hojun/anaconda3/envs/ai/lib/python3.6/site-packages/prediction_flow/pytorch/functions.py", line 57, in fit
    pred = model(batch)

  File "/home/hojun/anaconda3/envs/ai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)

  File "/home/hojun/anaconda3/envs/ai/lib/python3.6/site-packages/prediction_flow/pytorch/dien.py", line 100, in forward
    query, pos_hist, keys_length, neg_hist))

  File "/home/hojun/anaconda3/envs/ai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)

  File "/home/hojun/anaconda3/envs/ai/lib/python3.6/site-packages/prediction_flow/pytorch/nn/interest.py", line 235, in forward
    enforce_sorted=False)

  File "/home/hojun/anaconda3/envs/ai/lib/python3.6/site-packages/torch/nn/utils/rnn.py", line 244, in pack_padded_sequence
    _VF._pack_padded_sequence(input, lengths, batch_first)

RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor

Additional context
Pytorch 1.7.1v

deepfm跑movielens-1M的AUC是多少？

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.
请问你的deepfm跑movielens-1M的AUC是多少？我看网上别人动不动就是0.88左右的，我自己写的跑测试集才0.77至0.85左右AUC，可能是我的参数没优化好

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.IntTensor instead (while checking arguments for embedding)

执行movielens-1m.ipynb时报错
Traceback (most recent call last):
File "D:/项目/CTR/prediction-flow/examples/movielens/movielens-1m.py", line 118, in
fit(10, model, loss_func, optimizer, train_loader, valid_loader, notebook=True, auxiliary_loss_rate=0.1)
File "D:\项目\CTR\prediction-flow\prediction_flow\pytorch\functions.py", line 57, in fit
pred = model(batch)
File "D:\Programs\Anaconda\envs\python3.6\lib\site-packages\torch\nn\modules\module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "D:\项目\CTR\prediction-flow\prediction_flow\pytorch\interest_net.py", line 200, in forward
feature.name](x[feature.name])
File "D:\Programs\Anaconda\envs\python3.6\lib\site-packages\torch\nn\modules\module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "D:\Programs\Anaconda\envs\python3.6\lib\site-packages\torch\nn\modules\sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "D:\Programs\Anaconda\envs\python3.6\lib\site-packages\torch\nn\functional.py", line 1467, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.IntTensor instead (while checking arguments for embedding)

movielens的数据处理是不是有问题？

这种处理方式得到的数据中，每一个用户只有一条样本，这样数据量少了太多了吧。