qiaoguan / deep-ctr-prediction Goto Github PK

View Code? Open in Web Editor NEW

910.0 22.0 276.0 507 KB

CTR prediction models based on deep learning(基于深度学习的广告推荐CTR预估模型)

Home Page: https://github.com/qiaoguan/deep-ctr-prediction

Python 99.19% Shell 0.81%

ctr ctr-prediction ad-algorithm deep-ctr recommendation-algorithms esmm deep-interest-network deepfm deep-cross resnet

deep-ctr-prediction's Introduction

deep-ctr-prediction

一些广告算法(CTR预估)相关的DNN模型

wide&deep 可以参考official/wide_deep
deep&cross
deepfm
ESMM
Deep Interest Network
ResNet
xDeepFM
AFM(Attentional FM)
Transformer
FiBiNET

代码使用tf.estimator构建, 数据存储为tfrecord格式(字典，key:value), 采用tf.Dataset API, 加快IO速度，支持工业级的应用。特征工程定义在input_fn,模型定义在model_fn,实现特征和模型代码分离,特征工程代码只用修改input_fn,模型代码只用修改model_fn。数据默认都是存在hadoop，可以根据自己需求存在本地, 特征工程和数据的处理可以参考Google开源的wide&deep模型(不使用tfrecord格式, 代码在official/wide_deep)

All codes are written based on tf.estimator API, the data is stored in tfrecord format(dictionary, key:value), and the tf.Dataset API is used to speed up IO speed, it support industrial applications. Feature engineering is defined in input_fn, model function is defined in model_fn, the related code of feature engineering and model function is completely separated, the data is stored in hadoop by default, and can be locally stored according to your own need. The feature engineering and data processing can refer to Google's open source wide&deep model(without tfrecord format, codes are available at official/wide_deep)

Requirements

Tensorflow 1.10

参考文献

【1】Heng-Tze Cheng, Levent Koc et all. "Wide & Deep Learning for Recommender Systems," In 1st Workshop on Deep Learning for Recommender Systems,2016.

【2】Huifeng Guo et all. "DeepFM: A Factorization-Machine based Neural Network for CTR Prediction," In IJCAI,2017.

【3】Ruoxi Wang et all. "Deep & Cross Network for Ad Click Predictions," In ADKDD,2017.

【4】Xiao Ma et all. "Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate," In SIGIR,2018.

【5】Guorui Zhou et all. "Deep Interest Network for Click-Through Rate Prediction," In KDD,2018.

【6】Kaiming He et all. "Deep Residual Learning for Image Recognition," In CVPR,2016.

【7】Jianxun Lian et all. "xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems," In KDD,2018.

【8】Jun Xiao et all. "Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks," In IJCAI, 2017.

【9】Ashish Vasmani et all. "Attention is All You Need," In NIPS, 2017.

【10】Tongwen et all. "FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction," In RecSys, 2019.

deep-ctr-prediction's People

Contributors

Stargazers

Watchers

Forkers

skybe077 hzylf wind-meta shaoyn0817 yuqingsheng pokikop tempestwk1 lirunhua wangdq1989 gavinzjchao eadon999 lx86110 zhiscas sprinterzzj allensmile fakeryfx zengai yksiat dwd888999 zhoubutong yuyichen09 namerspace yzu2ustc zhudejunai smilesouth a540611121 lijian10086 pxp531 proxukun ecnumjc xiajiaxin liudefu bestrookever xielm12 xjdupeng snailcong congying-wang ywendeng rxt2012kc linyubupa awesome-archive zhouyonglong adewin zxyscz areafather lydonl shuoranly zwcdp qianrenjian yueyedeai sunnycat2013 thisisreallife shubhampachori12110095 huhuigou strategist922 wakupoo 2jungyup magnetstone mindis zhaoyang1708 michelle0-0 cshaoping sunnydx hitzzc mathlf2015 jason08 crazygirlfym renzhong007 ssssll feng1008 wuyifanisai futurehau jerrycatleung ghgmail penalizedem wutonghua mengxiaozhibo muzixizhu pst2016 kingleao williamwhe adazhou jackeyou wuchaowei2012 447555240 jyzhang122 njuthappy 13483910551 seeker1943 tessiehe datianshi21 amit2014 pia-ai songfang louieyliu leedong123 zchao0910 zephyr0703 jiapeijia zyfnhct

deep-ctr-prediction's Issues

ESMM根本跑不通

第一个错误就是esmm.py line46中user_id上一行少了一个逗号。。
想问下这个代码跑过么

DeepFM模型相关疑问

看build_model_columns只返回了deep_columns，其他的feature好像在DeepFM里面都没有用？
顺便请问一下对于多值类特征，比如tags这种，用tfrecord如何来保存处理呢？

about tf-serving

hi,dear,
after Saved the SavedModel according to the official link, I used the tensorflow-model-server reload the SavedModel, but when I use requests to query, I got the bug bellow,

>>> headers
{'content-type': 'application/json'}
>>> data
'{"signature_name": "serving_default", "instances": [{"age": [46.0], "education_num": [10.0], "capital_gain": [7688.0], "capital_loss": [0.0], "hours_per_week": [38.0]}, {"age": [24.0], "education_num": [13.0], "capital_gain": [0.0], "capital_loss": [0.0], "hours_per_week": [50.0]}]}'
>>> jp=requests.post(url2,data=data,headers=headers)
>>> jp.text
'{\n    "error": "Failed to process element: 0 key: age of \'instances\' list. Error: INVALID_ARGUMENT: JSON object: does not have named input: age"\n}'

esmm训练loss一直在400左右震荡，什么原因？

训练
INFO:tensorflow:global_step/sec: 12.3102
I0511 16:43:03.123719 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 12.3102
INFO:tensorflow:loss = 472.53354, step = 38300 (8.123 sec)
I0511 16:43:03.124620 139644774647616 basic_session_run_hooks.py:260] loss = 472.53354, step = 38300 (8.123 sec)
INFO:tensorflow:global_step/sec: 13.4449
I0511 16:43:10.561764 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 13.4449
INFO:tensorflow:loss = 496.88828, step = 38400 (7.439 sec)
I0511 16:43:10.563358 139644774647616 basic_session_run_hooks.py:260] loss = 496.88828, step = 38400 (7.439 sec)
INFO:tensorflow:global_step/sec: 13.5721
I0511 16:43:17.929780 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 13.5721
INFO:tensorflow:loss = 494.92902, step = 38500 (7.368 sec)
I0511 16:43:17.931165 139644774647616 basic_session_run_hooks.py:260] loss = 494.92902, step = 38500 (7.368 sec)
INFO:tensorflow:global_step/sec: 10.366
I0511 16:43:27.576712 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 10.366
INFO:tensorflow:loss = 477.77087, step = 38600 (9.647 sec)
I0511 16:43:27.578247 139644774647616 basic_session_run_hooks.py:260] loss = 477.77087, step = 38600 (9.647 sec)
INFO:tensorflow:global_step/sec: 13.6566
I0511 16:43:34.899176 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 13.6566
INFO:tensorflow:loss = 469.46252, step = 38700 (7.322 sec)
I0511 16:43:34.900484 139644774647616 basic_session_run_hooks.py:260] loss = 469.46252, step = 38700 (7.322 sec)
INFO:tensorflow:global_step/sec: 14.8222
I0511 16:43:41.645576 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 14.8222
INFO:tensorflow:loss = 505.62067, step = 38800 (6.746 sec)
I0511 16:43:41.646508 139644774647616 basic_session_run_hooks.py:260] loss = 505.62067, step = 38800 (6.746 sec)
INFO:tensorflow:global_step/sec: 14.7337
I0511 16:43:48.432974 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 14.7337
INFO:tensorflow:loss = 508.70572, step = 38900 (6.788 sec)
I0511 16:43:48.434319 139644774647616 basic_session_run_hooks.py:260] loss = 508.70572, step = 38900 (6.788 sec)
INFO:tensorflow:global_step/sec: 14.245
I0511 16:43:55.452730 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 14.245
INFO:tensorflow:loss = 481.75873, step = 39000 (7.019 sec)
I0511 16:43:55.453657 139644774647616 basic_session_run_hooks.py:260] loss = 481.75873, step = 39000 (7.019 sec)
INFO:tensorflow:global_step/sec: 14.1653
I0511 16:44:02.512451 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 14.1653
INFO:tensorflow:loss = 492.90146, step = 39100 (7.060 sec)
I0511 16:44:02.513763 139644774647616 basic_session_run_hooks.py:260] loss = 492.90146, step = 39100 (7.060 sec)
INFO:tensorflow:global_step/sec: 13.9005
I0511 16:44:09.706491 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 13.9005
INFO:tensorflow:loss = 481.75992, step = 39200 (7.194 sec)
I0511 16:44:09.708160 139644774647616 basic_session_run_hooks.py:260] loss = 481.75992, step = 39200 (7.194 sec)
INFO:tensorflow:global_step/sec: 13.2426
I0511 16:44:17.257735 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 13.2426
INFO:tensorflow:loss = 490.50104, step = 39300 (7.551 sec)
I0511 16:44:17.259049 139644774647616 basic_session_run_hooks.py:260] loss = 490.50104, step = 39300 (7.551 sec)
INFO:tensorflow:global_step/sec: 10.1131
I0511 16:44:27.145826 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 10.1131
INFO:tensorflow:loss = 478.1643, step = 39400 (9.888 sec)
I0511 16:44:27.146829 139644774647616 basic_session_run_hooks.py:260] loss = 478.1643, step = 39400 (9.888 sec)
INFO:tensorflow:global_step/sec: 10.4625
I0511 16:44:36.704025 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 10.4625
INFO:tensorflow:loss = 469.9007, step = 39500 (9.559 sec)
I0511 16:44:36.705528 139644774647616 basic_session_run_hooks.py:260] loss = 469.9007, step = 39500 (9.559 sec)
INFO:tensorflow:global_step/sec: 14.7374
I0511 16:44:43.489141 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 14.7374
INFO:tensorflow:loss = 481.07245, step = 39600 (6.785 sec)
I0511 16:44:43.490077 139644774647616 basic_session_run_hooks.py:260] loss = 481.07245, step = 39600 (6.785 sec)
INFO:tensorflow:global_step/sec: 11.0079
I0511 16:44:52.573785 139644774647616 basic_session_run_hooks.py:692] global_step/sec: 11.0079

最终的auc
ctr_accuracy: 0.62085813
ctr_auc: 0.6696427
cvr_accuracy: 0.9009072
cvr_auc: 0.67000365
global_step: 40000
loss: 488.3895

DIN样本数据含义

您好，您能帮忙解释下DIN代码中用到的数据集是什么意思嘛，主要是：product_id_att，creative_id_att，user_click_products_att，user_click_products_att，因为官方的代码里没有用过多的特征，感觉您的代码比较有实用性，烦请帮忙解答一下含义和数据格式，谢谢~

话说CTR模型的交互类的特征是怎么做的和存储的？

比如有user有1000W，item有1000W，那么要有 1000W*1000W = 1000000亿的特征数据？

关于模型导出的一些问题

您好，我现在想把训练好的模型文件单独导入然后用做预测。但是使用tf.estimator我无法找到输入接口。查阅了相关资料，适应如像你代码中的tf.estimator.export.RegressionOutput可以定义输出接口，但是我如何定义输入接口，即sess.run(output,feed_dict=???)。output是哪个？feed_dict又该如何赋值？

预测代码无法运行

user_id = features['user_id']
click_label = features['label']
conversion_label = features['is_conversion']

if mode == tf.estimator.ModeKeys.PREDICT:
predictions = {
'ctr_preds': ctr_preds,
'cvr_preds': cvr_preds,
'ctcvr_preds': ctcvr_preds,
'user_id': user_id,
'click_label': click_label,
'conversion_label': conversion_label
}
以上代码为ESSM第37-50行，从代码逻辑来看click_label，conversion_label是预测的对象，应该对应网络中的tensor啊，为什么从features直接读取啊？我的理解这两个label应该是分别从ctr_preds，ctcvr_preds转换过来的吧

tf.estimator.Estimator()定义内的参数model_fn应该赋值什么

我想问一下在train.py中的tf.estimator.Estimator()定义内的参数model_fn应该赋值什么，您赋值的din_model_fn是什么意思

特征工程

能说下你特征工程部分为什么要这么做吗？重点是定义了下面两个列表来进行分桶，这样分桶的意义在于？
DayShowSegs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 37, 38, 39, 41, 42, 44, 46, 47, 49, 51, 54, 56, 59, 61, 65, 68, 72, 76, 81, 86, 92, 100, 109, 120, 134, 153, 184, 243, 1195]
DayClickSegs = [1, 2, 3, 6, 23]

attention unit

您好，请教一个问题，计算attention unit的时候有一个max_seq_len，这里是对输入数据进行了padding吗？因为行为流有长有短

代码实现相关

在模型文件esmm.py中，mode为predict模式下，有一个
export_outputs = {
'regression': tf.estimator.export.RegressionOutput(predictions['cvr_preds']) #线上预测需要的
}
这个导出来有什么具体的用处吗？因为你在train.py文件中导出了模型，所以不是很理解这里export_outputs有什么用？

另一个问题是：
session_config = tf.ConfigProto(device_count={'GPU': 1, 'CPU': 10},
inter_op_parallelism_threads=10,
intra_op_parallelism_threads=10
# log_device_placement=True
)
这里多线程的设置有什么理论依据吗？还是经验设置？

x-deeplearning ESMM AUC 计算bug

感觉x-deeplearning 中auc 计算batch auc，把其中只包含negative的batch 算作invalid 抛弃，这个做法跟tensorflow里面做法不一样。会比较大的影响auc的计算，因为一个batch中没有positive 也会影响全局的FP，auc应该算全局的。
具体请看： alibaba/x-deeplearning#355

DIN 序列特征

你好，请问一下，在你构建序列特征attention 部分时，看到你写了个din_feature_column.py，你是想用feature_column 的方法构建序列特征么？但是我看你实现的时候确是下面的代码。
last_click_creativeid = tf.string_to_hash_bucket_fast(features["user_click_creatives_att"], 200000)
creativeid_embeddings = tf.get_variable(name="attention_creativeid_embeddings", dtype=tf.float32,
shape=[200000, 20])
last_click_creativeid_emb = tf.nn.embedding_lookup(creativeid_embeddings, last_click_creativeid)
att_creativeid = tf.string_to_hash_bucket_fast(features["creative_id_att"], 200000)
creativeid_emb = tf.nn.embedding_lookup(creativeid_embeddings, att_creativeid)

不知道作者是否有些建议给我，指点迷津。
例如生成tfrecord的内部格式是如何的，我现在是一个record有n个特征（key）
等等的建议，谢谢作者