hwwang55 / dkn Goto Github PK

View Code? Open in Web Editor NEW

407.0 16.0 135.0 33.6 MB

A tensorflow implementation of DKN (Deep Knowledge-aware Network for News Recommendation)

License: MIT License

C++ 68.24% Python 31.76%

knowledge-graph recommender-systems

dkn's Introduction

DKN

This repository is the implementation of DKN (arXiv):

DKN: Deep Knowledge-Aware Network for News Recommendation
Hongwei Wang, Fuzheng Zhang, Xing Xie, Minyi Guo
The Web Conference 2018 (WWW 2018)

DKN is a deep knowledge-aware network that takes advantage of knowledge graph representation in news recommendation. The main components in DKN is a KCNN module and an attention module:

The KCNN module is to learn from semantic-level and knowledge-level representations of news jointly. The multiple channels and alignment of words and entities enable KCNN to combine information from heterogeneous sources.
The attention module is to model the different impacts of a user’s diverse historical interests on current candidate news.

Files in the folder

data/
- kg/
  - Fast-TransX: an efficient implementation of TransE and its extended models for Knowledge Graph Embedding (from https://github.com/thunlp/Fast-TransX);
  - kg.txt: knowledge graph file;
  - kg_preprocess.py: pre-process the knowledge graph and output knowledge embedding files for DKN;
  - prepare_data_for_transx.py: generate the required input files for Fast-TransX;
- news/
  - news_preprocess.py: pre-process the news dataset;
  - raw_test.txt: raw test data file;
  - raw_train.txt: raw train data file;
src/: implementations of DKN.

Note: Due to the pricacy policies of Bing News and file size limits on Github, the released raw dataset and the knowledge graph in this repository is only a small sample of the original ones reported in the paper.

Format of input files

raw_train.txt and raw_test.txt:
user_id[TAB]news_title[TAB]label[TAB]entity_info
for each line, where news_title is a list of words w1 w2 ... wn, and entity_info is a list of pairs of entity id and entity name: entity_id_1:entity_name;entity_id_2:entity_name...
kg.txt:
head[TAB]relation[TAB]tail
for each line, where head and tail are entity ids and relation is the relation id.

Required packages

The code has been tested running under Python 3.6.5, with the following packages installed (along with their dependencies):

tensorflow-gpu == 1.4.0
numpy == 1.14.5
sklearn == 0.19.1
pandas == 0.23.0
gensim == 3.5.0

Running the code

$ cd data/news
$ python news_preprocess.py
$ cd ../kg
$ python prepare_data_for_transx.py
$ cd Fast-TransX/transE/ (note: you can also choose other KGE methods)
$ g++ transE.cpp -o transE -pthread -O3 -march=native
$ ./transE
$ cd ../..
$ python kg_preprocess.py
$ cd ../../src
$ python main.py (note: use -h to check optional arguments)

dkn's People

Contributors

Stargazers

Watchers

Forkers

jxz542189 loongriver weiweijiuzaizhe wubinzzu devaroundu iterjxp fulquan chenlushen coder3344 malizheng gegetang toughhou jiaqinglin blue1881eulb moran1986 crazycaspian lengzi zhangziliang04 l-seven cuizhigang1989 hyzcn orchestor qss2012 libin19861023 zscdumin songfgh zhustory-ll cuongnguyenngoc flowersaway sliu1013 fandeqing bulba826 gccrpm tandychao bkwapong zgf523 wjqy1510 zencyyoung doublepg bzp92 mayanxin yoohu saravananpsg icyxiaowenyi fishguysword hololen tiffen libie1 legendtianjin hnulst hetianbaicnn wuqingzhou828 gavinljj todun hapengzi almoslmi knowledgehacker dimple-bansal burakakrishna qianrenjian bulanka drbugkiller guangpengli embeddedsamurai 15800688908 ninedaywang richiesui superrichiesui ctzlwps millaras thunderleefocus seeker1943 robstens hguomin8 curlykonda amjoker 67in jqbman ycchou1129 janeyzy crystal22 poiulkj jilinsunkun lfsblack yezhiyong-github youngbigbird1985 cse-ljl xjliu1204 dingqi0201 cjmcgraw lijian10086 lizhaofu piaofu110 victorleejw jingpingfeng csiripinyo zixufang jeromechen14 zbn123 zhangyuan-bot

dkn's Issues

请求帮助help

王博士，您好！缩小batch_size是在main函数这里修改吗？我在这里改为1，还是OOM
parser.add_argument('--batch_size', type=int, default=1, help='number of samples in one batch')

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[312030,10,50] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [[node kcnn/embedding_lookup_1 (defined at D:\Jupyter_folder\DKN\src\dkn.py:94) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

这是我别的错误
Traceback (most recent call last): File "D:/Jupyter_folder/DKN/src/main.py", line 41, in <module> train(args, train_data, test_data) File "D:\Jupyter_folder\DKN\src\train.py", line 36, in train train_auc = model.eval(sess, get_feed_dict(model, train_data, 0, train_data.size)) File "D:\Jupyter_folder\DKN\src\dkn.py", line 162, in eval labels, scores = sess.run([self.labels, self.scores], feed_dict)

Resource exhausted: OOM when allocating tensor

Dr. Wang, thank you so much for your wonderful work.
When I run last step : python main.py, an error occored:

2019-07-15 17:38:14.500279: W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[442410,128,9,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

My gpu information is :
2019-07-15 17:37:54.354919: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:05:00.0
totalMemory: 11.00GiB freeMemory: 9.11GiB
2019-07-15 17:37:54.538331: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:09:00.0
totalMemory: 11.00GiB freeMemory: 9.11GiB

other information before the error occur is :
2019-07-15 17:38:14.475725: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 1 Chunks of size 2265139200 totalling 2.11GiB
2019-07-15 17:38:14.480071: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:678] Sum Total of in-use chunks: 5.72GiB
2019-07-15 17:38:14.484854: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:680] Stats:
Limit: 9244818801
InUse: 6142976256
MaxInUse: 6369439488
NumAllocs: 16017
MaxAllocSize: 2265139200

I wonder how can I solve the problem,thank you

.vec not found

I am new to Deep learning, I am getting error OSError: TransH_entity2vec_50.vec not found.
prior to this step, everything run smoothly,
any help appreciated

How can we get the complete dataset?

Quite confused on the file data/kg/kg.txt

Hi,
Thanks for publishing this wonderful project and outstanding paper related to recommendation systems.

As I am currently learning your DKN code after reading your paper, some confusion crossed my mind especially at the part ./data/kg/kg.txt.
Well, although you have described that head and tail are entity ids and relation is relation id, I still don't know how the relation ids are generated as entity ids are generated via previous functions.

I would be grateful if you could explain the process of the generation of relation ids. Look forward to your response.

Big thanks!

Results

Hi, I tried to run the code you provided and compared the results mentioned on your paper.
Results i got are not the same which are mentioned in the paper?

anything wrong I am doing?
below is a snapshot of the result generated where as in your paper you claimed AUC of 65.9+/- 1.2

请问在训练的时候怎么样在训练集里划分出用户的history clicked news

王老师好，代码中，在训练的时候，是把训练集中的30条划分成history clicked news ，请问为什么这么分，是实验做出来的超参数么
1.为什么不在训练的时候除了其中一条作为候选集其他的都当成历史记录，然后重复这个过程直到训练集中的所有数据都当过候选集呢
2.训练的时候是将用户所有标签为1的记录当成他的历史记录，还是标签为0的也会当成历史记录
3.候选集是怎么生成的？
4.有没有比较过不同条数的历史记录对模型的影响
5.测试集使用的历史记录和训练集一样么，还是使用用户的训练集作为测试时的历史记录
谢谢老师

1 There are no parameters for the MLP part in the paper.

Dear authors,

I found that the code implementation is slightly different from the structure presented in the paper (Figure 3). In the paper, both the output of the attention and the final output are produced by a two-layer MLP. However, in the code dkn.py, they are implemented as:

attention_weights = tf.reduce_sum(clicked_embeddings * news_embeddings_expanded, axis=-1)

and

self.scores_unnormalized = tf.reduce_sum(user_embeddings * news_embeddings, axis=1)

They are only a inner product between two vectors and there are no any parameters to learn the concat. My question is:

Why can this work? Can only updating embeddings without MLP weights get well trained?
Why is the code implementation different from the paper?

All the scores are greater than 0.5

王老师您好，我在评估模型时将通过sigmoid转换之后得到的预测概率值scores输出，发现他们全部大于0.5，请问模型划分正负例的阈值是多少呢？（应该不是0.5？）

例如某一个epoch得到的scores:
[0.5778407 0.61155593 0.57902104 0.6150518 0.61513776 0.605235
0.5789735 0.57939625 0.58373684 0.5776857 0.5788436 0.5802991
0.58303684 0.57765114 0.58011097 0.5782152 0.58273846 0.59276307
0.5777418 0.5781118 0.58053976 0.5770587 0.5848505 0.6206129
0.5796546 0.5936298 0.5800667 0.5841336 0.5985693 0.5795024
0.57965916 0.57965916 0.5940067 0.58897567 0.581066 0.5788839
0.59576434 0.5785835 0.5783613 0.5984433 0.5783426 0.5845572
0.5854817 0.5777896 0.5783829 0.58128166 0.5843682 0.57781625
0.58254665 0.5966302 0.592908 0.57927406 0.57772964 0.5805329
0.5793939 0.5789313 0.580673 0.62930167 0.57700074 0.5771642
0.5793654 0.5801946 0.5798775 0.57729614 0.5778348 0.5773217
0.5789409 0.5803839 0.58697975 0.57710063 0.5773818 0.57703525
0.57772624 0.58213145 0.58688086 0.58625937 0.5790521 0.583711
0.6132932 0.6129931 0.5814061 0.5788394 0.57913923 0.61161196
0.5792562 0.5790782 0.6079845 0.5920574 0.5793936 0.578782
0.584119 0.5926744 0.5794713 0.6049938 0.6050354 0.60463595
0.6047081 0.61172765 0.619973 0.60173035 0.6009864 0.58058125
0.60348684 0.5824682 0.57886064 0.5884625 0.5974402 0.578675
0.578743 0.57945085 0.5773963 0.57785344 0.5776528 0.57766426
0.577414 0.5773697 0.5899348 0.5813688 0.5976933 0.58020246
0.5853191 0.5883258 0.5884458 0.57786816 0.57844496 0.61082983
0.58969355 0.59313047 0.5914415 0.61171293 0.5790724 0.58127683
0.5841514 0.58132464 0.58002347 0.57958114 0.57698256 0.57736045
0.5829809 0.5780082 0.5776665 0.58005613 0.5967956 0.5783363
0.5775785 0.57833403 0.6010081 0.57743883 0.5820395 0.57780135
0.57779527 0.5832522 0.5774031 0.5813955 0.64714086 0.5897089
0.58454174 0.5777361 0.57789356 0.5920431 0.58439016 0.5799597
0.5790718 0.57887685 0.58060193 0.58858824 0.58267885 0.57806975
0.59110266 0.59110266 0.6118623 0.5787403 0.58097917 0.579126
0.58345675 0.5919414 0.5799597 0.57892144 0.6185592 0.5814157
0.5828195 0.5790548 0.57929236 0.5790717 0.5797003 0.58137774
0.58160037 0.5781356 0.5776759 0.57675505 0.5767635 0.5821508
0.5782573 0.57824135 0.58539546 0.577819 0.57790375 0.583094
0.6229433 0.5897612 0.5794023 0.5936962 0.59367275 0.59711105
0.57778823 0.5777652 0.6049461 0.5809651 0.57757777 0.5777028
0.58331954 0.57804024 0.5784547 0.5791342 0.58031636 0.6046847
0.60520047 0.5937593 0.58349776 0.58918655 0.585288 0.57771075
0.57864183 0.5796051 0.5799217 0.57687336 0.58341265 0.5785788
0.5789991 0.5795324 0.5821006 0.5966433 0.57749677 0.57777035
0.5910414 0.5773269 0.58567226 0.5843944 0.5774361 0.5958311
0.57861346 0.5779085 0.5836908 0.5775305 0.5810351 0.58814144
0.5789568 0.57872987 0.5776423 0.6053786 0.5790132 0.5928653
0.6052447 0.61543584 0.58458525 0.5783247 0.5830784 0.5811673
0.57899153 0.5809654 0.5809654 0.58994776 0.6330279 0.61973125
0.58105934 0.6396998 0.5890624 0.5808206 0.63922584 0.61993045
0.652699 0.58934414 0.6063602 0.61180705 0.5771982 0.5777954
0.5781886 0.5786163 0.5782559 0.5784811 0.5793637 0.5890808
0.5830033 0.58249104 0.60447097 0.57843167 0.57877594 0.57799184
0.59080225 0.57873803 0.5795959 0.5849787 0.5797996 0.579682
0.5805923 0.5797924 0.6207965 0.6172974 0.579682 0.5796384
0.5796296 0.5794636 0.58130836 0.61685055 0.581454 0.5855957
0.58034426 0.57981044 0.5810036 0.5794597 0.57951933 0.5829626
0.5786414 0.6041503 0.5790655 0.57868075 0.58829325 0.5786846
0.5817085 0.57936454 0.60755956 0.57873625 0.5784768 0.58369076
0.58381605 0.5836887 0.5858248 0.6171568 0.5904521 0.61717606
0.57917815 0.585089 0.5946311 0.57941866 0.5956347 0.5800357
0.57921237 0.57977897 0.57950985 0.59980136 0.5794211 0.5854362
0.57901037 0.5847996 0.5779387 0.57809323 0.5930451 0.58091354
0.5784779 0.578267 0.58326787 0.5783494 0.57879704 0.57847404
0.6102496 0.6100861 0.57847404 0.5784781 0.5782633 0.57895887
0.57854944 0.5831228 0.5891479 0.5931571 0.5818151 0.5823464
0.5790161 0.5795622 0.57984537 0.578698 0.5784987 0.58045685
0.57848155 0.5787312 0.57871133 0.5899911 0.5964079 0.5974127
0.6086674 0.59137857 0.57906425 0.58017033 0.62536037 0.5906543
0.57994765 0.5836277 0.6060708 0.60615116 0.5792676 0.5819194
0.5815447 0.57876724 0.57835925 0.57756114 0.58080244 0.5824134
0.579684 0.579288 0.59036326 0.5790007 0.57904476 0.58419985
0.5783132 0.57883286 0.6001782 0.57851803 0.57872695 0.5791408
0.5799836 0.57987076 0.5796064 0.5799593 0.57966936 0.59356964
0.58866215 0.5802649 0.58128166 0.58034664 0.5771053 0.5780635
0.57781297 0.58299184 0.5775686 0.5819763 0.57897043 0.5776775
0.5777392 0.5810676 0.57942575 0.5794476 0.5778632 0.59433913
0.5929802 0.6113254 0.58432055 0.5885272 0.5841351 0.57888293
0.58728164 0.57860404 0.5781007 0.57765305 0.5887117 0.584502
0.58222306 0.5774401 0.58222306 0.57812953 0.64316356 0.581735
0.58159864 0.57851267 0.5795053 0.5944862 0.5847298 0.57949877
0.58130103 0.58020645 0.5800007 0.5978961 0.5795877 0.5803144 ]

Dataset

Hi , thanks for providing code.

Could you please provide the complete dataset? any link or possible to email??

两个疑惑之处

在数据处理过程中，若设置浏览历史数目为30的话，最终得到一个数据结构，每行的格式是
用户id 用户浏览历史标题（一个30×10的二维数组）用户浏览历史实体（一个30×10的二维数组）新闻标题新闻实体 Label
有以下几个疑惑：
1.若某一个用户的记录不足30条，代码中似乎使用重复新闻来填充这个30，这样在训练过程中会不会增加误差呢？
2.实验的训练环节采用了将整个训练集分为很多batch，我发现在attention中，会将clicked_words的维度改变为（batch_size*max_click_history,max_title_length），这样做是不是错把一个用户的浏览历史给换成一个batch中的用户的浏览历史了呢？
@hwwang55

entity embeddings

Thank you for sharing the code and papers. I've been doing sequence Tagging recently. Can you share the entity embeddings you've trained? thanks.

Question about the experimental results

请问怎么对推荐结果进行验证

请问怎么对推荐结果进行验证，能不能贴一贴参考代码

Question about DKN/data/kg/kg_preprocess.py /

def get_neighbors_for_entity(file):
reader = open(file, encoding='utf-8')
entity2neighbor_map = {}
for line in reader:
array = line.strip().split('\t')
if len(array) != 3: # to skip the first line in triple2id.txt
continue
head = int(array[0])
tail = int(array[1])
DKN/data/kg/kg_preprocess.py should the line 30 be"tail = int(array[2])"?

the l2_loss variable needs to be claimed as non-trainable

dkn.py line 141

self.l2_loss needs to be claimed as non-trainable, i.e.,
self.l2_loss = tf.Variable(tf.constant(0., dtype=tf.float32), trainable=False)

In the current version, the l2_loss gets negative results since it is also trained. But I found that other model parameters are updated correctly.

代码有些问题

def get_local_word2entity(entities): # 格式：entity_id: entity
"""
Given the entities information in one line of the dataset, construct a map from word to entity index
E.g., given entities = 'id_1:Harry Potter;id_2:England', return a map = {'harry':index_of(id_1),
'potter':index_of(id_1), 'england': index_of(id_2)}
:param entities: entities information in one line of the dataset
:return: a local map from word to entity index
"""
local_map = {}

for entity_pair in entities.split(';'):
    entity_id = entity_pair[:entity_pair.index(':')]
    entity_name = entity_pair[entity_pair.index(':') + 1:]

    # remove non-character word and transform words to lower case
    entity_name = PATTERN1.sub(' ', entity_name)
    entity_name = PATTERN2.sub(' ', entity_name).lower()
    # constructing map: word -> entity_index
    for w in entity_name.split(' '):
        entity_index = entity2index[entity_id]  
        local_map[w] = entity_index  # 这里有问题，不同的实体如果有相同的词，就覆盖了之前的entity_index，不知道这里什么意思？

return local_map

关于使用的知识图谱

问下raw_train.txt中的entity_id跟kg.txt中的entity_id和relation_id都是Microsoft Satori knowledge graph知识图谱中的吗？怎么下载这个知识图谱？

f1值

王老师您好：
我有两个疑问
1.请问代码中为什么只有auc值而没有论文中提到的f1值的计算呢？
2.测试集和训练集中的label标签是点击与未点击的区别吗？

About the kg.txt

Thanks for providing the code.
I have two questions:
1.where is the source of the knowledge graph(kg.txt)? is it like the paper said extract from the bing server?
2. I 've checked other issues and I am wondering whether this is the full version of the dataset?
If it's because of the privacy policy, I totally understand, I just want to make sure a little bit.

Finally, thanks again~
@hwwang55

TransE

When I compile transE.cpp, I got a warning
. But I anyway run transE.exe, and see such a result
. Count of epoch is greater than 5000. It's normal? Do i have any runtime error?

c++程序无法执行

想问下执行这步操作g++ transE.cpp -o transE -pthread -O3 -march=native时没有反应，也没有报错，不知道是什么原因。首先用transE执行，没有生成结果文件，也没有报错
(image_text_match_tf_1.15) [user@tianhu-worker-5 Fast-TransX]$ cd transE
(image_text_match_tf_1.15) [user@tianhu-worker-5 transE]$ g++ transE.cpp -o transE -pthread -O3 -march=native
(image_text_match_tf_1.15) [user@tianhu-worker-5 transE]$

然后用transD，生成了一个trnaD文件，但是也没有生成TransD_relation2vec_50.vec文件，请问是什么原因呢？
(image_text_match_tf_1.15) [user@tianhu-worker-5 kg]$ cd Fast-TransX/transD/
(image_text_match_tf_1.15) [user@tianhu-worker-5 transD]$ g++ transD.cpp -o transD -pthread -O3 -march=native
(image_text_match_tf_1.15) [user@tianhu-worker-5 transD]$ ls

Error while Running kg_preprocess.py

关于TransE代码的一些疑问

你好，TransE代码有一部分实在看不明白，故求教一下
https://user-images.githubusercontent.com/35772225/68124864-49f60200-ff4b-11e9-9a72-a79d77ca5cc6.png
上图中这部分初始化的四个数组rigTail、lefTail、rigHead和lefHead..到底是干什么用的?我在纸上做了下演草，还是看不出所以然..还望王博抽空解疑一下，谢谢

how do you implement the procedure of entity linking

Dr. Wang, thank you so much for your wonderful work which combines KG and recommender system. It seems that the procedure of disambiguation based on entity linking is not implemented in this code. Does the 'kg.txt' represent KG that has been disambiguated?

convert to tf 2.0 code

any plan to convert DKN model to be tf 2.0 compatible?

Some question about DKN

Thanks for your so great project. When I running TransE.cpp(others),there is a warning that "no return statement in function returning non-void". I read the code and found the function transetrainMode() and train_transe() don't have return statement while their types are void*. Because of this warning, there is no output file(TransE_relation2vec_50.vec). How can I solve this problem?