davidlvxin / transc Goto Github PK

Source code and datasets of EMNLP2018 paper: "Differentiating Concepts and Instances for Knowledge Graph Embedding".

Makefile 0.77% C++ 71.98% Python 27.25%

knowledge-graph knowledge-graph-embeddings knowledge-base-embeddings

transc's Introduction

Differentiating Concepts and Instances for Knowledge Graph Embedding

Source code and datasets of EMNLP2018 paper: "Differentiating Concepts and Instances for Knowledge Graph Embedding". https://arxiv.org/abs/1811.04588

News

Thanks to Shengding Hu for providing the PyTorch version of the training code.

Dataset

We provide YAGO39K and M-YAGO39K dataset we used in this work in data/YAGO39K directory. The meanings of these files are explained as follows:

data/YAGO39K/Train

instance2id.txt, concept2id.txt, relation2id.txt: the id of every instance, concept and relation in this dataset.
triple2id.txt: normal triples represented as ids with format [head_instance_id tail_instance_id relation_id].
instanceOf2id.txt: instanceOf triples represented as ids with format [instance_id concept_id].
subClassOf2id.txt: subClassOf triples represented as ids with format [sub_concept_id concept_id].

data/YAGO39K/Valid

triple2id_positive.txt, instanceOf2id_positive.txt, subClassOf2id_positive.txt: positive triples from YAGO dataset. Their formats are the same as those in data/YAGO39K/Train.
triple2id_negative.txt, instanceOf2id_negative.txt, subClassOf2id_negative.txt: negative triples generated by us, which will be used in Triple Classification experiments.

data/YAGO39K/Test

the test dataset, which is similar to data/YAGO39K/Valid.

data/YAGO39K/M-Valid

the valid dataset for M-YAGO39K. The formats of all files are the same as those in data/YAGO39K/Valid.

data/YAGO39K/M-Test

the test dataset for M-YAGO39K. The formats of all files are the same as those in data/YAGO39K/Test.

Codes

The source codes of our work are put in the folder src/.

Compile

cd src
make

Train

cd src
./transC

You can add the following optional parameters when running transC:

-dim: the vector size.
-margin: the margin for normal triples.
-margin_ins: the margin for instanceOf triples.
-margin_sub: the margin for subClassOf triples.
-data: the dataset that you want to use. You can only choose YAGO39K in current vision.
-l1flag: it can be 1 or 0, 1 means using L1 loss and 0 means L2 loss.
-bern: it can be 1 or 0, 1 means using "bern" strategy and 0 means using "unif" strategy.

The Vectors of entities, concepts and relations will be stored in the folder vector/ after training.

Python Code

There is also python code for the training part, in py_version/ The python version can be run with

python -m py_version.transc

The outputs are stored in the same format and location with cpp version.

Experiment

Link Precidtion

For Link Prediction experiment, you need to

cd src
./test_link_prediction

You can add the following optional parameters when running test_link_prediction:

-dim: the vector size.
-data: the dataset that you want to use. You can only choose YAGO39K in current vision.
-threads: how many threads you want to use.

Normal Triple Classification

For normal Triple Classification experiment, you need to

cd src
./test_classification_normal

You can add the following optional parameters when running test_triple_classification_normal:

-dim: the vector size.
-data: the dataset that you want to use. You can only choose YAGO39K in current vision.

InstanceOf and SubClassOf Triple Classification

For instanceOf and subClassOf Triple Classification experiments, you need to

cd src
./test_classification_isA

You can add the following optional parameters when running test_triple_classification_isA:

-dim: the vector size.
-data: the dataset that you want to use. You can only choose YAGO39K in current vision.
-mix: it can be 1 or 0, 1 means doing this experiment in M-YAGO39K and 0 means doing this experiment in YAGO39K.

Cite

If you use the code, please cite this paper:

Xin Lv, Lei Hou, Juanzi Li, Zhiyuan Liu. Differentiating Concepts and Instances for Knowledge Graph Embedding. *The Conference on Empirical Methods in Natural Language Processing (EMNLP 2018)*L.

transc's People

Contributors

Stargazers

Watchers

Forkers

leiloong stevenlol nysdy jiwenfei peteroxic knowscieng 876012988 codemanyep himmelstein shengdinghu marspalliu saltfun kingrtyang zhouaimin

transc's Issues

Segmentation fault (core dumped)问题

你好！拜读了你们的文章后，我收获颇丰！
只是在复现代码的过程中，遇到了一点小小的问题。
在进行Link Prediction等三个实验的时候，出现了一些问题：
Segmentation fault (core dumped)，大约是在test_link_prediction.cpp中的void prepare()中的“tmp = fscanf(fin, "%f", &entityVec[last + j]);”一句，请问有什么好的解决方法么？
不吝赐教！

vector file

I'm trying to execute your test link but it crashes when it tries to open the vector file: where can can I find it please?

have you tried NELL KG?

Dear Authors,

Today, I tried running NELL KG (http://rtw.ml.cmu.edu/rtw/) using TransC, and I found out that TransC needs a lot of time to train the model (I started training the dataset using 95GB RAM machine since 7 hours ago and until I wrote this Issue, TransC still doing the training).
The size of NELL KG that I feed to the training process is small (38,806 triples), with 1188 concepts and 895 relations.
My questions are as follows:

Have you test running NELL KG with TransC ?
Does the length of the concept label and also the length of the relation label affect the training process of TransC? For instance, one of the concept label in NELL KG is academicfieldusedbyeconomicsector and one of the relation label in NELL KG is inverseofastronautwasonamissionwithastronaut .

Thank you very much.
Kemas Wiharja

Maximum dataset size?

Hello,
I've successfully run your code on a truncated version of my dataset but am having issues on running it on the full version. Do you have a vague estimate for the maximum dataset size that I should be attempting?
Thanks

您好，请问近期是否会发布pytorch版本？

作者您好：
您的文章思路新颖，启发性很强，想作为baseline对比使用，请问近期是否会有Pytorch版本发布，或者短期有无发布计划，翻看issue时看到您18年说后续会发布pytorch版本。

祝您生活愉快，科研顺利！

您好，关于代码中设置batch的问题

作者您好：
我看到C++以及python代码中设置了batch，但训练用的貌似应该还是随机梯度下降的方法，不太明白这里设置batch的目的是什么，请问您能否指点一二？

   祝您生活愉快、工作顺利。

Provide Python interface

Thanks for sharing the code. Is it possible to provide a Python 3 interface to it all?

参数cut

您好，请问transC.cpp中的参数cut是什么含义？

yago39k数据集问题

您好，我关于数据集的问题有以下几点：

按照论文中构建数据集的方法，应该每一个instance都归属于至少一个concept，但是在统计中发现，有大约几百个instance是没有归属于任何concept的。
论文中写到yago39k的Train中有39个relation，46110个concept，但是实际数据集中只有37个relation和46109个concept。
不知道这其中是否有什么故事？第二点是否是第一点出现的原因呢？如果没有故事，那请问您是怎么处理没有concept的instance的？谢谢，期待回复！

Add Arxiv link

Please link https://arxiv.org/abs/1811.04588 in the readme.

您好？有没有数据集M-YAGO39呢？

结果出问题了，希望大佬给予指点

Instances with multiple concepts

Hi,

Thank you for the nice work! It's extremely inspiring to know the difference between instances and concepts. I am wondering is TransC able to learn embeddings for instances with more than one concept? For example, <Gary_Shelton> can be both a "soccer player" and at the same time a "person". I don't understand how TransC handles multiple concepts according the paper. Could you kindly explain a bit? Thank you so much.

您好，关于subclassOf类别的Triple损失函数有一点疑问向您请教

作者您好：
感谢您百忙之中抽出时间解惑。在拜读您的文章时，发现对于subclassof三种不同情况下，具体共使用了两种损失函数，但在阅读代码时发现，最终只用了较为普适的一种，请问这样选择是基于实际实验效果还是因为在前两种情况下提出的loss可以cover掉第三种情况，所以没使用论文中提出的针对第三种情况下的loss用于实际训练吗？再次感谢您！

   祝您生活愉快、科研顺利！