nju-websoft / bootea Goto Github PK

View Code? Open in Web Editor NEW

151.0 8.0 39.0 93.23 MB

Bootstrapping Entity Alignment with Knowledge Graph Embedding, IJCAI 2018

License: MIT License

Python 100.00%

knowledge-graph-embedding entity-alignment

bootea's Introduction

BootEA

Source code and datasets for IJCAI-2018 paper "Bootstrapping Entity Alignment with Knowledge Graph Embedding".

Dataset

We use two datasets, namely DBP15K and DWY100K. DBP15K can be found here while DWY100K is as follows.

Id files of DWY100K

Folder "dataset/DWY100K/" contains the id files of DWY100K.

The subfolder "mapping/0_3" contains the id files used in BootEA and MTransE while the subfolder "sharing/0_3" is for JAPE and IPTransE. The two datasets use 30% reference entity alignment as seeds. Id files in "sharing/0_3" are generated following the idea of parameter sharing that lets the two aligned enitites in seed alignment share the same id, while "mapping/0_3" does not.

The subfolder "mapping/0_3" inculdes the following files:

ent_ids_1: entity ids in the source KG;
ent_ids_2: entity ids in the target KG;
ref_ent_ids: entity alignment for testing, list of pairs like (e_s \t e_t);
sup_ent_ids: seed entity alignment (training data);
rel_ids_1: relation ids in the source KG;
rel_ids_2: relation ids in the target KG;
triples_1: relation triples in the source KG;
triples_2: relation triples in the target KG;

The subfolder "sharing/0_3" inculdes the following additional files:

attr_ids_1: attribute ids in the source KG;
attr_ids_2: attribute ids in the target KG;
attr_range_type_1: attribute ranges in the source KG, list of pairs like (attribute id \t range code);
attr_range_type_2: attribute ranges in the target KG;
ent_attrs_1: entity attributes in the source KG;
ent_attrs_2: entity attributes in the target KG;
ref_ents: seed entity alignment denoted by URIs (training data);

Raw data of DWY100K

File "dataset/DWY100K_raw_data.zip" is the raw data of DWY100K, where each entity, relation or attribute is represented by a URI. Each dataset has the following files:

ent_links: all the entity links without traning/test splits;
triples_1: relation triples in the source KG, list of triples like (h \t r \t t);
triples_2: relation triples in the target KG;
attr_triples_1: attribute triples in the source KG;
attr_triples_2: attribute triples in the target KG;
uri_attr_range_type_1: attribute ranges in the source KG, list of pairs like (attribute uri \t range code);
uri_attr_range_type_2: attribute ranges in the target KG;
attrs_1: entity attributes in the source KG;
attrs_2: entity attributes in the target KG;

Code

Folder "code" contains all codes of BootEA, in which:

"AlignE.py" is the implementation of AlignE;
"BootEA.py" is the implementation of BootEA;
"param.py" is the config file.

Dependencies

Python 3
Tensorflow 1.x
Scipy
Numpy
Graph-tool or igraph or NetworkX

If you fail to install Graph-tool, we suggest you to set "self.heuristic = False" in param.py, which allows BootEA to run using igraph rather than Graph-tool. If you have trouble installing igraph, you can use NetworkX by modifying the code of line 186-189 in train_bp.py and replacing "mwgm_graph_tool" and "mwgm_igraph" with "mwgm_networkx". Note that, igraph and NetworkX are much slower than Graph-tool!

If you have any difficulty or question in running code and reproducing experiment results, please email to [email protected] and [email protected].

Citation

If you use this model or code, please cite it as follows:

@inproceedings{BootEA,
  author    = {Zequn Sun and Wei Hu and Qingheng Zhang and Yuzhong Qu},
  title     = {Bootstrapping Entity Alignment with Knowledge Graph Embedding},
  booktitle = {IJCAI},
  pages     = {4396--4402},
  year      = {2018}
}

bootea's People

Contributors

Stargazers

Watchers

bootea's Issues

运行问题

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 3.73 GiB for an array with shape (10000, 100000) and data type float32
运行过程中出现此问题请问如何解决？

对齐种子的实体名问题

老师您好，我正在DBP15k上复现MultiKE。

然而MultiKE在论文中使用的数据集是DBP-WD和DBP-YG，所以我决定先在这两个数据集上复现，再在DBP15k上复现。

（这两个数据集来自这个仓库，所以本issue发在这里。）

我复现的过程中发现，代码里缺少基于实体名获取嵌入的部分。于是我用了bert。结果，在不训练网络的情况下，仅仅使用 bert 给出的嵌入，其对齐效果就有99%。

一开始我以为是bert太强大。但是，很快我发现，我错了。

我做了实验：只基于实体名，当两个实体名相等时，则认为两实体对齐了。其中没有用到任何模型。这仅仅是一些 if-else 而已。其结果令人惊讶，100%。

代码如下

# 1. build map
kg1_id2name = {}
kg2_id2name = {}
kg2_name2id = {}
with open("./dataset/DWY100K/dbp_yg/mapping/0_3/ent_ids_1") as f:
    lines = f.readlines()
    for i in lines:
        id, name = i.strip().split("\t")
        kg1_id2name[int(id)] = name
with open("./dataset/DWY100K/dbp_yg/mapping/0_3/ent_ids_2") as f:
    lines = f.readlines()
    for i in lines:
        id, name = i.strip().split("\t")
        kg2_id2name[int(id)] = name
        kg2_name2id[name] = int(id)

# 2. load test data
left_entities = []
right_entities = []
with open("./dataset/DWY100K/dbp_yg/mapping/0_3/ref_ent_ids") as f:
    lines = f.readlines()
    all = len(lines)
    for i in lines:
        a, b = i.split("\t")
        aId = int(a)
        bId = int(b)
        left_entities.append(aId)
        right_entities.append(bId)

# 3. predict
my_right_entities = []
for left_entity in left_entities:
    left_entity_name = kg1_id2name[left_entity]
    idx = left_entity_name.rindex("resource/")
    short_name = left_entity_name[idx+len("resource/"):]
    try:
        kg2_entity_id = kg2_name2id[short_name]
        my_right_entities.append(kg2_entity_id)
    except Exception:
        my_right_entities.append(-1)

# 4. result
true_count = 0
for i in range(len(right_entities)):
    right_entity = right_entities[i]
    my_right_entity = my_right_entities[i]
    if right_entity == my_right_entity:
        true_count += 1

print("true/all = %.2f%%" % (true_count/all*100))

结果截图

也就是说，对齐种子如果只通过实体名这一方面的信息做对齐，效果必须有100%。

这太奇怪了。

我使用该数据集的方法是否正确？对齐种子的实体名是否发生了泄漏？我应该如何正确地使用这两个数据集？

期待您的解惑。

自举数据问题

您好，我对您提出的方法很感兴趣，我们复现了实验，发现自举过程是从ref_pairs中筛选新的对齐实体对，请问是否应该是从未对齐的实体中选择呢？

Can you tell me about the description of DWY100k?

result文件

想请问一下result文件里面的结果，为什么只对齐了ref-pairs中的实体。还想问一下为什么那个输出的准确率和召回率会随着迭代次数减少。

数据中的疑虑

hello,你好，最近在膜拜你写的BootEA代码，不过对于其中使用到的数据有个小问题，请教下哈：ref_ent_ids和sup_ent_ids之间的区别是什么呢，感觉好像都是对齐种子？

MemoryError

好像是内存满了，打开任务管理器GPU好像没跑起来，这是什么问题，其他的代码都可以用GPU跑，该怎么解决

Question about the baseline IPTransE results in paper

Hi, Zequn.
I try to use https://github.com/thunlp/IEAJKE(IPTransE) to reproduce the results of IPTransE on DBP15Kzh-en. But the results are so much bad than what reported in your paper BootEA(My results are Hit@1: 0.1941, Hit@10: 0.4685, MRR: 0.2833). I just changed the data format of the DBP15Kzh-en to what IPTransE needs. No parameters modified.
Would you like to share some advice on it? (Or kindly would you like to share the code/input_dataset that you use?)
Thanks!

About AlignE with different proportion of seeds

Hi, could you please share the results of AlignE under different proportion of seeds? There are only the results of BootEA in paper charts.