apoorvumang / kgt5 Goto Github PK

View Code? Open in Web Editor NEW

97.0 97.0 18.0 4.21 MB

Sequence-to-Sequence Knowledge Graph Completion and Question Answering (KGT5)

License: Apache License 2.0

Python 100.00%

kgt5's People

Contributors

Stargazers

Watchers

Forkers

marspalliu techthiyanes xiaojinwhu leezythu hbnu-firesun tutouguai eatinghungry paulpeng-popo aqhali moqingxinai munirabobaker hcffffff sybokia rsanthanagopalan px6927 hiendang7613 wenwen19910224 nekonekonekomi

kgt5's Issues

How to get a one-to-one mapping between an entity/relationship and its text representation with mateQA

Hello,apoorv! @apoorvumang
Thank you for your excellent work!.You provide one-to-one mapping between the entities/relationships of Wikidata5m and their text representations in the code, but there is no mapping of other datasets. How do I need to get the mapping of these datasets? Can you tell me the specific method? Thank you!

1to1 mapping split issues

Hi Aproov,

there already was a question about one-to-one mapping between entities and relations in #5 . You shared the file with the suggested one-to-one mapping. But the issue is, there is one entity without a mapping and about 190 entities that have duplicated names there. Examples of duplicated names are john mclachlan, xing kong, belle vue, etc. An id with a missed name is Q11159396.
In the original paper you suggested using the following scheme of resolving such issues with duplicated names:

However, multiple entities can have identical canonical mentions; we disambiguate such enti- ties by appending the name with their 1-line description if available. In all other cases of identical canonical mentions we extend each mention with a unique id.

However, while the 1-line descriptions were provided for some entities, for one there was no such option, there no disambiguation scheme at all the file you shared.

Here I provide the file with an improved 1to1 mapping without the missed element and disambiguated entities by the scheme you described in the paper (e.g. for id Q8044536 duplicated name becomes xing kong_8044536).

But anyway, I got kind of confused with the mapping you share. There are also more (>300k) entities than overall in wikidata5m. Is the file you shared the same as you used for the verbalisation during your work on KGT5? And did I get and perform you disambiguation scheme in a right way?

Thank you in advance.

MRR calculation

Hi @apoorvumang!
In issue #8 you mentioned that to calculate the metrics reported in the paper you used code from eval_accelerate.py. For me, it is still not clear how you calculated MRR because all the functions in eval_accelerate.py seems to be related to Hits@k metrics. Nevertheless, you reported MRR values for both Wikidata5M and Wikidata90M. Can you please elaborate on how one can evaluate the model in terms of the MRR metric? From what I found in evaluate.py it seems not feasible to run model inference #of entities (5M) * len(test dataset) times.

Thank you for your support.

pretained-Chinese data

Hello, this work is terrific, and I am happy to find your work. But I have some questions. Could I use your model for Chinese Triples data? If it is could, should I train your model again?

Questions about dataset

Hi Apoorv,
Thanks for the great work!
I can't download the KGC and KGQA datasets from the connection you provided (https://storage.googleapis.com/t5-kgc-colab/data/data.zip, https://storage.googleapis.com/t5-kgc-colab/data/data_kgqa.zip). However, other datasets can be downloaded in https://storage.googleapis.com. Can you check the details or provide a new link about the KGC and KGQA datasets? Thanks.

Answering questions

I followed your advice to push into the model
input="predict answer: what can cause a tsunami"
and then
out = topkSample(input, model, tokenizer, num_samples=5)
But as you can see ther esults are extremely low, but in your notebooks this approach showed good results, can you help me please with this issue

Question about KGC datasets

Hi,

Thanks for the great work! Could you also share the KGC datasets fb15k-237 and WN18RR (the same format as wikidata5m)? BTW, I also see the dataset codex-m in the shared data but did not find results in your paper. Did you also conduct experiments on codex-m?

Verbalization procedure question

Thank you for great work! I still have a question:
for example the initial question - What instrument did Jimi Hendrix play?
But it is required by the model to verbalize it as stated in the article to the (s, p, ?) to receive - Jimi Hendrix | instrument

where
instrument
and
Jimi Hendrix

and only verbalized question we can push into topkSample function.

My question: do you provide code for the procedure of verbalization?

Implementing the KGT5 pipeline for my own constructed KG

Hi Apoorv
Great work with KGT5 model.

I basically want to implement the entire pipeline(KGT5) for my constructed KG. I published a paper(Knowledge Graph – Deep Learning: A Case Study in Question Answering in Aviation Safety Domain) in LREC 2022 where I contributed the Aviation KG and showed results with the KG+DL QA system. Our combined QA system performance was better than individually constructed DLQA and KGQA systems. In your paper, I'll work with the training of triplets but need help in fine-tuning the model as your other branch code is not clean so not really understandable. If README can be provided it would be great.

KGQA code

Thanks. For question answering, would you like to share the data processing process and the code for fine-tuning and inference of the QA pairs?

KGQA data split

Hi Apoorv,

Thank you for the nice work on KGT5.

I get a question. How can we find the data split for KGQA? I went throught the link for downloading datasets, but the KGQA dataset splits mentioned in the paper have not been included. Just wonder did I miss something, or they have not been ready for release yet?

Thanks and regards.

Dataset Link Cannot be Opened

Hello author, I am very happy to read about kgt5. But I have a question. The link of dataset download 'https://storage.googleapis.com/t5-kgc-colab/data/data.zip' cannot be opened. Thanks.

Is there a checkpoint that I can use directly?

Hi,
Is there a checkpoint that I can use directly?

kgqa

Hello, may I ask if the explanation for kgqa can be updated

Tokenizer questions

In the paper you explicitly mentioned that you trained a BPE tokenizer for your experiments:

However, in the code of the dataset.py you used T5TokenizerFast that is based on Unigram:

Moreover, you used pertained tokenizer in the code:

Could you please clarify which tokenizer configurations were used in your experiments for their reproducibility?

And also, could you please specify the vocabulary size for WN18RR, FB15k-237, and YAGO3-10 as there is no info about these datasets in the paper?

General Question

Hi Apoorv,

It was great to read the KGT5 paper.
If I understand correctly, during the fine-tuning phase, on the QA dataset, we do not use any (retrieved) knowledge graph (subgraph) i.e. we just use T5 model for answering the question.
Other works such as QA-GNN/GreaseLM etc use a retrieved knowledge graph along with a language model for reasoning using the 2 modalities and answering the question.
Do you think we can do something similar with KGT5? or have you tried doing it in any of your experiments?

Thanks

What is the usage of --save_prefix wd5m-1gpu ?

A bug when creating entity_strings.txt for wikidata5m

when I creat entity_strings.txt for wikidata5m, it reports a bug.

python /home/zjj/kgt5/data/get_unique_entities.py --dataset wikidata5m 285780it [00:00, 596976.79it/s] Traceback (most recent call last): File "/home/zjj/kgt5/data/get_unique_entities.py", line 26, in <module> unique_entities.add(split_sentence[1].strip()) IndexError: list index out of range

And I output this line in train.txt, it is
predict tail: creation | destruction | instance of | bonus tracks

So how do I solve this? Just skip this line?

Training time of KGT5

Could you please provide an estimate of training time of the model, let's say, on FB15k-237 dataset along with number of triples in trainset and number of epochs?

In case FB15k-237 is difficult, an estimate of how much time does it take to train KGT5 on 100 triples would also help.

Pretrained checkpoints for QA task

Following issue #23 I make this issue in order to track pretrained checkpoints for QA. I hope that authors will add them soon.
Thank you again!

Questions about Link Prediction Task

Nice work. However, I have some questions about the link prediction task on Wikidata5m dataset.
With your training code, I train the model for 4M steps, and the Hits@1 accuracy is only 0.22, which is far lower than the 0.267 reported in your paper. The code I ran is main_accelerate.py, and the command I used for training is:
CUDA_VISIBLE_DEVICES=0 python3 main_accelerate.py \ --save_prefix wd5m-1gpu \ --model_size small --dataset wikidata5m \ --batch_size 64 --save_steps 5000 \ --loss_steps 500
Can you provide more details about how you obtain such a result of 0.267? Thanks.

Question about QA-fine-tuning

Hi Apoorv, nice work. I have some issue about the QA-fine-tuning.
I experimented with the MetaQA dataset using the code under the apoorv-dump branch with the following training detials：

model_size: T5-small
pointcheck: 3330000.pt (kgc task results on Wikidata5M : 21.6 (Hits@1))
epoch: 60 batchsize: 64
INPUT(‘predict answer: Topic Entity token | question token with NE |’) OUTPUT(‘ answer token ’)

However, the best accuracy of my model on the qa_test set was only 40.7%/12.9%/26.6% (1-hop/2-hop/3hop).
Am I missing some details during the experiment that make it less accurate? Please let me know. It would be great if you could give me a pointcheck with high accuracy.

Text data about WikiKG90M-LSC.

This work is great! I wonder if you could release the text of entity and relation in the WikiKG90M-LSC.

One-to-one mapping between an entity/relation and its textual representation

I want to ask that if the current Wikidata5 dataset has been convert into the format that has one-to-one mapping between an entity/relation and its textual representation? If so, could you give a mapping file between the textual representation and the origin entity ID such as Donald John Trump | Q22686 ? Thanks!

How is T5 randomly initialised and NOT using pretrained LM weights?

Hi Apoorv,

I am trying to locate where the T5 is randomly initialised (using transformers library). I think this line loads the pretrained weights. Could you please tell how and where the randomly-initialised weights are loaded?

This also explains that from_pretrained method loads the pretrained weights.

Question about training step

Hi Apoorv
Great work with KGT5 model.

I follow your code to training the T5 from scratch. However , it does not work after staring fresh. Have you ever encountered such a problem or could you give me some suggestions？
Starting fresh 0%| | 0/166748 [00:00<?, ?batches/s]

This is my code to start training
CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun \ --nproc_per_node 4 \ main_accelerate.py \ --save_prefix wd5m \ --model_size small --dataset wikidata5m \ --batch_size 64 --save_steps 5000 \ --loss_steps 500

Inference time when calulating MRR

Hi, @apoorvumang, this work is so great!

I try to reproduce the reported MRR, MR and H@N results on Wikidata5M with evaluate.py, but I found it needs quite a long time. Could you please provide an estimation of the evaluation time for calculating MRR?

Thanks!

KGC data split

Hi Apoorv,

thank you for sharing the repo on KGT5. I am sorry for bothering you, but could you please tell me whether there is a link to the dataset split you used for the link prediction task and mentioned in the paper? I can't find it in previous issues + link from the repo leads to the dataset having the same number of connections as an original wikidata5m has.

Thank you in advance.

Question about training in link prediction？

Nice work, I have some questions about the training process of KGT5.
(1) 1 vs ALL method: The KGT5 model is trained using the 1 vs ALL method, but the loss function in main_accelerate.py is CE.
(2) Link prediction: I used the provided code to perform link prediction on the WN18RR dataset, and Hit@1 is only 0.108.
Can you provide more details on training and experimental setup?

questions of link-prediction

thanks for your genius approach！while reproducing the result of link-prediction, I wondering where does the model calculate hit!@1 and hit@10 on Wikidata5M? I only found loss and acc。Looking forward to your reply

apoorvumang / kgt5 Goto Github PK

kgt5's People

Contributors

Stargazers

Watchers

Forkers

kgt5's Issues

Recommend Projects

Recommend Topics

Recommend Org