Giter Club home page Giter Club logo

kgt5's People

Contributors

adrianks avatar apoorvumang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

kgt5's Issues

1to1 mapping split issues

Hi Aproov,

there already was a question about one-to-one mapping between entities and relations in #5 . You shared the file with the suggested one-to-one mapping. But the issue is, there is one entity without a mapping and about 190 entities that have duplicated names there. Examples of duplicated names are john mclachlan, xing kong, belle vue, etc. An id with a missed name is Q11159396.
In the original paper you suggested using the following scheme of resolving such issues with duplicated names:

However, multiple entities can have identical canonical mentions; we disambiguate such enti- ties by appending the name with their 1-line description if available. In all other cases of identical canonical mentions we extend each mention with a unique id.

However, while the 1-line descriptions were provided for some entities, for one there was no such option, there no disambiguation scheme at all the file you shared.

Screenshot 2022-08-25 at 13 59 53

Screenshot 2022-08-25 at 14 00 09

Here I provide the file with an improved 1to1 mapping without the missed element and disambiguated entities by the scheme you described in the paper (e.g. for id Q8044536 duplicated name becomes xing kong_8044536).

But anyway, I got kind of confused with the mapping you share. There are also more (>300k) entities than overall in wikidata5m. Is the file you shared the same as you used for the verbalisation during your work on KGT5? And did I get and perform you disambiguation scheme in a right way?

Thank you in advance.

MRR calculation

Hi @apoorvumang!
In issue #8 you mentioned that to calculate the metrics reported in the paper you used code from eval_accelerate.py. For me, it is still not clear how you calculated MRR because all the functions in eval_accelerate.py seems to be related to Hits@k metrics. Nevertheless, you reported MRR values for both Wikidata5M and Wikidata90M. Can you please elaborate on how one can evaluate the model in terms of the MRR metric? From what I found in evaluate.py it seems not feasible to run model inference #of entities (5M) * len(test dataset) times.

Thank you for your support.

pretained-Chinese data

Hello, this work is terrific, and I am happy to find your work. But I have some questions. Could I use your model for Chinese Triples data? If it is could, should I train your model again?

Answering questions

IMAGE 2022-11-09 11:45:50

I followed your advice to push into the model
input="predict answer: what can cause a tsunami"
and then
out = topkSample(input, model, tokenizer, num_samples=5)
But as you can see ther esults are extremely low, but in your notebooks this approach showed good results, can you help me please with this issue

Question about KGC datasets

Hi,

Thanks for the great work! Could you also share the KGC datasets fb15k-237 and WN18RR (the same format as wikidata5m)? BTW, I also see the dataset codex-m in the shared data but did not find results in your paper. Did you also conduct experiments on codex-m?

Verbalization procedure question

Thank you for great work! I still have a question:
for example the initial question - What instrument did Jimi Hendrix play?
But it is required by the model to verbalize it as stated in the article to the (s, p, ?) to receive - Jimi Hendrix | instrument

where
instrument
and
Jimi Hendrix

and only verbalized question we can push into topkSample function.

My question: do you provide code for the procedure of verbalization?

Implementing the KGT5 pipeline for my own constructed KG

Hi Apoorv
Great work with KGT5 model.

I basically want to implement the entire pipeline(KGT5) for my constructed KG. I published a paper(Knowledge Graph – Deep Learning: A Case Study in Question Answering in Aviation Safety Domain) in LREC 2022 where I contributed the Aviation KG and showed results with the KG+DL QA system. Our combined QA system performance was better than individually constructed DLQA and KGQA systems. In your paper, I'll work with the training of triplets but need help in fine-tuning the model as your other branch code is not clean so not really understandable. If README can be provided it would be great.

KGQA code

Thanks. For question answering, would you like to share the data processing process and the code for fine-tuning and inference of the QA pairs?

KGQA data split

Hi Apoorv,

Thank you for the nice work on KGT5.

I get a question. How can we find the data split for KGQA? I went throught the link for downloading datasets, but the KGQA dataset splits mentioned in the paper have not been included. Just wonder did I miss something, or they have not been ready for release yet?

Thanks and regards.

kgqa

Hello, may I ask if the explanation for kgqa can be updated

Tokenizer questions

In the paper you explicitly mentioned that you trained a BPE tokenizer for your experiments:

image

However, in the code of the dataset.py you used T5TokenizerFast that is based on Unigram:

Moreover, you used pertained tokenizer in the code:

image

Could you please clarify which tokenizer configurations were used in your experiments for their reproducibility?

And also, could you please specify the vocabulary size for WN18RR, FB15k-237, and YAGO3-10 as there is no info about these datasets in the paper?

General Question

Hi Apoorv,

It was great to read the KGT5 paper.
If I understand correctly, during the fine-tuning phase, on the QA dataset, we do not use any (retrieved) knowledge graph (subgraph) i.e. we just use T5 model for answering the question.
Other works such as QA-GNN/GreaseLM etc use a retrieved knowledge graph along with a language model for reasoning using the 2 modalities and answering the question.
Do you think we can do something similar with KGT5? or have you tried doing it in any of your experiments?

Thanks

A bug when creating entity_strings.txt for wikidata5m

when I creat entity_strings.txt for wikidata5m, it reports a bug.

python /home/zjj/kgt5/data/get_unique_entities.py --dataset wikidata5m 285780it [00:00, 596976.79it/s] Traceback (most recent call last): File "/home/zjj/kgt5/data/get_unique_entities.py", line 26, in <module> unique_entities.add(split_sentence[1].strip()) IndexError: list index out of range

And I output this line in train.txt, it is
predict tail: creation | destruction | instance of | bonus tracks

So how do I solve this? Just skip this line?

Training time of KGT5

Could you please provide an estimate of training time of the model, let's say, on FB15k-237 dataset along with number of triples in trainset and number of epochs?

In case FB15k-237 is difficult, an estimate of how much time does it take to train KGT5 on 100 triples would also help.

Questions about Link Prediction Task

Nice work. However, I have some questions about the link prediction task on Wikidata5m dataset.
With your training code, I train the model for 4M steps, and the Hits@1 accuracy is only 0.22, which is far lower than the 0.267 reported in your paper. The code I ran is main_accelerate.py, and the command I used for training is:
CUDA_VISIBLE_DEVICES=0 python3 main_accelerate.py \ --save_prefix wd5m-1gpu \ --model_size small --dataset wikidata5m \ --batch_size 64 --save_steps 5000 \ --loss_steps 500
Can you provide more details about how you obtain such a result of 0.267? Thanks.

Question about QA-fine-tuning

Hi Apoorv, nice work. I have some issue about the QA-fine-tuning.
I experimented with the MetaQA dataset using the code under the apoorv-dump branch with the following training detials:

  1. model_size: T5-small
  2. pointcheck: 3330000.pt (kgc task results on Wikidata5M : 21.6 (Hits@1))
  3. epoch: 60 batchsize: 64
  4. INPUT(‘predict answer: Topic Entity token | question token with NE |’) OUTPUT(‘ answer token ’)

However, the best accuracy of my model on the qa_test set was only 40.7%/12.9%/26.6% (1-hop/2-hop/3hop).
Am I missing some details during the experiment that make it less accurate? Please let me know. It would be great if you could give me a pointcheck with high accuracy.

One-to-one mapping between an entity/relation and its textual representation

I want to ask that if the current Wikidata5 dataset has been convert into the format that has one-to-one mapping between an entity/relation and its textual representation? If so, could you give a mapping file between the textual representation and the origin entity ID such as Donald John Trump | Q22686 ? Thanks!

Question about training step

Hi Apoorv
Great work with KGT5 model.

I follow your code to training the T5 from scratch. However , it does not work after staring fresh. Have you ever encountered such a problem or could you give me some suggestions?
Starting fresh 0%| | 0/166748 [00:00<?, ?batches/s]

This is my code to start training
CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun \ --nproc_per_node 4 \ main_accelerate.py \ --save_prefix wd5m \ --model_size small --dataset wikidata5m \ --batch_size 64 --save_steps 5000 \ --loss_steps 500

Inference time when calulating MRR

Hi, @apoorvumang, this work is so great!

I try to reproduce the reported MRR, MR and H@N results on Wikidata5M with evaluate.py, but I found it needs quite a long time. Could you please provide an estimation of the evaluation time for calculating MRR?

Thanks!

KGC data split

Hi Apoorv,

thank you for sharing the repo on KGT5. I am sorry for bothering you, but could you please tell me whether there is a link to the dataset split you used for the link prediction task and mentioned in the paper? I can't find it in previous issues + link from the repo leads to the dataset having the same number of connections as an original wikidata5m has.

Thank you in advance.

Question about training in link prediction?

Nice work, I have some questions about the training process of KGT5.
(1) 1 vs ALL method: The KGT5 model is trained using the 1 vs ALL method, but the loss function in main_accelerate.py is CE.
(2) Link prediction: I used the provided code to perform link prediction on the WN18RR dataset, and Hit@1 is only 0.108.
Can you provide more details on training and experimental setup?

questions of link-prediction

thanks for your genius approach!while reproducing the result of link-prediction, I wondering where does the model calculate hit!@1 and hit@10 on Wikidata5M? I only found loss and acc。Looking forward to your reply

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.