yuhaozhang / tacred-relation Goto Github PK

View Code? Open in Web Editor NEW

353.0 13.0 98.0 43 KB

PyTorch implementation of the position-aware attention model for relation extraction

License: Other

Python 97.91% Shell 2.09%

information-extraction relation-extraction nlp natural-language-processing

tacred-relation's People

Contributors

Stargazers

Watchers

Forkers

erhurong vikingmew xgeric nbermudezs fuyanzhe frankxu2004 alwayssomeone zbxzc35 milozms bflashcp3f gentlezhu hitcszq zhoudayang garym713 alchemist1024 zhaohuiqiang laiviet vhientran danielhers asep-fajar-firmansyah ysun57 li-pengfei cherry979988 almoslmi lagrangesmile zhangyijia1979 zxyever ardellelee yuanqi666 ushagayatri crw2998 lishengfever patriciarodrigues11 anilujohn dreaminvoker antxyz sxrczh dunefox dlmuyy pasinit zgq7799 gstoica27 cenjat serryuer quancq aidaah baymaxoct databill86 gsgoncalves lazybonee ys7yoo rwangus youngflyasd zhangpiepie sjliu0920 sancharidan nu-c3lab yxzero tybiot xumeng123 lili6138 wangzhen-nlp madelonhulsebos simran-arora divinetx pengwork dobbersc benfengxu asyrofist heart-never-lie saarahasad midannii anushkasw karikara95 mnsalimi chenyunke wkw1259 luckylijg zea929 collen521 knowledge-graph-relation-extraction 6roy niuyake duanruixue aakashb95 zhuowenzheng marcov131842 aqhali kevinkyoma l-shuai hbnu-firesun jabnik jefftandev liiiiiiii-eng chz367 mengaaron iq-scm chuyg1005

tacred-relation's Issues

Size of tensors much match runtime error

When I load the embedding matrix at emb_matrix = np.load(emb_file) I see the shape is (39,300). When I go to train I get the error RuntimeError: Sizes of tensors must match except in dimension 1. Got 9 and 10 (The offending index is 0). Before erroring, I see the shape of emb_matrix is (torch.Size([50, 10, 300]). I think this is somehow that 10 does not divide 39, but being new to this code I don't quite get it. What would I need to do to get it to train?

Confusion over tacred dataset size

Hi, I noticed that there seems to be two versions of TACRED, some papers use TACRED dataset with a total size of 119,474 instead of 106,264, is it possible to make a fair comparison between models using these two different sizes of TACRED dataset? How can I get the dataset with a total size of 119,474?

the result is bad

Evaluating on dev set...
Precision (micro): 100.000%
Recall (micro): 0.000%
F1 (micro): 0.000%
epoch 28: train_loss = 0.757685, dev_loss = 8.513657, dev_f1 = 0.0000
model saved to ./saved_models/00/checkpoint_epoch_28.pt

Evaluating on dev set...
Precision (micro): 100.000%
Recall (micro): 0.000%
F1 (micro): 0.000%
epoch 29: train_loss = 0.749475, dev_loss = 8.528827, dev_f1 = 0.0000
model saved to ./saved_models/00/checkpoint_epoch_29.pt

Evaluating on dev set...
Precision (micro): 100.000%
Recall (micro): 0.000%
F1 (micro): 0.000%
epoch 30: train_loss = 0.732781, dev_loss = 8.541955, dev_f1 = 0.0000
model saved to ./saved_models/00/checkpoint_epoch_30.pt

Training ended with 30 epochs.
$

Is the dataset is available on LDC ?

Hi Yuhao,

First of all, I would like to congratulate you on achieving the new state of the art results in the field using a completly novel technique.
Out of curiosity, I wanted to reproduce the results of the paper you published but could not find dataset for the same. I check the LDC website for new-corpora (https://www.ldc.upenn.edu/new-corpora) but could not find any reference to the dataset you mentioned.
It would be a great help if you could confirm if the dataset is present on LDC and if not what would be the tentative time for the same.

Cheers,
Apoorv

A little bug

In file model/layers.py, there is an error in 25th line where lens should be x_lens.

What does the deprel key stand for? How is it computed from plain text?

What is the meaning of "ann" key of the Tacred dataset?

Preprocessing script that converts txt file into tacred dataset format

Is there a script that can convert a txt file into the tacred data format that can then be used for predicting using a pre-trained model? I'm wondering about a preprocessing script that can convert a normal txt file into the tacred dataset format?

Thanks and God bless,

精度为1如何改代码？

你好，请问一下，
Precision (micro): 100.000%
Recall (micro): 0.000%
F1 (micro): 0.000%

这种情况如何解决

Pretrained tacred-relation models?

Hi, Thanks for the interesting paper and code. Do you happen to have some pretrained models for us to try out? Thanks again!

Version of Stanford CoreNLP used for preprocessing TACRED

Hi,
I see that some information which is provided by Stanford CoreNLP is not included in the dataset. Can you tell me which version of CoreNLP was used for preprocessing it (3.8, 3.9, 4.0,....)? I want to run preprocessing once again without facing the tokenization mismatch problem.

Thanks in advance!

How model predicts directly given a sequence text?

Hi, I just finished training the model by using the sample dataset, but when I tried to predict by using some a random text, it throws the following error:

# load the model after evaluation ended i added this print predicts
print (model.predict("Youth minister and Street General, Charles Ble Goude, who is under UN sanctions for acts of violence by street militias, including beatings, rapes and extrajudicial killings, vows to fight for Ivory Coast's sovereignty"))

line 161, in forward
seq_lens = list(masks.data.eq(constant.PAD_ID).long().sum(1).squeeze())
AttributeError: 'str' object has no attribute 'data'

How solve this issue?
Thank you in advance.

Confusion on micro metrics

I have a small confusion about the metrics. It seems you are using micro averaging for precision, recall and f1. I wanted to know if it makes sense to use macro averaging or not? To my knowledge semeval does use macro averaging. So I am wondering if there was any specific reason you chose to use micro.

Why replace subject and object entity with special token?

In the paper 4.2 section, you say this precessing step helps (1) provide a model with entity type information, and (2) pre- vent a model from overfitting its predictions to specific entities.But i think it may lead overfitting easily instead of preventing because the original subject entity and object entity should be various, while after this precessing, they are masked to some special tokens and whether the number or the form of special tokens is not as ample as original tokens. So why this precessing step can prevent a model from overfitting?
Anther question is if I feed a new text to this model, I have to recognize name entity in the text, how to do that if I don't use Stanford CoreNLP? By the way, can this model be applied to other datasets or real-life scenario?
Thanks and God bless.

annotation guideline

Hi Yuhao,

Is there an annotation guideline for the TACRED dataset? I'm curious about how you define the object entity types and relations.

Thank you in advance.

What is stanford head in the json file?

I am trying to work with a different dataset. I can understand the other keys in the dictionary but I am clueless about stanford head. Could you suggest how to generate that?

Cannot reproduce the score. Test F1 = 0.59

Hi,

I follow your instructions, but I cannot gain the 0.65 F1 score.

Final Score:
Precision (micro): 62.869%
Recall (micro): 55.759%
F1 (micro): 59.101%
Evaluation ended.

SemEval 2010 Task 8 Dataset

In the original paper, you conduct experiment on SemEval 2010 Task 8 Dataset. But in this code, I don't see this experiment.Can you share your experiment code on this dataset.Thank you!

can you provide tacred dataset? thank you

Why all fields of data are sorted by lens for easy RNN operations?

tacred-relation/data/loader.py

Line 83 in 18221ef

batch, orig_idx = sort_all(batch, lens)

Why does rnn operations become easier when I (descending) sort?

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 2. Got 66 and 67 in dimension 1 at /Users/distiller/project/conda/conda-bld/pytorch_1556653464916/work/aten/src/TH/generic/THTensor.cpp:711

I keep getting this error whenever I run the training script and I have no idea how to fix it. This is what the exact error looks like:

Traceback (most recent call last):
File "train.py", line 117, in
loss = model.update(batch)
File "rnn.py", line 50, in update
logits, _ = self.model(inputs)
File "/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "rnn.py", line 185, in forward
inputs = self.drop(torch.cat(inputs, dim=2)) # add dropout to input
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 2. Got 66 and 67 in dimension 1 at /Users/distiller/project/conda/conda-bld/pytorch_1556653464916/work/aten/src/TH/generic/THTensor.cpp:711