lorrinwww / two-are-better-than-one Goto Github PK

Code associated with the paper **Two are Better Than One: Joint Entity and Relation Extraction with Table-Sequence Encoders**, at EMNLP 2020

Python 95.39% Shell 1.27% Perl 1.76% Jupyter Notebook 1.58%

two-are-better-than-one's Introduction

README

Code associated with the paper Two are Better Than One: Joint Entity and Relation Extraction with Table-Sequence Encoders, at EMNLP 2020

If you find this code useful in your research, please consider citing:

@inproceedings{wang2020two,
  title={Two Are Better than One: Joint Entity and Relation Extraction with Table-Sequence Encoders},
  author={Wang, Jue and Lu, Wei},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  pages={1706--1721},
  year={2020}
}

Navi

Resources

The datasets are available in "./datasets/". Due to copyright issue, we cannot directly release ACE datasets, and instead, their pre-processing scripts are put in "./datasets/".

The word vectoers for each dataset are included in "./wv/";

The contextualized word embeddings and attention weights are not included (since they are too big). We use library "transformers" and "flair" to generate them locally. Please refer to "./gens/gen_*.py".

Model Related

The model is defined in "./models/joint_models.py".

The basic layers are defined in "./layers/". Specifically, the MDRNN (especially GRU) is defined in "./layers/encodings/mdrnns/gru.py".

The metric related part is located in "./data/joint_data.py". The micro-averaged F1 is calculated in class JointTrainer._get_metrics; the macro-averaged F1 is calculated in class JointTrainerMacroF1._get_metrics.

Dependencies

python3
pytorch 1.4.0
transformers 2.9.1
flair 0.4.5
gpustat

Quick Start (Training)

Generate contextualized embeddings (and attention weights)
```
python gens/gen_bert.py \
    --model albert-xxlarge-v1 \
    --dataset ACE05 \
    --save_attention 1 \
    --save_path ./wv/albert.ace05_with_heads.pkl
```
where
- if bert or albert use "gens/gen_bert.py"; if roberta use "gens/gen_roberta.py"
- "--model" is the checkpoint name;
- "--dataset" is the name of dataset for which contextualized embeddings are prepared;
- "--save_attention" is whether to save attention weights, note enable this will make the output file to be very big.
- "--save_path" is the path where features are saved to.

train model!

python -u train.py \
    --num_layers 3 \
    --batch_size 24 \
    --evaluate_interval 500 \
    --dataset ACE05 \
    --pretrained_wv ./wv/glove.6B.100d.ace05.txt  \
    --max_epoches 5000 \
    --max_steps 20000 \
    --model_class JointModel  \
    --model_write_ckpt ./ckpts/my_model \
    --crf None \
    --optimizer adam \
    --lr 0.001 \
    --tag_form iob2 \
    --cased 0 \
    --token_emb_dim 100 \
    --char_emb_dim 30 \
    --char_encoder lstm \
    --lm_emb_dim 4096 \
    --head_emb_dim 768 \
    --lm_emb_path ./wv/albert.ace05_with_heads.pkl \
    --hidden_dim 200 \
    --ner_tag_vocab_size 32 \
    --re_tag_vocab_size 64 \
    --vocab_size 15000 \
    --dropout 0.5 \
    --grad_period 1

The defualt parameters above can be used to reproduce the results in the paper on ACE05.

Quick Start (Inference on custom text)

Please refer to notebook inference. Note: It is not rigorously tested, and thus may be buggy. If you encounter any issue, don't hesitate to report it.

Training Arguments

In this section, we illustrate arguments for "train.py".

Dataset ("--dataset") needs to be specified before training. The preset data sets are:

"ACE04_{i}" (where {i} = 0,...,4, for 5-fold cross validation)
"ACE05"
"CoNLL04"
"ADE_{i}" (where {i} = 0,...,9, for 10-fold cross validation)

For each dataset, we prepare a subset of GloVe word vectors, which should be specified in "--pretrained_wv":

"./wv/glove.6B.100d.ace04.txt"
"./wv/glove.6B.100d.ace05.txt"
"./wv/glove.6B.100d.conll04.txt"
"./wv/glove.6B.100d.ade.txt"

Each word is mapped to a 100d vector ("--token_emb_dim 100").

Contextualized word embeddings (and attention weights) are recommended to used. The path of pre-calculated contextualized word embeddings (and attention weights) needs to be specified in argument "--lm_emb_path".

"--lm_emb_dim" should be the exact emb size of vecters stored in "lm_emb_path", set it to 0 when you do not want use contextualized embeddings;

"head_emb_dim" is the size of attention weights of language model, it should be exactly equals to (number of layers of language model * number of heads of language model), where the language model is what stored in "lm_emb_path", set it to 0 when you do not want use attention weights.

"--grad_period" is tricky, the optimizer will accumulate gradients for "grad_period" training steps before updating the parameters. The memory is not a problem, it should be set to 1; otherwises, reduce the batch size, and try to set grad_period to 2 or 3 to simulate big batch size.

Possible Problems

However, if something goes wrong, we suggest to check the following items:

"ner_tag_vocab_size" should be larger than the number of entity tags (2 * number of entity classes + 1);
"re_tag_vocab_size" should be larger than the number of relation tags (2 * number of relation classes + 1);
"vocab_size" should be larger than the vocab size in "pretrained_wv";
"token_emb_dim" should be the exact emb size of vecters stored in "pretrained_wv";
"lm_emb_dim" should be the exact emb size of vecters stored in "lm_emb_path", set it to 0 when you do not want use contextualized embeddings;
"head_emb_dim" is the size of attention weights of language model, it should be exactly equals to (number of layers of language model * number of heads of language model), where the language model is what stored in "lm_emb_path", set it to 0 when you do not want use attention weights.

This software does not optimize the use of memory, so OOM is likely to occur if you do not use the server designed for deep learning. Normally, a GPU with 32G is required to run the default setting. We give suggestions for the case of OOM:

reduce the batch size :)
reduce the hidden dim :)
"grad_period" is used to perform "Gradient Accumulation", so batch_size * grad_period is the effective batch size. However, the training time will be grad_period times longer than usual.

Examples

ALBERT($+x^{\ell}+T^{\ell}$), 24 batch size, 3 layers, ACE05. (requires 32G memory)

python -u train.py \
    --num_layers 3 \
    --batch_size 24 \
    --evaluate_interval 500 \
    --dataset ACE05 \
    --pretrained_wv ./wv/glove.6B.100d.ace05.txt  \
    --max_epoches 5000 \
    --max_steps 20000 \
    --model_class JointModel  \
    --model_write_ckpt ./ckpts/my_model \
    --crf None \
    --optimizer adam \
    --lr 0.001 \
    --tag_form iob2 \
    --cased 0 \
    --token_emb_dim 100 \
    --char_emb_dim 30 \
    --char_encoder lstm \
    --lm_emb_dim 4096 \
    --head_emb_dim 768 \
    --lm_emb_path ./wv/albert.ace05_with_heads.pkl \
    --hidden_dim 200 \
    --ner_tag_vocab_size 32 \
    --re_tag_vocab_size 64 \
    --vocab_size 15000 \
    --dropout 0.5 \
    --grad_period 1

ALBERT($\mathbf{+x^{\ell}}$), 24 batch size, 3 layers, ACE05. (requires 32G memory)

python -u train.py \
    --num_layers 3 \
    --batch_size 24 \
    --evaluate_interval 500 \
    --dataset ACE05 \
    --pretrained_wv ./wv/glove.6B.100d.ace05.txt  \
    --max_epoches 5000 \
    --max_steps 20000 \
    --model_class JointModel  \
    --model_write_ckpt ./ckpts/my_model \
    --crf None \
    --optimizer adam \
    --lr 0.001 \
    --tag_form iob2 \
    --cased 0 \
    --token_emb_dim 100 \
    --char_emb_dim 30 \
    --char_encoder lstm \
    --lm_emb_dim 4096 \
    --head_emb_dim 0 \
    --lm_emb_path ./wv/albert.ace05.pkl \
    --hidden_dim 200 \
    --ner_tag_vocab_size 32 \
    --re_tag_vocab_size 64 \
    --vocab_size 15000 \
    --dropout 0.5 \
    --grad_period 1

ALBERT($+x^{\ell}+T^{\ell}$), 24 batch size, 2 layers, ACE05. (requires 24G memory)

python -u train.py \
    --num_layers 2 \
    --batch_size 24 \
    --evaluate_interval 500 \
    --dataset ACE05 \
    --pretrained_wv ./wv/glove.6B.100d.ace05.txt  \
    --max_epoches 5000 \
    --max_steps 20000 \
    --model_class JointModel  \
    --model_write_ckpt ./ckpts/my_model \
    --crf None \
    --optimizer adam \
    --lr 0.001 \
    --tag_form iob2 \
    --cased 0 \
    --token_emb_dim 100 \
    --char_emb_dim 30 \
    --char_encoder lstm \
    --lm_emb_dim 4096 \
    --head_emb_dim 768 \
    --lm_emb_path ./wv/albert.ace05_with_heads.pkl \
    --hidden_dim 200 \
    --ner_tag_vocab_size 32 \
    --re_tag_vocab_size 64 \
    --vocab_size 15000 \
    --dropout 0.5 \
    --grad_period 1

ALBERT($+x^{\ell}+T^{\ell}$), (12 batch size * 2 grad period = 24 effective batch size), 2 layers, ACE05. (requires 11G memory)

python -u train.py \
    --num_layers 2 \
    --batch_size 12 \
    --evaluate_interval 1000 \
    --dataset ACE05 \
    --pretrained_wv ./wv/glove.6B.100d.ace05.txt  \
    --max_epoches 5000 \
    --max_steps 20000 \
    --model_class JointModel  \
    --model_write_ckpt ./ckpts/my_model \
    --crf None \
    --optimizer adam \
    --lr 0.001 \
    --tag_form iob2 \
    --cased 0 \
    --token_emb_dim 100 \
    --char_emb_dim 30 \
    --char_encoder lstm \
    --lm_emb_dim 4096 \
    --head_emb_dim 768 \
    --lm_emb_path ./wv/albert.ace05_with_heads.pkl \
    --hidden_dim 200 \
    --ner_tag_vocab_size 32 \
    --re_tag_vocab_size 64 \
    --vocab_size 15000 \
    --dropout 0.5 \
    --grad_period 2

two-are-better-than-one's People

Contributors

Stargazers

Watchers

two-are-better-than-one's Issues

dataset

hi, thanks for your work
Is the code available now?
or does its dataset need any special processing skills?

On the evaluation of ACE

Follow Li and Ji (2014); Miwa and Bansal (2016), we use head spans for entities in ACE. And we keep the full mention boundary for other corpora.

In Li and Ji (2014),

An entity mention is considered correct if its entity type is correct and the offsets of its mention head are correct. A relation mention is considered correct if its relation type is correct, and the head offsets of two entity mention arguments are both correct.

It seems that the evaluation code in this repo considers the full mention boundary instead of head spans. How did you get the result on ACE dataset?

ACE05data

Can I get the format of the sample data please? I want to train ace05 model, but I encountered “java.lang.ClassNotFoundException: edu.stanford.nlp.pipeline.StanfordCoreNLP text/*.split.txt ” when I run run.zsh . For example ,the format of the sample data is a json list and its key are tokens、sentence and so on .

ckpts

Hello, I met this problem while running this code:
"[Errno 2] No such file or directory: './ckpts/my_model.pt'".
I don‘t know to solve it. Looking forward to hearing from you.

数据集预处理报错

你好，感谢分享。
我依照说明文档使用ace数据集预处理脚本ace2005/run.zsh处理ace2005官方数据集（ldc2006t06）时，发现总是会报文件缺失错误，如截图所示

而奇怪的是却可以在ace2005/corpus/test 目录下发现AFP_ENG_20030304.0250.txt文件，请问是否是源数据缺损引起，亦或是有其他原因

About Chinese

How to use Chinese Bert on gen.py

Segfault when getting embedding

Hi, I get segmentation fault when I try to run gen_bert.py. Is there anyone else having the issue?

shared parameters

Hello, I have encountered a problem in the encoders whether to share parameters on the impact of the experiment results. I don't know how to run this. Could you help me? Looking forward to your reply.

how to train it with cpu.

I got below error. I believe that it may be caused by gpu inavailability issue. Can you tell how to train with cpu?

site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
  File "/root/anaconda3/envs/env12_twoBetterOne/lib/python3.7/site-packages/pynvml.py", line 644, in _LoadNvmlLibrary
    nvmlLib = CDLL("libnvidia-ml.so.1")
  File "/root/anaconda3/envs/env12_twoBetterOne/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 197, in <module>
    gpu_idx, gpu_mem = set_max_available_gpu()
  File "/home/projects/information_extraction/two-are-better-than-one/utils/cuda.py", line 32, in set_max_available_gpu
    gpu_idx, gpu_mem = get_max_available_gpu()
  File "/home/projects/information_extraction/two-are-better-than-one/utils/cuda.py", line 26, in get_max_available_gpu
    gpu_available_memory_list = get_available_gpu_memory_list()
  File "/home/projects/information_extraction/two-are-better-than-one/utils/cuda.py", line 19, in get_available_gpu_memory_list
    ret = gpustat.new_query()
  File "/root/anaconda3/envs/env12_twoBetterOne/lib/python3.7/site-packages/gpustat/core.py", line 510, in new_query
    return GPUStatCollection.new_query()
  File "/root/anaconda3/envs/env12_twoBetterOne/lib/python3.7/site-packages/gpustat/core.py", line 281, in new_query
    N.nvmlInit()
  File "/root/anaconda3/envs/env12_twoBetterOne/lib/python3.7/site-packages/pynvml.py", line 608, in nvmlInit
    _LoadNvmlLibrary()
  File "/root/anaconda3/envs/env12_twoBetterOne/lib/python3.7/site-packages/pynvml.py", line 646, in _LoadNvmlLibrary
    _nvmlCheckReturn(NVML_ERROR_LIBRARY_NOT_FOUND)
  File "/root/anaconda3/envs/env12_twoBetterOne/lib/python3.7/site-packages/pynvml.py", line 310, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found

Inference

Can you please tell how can I perform inference and get (entity, relation, entity) type of output?

obtain RE(gold) for other datasets

Hi,
I just wonder how to obtain results of RE(gold) in Table 3 of your paper for other datasets (like CoNLL04 and ADE). Could you help me with this? Thank you in advance.

Custom dataset - change from MD-RNN to CNN

Hello! I am trying your model on my tiny dataset. Would it be possible for you to publish or send me the model with CNN instead of MD-RNN? Since CNN is enough for my small dataset, I would like to reduce the complexity of the model.

Issue with running the code due to dimension mismatch

I have generated the contextual embedding by the below code-
python gens/gen_bert.py
--model_name bert-base-multilingual-cased
--dataset
--save_path ./wv/

But when running the train script by below code facing the issue-
python -u train.py
--num_layers 3
--batch_size 24
--evaluate_interval 500
--dataset
--max_epoches 5000
--max_steps 20000
--model_class JointModel
--model_write_ckpt ./ckpts/my_model
--crf None
--optimizer adam
--lr 0.001
--tag_form iob2
--cased 0
--token_emb_dim 100
--char_emb_dim 30
--char_encoder lstm
--lm_emb_dim 4096
--head_emb_dim 768
--lm_emb_path ./wv/
--hidden_dim 200
--ner_tag_vocab_size <2nertag+2>
--re_tag_vocab_size <2re_tag+2>
--vocab_size 15000
--dropout 0.5
--grad_period 1

Can you tell me what should be the correct parameter value to run the code with custom data and contextual embeddings?

allennlp的版本

代码里用到了allennlp，但是readme.md里面没有写，安装最新版本的allennlp会导致装pytorch1.7，可是readme.md里要求的pytorch是１.4。所以请问allennlp的版本是多少呢

Hoe

Stanford工具包下载问题

您好，wget会遇到网络问题，请问作者方不方便共享一下这两个zip文件呢？非常感谢！[email protected]

自定义数据集的训练

你好，最近我在研读你的论文，想将这篇论文的方法应用在我自己的数据集上，请问如果要使用自己的数据集作为该模型的输入的时候，数据集的标准格式是怎么样的？
谢谢！

Question about ACE2004

Hello,

This work is very interesting!
However, I can not find the code about how you do the k-fold cross-validation in ACE2004.

Siheng

custom input text

Hi, I wonder if there is any code if we want to run a trained model on custom input text, e.g. on a pure txt file?

epoch

Hello, when I learn and run your code, the results show only steps.
"g_step 2000, step 152, avg_time 0.874, loss:51.5502"
I don't know how to use epoch in this code. And I have difficulty in understanding the relationship between the epoch and step and definitions of them both in this code. Looking forward to your answer. Thank you very much.

ace2005数据预处理后的样例

请问在ace2005上进行数据预处理后的json格式是怎样的，可以提供一个样例嘛

ACE05 dataset not in unified?

just trying to run the default parameters provided by the readme.md and its crashing with this error:

File "gens/gen_bert.py", line 257, in <module>
    with open(f'./datasets/unified/train.{flag}.json') as f:
FileNotFoundError: [Errno 2] No such file or directory: './datasets/unified/train.ACE05.json'

should I try a different dataset?

Baseline 训练问题

我看到训练参数中有crf选项，想问下若是在原有的模型上添加crf是不是将crf参数后设置对应crf 模型文件即可？

RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418

I got problem when I run(using my own data)

python gens/gen_bert.py \
    --model albert-xxlarge-v1 \
    --dataset ACE05 \
    --save_attention 1 \
    --save_path ./wv/albert.ace05_with_heads.pkl

error:
98% 50/51 [00:40<00:00, 1.24it/s] Traceback (most recent call last): File "gens/gen_bert.py", line 286, in <module> embedding.embed(s) File "/usr/local/lib/python3.7/dist-packages/flair/embeddings.py", line 96, in embed self._add_embeddings_internal(sentences) File "gens/gen_bert.py", line 138, in _add_embeddings_internal tmp = self.model(all_input_ids, attention_mask=all_input_masks) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_albert.py", line 558, in forward input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_bert.py", line 174, in forward position_embeddings = self.position_embeddings(position_ids) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 114, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1484, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418

any method to slove it? thx!

data

Hello, in order to better study, can you put some sample data

Not enough memory when i do `python gens/gen_bert.py` on MY OWN DATASET

train.CoNLL04.json under the path datasets/unified/ is 352KB, and I can run gens/gen_bert.py on this dataset.
But when I run the script on my own dataset(train.xx.json is 39MB), the device memory can't hold on and be killed finially.

How to solve this problem?
Thanks.

arguments

Hello，when I learn and run your code, I have difficulty in understanding the relationship between the epoch and step and definitions of them both. And I don't know how to use epoch in this code. Would you explain it please？Thank you very much.

中文迁移问题

您好！在看了您的论文之后在考虑，该模型是否可以input部分（论文中S0）替换为中文词嵌入模型直接进行词嵌入从而来处理中文的关系抽取任务？

AllenNLP version

Hi, can you tell the version of AllenNLP that you are using in this project. When I install the newest version of AllenNLP, it keeps overriding pytorch and transformer version. Thanks.

runtime error

=== start training ===
warm up: learning rate was adjusted to 1e-06
Traceback (most recent call last):
File "train.py", line 249, in
trainer.train_model(args=args)
File "/mnt/lustre02/jiangsu/aispeech/home/yfy19/remote/knowledge_factory/two-are-better-than-one/data/base.py", line 73, in train_model
loss = model.train_step(batch)['loss'].detach().cpu().numpy()
File "/mnt/lustre02/jiangsu/aispeech/home/yfy19/remote/knowledge_factory/two-are-better-than-one/models/base.py", line 110, in hooked_train_step
rets = self._train_step(inputs)
File "/mnt/lustre02/jiangsu/aispeech/home/yfy19/remote/knowledge_factory/two-are-better-than-one/models/joint_models.py", line 516, in train_step
loss.backward()
File "/mnt/lustre02/jiangsu/aispeech/home/yfy19/.conda/envs/torch_env_clone/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/lustre02/jiangsu/aispeech/home/yfy19/.conda/envs/torch_env_clone/lib/python3.6/site-packages/torch/autograd/init.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 2)