Giter Club home page Giter Club logo

glre's Introduction

Global-to-Local Neural Networks for Document-Level Relation Extraction

Contributions Welcome License language-python3 made-with-Pytorch

Relation extraction (RE) aims to identify the semantic relations between named entities in text. Recent years have witnessed it raised to the document level, which requires complex reasoning with entities and mentions throughout an entire document. In this paper, we propose a novel model to document-level RE, by encoding the document information in terms of entity global and local representations as well as context relation representations. Entity global representations model the semantic information of all entities in the document, entity local representations aggregate the contextual information of multiple mentions of specific entities, and context relation representations encode the topic information of other relations. Experimental results demonstrate that our model achieves superior performance on two public datasets for document-level RE. It is particularly effective in extracting relations between entities of long distance and having multiple mentions.

Getting Started

Package Description

GLRE/
├─ configs/
    ├── cdr_basebert.yaml: config file for CDR dataset under "Train" setting
    ├── cdr_basebert_train+dev.yaml: config file for CDR dataset under "Train+Dev" setting
    ├── docred_basebert.yaml: config file for DocRED dataset under "Train" setting
├─ data/: raw data and preprocessed data about CDR and DocRED dataset
    ├── CDR/
    ├── DocRED/
├─ data_processing/: data preprocessing scripts
├─ results/: pre-trained models and results 
├─ scripts/: running scripts
├─ src/
    ├── data/: read data and convert to batch
    ├── models/: core module to implement GLRE
    ├── nnet/: sub-layers to implement GLRE
    ├── utils/: utility function
    ├── main.py

Dependencies

  • python (>=3.6)
  • pytorch (>=1.5)
  • numpy (>=1.13.3)
  • recordtype (>=1.3)
  • yamlordereddictloader (>=0.4.0)
  • tabulate (>=0.8.7)
  • transformers (>=2.8.0)
  • scipy (>=1.4.1)
  • scikit-learn (>=0.22.1)

Usage

Datasets & Pre-processing

The datasets include CDR and DocRED. The data are located in data/CDR directory and data/DocRED directory, respectively. The pre-processing scripts are located in the data_processing directory, and the pre-processing results are located in the data/CDR/processed directory and data/DocRED/processed directory, respectively. The pre-trained models are in the results directory.

Specifically, we pre-processed the CDR dataset following edge-oriented graph:

Download the GENIA Tagger and Sentence Splitter:
$ cd data_processing
$ mkdir common && cd common
$ wget http://www.nactem.ac.uk/y-matsu/geniass/geniass-1.00.tar.gz && tar xvzf geniass-1.00.tar.gz
$ cd geniass/ && make && cd ..
$ git clone https://github.com/bornabesic/genia-tagger-py.git
$ cd genia-tagger-py 

Here, you should modify the Makefile inside genia-tagger-py and replace line 3 with `wget http://www.nactem.ac.uk/GENIA/tagger/geniatagger-3.0.2.tar.gz`
$ make
$ cd ../../

In order to process the datasets, they should first be transformed into the PubTator format. The run the processing scripts as follows:
$ sh process_cdr.sh

Then, please use the following code to preprocess the DocRED dataset:

python docRedProcess.py --input_file ../data/DocRED/train_annotated.json \
                   --output_file ../data/DocRED/processed/train_annotated.data \

Train & Test

First, you should download biobert_base and bert_base from figshare and place them in the GLRE directory.

The default hyper-parameters are in the configs directory and the train&test scripts are in the scripts directory. Besides, the run_cdr_train+dev.py script corresponds to the CDR under traing + dev setting.

python scripts/run_cdr.py
python scripts/run_cdr_train+dev.py
python scripts/run_docred.py

Evaluation

For CDR, you can evaluate the results using the evaluation script as follows:

python utils/evaluate_cdr.py --gold ../data/CDR/processed/test.gold --pred ../results/cdr-dev/cdr_basebert_full/test.preds --label 1:CID:2

For DocRED, you can submit the result.json to Codalab.

License

This project is licensed under the GPL License - see the LICENSE file for details.

Citation

If you use this work or code, please kindly cite the following paper:

@inproceedings{GLRE,
 author = {Difeng Wang and Wei Hu and Ermei Cao and Weijian Sun},
 title = {Global-to-Local Neural Networks for Document-Level Relation Extraction},
 booktitle = {EMNLP},
 year = {2020},
}

Contacts

If you have any questions, please feel free to contact Difeng Wang, we will reply it as soon as possible.

glre's People

Contributors

slzbywdf avatar whu2015 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

glre's Issues

论文中公式7的疑问

你好,请问计算context relation representation的公式(7)中Or'表示的什么,论文中似乎没有提起。

关于数据预处理

您好,数据处理时,发现了报错,是缺少文件,我看了在data_processing/tools.py给出的路径中GLRE-master\data_processing\common\geniass的确是缺少了一个文件,可是文件里是什么呢?
错误如下:
Traceback (most recent call last):
File "process.py", line 118, in
main()
File "process.py", line 57, in main
split_sents = sentence_split_genia(orig_sentences)
File "/mnt/d/code/GLRE-master/data_processing/tools.py", line 244, in sentence_split_genia
with open('temp_file.split.txt', 'r') as ifile:
FileNotFoundError: [Errno 2] No such file or directory: 'temp_file.split.txt'

非常期待您的回复!

F值计算

您好,阅读了下您的代码,但没有找到计算Inter-F和Intra-F的代码的位置
可以提供下计算这两个数值的代码和方法吗
谢谢!

关于使用预训练模型作为编码器构图的问题

你好,想请教两个问题。

1、当模型使用预训练模型作为编码器时,word会被分为子词,那么原本数据中标注的entity位置就会被改变了。dataset.py中好像没有对这种情况有进行处理,继续用的是数据集中标注的entity位置。

2、此外看到代码中使用bert时会截取512的长度,那超出512的片段都不要吗?好像论文中说的会分好几段来编码。如果是我理解错了,可以指出在哪进行了处理吗?

关于计算资源

你好,感谢你出色的尝试工作!
请问一下您的实验中所用到的GPU数量以及时间消耗是多少呢?
期待你的回复!

关于 python scripts/run_cdr.py

微信图片_20220124013615
在执行这个文件的时候,denom = torch.sparse.sum(adj[batch, i], dim=1).to_dense()这一句代码报错。
RuntimeError: sparse tensors do not have strides是什么原因呢?需要怎样进行修改?谢谢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.