Global-to-Local Neural Networks for Document-Level Relation Extraction

Relation extraction (RE) aims to identify the semantic relations between named entities in text. Recent years have witnessed it raised to the document level, which requires complex reasoning with entities and mentions throughout an entire document. In this paper, we propose a novel model to document-level RE, by encoding the document information in terms of entity global and local representations as well as context relation representations. Entity global representations model the semantic information of all entities in the document, entity local representations aggregate the contextual information of multiple mentions of specific entities, and context relation representations encode the topic information of other relations. Experimental results demonstrate that our model achieves superior performance on two public datasets for document-level RE. It is particularly effective in extracting relations between entities of long distance and having multiple mentions.

Getting Started

Package Description

GLRE/
├─ configs/
    ├── cdr_basebert.yaml: config file for CDR dataset under "Train" setting
    ├── cdr_basebert_train+dev.yaml: config file for CDR dataset under "Train+Dev" setting
    ├── docred_basebert.yaml: config file for DocRED dataset under "Train" setting
├─ data/: raw data and preprocessed data about CDR and DocRED dataset
    ├── CDR/
    ├── DocRED/
├─ data_processing/: data preprocessing scripts
├─ results/: pre-trained models and results 
├─ scripts/: running scripts
├─ src/
    ├── data/: read data and convert to batch
    ├── models/: core module to implement GLRE
    ├── nnet/: sub-layers to implement GLRE
    ├── utils/: utility function
    ├── main.py

Dependencies

python (>=3.6)
pytorch (>=1.5)
numpy (>=1.13.3)
recordtype (>=1.3)
yamlordereddictloader (>=0.4.0)
tabulate (>=0.8.7)
transformers (>=2.8.0)
scipy (>=1.4.1)
scikit-learn (>=0.22.1)

Usage

Datasets & Pre-processing

The datasets include CDR and DocRED. The data are located in data/CDR directory and data/DocRED directory, respectively. The pre-processing scripts are located in the data_processing directory, and the pre-processing results are located in the data/CDR/processed directory and data/DocRED/processed directory, respectively. The pre-trained models are in the results directory.

Specifically, we pre-processed the CDR dataset following edge-oriented graph:

Download the GENIA Tagger and Sentence Splitter:
$ cd data_processing
$ mkdir common && cd common
$ wget http://www.nactem.ac.uk/y-matsu/geniass/geniass-1.00.tar.gz && tar xvzf geniass-1.00.tar.gz
$ cd geniass/ && make && cd ..
$ git clone https://github.com/bornabesic/genia-tagger-py.git
$ cd genia-tagger-py 

Here, you should modify the Makefile inside genia-tagger-py and replace line 3 with `wget http://www.nactem.ac.uk/GENIA/tagger/geniatagger-3.0.2.tar.gz`
$ make
$ cd ../../

In order to process the datasets, they should first be transformed into the PubTator format. The run the processing scripts as follows:
$ sh process_cdr.sh

Then, please use the following code to preprocess the DocRED dataset:

python docRedProcess.py --input_file ../data/DocRED/train_annotated.json \
                   --output_file ../data/DocRED/processed/train_annotated.data \

Train & Test

First, you should download biobert_base and bert_base from figshare and place them in the GLRE directory.

The default hyper-parameters are in the configs directory and the train&test scripts are in the scripts directory. Besides, the run_cdr_train+dev.py script corresponds to the CDR under traing + dev setting.

python scripts/run_cdr.py
python scripts/run_cdr_train+dev.py
python scripts/run_docred.py

Evaluation

For CDR, you can evaluate the results using the evaluation script as follows:

python utils/evaluate_cdr.py --gold ../data/CDR/processed/test.gold --pred ../results/cdr-dev/cdr_basebert_full/test.preds --label 1:CID:2

For DocRED, you can submit the result.json to Codalab.

License

This project is licensed under the GPL License - see the LICENSE file for details.

Citation

If you use this work or code, please kindly cite the following paper:

@inproceedings{GLRE,
 author = {Difeng Wang and Wei Hu and Ermei Cao and Weijian Sun},
 title = {Global-to-Local Neural Networks for Document-Level Relation Extraction},
 booktitle = {EMNLP},
 year = {2020},
}

Contacts

If you have any questions, please feel free to contact Difeng Wang, we will reply it as soon as possible.

nju-websoft / glre Goto Github PK

glre's Introduction

Global-to-Local Neural Networks for Document-Level Relation Extraction

Getting Started

Package Description

Dependencies

Usage

Datasets & Pre-processing

Train & Test

Evaluation

License

Citation

Contacts

glre's People

Contributors

Stargazers

Watchers

Forkers

glre's Issues

Recommend Projects

Recommend Topics

Recommend Org