ink-usc / usc-ds-relationextraction Goto Github PK

Distantly Supervised Relation Extraction

License: MIT License

Python 5.07% C++ 68.98% C 1.49% Makefile 0.09% Perl 0.19% Shell 0.24% CMake 2.29% Fortran 17.09% JavaScript 0.09% CSS 0.06% Java 2.68% HTML 0.98% M4 0.74% MATLAB 0.01%

natural-language-processing information-extraction relation-extraction knowledgebase machine-learning

usc-ds-relationextraction's Introduction

USC Distantly-supervised Relation Extraction System

This repository puts together recent models and data sets for sentence-level relation extraction using knowledge bases (i.e., distant supervision). In particular, it contains the source code for WWW'17 paper CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases.

Please also check out our new repository on handling shifted label distribution in distant supervision

Task: Given a text corpus with entity mentions detected and heuristically labeled using distant supervision, the task aims to identify relation types/labels between a pair of entity mentions based on the sentence context where they co-occur.

Blog Posts

[08/2017] Indirect Supervision for Relation Extraction Using Question-Answer Pairs
[08/2016] Heterogeneous Supervision for Relation Extraction

Data

For evaluating on sentence-level extraction, we processed (using our data pipeline) three public datasets to our JSON format. We ran Stanford NER on training set to detect entity mentions, mapped entity names to Freebase entities using DBpediaSpotlight, aligned Freebase facts to sentences, and assign entity types of Freebase entities to their mapped names in sentences:

PubMed-BioInfer: 100k PubMed paper abstracts as training data and 1,530 manually labeled biomedical paper abstracts from BioInfer (Pyysalo et al., 2007) as test data. It consists of 94 relation types (protein-protein interactions) and over 2,000 entity types (from MESH ontology). (Download)
NYT-manual: 1.18M sentences sampled from 294K New York Times news articles which were then aligned with Freebase facts by (Riedel et al., ECML'10) (link to Riedel's data). For test set, 395 sentences are manually annotated with 24 relation types and 47 entity types (Hoffmann et al., ACL'11) (link to Hoffmann's data). (Download)
Wiki-KBP: the training corpus contains 1.5M sentences sampled from 780k Wikipedia articles (Ling & Weld, 2012) plus ~7,000 sentences from 2013 KBP corpus. Test data consists of 14k system-labeled sentences from 2013 KBP slot filling assessment results. It has 7 relation types and 126 entity types after filtering of numeric value relations. (Download)

Please put the data files in corresponding subdirectories under data/source

Benchmark

Performance comparison with several relation extraction systems over KBP 2013 dataset (sentence-level extraction).

Method	Precision	Recall	F1
Mintz (our implementation, Mintz et al., 2009)	0.296	0.387	0.335
LINE + Dist Sup (Tang et al., 2015)	0.360	0.257	0.299
MultiR (Hoffmann et al., 2011)	0.325	0.278	0.301
FCM + Dist Sup (Gormley et al., 2015)	0.151	0.498	0.300
HypeNet (our implementation, Shwartz et al., 2016)	0.210	0.315	0.252
CNN (our implementation, Zeng et at., 2014)	0.198	0.334	0.242
PCNN (our implementation, Zeng et at., 2015)	0.220	0.452	0.295
LSTM (our implementation)	0.274	0.500	0.350
Bi-GRU (our implementation)	0.301	0.465	0.362
SDP-LSTM (our implementation, Xu et at., 2015)	0.300	0.436	0.356
Position-Aware LSTM (Zhang et al., 2017)	0.265	0.598	0.367
CoType-RM (Ren et al., 2017)	0.303	0.407	0.347
CoType (Ren et al., 2017)	0.348	0.406	0.369

Note: for models that trained on sentences annotated with a single label (HypeNet, CNN/PCNN, LSTM, SDP/PA-LSTMs, Bi-GRU), we form one training instance for each sentence-label pair based on their DS-annotated data.

Usage

Dependencies

We will take Ubuntu for example.

python 2.7
Python library dependencies

$ pip install pexpect ujson tqdm

stanford coreNLP 3.7.0 and its python wrapper. Please put the library under `code/DataProcessor/'.

$ cd code/DataProcessor/
$ git clone [email protected]:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip

eigen 3.2.5 (already included).

We have included compilied binaries. If you need to re-compile retype.cpp under your own g++ environment

$ cd code/Model/retype; make

Default Run

As an example, we show how to run CoType on the Wiki-KBP dataset

Start the Stanford corenlp server for the python wrapper.

$ java -mx4g -cp "code/DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

Feature extraction, embedding learning on training data, and evaluation on test data.

$ ./run.sh

For relation classification, the "none"-labeled instances need to be first removed from train/test JSON files. The hyperparamters for embedding learning are included in the run.sh script.

Parameters

Dataset to run on.

Data="KBP"

Hyperparameters for relation extraction:

- KBP: -negative 3 -iters 400 -lr 0.02 -transWeight 1.0
- NYT: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0
- BioInfer: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0

Hyperparameters for relation classification are included in the run.sh script.

Evaluation

Evaluates relation extraction performance (precision, recall, F1): produce predictions along with their confidence score; filter the predicted instances by tuning the thresholds.

$ python code/Evaluation/emb_test.py extract KBP retype cosine 0.0
$ python code/Evaluation/tune_threshold.py extract KBP emb retype cosine

In-text Prediction

The last command in run.sh generates json file for predicted results, in the same format as test.json in data/source/$DATANAME, except that we only output the predicted relation mention labels. Replace the second parameter with whatever threshold you would like.

$ python code/Evaluation/convertPredictionToJson.py $Data 0.0

Customized Run

Code for producing the JSON files from a raw corpus for running CoType and baseline models is here.

Baselines

You can find our implementation of some recent relation extraction models under the Code/Model/ directory.

References

Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare R. Voss, Heng Ji, Tarek F. Abdelzaher, Jiawei Han. "CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases", WWW 2017.
Meng Qu, Xiang Ren, Yu Zhang, Jiawei Han. “Weakly-supervised Relation Extraction by Pattern-enhanced Embedding Learning”, WWW 2018.
Liyuan Liu*, Xiang Ren*, Qi Zhu, Shi Zhi, Huan Gui, Heng Ji, Jiawei Han. "Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach", EMNLP 2017.
Ellen Wu, Xiang Ren, Frank Xu, Ji Li, Jiawei Han. "Indirect Supervision for Relation Extraction using Question-Answer Pairs", WSDM 2018.

Contributors

Ellen Wu
Meng Qu
Frank Xu
Wenqi He
Maosen Zhang
Qinyuan Ye
Xiang Ren

usc-ds-relationextraction's People

Contributors

Stargazers

Watchers

Forkers

ellenmellon gaohuan2016 xiongda hanqichen dapeng2018 xgeric knoweng colinsongf leezqcst himmelstein gispk47 frankxu2004 loyaltyji zhyuxie hanllu djher gonewithgt yuankq pi19404 xiliangsong tpnguyen geledek librasoldier chybot qiuyuew liuweiping2020 aurorazhis dddragons zhuchangjiang zhangyijia1979 reandrade22 charlotteliu liushui9404 clusteranalysis mystery-000 s-wxy windyjune aim-for-better hongtaowutj hwaking heartburing jackysnake fancycheung xiaopangxia lrxzhy zhaohuiqiang ranar90 czyssrs kaharjan plume milozms minzhe cherry979988 falconzyx ahmedmn adzhua gccrpm reeuq daisy1992 hatleon houjinyu hitcszq meibaihui platanus-hy varuniyer jwh7337 pvk444 cdjasonj jayewu ljch2018 syncyourmind jerryten legendtianjin geziaka fulquan yy91 auscenery tjunlp guoyin90 xnz535264581 zgq7799 communicateconnectcreate langfangctt nonothingc cemeiq sxrczh arpitagupta15 catherine1999 chengli0327 zxc2012 sainiudit tony520 aliqiao anyuanay hunterheidy gavingx yueyedeai longlongman jinglishi0206 ngobibibnbe

usc-ds-relationextraction's Issues

NYT corpus: Issue with number of entity types

Hello,

In your paper you state that you have 47 entity types for the NYT corpus.
However, in the uploaded data as well as in the data generation code I find there are only 3 (person, organisation, location).
Do you consider these types as supertypes and include other types inside? If so, can you provide the correct mapping from Freebase?

Thanks in advance!

Applying CoType to Freebase

I found your work very interesting and I'd like to apply it to extract Freebase relations.
Could you give some hints on the code I should modified so that CoType can extract Freebase relations?

Thanks!

Hoffmann test set has more than 1000 sentences

I'm wondering why you wrote that Hoffmann's data has only 395 sentences (both on this github and the paper). Did you preprocess the data?

libgsl.so.19: No such file or directory

When I run the ./run.sh

I get this error:

Learn CoType embeddings...
code/Model/retype/retype: error while loading shared libraries: libgsl.so.19: cannot open shared object file: No such file or directory

Although I have installed gsl ,it did not work ...
Are there any suggestions? Thx a lot!

Output for data/intermediate/KBP/em/type.txt is strange

The content of intermediate/KBP/em/type.txt is strange for the KBP corpus. I haven't tested on other corpora.

type.txt:

/       0       466411
p       1       177541
e       2       270192
r       3       288178
s       4       182380
o       5       480385
n       6       380277
l       7       159181
d       8       28569
i       9       303219
,       10      170403
m       11      42962
a       12      236368
c       13      196298
h       14      38605
t       15      312620
u       16      59252
g       17      59662
z       18      16877
_       19      16360
w       20      5846
k       21      2778
f       22      3852
y       23      52196
b       24      4285
v       25      21442
x       26      1

For relation mentions, type.txt looks fine:

None    0       111343
per:country_of_death    1       10265
per:country_of_birth    2       12040
per:parents     3       6984
per:children    4       6984
per:religion    5       2584
per:countries_of_residence      6       1

I believe the problem is caused by ner_feature.py line 84 to line 91. For relation mention, mention.labels is a list so the code works ok. However, in the case of entity mention, mention.labels is a string (e.g. /person/soldier,/person/actor,/person/politician,/person/author,/person) and the for loop will loop over the characters in the string and thus creating weird type.txt.

Missing descriptions for the output files of retype

Hi,

I am very interested in how CoType works. However, it seems I can't find any descriptions for the output files generated from retype. More specifically, when I did test run with KBP, it gives the following outputs:

data/results/KBP/em:
total 309M
-rw-rw-r-- 1 msk msk 240M Nov 30 12:34 emb_retype_feature.txt
-rw-rw-r-- 1 msk msk 70M Nov 30 12:34 emb_retype_mention.txt
-rw-rw-r-- 1 msk msk 62K Nov 30 12:34 emb_retype_type.txt

data/results/KBP/rm:
total 440M
-rw-rw-r-- 1 msk msk 368M Nov 30 12:35 emb_retype_feature.txt
-rw-rw-r-- 1 msk msk 73M Nov 30 12:34 emb_retype_mention.txt
-rw-rw-r-- 1 msk msk 3.4K Nov 30 12:35 emb_retype_type.txt
-rw-rw-r-- 1 msk msk 48K Nov 30 12:35 prediction_emb_retype_cosine.txt
-rw-rw-r-- 1 msk msk 2.8K Nov 30 12:35 tune_thresholds_emb_retype_cosine.txt

However, I don't understand what each file means and what rows/columns of each file represent. Can you please provide this information?

Thank you,
Minseung Kim

the output file of the NYT dataset

can you give me the output file of the NYT dataset?i just run failed
thanks
my personal email:[email protected]
thanks a lot

hi,where is the mention_type_test.txt

It was fine untill “learn Cotype embeddings”

Learn CoType embeddings...
./run.sh: line 20: code/Model/retype/retype: Permission denied

why PubMed-BioInfer and Wiki-KBP data in google drive is so small?

PubMed-BioInfer: 100k PubMed paper abstracts as training data and 1,530 manually labeled biomedical paper abstracts from BioInfer (Pyysalo et al., 2007) as test data. It consists of 94 relation types (protein-protein interactions) and over 2,000 entity types (from MESH ontology)

But on the google drive link, there is only 1580 sentences in the train.json, while 708 sentence in test.json.

and for Wiki-KBP data sets, the size are 23784 for train, 289 for test.

So I wonder if there is anything wrong?

Regarding fields of train.json

Are the parameters sentId and start (inside entityMentions) used in the training / testing ?

How do you create the mention_type_test.txt?

I tried to run ./run.sh but it didn't create mention_type_test.txt I get this:

Traceback (most recent call last):
File "code/Evaluation/convertPredictionToJson.py", line 13, in
with open(typeMapFile) as typeF:
IOError: [Errno 2] No such file or directory: 'data/intermediate/KBP/rm/type.txt'
Traceback (most recent call last):
File "code/Evaluation/tune_threshold.py", line 66, in
ground_truth = load_labels(indir + '/mention_type_test.txt')
File "/workspace/png/CoType-master/code/Evaluation/evaluation.py", line 17, in load_labels
with open(file_name) as f:
IOError: [Errno 2] No such file or directory: 'data/intermediate/KBP/rm/mention_type_test.txt'

Does CoType handle multiple relations between two entities in distant supervision?

I'm currently applying CoType for Freebase relation extraction. Between two entity mentions in a sentence, there might exists multiple relations between both entities. For example, there exists two relations president_of and nationality between Obama and USA.

In distantSupervision.py of StructMineDataPipeline, multiple relations are joined with commas (line 96). In my case, I select 100 relations as my target relations, but after feature generation, intermediate/my_extraction/rm/type.txt contains 1200 types, where multiple relations are seen as a type (president_of,nationality is one of the types) I wonder if CoType handles the case of multiple relations?

The dataset linked to google drive seems to be wrong

I scanned through the test set of both NYT and KBP as referred to in the link. The 'test.json' in KBP have a field 'articleID' beginning with 'NYT', which suggests it might be the dataset from NYT.

Hope to update the correct dataset links in google drive :)

Benchmark result Reproduction

I use the default setting for KBP dataset.

code/Model/retype/retype -data $Data -mode j -size 50 -negative 3 -threads 3 -alpha 0.0001 -samples 1 -iters 400 -lr 0.02 -transWeight 1.0

But my best F1 core only reach 0.31, which is far from benchmark performance(o.369).

My Result:
Best threshold: 0.54 .  Precision: 0.303225806442 .     Recall: 0.33215547702 . F1: 0.317032035472

Benchmark performance:
CoType (Ren et al., 2017) | 0.348 | 0.406 | 0.369

Would anyone help me to reproduce the benchmark result? Thank you very much.

Trouble running stanford coreNLP and run.sh

Stanford coreNLP has requirement python>＝3.6，while the code is based on python2,thus import stanza.nlp cannot be done.(Module import error)Did it have legacy support python2 or the way I run code was wrong？

stanza问题

git clone [email protected]:stanfordnlp/stanza.git
Cloning into 'stanza'...
Warning: Permanently added the RSA host key for IP address '52.74.223.119' to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
(py27) shuzhilian@gpuserver0301:/home/chency/wll/pythonworkspace/DS-RelationExtraction/code/DataProcessor$

CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with

gpuws@gpuws32g:/media/gpuws/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/DS-RelationExtraction$ bash run.sh
NYT
Generate Features...
Start nlp parsing
CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%27outputFormat%27%3A+%27serialized%27%2C+%27annotators%27%3A+%27ssplit%2Ctokenize%2Cpos%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fde9a953e10>: Failed to establish a new connection: [Errno 99] Cannot assign requested address',))",),)
CRITICAL:root:It seems like we've temporarily ran out of ports. Taking a 30s break...
CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%27outputFormat%27%3A+%27serialized%27%2C+%27annotators%27%3A+%27ssplit%2Ctokenize%2Cpos%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fde9a6e0a50>: Failed to establish a new connection: [Errno 99] Cannot assign requested address',))",),)
CRITICAL:root:It seems like we've temporarily ran out of ports. Taking a 30s break...
CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%27outputFormat%27%3A+%27serialized%27%2C+%27annotators%27%3A+%27ssplit%2Ctokenize%2Cpos%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fde9a930190>: Failed to establish a new connection: [Errno 99] Cannot assign requested address',))",),)
CRITICAL:root:It seems like we've temporarily ran out of ports. Taking a 30s break...

IOError: [Errno 2] No such file or directory: 'data/results/KBP/rm/prediction_emb_retype_cosine.txt'

when run the script run.sh, I have below error:

Learn CoType embeddings...
code/Model/retype/retype: error while loading shared libraries: libgsl.so.19: cannot open shared object file: No such file or directory
...
IOError: [Errno 2] No such file or directory: 'data/results/KBP/rm/prediction_emb_retype_cosine.txt'

please tell me how to build libgsl? I already ran the make command under retype folder.

liblinear.so.3

During Evaluate on Relattion Extraction, "raise Exception('LIBLINEAR library not found.')" and I don't use try
it report "OSError: ~/anaconda2/bin/../lib/libgomp.so.1: version `GOMP_4.0' not found (required by /data/guozhao/sourceCodes/CoType-master/code/Classifier/liblinear.so.3)".

ink-usc / usc-ds-relationextraction Goto Github PK

usc-ds-relationextraction's Introduction

USC Distantly-supervised Relation Extraction System

Quick Start

Blog Posts

Data

Benchmark

Usage

Dependencies

Default Run

Parameters

Evaluation

In-text Prediction

Customized Run

Baselines

References

Contributors

usc-ds-relationextraction's People

Contributors

Stargazers

Watchers

Forkers

usc-ds-relationextraction's Issues

Recommend Projects

Recommend Topics

Recommend Org