Giter Club home page Giter Club logo

usc-ds-relationextraction's Introduction

USC Distantly-supervised Relation Extraction System

This repository puts together recent models and data sets for sentence-level relation extraction using knowledge bases (i.e., distant supervision). In particular, it contains the source code for WWW'17 paper CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases.

Please also check out our new repository on handling shifted label distribution in distant supervision

Task: Given a text corpus with entity mentions detected and heuristically labeled using distant supervision, the task aims to identify relation types/labels between a pair of entity mentions based on the sentence context where they co-occur.

Quick Start

Blog Posts

Data

For evaluating on sentence-level extraction, we processed (using our data pipeline) three public datasets to our JSON format. We ran Stanford NER on training set to detect entity mentions, mapped entity names to Freebase entities using DBpediaSpotlight, aligned Freebase facts to sentences, and assign entity types of Freebase entities to their mapped names in sentences:

  • PubMed-BioInfer: 100k PubMed paper abstracts as training data and 1,530 manually labeled biomedical paper abstracts from BioInfer (Pyysalo et al., 2007) as test data. It consists of 94 relation types (protein-protein interactions) and over 2,000 entity types (from MESH ontology). (Download)

  • NYT-manual: 1.18M sentences sampled from 294K New York Times news articles which were then aligned with Freebase facts by (Riedel et al., ECML'10) (link to Riedel's data). For test set, 395 sentences are manually annotated with 24 relation types and 47 entity types (Hoffmann et al., ACL'11) (link to Hoffmann's data). (Download)

  • Wiki-KBP: the training corpus contains 1.5M sentences sampled from 780k Wikipedia articles (Ling & Weld, 2012) plus ~7,000 sentences from 2013 KBP corpus. Test data consists of 14k system-labeled sentences from 2013 KBP slot filling assessment results. It has 7 relation types and 126 entity types after filtering of numeric value relations. (Download)

Please put the data files in corresponding subdirectories under data/source

Benchmark

Performance comparison with several relation extraction systems over KBP 2013 dataset (sentence-level extraction).

Method Precision Recall F1
Mintz (our implementation, Mintz et al., 2009) 0.296 0.387 0.335
LINE + Dist Sup (Tang et al., 2015) 0.360 0.257 0.299
MultiR (Hoffmann et al., 2011) 0.325 0.278 0.301
FCM + Dist Sup (Gormley et al., 2015) 0.151 0.498 0.300
HypeNet (our implementation, Shwartz et al., 2016) 0.210 0.315 0.252
CNN (our implementation, Zeng et at., 2014) 0.198 0.334 0.242
PCNN (our implementation, Zeng et at., 2015) 0.220 0.452 0.295
LSTM (our implementation) 0.274 0.500 0.350
Bi-GRU (our implementation) 0.301 0.465 0.362
SDP-LSTM (our implementation, Xu et at., 2015) 0.300 0.436 0.356
Position-Aware LSTM (Zhang et al., 2017) 0.265 0.598 0.367
CoType-RM (Ren et al., 2017) 0.303 0.407 0.347
CoType (Ren et al., 2017) 0.348 0.406 0.369

Note: for models that trained on sentences annotated with a single label (HypeNet, CNN/PCNN, LSTM, SDP/PA-LSTMs, Bi-GRU), we form one training instance for each sentence-label pair based on their DS-annotated data.

Usage

Dependencies

We will take Ubuntu for example.

  • python 2.7
  • Python library dependencies
$ pip install pexpect ujson tqdm
$ cd code/DataProcessor/
$ git clone [email protected]:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip

We have included compilied binaries. If you need to re-compile retype.cpp under your own g++ environment

$ cd code/Model/retype; make

Default Run

As an example, we show how to run CoType on the Wiki-KBP dataset

Start the Stanford corenlp server for the python wrapper.

$ java -mx4g -cp "code/DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

Feature extraction, embedding learning on training data, and evaluation on test data.

$ ./run.sh  

For relation classification, the "none"-labeled instances need to be first removed from train/test JSON files. The hyperparamters for embedding learning are included in the run.sh script.

Parameters

Dataset to run on.

Data="KBP"
  • Hyperparameters for relation extraction:
- KBP: -negative 3 -iters 400 -lr 0.02 -transWeight 1.0
- NYT: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0
- BioInfer: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0

Hyperparameters for relation classification are included in the run.sh script.

Evaluation

Evaluates relation extraction performance (precision, recall, F1): produce predictions along with their confidence score; filter the predicted instances by tuning the thresholds.

$ python code/Evaluation/emb_test.py extract KBP retype cosine 0.0
$ python code/Evaluation/tune_threshold.py extract KBP emb retype cosine

In-text Prediction

The last command in run.sh generates json file for predicted results, in the same format as test.json in data/source/$DATANAME, except that we only output the predicted relation mention labels. Replace the second parameter with whatever threshold you would like.

$ python code/Evaluation/convertPredictionToJson.py $Data 0.0

Customized Run

Code for producing the JSON files from a raw corpus for running CoType and baseline models is here.

Baselines

You can find our implementation of some recent relation extraction models under the Code/Model/ directory.

References

Contributors

  • Ellen Wu
  • Meng Qu
  • Frank Xu
  • Wenqi He
  • Maosen Zhang
  • Qinyuan Ye
  • Xiang Ren

usc-ds-relationextraction's People

Contributors

ellenmellon avatar ljch2018 avatar milozms avatar shanzhenren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

usc-ds-relationextraction's Issues

NYT corpus: Issue with number of entity types

Hello,

In your paper you state that you have 47 entity types for the NYT corpus.
However, in the uploaded data as well as in the data generation code I find there are only 3 (person, organisation, location).
Do you consider these types as supertypes and include other types inside? If so, can you provide the correct mapping from Freebase?

Thanks in advance!

Applying CoType to Freebase

I found your work very interesting and I'd like to apply it to extract Freebase relations.
Could you give some hints on the code I should modified so that CoType can extract Freebase relations?

Thanks!

libgsl.so.19: No such file or directory

When I run the ./run.sh

I get this error:

Learn CoType embeddings...
code/Model/retype/retype: error while loading shared libraries: libgsl.so.19: cannot open shared object file: No such file or directory

Although I have installed gsl ,it did not work ...
Are there any suggestions? Thx a lot!

Output for data/intermediate/KBP/em/type.txt is strange

The content of intermediate/KBP/em/type.txt is strange for the KBP corpus. I haven't tested on other corpora.

type.txt:

/       0       466411
p       1       177541
e       2       270192
r       3       288178
s       4       182380
o       5       480385
n       6       380277
l       7       159181
d       8       28569
i       9       303219
,       10      170403
m       11      42962
a       12      236368
c       13      196298
h       14      38605
t       15      312620
u       16      59252
g       17      59662
z       18      16877
_       19      16360
w       20      5846
k       21      2778
f       22      3852
y       23      52196
b       24      4285
v       25      21442
x       26      1

For relation mentions, type.txt looks fine:

None    0       111343
per:country_of_death    1       10265
per:country_of_birth    2       12040
per:parents     3       6984
per:children    4       6984
per:religion    5       2584
per:countries_of_residence      6       1

I believe the problem is caused by ner_feature.py line 84 to line 91. For relation mention, mention.labels is a list so the code works ok. However, in the case of entity mention, mention.labels is a string (e.g. /person/soldier,/person/actor,/person/politician,/person/author,/person) and the for loop will loop over the characters in the string and thus creating weird type.txt.

Missing descriptions for the output files of retype

Hi,

I am very interested in how CoType works. However, it seems I can't find any descriptions for the output files generated from retype. More specifically, when I did test run with KBP, it gives the following outputs:

data/results/KBP/em:
total 309M
-rw-rw-r-- 1 msk msk 240M Nov 30 12:34 emb_retype_feature.txt
-rw-rw-r-- 1 msk msk 70M Nov 30 12:34 emb_retype_mention.txt
-rw-rw-r-- 1 msk msk 62K Nov 30 12:34 emb_retype_type.txt

data/results/KBP/rm:
total 440M
-rw-rw-r-- 1 msk msk 368M Nov 30 12:35 emb_retype_feature.txt
-rw-rw-r-- 1 msk msk 73M Nov 30 12:34 emb_retype_mention.txt
-rw-rw-r-- 1 msk msk 3.4K Nov 30 12:35 emb_retype_type.txt
-rw-rw-r-- 1 msk msk 48K Nov 30 12:35 prediction_emb_retype_cosine.txt
-rw-rw-r-- 1 msk msk 2.8K Nov 30 12:35 tune_thresholds_emb_retype_cosine.txt

However, I don't understand what each file means and what rows/columns of each file represent. Can you please provide this information?

Thank you,
Minseung Kim

why PubMed-BioInfer and Wiki-KBP data in google drive is so small?

PubMed-BioInfer: 100k PubMed paper abstracts as training data and 1,530 manually labeled biomedical paper abstracts from BioInfer (Pyysalo et al., 2007) as test data. It consists of 94 relation types (protein-protein interactions) and over 2,000 entity types (from MESH ontology)

But on the google drive link, there is only 1580 sentences in the train.json, while 708 sentence in test.json.

and for Wiki-KBP data sets, the size are 23784 for train, 289 for test.

So I wonder if there is anything wrong?

How do you create the mention_type_test.txt?

I tried to run ./run.sh but it didn't create mention_type_test.txt I get this:

Traceback (most recent call last):
File "code/Evaluation/convertPredictionToJson.py", line 13, in
with open(typeMapFile) as typeF:
IOError: [Errno 2] No such file or directory: 'data/intermediate/KBP/rm/type.txt'
Traceback (most recent call last):
File "code/Evaluation/tune_threshold.py", line 66, in
ground_truth = load_labels(indir + '/mention_type_test.txt')
File "/workspace/png/CoType-master/code/Evaluation/evaluation.py", line 17, in load_labels
with open(file_name) as f:
IOError: [Errno 2] No such file or directory: 'data/intermediate/KBP/rm/mention_type_test.txt'

Does CoType handle multiple relations between two entities in distant supervision?

I'm currently applying CoType for Freebase relation extraction. Between two entity mentions in a sentence, there might exists multiple relations between both entities. For example, there exists two relations president_of and nationality between Obama and USA.

In distantSupervision.py of StructMineDataPipeline, multiple relations are joined with commas (line 96). In my case, I select 100 relations as my target relations, but after feature generation, intermediate/my_extraction/rm/type.txt contains 1200 types, where multiple relations are seen as a type (president_of,nationality is one of the types) I wonder if CoType handles the case of multiple relations?

The dataset linked to google drive seems to be wrong

I scanned through the test set of both NYT and KBP as referred to in the link. The 'test.json' in KBP have a field 'articleID' beginning with 'NYT', which suggests it might be the dataset from NYT.

Hope to update the correct dataset links in google drive :)

Benchmark result Reproduction

I use the default setting for KBP dataset.

code/Model/retype/retype -data $Data -mode j -size 50 -negative 3 -threads 3 -alpha 0.0001 -samples 1 -iters 400 -lr 0.02 -transWeight 1.0

But my best F1 core only reach 0.31, which is far from benchmark performance(o.369).

My Result:
Best threshold: 0.54 .  Precision: 0.303225806442 .     Recall: 0.33215547702 . F1: 0.317032035472

Benchmark performance:
CoType (Ren et al., 2017) | 0.348 | 0.406 | 0.369

Would anyone help me to reproduce the benchmark result? Thank you very much.

Trouble running stanford coreNLP and run.sh

Stanford coreNLP has requirement python>=3.6,while the code is based on python2,thus import stanza.nlp cannot be done.(Module import error)Did it have legacy support python2 or the way I run code was wrong?

stanza问题

git clone [email protected]:stanfordnlp/stanza.git
Cloning into 'stanza'...
Warning: Permanently added the RSA host key for IP address '52.74.223.119' to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
(py27) shuzhilian@gpuserver0301:/home/chency/wll/pythonworkspace/DS-RelationExtraction/code/DataProcessor$

CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with

gpuws@gpuws32g:/media/gpuws/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/DS-RelationExtraction$ bash run.sh
NYT
Generate Features...
Start nlp parsing
CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%27outputFormat%27%3A+%27serialized%27%2C+%27annotators%27%3A+%27ssplit%2Ctokenize%2Cpos%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fde9a953e10>: Failed to establish a new connection: [Errno 99] Cannot assign requested address',))",),)
CRITICAL:root:It seems like we've temporarily ran out of ports. Taking a 30s break...
CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%27outputFormat%27%3A+%27serialized%27%2C+%27annotators%27%3A+%27ssplit%2Ctokenize%2Cpos%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fde9a6e0a50>: Failed to establish a new connection: [Errno 99] Cannot assign requested address',))",),)
CRITICAL:root:It seems like we've temporarily ran out of ports. Taking a 30s break...
CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%27outputFormat%27%3A+%27serialized%27%2C+%27annotators%27%3A+%27ssplit%2Ctokenize%2Cpos%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fde9a930190>: Failed to establish a new connection: [Errno 99] Cannot assign requested address',))",),)
CRITICAL:root:It seems like we've temporarily ran out of ports. Taking a 30s break...

IOError: [Errno 2] No such file or directory: 'data/results/KBP/rm/prediction_emb_retype_cosine.txt'

when run the script run.sh, I have below error:

Learn CoType embeddings...
code/Model/retype/retype: error while loading shared libraries: libgsl.so.19: cannot open shared object file: No such file or directory
...
IOError: [Errno 2] No such file or directory: 'data/results/KBP/rm/prediction_emb_retype_cosine.txt'

please tell me how to build libgsl? I already ran the make command under retype folder.

liblinear.so.3

During Evaluate on Relattion Extraction, "raise Exception('LIBLINEAR library not found.')" and I don't use try
it report "OSError: ~/anaconda2/bin/../lib/libgomp.so.1: version `GOMP_4.0' not found (required by /data/guozhao/sourceCodes/CoType-master/code/Classifier/liblinear.so.3)".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.