gabrielstanovsky / oie-benchmark Goto Github PK

View Code? Open in Web Editor NEW

93.0 10.0 19.0 9.18 MB

Framework for converting QA-SRL to Open-IE and evaluating Open IE parsers.

License: MIT License

Shell 5.76% Python 94.24%

oie-benchmark's Introduction

Table of Contents generated with DocToc

Introduction
- Citing
- Contact
Requirements
Changelog
- Filtering Pronoun Arguments
- Changing the Matching Function
Converting QA-SRL to Open IE
- Expected Folder Structure
Evaluating an Open IE Extractor
Evaluating Existing Systems
Plotting

Introduction

This repository contains code for converting QA-SRL annotations to Open-IE extractions and comparing Open-IE parsers against a converted benchmark corpus. This is an implementation of the algorithms described in our EMNLP2016 paper.

Citing

If you use this software, please cite:

@InProceedings{Stanovsky2016EMNLP,
  author    = {Gabriel Stanovsky and Ido Dagan},
  title     = {Creating a Large Benchmark for Open Information Extraction},
  booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  month     = {November},
  year      = {2016},
  address   = {Austin, Texas},
  publisher = {Association for Computational Linguistics},
  pages     = {(to appear)},
}

Contact

Leave us a note at gabriel (dot) satanovsky (at) gmail (dot) com

Requirements

Python 3
See required python packages here.

Additional help can be found in the FAQ section.

Changelog

Since the publication of this resource, we made several changes, outlined below. The original version of the corpus, with 10,359 extractions, as reported in the paper, is available here.

Filtering Pronoun Arguments

We removed extractions that had only pronouns as one of the arguments, and that the same extraction appeared using the entity the pronoun was referring to.
Consider, For example: John went home, he was hungry The original corpus would have had both extractions:

("John", "was", "hungry")
("he", "was", "hungry")

In the current version of the corpus, extraction (2) is omitted, following the observation that we penalize some systems for not extracting it, where they may be doing so as a design choice.

Changing the Matching Function

Somewhat similar to the first point, we changed the evaluation scripts slightly to be more lenient. Overall, while this changes the absolute performance numbers of the different systems, it does not change the relative performance of any of the tested systems.

Converting QA-SRL to Open IE

To run the code, you should first obtain the full QA-SRL corpus and place it under QASRL-full. After obtaining the QA-SRL corpus, run:

./create_oie_corpus.sh

If everything runs fine, this should create an Open IE corpus (split between wiki and newswire domain) under oie_corpus. A snapshot of the corpus is also available.

Expected Folder Structure

The script currently expects the following folder structure:

QASRL-full/:
newswire/  README.md  wiki/

QASRL-full/newswire:
propbank.dev.qa  propbank.qa  propbank.test.qa  propbank.train.qa

QASRL-full/wiki:
wiki1.dev.qa  wiki1.qa  wiki1.test.qa  wiki1.train.qa

Please make sure that your folders adhere to this structure and naming conventions.

Otherwise, you can invoke the conversion separately for each QA-SRL file by running qa_to_oie.py --in=INPUT_FILE --out=OUTPUT_FILE. Where INPUT_FILE is the QA-SRL file, and the OUTPUT_FILE is where the Open IE file will be created. You can see that the script above just makes separate calls to this file, and then concatenates the outputs.

Evaluating an Open IE Extractor

After converting QA-SRL to Open IE, you can now automatically evaluate your Open-IE system against this corpus. Currently, we support the following Open IE formats:

To compare your extractor:

Run your extractor over the raw sentences and store the output into "your_output.txt"
Depending on your output format, you can get a precision-recall curve by running benchmark.py:

Usage:
   benchmark --gold=GOLD_OIE --out=OUTPUT_FILE (--stanford=STANFORD_OIE | --ollie=OLLIE_OIE |--reverb=REVERB_OIE | --clausie=CLAUSIE_OIE | --openiefour=OPENIEFOUR_OIE | --props=PROPS_OIE)

Options:
  --gold=GOLD_OIE              The gold reference Open IE file (by default, it should be under ./oie_corpus/all.oie).
  --out=OUTPUT_FILE            The output file, into which the precision recall curve will be written.
  --clausie=CLAUSIE_OIE        Read ClausIE format from file CLAUSIE_OIE.
  --ollie=OLLIE_OIE            Read OLLIE format from file OLLIE_OIE.
  --openiefour=OPENIEFOUR_OIE  Read Open IE 4 format from file OPENIEFOUR_OIE.
  --props=PROPS_OIE            Read PropS format from file PROPS_OIE
  --reverb=REVERB_OIE          Read ReVerb format from file REVERB_OIE
  --stanford=STANFORD_OIE      Read Stanford format from file STANFORD_OIE

Evaluating Existing Systems

In the course of this work we tested the above mentioned Open IE parsers against our benchmark. We provide the output files (i.e., Open IE extractions) of each of these systems in systems_output. You can give each of these files to benchmark.py, to get the corresponding precision recall curve.

For example, to evaluate Stanford Open IE output, run:

python benchmark.py --gold=./oie_corpus/all.oie --out=./StanfordPR.dat --stanford=./systems_output/stanford_output.txt

Plotting

You can plot together multiple outputs of benchmark.py, by using pr_plot.py:

Usage:
   pr_plot --in=DIR_NAME --out=OUTPUT_FILENAME 

Options:
  --in=DIR_NAME            Folder in which to search for *.dat files, all of which should be in a P/R column format (outputs from benchmark.py).
  --out=OUTPUT_FILENAME    Output filename, filetype will determine the format. Possible formats: pdf, pgf, png

Finally, try running:

./eval.sh

This will create the Precision Recall figure using the output of OIE parsers in systems_output.

oie-benchmark's People

Contributors

Stargazers

Watchers

Forkers

xiliangsong mischn biu-nlp tjunlp maana-io npcai jbecke lbda1 sangnie jacobsolawetz mylv1222 yynomoon srravula1 liyandan kolaen mingyingyu tanajp dash-ka markhsia

oie-benchmark's Issues

Add minimal required nltk version to README

I ran create_oie_corpus.sh on another machine and had problems with python hanging when loading the qa-srl files. For every line, it got slower, and actually never finished.

I figured out that the script spent most of the time loading the nltk PoS tagging model. I had nltk 3.1 installed. After updating to 3.2.1, it worked without problems. So, there seems to be an issue in the PoS tagger of older nltk versions.

It might be helpful to add a minimal required version of nltk to the README.

Python 3 compatibility integration

I've converted the scripts to Python 3 over at https://github.com/NPCai/oie-benchmark. Let me know if and how you'd want this integrated @gabrielStanovsky, whether into master or a new branch potentially.

A few comments on installation

I needed to install a few nltk tools

import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

Also it seems the file names at
https://dada.cs.washington.edu/qasrl/#dataset

have changed, from propbank* to newswire*

Possible inconsistencies with the numbers reported in the paper?

Hi Gabriel,

I have a couple of doubts. Please let me know if I am going wrong somewhere.

Since I could not find CoNLL 2009 English corpus anywhere, I was working with the oie_corpus snapshot you had shared. If I am not wrong, each line in 'all.oie' file is supposed to count as one extraction and that makes it a total of 8481 extractions, while in the paper in Table 2 you have mentioned the total number of extractions to be 10,359. Please let me know if I am missing something here.
I am also unable to reproduce the PR curve that you have given in the paper. I was however able to regenerate the curve given in 'eval/eval.png' but that doesn't seem to match with what you have in the paper.

Duplicates in raw_sentences/all.txt file

I can see that there are 6400 sentences in raw_sentences/all.txt file. First 3200 sentences are repeated in the file.

Diversity of the dataset predicates ?

Due to having QA-SRL dataset as a source, most of the OIE benchmark predicates are single worded verbs. In Open IE systems output relations might include other forms such as ( Nouns or Noun+prep) Are there any future work considering that ? (for ex: expanding the dataset with other available annotations such as AMR )

Thanks for the benchmark, great work!.

A possible mistake in benchmark.py

In line 56~58,

for goldEx in goldExtractions:   
    unmatchedCount += len(goldExtractions)
    correctTotal += len(goldExtractions)

Maybe the for loop is not needed since by doing the for loop you count len(goldExtractions)* len(goldExtractions) times for unmatchedCount and correctTotal but only len(goldExtractions) times are enough for both.

work environment problem

Hello @gabrielStanovsky,
I have a problem trying when I try to launch the create_oie_corpus.sh file. It returns :
IOError: [Errno 2] No such file or directory: './QASRL-full/newswire/propbank.dev.qa'
According to your documentation the QASRL-full should only contain the QA-SRL corpus, which, downloaded from https://dada.cs.washington.edu/qasrl/#dataset looks like this:
newswire_nosent.dev.qa
newswire_nosent.test.qa
newswire_nosent.train.qa
README.md
wiki1.dev.qa
wiki1.test.qa
wiki1.train.qa
So did I miss something?

BTW Thanks for your paper and the benchmark

Should the verb "to be" be considered a predicate?

Hello, I have two questions on your data and on OpenIE more generally.

I am wondering if a sentence which only has the verb "to be" should be considered as having a predicate or not. It does not seem like you are including these as part of your training data.

For example:
He is a friendly person.

Should an OpenIE system predict that there is a relation "is" with arguments "he" and "a friendly person"?

Following up on this, how should the system behave when there is an auxiliary and a participle?

For example:
He is running to the store.

Should an OpenIE system predict that the relation is "is running" with arguments "he" and "to the store"?

Thanks a lot for your thoughts on this!

Problems in Eval

Hi,

I output my results as clausie format. But when I did evaluation, I just got one pair of percision and recall value. Can you please tell me what should be the right clausie format?

Thanks in advance!

Split raw sentences into train/dev/test

Minor

Different Sentences in raw_sentences/all.txt and oie_corpus/all.oie

I am working on a research project on knowledge representation. I wanted to
use your benchmark for evaluation purposes.

I have cloned the repository and downloaded the dataset, but I am facing issues.
Many sentences in all.txt in the raw_sentences folder are not there in
the all.oie file in oie_corpus folder.

Can you please help me in figuring this out?
Is the dataset updated? If yes then how can I test it now as the
all.oie is still pointing to old dataset?

I have also ran the difference checker there is only 2 percent overlap between 2 files.
Please help me!
I need dataset ASAP.

Reproducing results in papers

Could you comment on which version of this repository or the associated supervised-oie (and the benchmark included in that) can be used to reproduce results in your papers (including "Creating a Large Benchmark for Open Information Extraction"). There appear to be two sets of code which differ, various versions of the corpora and its also unclear which splits may have been used. So far I have found several different PR Curve plots and been able (with fixes to code) to generate my own but they are all different.

I'm also curious as to what might have been the latest state wrt to any performance of each system tested. I've not even considered any updates to those and the outputs included in this repo wrt their current state, this might also explain some differences.

Thanks

Tony