2WikiMultihopQA: A Dataset for Comprehensive Evaluation of Reasoning Steps

This is the repository for the paper: Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps (COLING 2020).

New Update (April 7, 2021)

We release the para_with_hyperlink.zip file that contains all articles/paragraphs with hyperlink information (except for some error paragraphs). Each paragraph has the following information:
- id: Wikipedia id of an article
- title
- sentences: a list of sentence
- mentions: a list of hyperlink information, each element has the following information: id, start, end, ref_url, ref_ids, sent_idx.
Here is the link of the dataset that has fixed the inconsistency of sentence segmentation in paragraphs.

Update on December 8, 2020

Due to the multiple names of an entity in Wikidata, we add evidences_id and answer_id to our dataset. Here are the details:
- For Inference and Compositional questions: we add to all questions.
- For Comparison and Bridge_comparison questions: we add to questions that have relations: country, country of origin, and country of citizenship.
We update the new evaluation script 2wikimultihop_evaluation_v1.1.py. We can use this evaluation script to evaluate the dataset with evidences_id and answer_id.
We also update the results of the baseline model by using the new evaluation script and the dataset with evidences_id and answer_id. The updated results of tables 5, 6, and 7 in the paper are in the folder update_results.
Here is the link of the dataset with evidences_id and answer_id. File id_aliases.json is used for evaluation.

Leaderboard

Date	Model	Ans EM	Ans F1	Sup EM	Sup F1	Evi EM	Evi F1	Joint EM	Joint F1
Oct 29, 2021	NA-Reviewer	76.73	81.91	89.61	94.31	53.66	70.83	52.75	65.23
Oct 26, 2021	CRERC	69.58	72.33	82.86	90.68	54.86	68.83	49.80	58.99
June, 2022	BigBird-base model	74.05	79.68	77.14	92.13	45.75	76.64	39.30	63.24
Jan 12, 2022	BigBird-base model - Weighted (Anonymous)	73.04	78.90	76.92	91.95	45.05	76.13	38.72	62.33
Jan 12, 2022	BigBird-base model - Unweighted (Anonymous)	72.38	77.98	75.68	91.56	35.07	71.09	29.86	57.74
June 14, 2021	BigBird-base model (Anonymous)	71.42	77.64	73.84	90.68	24.64	63.69	21.37	51.44
Dec 11, 2021	RoBERTa-base (Anonymous)	32.24	40.90	40.91	71.85	13.80	41.37	6.92	20.54
Oct 25, 2020	Baseline model	36.53	43.93	24.99	65.26	1.07	14.94	0.35	5.41
Aug 2, 2023	Beam Retrieval	88.47	90.87	95.87	98.15	x	x	x	x
July 30, 2021	HGN-revise model (Anonymous)	71.20	75.69	69.35	89.07	x	x	x	x

Submission guide

To evaluate your model on the test data, please contact us. Please prepare the following information:

Your prediction file (follow format in file: prediction_format.json)
The name of your model
Public repository of your model (optional)
Reference to your publication (optional)

Dataset Contents

The full dataset is in here.

Our dataset follows the format of HotpotQA. Each sample has the following keys:

_id: a unique id for each sample
question: a string
answer: an answer to the question. The test data does not have this information.
supporting_facts: a list, each element is a list that contains: [title, sent_id], title is the title of the paragraph, sent_id is the sentence index (start from 0) of the sentence that the model uses. The test data does not have this information.
context: a list, each element is a list that contains [title, setences], sentences is a list of sentences.
evidences: a list, each element is a triple that contains [subject entity, relation, object entity]. The test data does not have this information.
type: a string, there are four types of questions in our dataset: comparison, inference, compositional, and bridge-comparison.
entity_ids: a string that contains the two Wikidata ids (four for bridge_comparison question) of the gold paragraphs, e.g., 'Q7320430_Q51759'.

Baseline model

Our baseline model is based on the baseline model in HotpotQA. The process to train and test is quite similar to HotpotQA.

Process train data

python3 main.py --mode prepro --data_file wikimultihop/train.json --para_limit 2250 --data_split train

Process dev data

python3 main.py --mode prepro --data_file wikimultihop/dev.json --para_limit 2250 --data_split dev

Train a model

python3 -u main.py --mode train --para_limit 2250 --batch_size 24 --init_lr 0.1 --keep_prob 1.0

Evaluation on dev (Local Evaluation)

python3 main.py --mode test --data_split dev --para_limit 2250 --batch_size 24 --init_lr 0.1 --keep_prob 1.0 --save WikiMultiHop-20201024-023745 --prediction_file predictions/wikimultihop_dev_pred.json

python3 2wikimultihop_evaluate.py predictions/wikimultihop_dev_pred.json data/dev.json

Use new evaluation script

python3 2wikimultihop_evaluate_v1.1.py predictions/wikimultihop_dev_pred.json data_ids/dev.json id_aliases.json

Citation

If you plan to use the dataset, please cite our paper:

@inproceedings{xanh2020_2wikimultihop,
    title = "Constructing A Multi-hop {QA} Dataset for Comprehensive Evaluation of Reasoning Steps",
    author = "Ho, Xanh  and
      Duong Nguyen, Anh-Khoa  and
      Sugawara, Saku  and
      Aizawa, Akiko",
    booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
    month = dec,
    year = "2020",
    address = "Barcelona, Spain (Online)",
    publisher = "International Committee on Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.coling-main.580",
    pages = "6609--6625",
}

References

The baseline model and the evaluation script are adapted from https://github.com/hotpotqa/hotpot

alab-nii / 2wikimultihop Goto Github PK

2wikimultihop's Introduction

2WikiMultihopQA: A Dataset for Comprehensive Evaluation of Reasoning Steps

New Update (April 7, 2021)

Update on December 8, 2020

Leaderboard

Submission guide

Dataset Contents

Baseline model

Citation

References

2wikimultihop's People

Contributors

Stargazers

Watchers

Forkers

2wikimultihop's Issues

Recommend Projects

Recommend Topics

Recommend Org