Giter Club home page Giter Club logo

mrqa-shared-task-2019's Introduction

MRQA 2019 Shared Task on Generalization

Overview

The MRQA 2019 Shared Task focuses on generalization in question answering. An effective question answering system should do more than merely interpolate from the training set to answer test examples drawn from the same distribution: it should also be able to extrapolate to out-of-distribution examples โ€” a significantly harder challenge.

The format of the task is extractive question answering. Given a question and context passage, systems must find the word or phrase in the document that best answers the question. While this format is somewhat restrictive, it allows us to leverage many existing datasets, and its simplicity helps us focus on out-of-domain generalization, instead of other important but orthogonal challenges.

We release an official training dataset containing examples from existing extractive QA datasets, and evaluate submitted models on ten hidden test datasets. Both train and test datasets have the same format described above, but may differ in some of the following ways:

  • Passage distribution: Test examples may involve passages from different sources (e.g., science, news, novels, medical abstracts, etc) with pronounced syntactic and lexical differences.
  • Question distribution: Test examples may emphasize different styles of questions (e.g., entity-centric, relational, other tasks reformulated as QA, etc) which may come from different sources (e.g., crowdworkers, domain experts, exam writers, etc.)
  • Joint distribution: Test examples may vary according to the relationship of the question to the passage (e.g., collected independent vs. dependent of evidence, multi-hop, etc)

Each participant will submit a single QA system trained on the provided training data. We will then privately evaluate each system on the hidden test data.

This repository contains resources for accessing the official training and development data.

Quick Links

Datasets

We have adapted several existing datasets from their original formats and settings to conform to our unified extractive setting. Most notably:

  • We provide only a single, length-limited context.
  • There are no unanswerable or non-span answer questions.
  • All questions have at least one accepted answer that is found exactly in the context.

A span is judged to be an exact match if it matches the answer string after performing normalization consistent with the SQuAD dataset. Specifically:

  • The text is uncased.
  • All punctuation is stripped.
  • All articles {a, an, the} are removed.
  • All consecutive whitespace markers are compressed to just a single normal space ' '.

Training Data

Dataset Download MD5SUM Examples
SQuAD Link 67afd110c0ad9860c4e88f16a44cd44c 86,588
NewsQA Link d8288b5de5bd10fb42ce5291ef0f7fbe 74,160
TriviaQA Link 1d198c0cd60e4d91130e2a2545eb9122 61,688
SearchQA Link fa9c8c6b2f24e4f410cba81ef63ea284 117,384
HotpotQA Link 53e65212b46c74a6ee95e83817443db1 72,912
NaturalQuestions Link f12d2ce98ba0065a79226b9fa22d936a 104,071

Development Data

In-Domain

Dataset Download MD5SUM Examples
SQuAD Link be0d95e28b470254b3574aeada84a79d 10,507
NewsQA Link aa9878b7469ad5b5c0f0738636cdb5bd 4,212
TriviaQA Link fdfac306651dd74372f0edcff357ec80 7,785
SearchQA Link fa087f2cc134f9c316f1d93c40827615 16,980
HotpotQA Link d0adef52100cbbf93090ba6c06b83b2b 5,901
NaturalQuestions Link a017834fddfe9df888b7f6cd5bbfba2e 12,836

Note: This in-domain data may be used for helping develop models. The final testing, however, will only contain out-of-domain data.

Out-of-Domain

Out-of-domain data will be released at a future date.

Download Scripts

We have provided a convenience script to download all of the training and development data (that is released).

Please run:

./download_train.sh path/to/store/downloaded/directory

To download the development data of the training datasets (in-domain), run:

./download_in_domain_dev.sh path/to/store/downloaded/directory

MRQA Format

All of the datasets for this task have been adapted to follow a unified format. They are stored as compressed JSONL files (with file extension .jsonl.gz).

The general format is:

{
  "header": {
    "dataset": <dataset name>,
    "split": <train|dev|test>,
  }
}
...
{
  "context": <context text>,
  "context_tokens": [(token_1, offset_1), ..., (token_l, offset_l)],
  "qas": [
    {
      "qid": <uuid>,
      "question": <question text>,
      "question_tokens": [(token_1, offset_1), ..., (token_q, offset_q)],
      "detected_answers": [
        {
          "text": <answer text>,
          "char_spans": [[<start_1, end_1>], ..., [<start_n, end_n>]],
          "token_spans": [[<start_1, end_1>], ..., [<start_n, end_n>]],
        },
        ...
      ],
      "answers": [<answer_text_1>, ..., <answer_text_m>]
    },
    ...
  ]
}

Note that it is permissible to download the original datasets and use them as you wish. However, this is the format that the test data will be presented in.

Fields

  • context: This is the raw text of the supporting passage. Three special token types have been inserted: [TLE] precedes document titles, [DOC] denotes document breaks, and [PAR] denotes paragraph breaks. The maximum length of the context is 800 tokens.
  • context_tokens: A tokenized version of the supporting passage, using spaCy. Each token is a tuple of the token string and token character offset. The maximum number of tokens is 800.
  • qas: A list of questions for the given context.
  • qid: A unique identifier for the question. The qid is unique across all datasets.
  • question: The raw text of the question.
  • question_tokens: A tokenized version of the question. The tokenizer and token format is the same as for the context.
  • detected_answers: A list of answer spans for the given question that index into the context. For some datasets these spans have been automatically detected using searching heuristics. The same answer may appear multiple times in the text --- each of these occurrences is recorded. For example, if 42 is the answer, the context "The answer is 42. 42 is the answer.", has two occurrences marked.
    • text: The raw text of the detected answer.
    • char_spans: Inclusive [start, end] character spans (indexing into the raw context).
    • token_spans: Inclusive [start, end] token spans (indexing into the tokenized context).
  • answers: All accepted answer to the question, whether or not there is an exact match in the given context.

Visualization

To view examples in the terminal please install requirements.txt (pip install requirements.txt) and then run:

python visualize.py path/or/url

The script argument may be either a URL or a local file path. For example:

python visualize.py https://s3.us-east-2.amazonaws.com/mrqa/release/train/SQuAD.jsonl.gz

Evaluation

Answers are evaluated using exact match and token-level F1 metrics. The mrqa_official_eval.py script is used to evaluate predictions on a given dataset:

python mrqa_official_eval.py <url_or_filename> <predictions_file>

The predictions file must be a valid JSON file of qid, answer pairs:

{
  "qid_1": "answer span text 1",
  ...
  "qid_n": "answer span text N"
}

The final score for the MRQA shared task will be the macro-average across all test datasets.

Baseline Model

An implementation of a simple multi-task BERT-based baseline model is available in the baseline directory.

Submission

Submission will be handled through the Codalab platform. Instructions will be released soon. We will ask participants to submit two components:

  1. A command that makes predictions given a .jsonl.gz file in our standard format;
  2. A command that starts a local server that accepts POST requests of single JSON objects in our standard format, and returns a JSON prediction object.

The baseline directory includes example implementations of both components, in predict.py and serve.py, respectively. The server will be used to create interactive demos for all submitted models.

mrqa-shared-task-2019's People

Contributors

ajfisch avatar alontalmor avatar robinjia avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.