Giter Club home page Giter Club logo

fairytaleqadata's Introduction

FairytaleQA: A Dataset for Question and Answer Generation

This repository contains the FairytaleQA dataset for our paper: Fantastic questions and where to find them: FairytaleQA -- An authentic dataset for narrative comprehension. [Accepted to ACL 2022]

The FairytaleQA dataset contains CSV files of 278 children's stories from Project Gutenberg and a set of questions and answers developed by educational experts based on an evidence-based theoretical framework. This dataset focuses on narrative comprehension of kindergarten to eighth-grade students to facilitate assessment and training of narrative comprehension skills for both machines and young children.

Dataset Statistics

Core statistics of the FairytaleQA dataset Breakdown statistics of QAs based on the 7 narrative elements' schema in FairytaleQA dataset

The table on the left is the core statistics of the FairytaleQA dataset. The table on the right is the breakdown statistics of QA-pairs (question-answer pairs) in FairytaleQA dataset based on the 7 narrative elements' schema.

Core statistics of the FairytaleQA dataset by train/test/val splits

Above is the statistics of the FairytaleQA dataset by train/test/val splits.

Repository Structure

We have two different split methods for the dataset, one is to split the stories by their origins(./data-by-origin), the other is to split the stories into train/val/test for fine-tuning a model(./data-by-train-split). We randomly split the stories into train/test/val and provide the core statistics of each split above.

In either split approach, each story has two primary files :

  • The story content (../section-stories). Eeach line is a section determined by human coders which contains multiple paragraphs.
  • The education experts labeled QA-pairs (../questions). Each line is a QA-pair that is linked to one or more sections (shown by cor_section in question files which can be mapped to section in story section files)
    • For the question file of each story in test/val, there are guaranteed to have answer 1 and answer 4, which are ground-truth answers provided by two different annotators. The rest columns (answer 2 & answer 3 are alternative options provided by the first annotator; while answer 5 & answer 6 are alternative options provided by the second annotator) are not used in the benchmark experiment evaluation because not all answers have those columns and we want to maintain fairness across all QAs.

We further provide a third file for each story (../sentence-stories) that breaks down the stories into sentences in favor of further detailed analysis on sentence-levels. Each line is a sentence in this story. We use Spacy English pipeline en_core_web_sm as the sentencizer.

For example, story ali-baba-and-forty-thieves has the following files in origin split: ./data-by-origin/questions/first-round/ali-baba-and-forty-thieves-questions.csv, ./data-by-origin/section-stories/first-round/ali-baba-and-forty-thieves-story.csv, and ./data-by-origin/sentence-stories/first-round/ali-baba-and-forty-thieves-story.csv. The stories are organized in the same way by train/test/val split.

We also provide a meta data file story_meta.csv for our dataset which contains the list of stories and their corresponding metadata. This can be useful for traversing or filtering the dataset.

Using this Dataset

From this repo

To start with this dataset, either clone the repo or download it as a zip file.

starter.py contains some starter code that can retrieve and aggregate the QA pairs or the stories. Example output from get_question_df is below. These functions are useful for people who want to handle to data for general purposes. For NLP preprocessing, go to this Jupyter notebook.

Example Output: Question Function Output

From huggingface dataset hub

Fast Usage for NLP tasks with datasets library

The dataset is uploaded to Huggingface Hub: https://huggingface.co/datasets/WorkInTheDark/FairytaleQA

from datasets import load_dataset
dataset = load_dataset("WorkInTheDark/FairytaleQA")

'''
DatasetDict({
    train: Dataset({
        features: ['story_name', 'story_section', 'question', 'answer1', 'answer2', 'local-or-sum', 'attribute', 'ex-or-im', 'ex-or-im2'],
        num_rows: 8548
    })
    validation: Dataset({
        features: ['story_name', 'story_section', 'question', 'answer1', 'answer2', 'local-or-sum', 'attribute', 'ex-or-im', 'ex-or-im2'],
        num_rows: 1025
    })
    test: Dataset({
        features: ['story_name', 'story_section', 'question', 'answer1', 'answer2', 'local-or-sum', 'attribute', 'ex-or-im', 'ex-or-im2'],
        num_rows: 1007
    })
})
'''

To load train/test/valid split:

from datasets import load_dataset
dataset = load_dataset("WorkInTheDark/FairytaleQA", split='train')

Our Question Generation task can be found here on huggingface dataset hub: https://huggingface.co/datasets/GEM/FairytaleQA and could be easily loaded by the following code. Reminder: this version of data loader is specifically designed for the QG task, where we set target column for ground-truth question.

from datasets import load_dataset
dataset = load_dataset("GEM/FairytaleQA")
  • We will host the original dataset with another data loader that supports universal use cases to huggingface dataset hub soon

Related Work

In a concurrent work, we use this dataset to build a QA-pair Generation (QAG) System. You can also find some Jupyter Notebooks to train and run models here. This work It is AI's Turn to Ask Humans a Question: Question-Answer Pair Generation for Children's Story Books is accepted to ACL 2022.

We also leveraged the QAG System to build an interactive storytelling system that allows parents to collaborate with AI system in creating storytelling experiences with interactive question-answering for their children. This work StoryBuddy: A Human-AI Collaborative Chatbot for Parent-Child Interactive Storytelling with Flexible Parental Involvement is accepted to CHI 2022.

A group of researchers aimed to generate educational meaningful and high-cognitive-demand questions. They used this dataset to train a novel model architecture that can take into account of a summary of the salient events, and a learnable distribution of the question types. There work Educational Question Generation of Children Storybooks via Question Type Distribution Learning and Event-Centric Summarization is accepted to ACL 2022. And their code is available here.

Future Work

We are working on exploring the influence of various social biases from the stories to SOTA neural models, we plan to further annotate the dataset for various social biases found in the stories.

Citation

Our Dataset Paper is accepted to ACL 2022, you may cite:

@inproceedings{xu2022fairytaleqa,
    author={Xu, Ying and Wang, Dakuo and Yu, Mo and Ritchie, Daniel and Yao, Bingsheng and Wu, Tongshuang and Zhang, Zheng and Li, Toby Jia-Jun and Bradford, Nora and Sun, Branda and Hoang, Tran Bao and Sang, Yisi and Hou, Yufang and Ma, Xiaojuan and Yang, Diyi and Peng, Nanyun and Yu, Zhou and Warschauer, Mark},
    title = {Fantastic Questions and Where to Find Them: Fairytale{QA} -- An Authentic Dataset for Narrative Comprehension},
    publisher = {Association for Computational Linguistics},
    year = {2022}
}

fairytaleqadata's People

Contributors

dritchie1031 avatar workinthedark avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.