Giter Club home page Giter Club logo

dstc8-meta-dialog's Introduction

DSTC8 Meta-Learning User Response Models Task

Competition Info

  • [2019-11-15] Full human evaluation data and script to reproduce results are available here. See this section for details.
  • [2019-11-05] Human evaluation results available, see the official DSTC 8 spreadsheet. Full rankings per testset, dialogue, and metric are available here.
  • [2019-10-07] Submission deadline has been extended to Sunday October 13, 2019 at 11:59 pm Pacific Daylight Time (PDT)
  • [2019-09-24] Submission format details posted on the evaluation page.
  • [2019-09-23] Evaluation data is posted. See the evaluation and data pages for details.
  • [2019-07-15] Codalab competition is back online. Due to a major outage in the Codalab platform, participants who registered before July 12, 2019 must re-register.
  • [2019-06-17] Task description and data are released.
  • [2019-06-10] Registration is Open! Registrants will be approved starting June 17, and will have access to the data then.

Please sign up for the competition on our CodaLab page, which also contains further details on the timeline, organizers, and terms and conditions.

Task Description

In goal-oriented dialogue, data is scarce. This is a problem for dialogue system designers, who cannot rely on large pre-trained models. The aim of our challenge is to develop natural language generation (NLG) models which can be quickly adapted to a new domain given a few goal-oriented dialogues from that domain.

The suggested approach roughly follows the idea of meta-learning (e.g. MAML: Finn, Abbeel, Levine, 2017, Antoniou et al. 2018, Ravi & Larochelle 2017): During the training phase, train a model that can be adapted quickly to a new domain:


During the evaluation phase, the model should predict the final user turn of an incomplete dialogue, given some (houndreds) of examples from the same domain:


Resources

  • A large reddit-based dialogue dataset is available. Due to licensing restrictions we cannot provide a download, but we provide code to generate the dataset. This database has 1000 subreddits as "domains".

  • A smaller goal-oriented dataset split into domains and tasks, named MetaLWoz. The dataset contains 37,884 crowd-sourced dialogues divided into 47 domains. Each domain is further divided into tasks, for a total task number of 227. No annotation is provided aside from the domain labels, task labels, and task descriptions.


Evaluation

Evaluation for this task is using automatic as well as human metrics.

During development, participants can track their progress using word overlap metrics, e.g. using nlg-eval. Depending on the parameters of scripts/make_test_set, you can determine within-task or across-task generalization within a MetaLWoz domain.

Towards the end of the evaluation phase, we will provide a zip file with dialogues in a novel domain and a file specifying dialogues and turns that participants should predict. The file format is the same as the one produced by scripts/make_test_set, each line is a valid JSON object with the following schema:

{"support_dlgs": ["SUPPORT_DLG_ID_1", "SUPPORT_DLG_ID_2", ...],
 "target_dlg": "TARGET_DLG_ID",
 "predict_turn": "ZERO-BASED-TURN-INDEX"
}

Dialogue IDs uniquely identify a dialogue in the provided MetaLWoz zip file.

To generate predictions, condition your (pre-trained) model on the support dialogues, and use the target dialogue history as context to predict the indicated user turn.

Make sure that (1) your model has never seen the test domain before predicting and (2) reset your model before adapting it to the support set and predicting each dialogue.

On the responses submitted by the participants, we will

  1. Run a fixed NLU module to determine whether response intents and slots are in line with ground truth.
  2. Ask crowd workers to evaluate informativeness and appropriateness of the responses.

Submissions should have one response per line, in JSON format, with this schema:

{"dlg_id": "DIALOGUE ID FROM ZIP FILE",
 "predict_turn": "ZERO-BASED PREDICT TURN INDEX",
 "response": "PREDICTED RESPONSE"}

where dlg_id and predict_turn correspond to the target_dlg id and predict_turn of the test specification file above, respectively.

Baseline Implementation

A simple retrieval baseline implementation is provided on github. The retrieval baseline requires no training; it uses pre-trained embeddings (BERT and a combination of SentencePiece and FastText). To complete the test dialogue, the retrieval model returns the response associated with the most similar dialogue context on the support set, using cosine distance between the embeddings.

The baseline implementation also implements metabatch iterators that can be used to train more complex models. Each meta batch is split into domains, each domain contains a support set and a target set.

Setup

  1. install conda / anaconda, e.g. via miniconda
  2. conda create -n dstc8-baseline python=3.7 cython
  3. conda activate dstc8-baseline
  4. conda install -c pytorch pytorch
  5. pip install -e .

Running

# Create sentencepiece and fasttext models, and normalized dialogues
# for both reddit and MetaLWoz datasets
$ ./scripts/preprocess metalwoz metalwoz-v1.zip pp-metalwoz-dir
$ ./scripts/preprocess reddit dstc8-reddit-corpus.zip pp-reddit-dir

Notes:

  1. It's recommended to have 25 GB of space free for the large FastText models and the unzipped dataset dump used to train SentencePiece and FastText.
  2. Reddit takes the longest to preprocess, allow 8-24 hours for this script to run end-to-end on it.
  3. SentencePiece consumes a ton of memory so the maximum number of lines used to train it is limited to 5 million by default. If this is too much for your system, you can set maxlines in the script.
  4. The normalizers can be found in mldc/preprocessing/normalization.py.

Now, create a test set specification file (we will provide an official one later). This file references dialogues from the zip file by their ID. Each line specifies a single-domain meta batch, with a support set of size 128 and a target set of size 1. The (zero-based) index of the turn to predict is also indicated. See the evaluation section for more details.

./scripts/make_test_set ./pp-metalwoz-dir/metalwoz-v1-normed.zip test-spec-cross-task.txt --cross-task

Now train the model. The retrieval model does not actually do any training, so this step is fast. The infrastructure for training is present in the code, however, so you can easily add your own methods. Take care to exclude domains for evaluation (e.g. early stopping) and testing.

Use embedding models trained on reddit to avoid train/test overlap. Use --input-embed to change the embedding type (BERT is default).

./scripts/retrieval-baseline train ./pp-metalwoz-dir/metalwoz-v1-normed.zip \
  --preproc-dir ./pp-reddit-dir \
  --output-dir  ./metalwoz-retrieval-model \
  --eval-domain dialogues/ALARM_SET.txt --test-domain dialogues/EVENT_RESERVE.txt

Now we run the evaluation on the excluded test domain for the meta batches specified in test-spec-cross-task.txt. This command

  1. Prints the dialogue context, ground truth and predictions to standard output, and
  2. Generates files for submitting results and automatic metric evaluation with nlg-eval.
./scripts/retrieval-baseline predict ./metalwoz-retrieval-model \
    ./pp-metalwoz-dir/metalwoz-v1-normed.zip \
    --test-spec test-spec-cross-task.txt --nlg-eval-out-dir ./out

# calculate metrics (you need to install https://github.com/Maluuba/nlg-eval for that)
./scripts/evaluate ./pp-metalwoz-dir/metalwoz-v1-normed.zip \
    test-spec-cross-task.txt ./out EVENT_RESERVE [other test domains...]

Human Evaluation Results

Full judgements collected through Amazon Mechanical Turk and a script to produce the rankings are available in competition/human_evaluation.

Data in the competition/human_evaluation/data folder is licensed separately under the Microsoft Open Use Of Data Agreement.

To reproduce the competition results, run the script as follows:

$ cd competition/human_evaluation
$ ./human-evaluation-results.py data/judgements_data.csv

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

dstc8-meta-dialog's People

Contributors

aatkinson avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar temporaer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dstc8-meta-dialog's Issues

Questions for evaluation

  1. Can we use the prompt and the role sentences for our model? are these sentences also included in the real test data?

  2. Are support dialogs given in the actual evaluation data the same task as the test dialog(within-task setting), or just the same domain? (cross-task setting).

  3. Is the number of support dialogs per one test case also 128 in real test data?

  4. During the development phase, could we assume that it is similar with the actual evaluation environment to test our model on MutiWoz dataset by measuring slot/intent accuracy of the predicted user response using Task1 baseline NLU?

  5. Could you provide a more detailed description on the automatic measures(e.g. F1 score or accuracy of the slot/intent detection) that will be used for the actual test environment?

Thank you!

Why do we choose PyText

Hi,

I am new to pytorch. And when I try to implement the dialogue generation model using this pytext structure I found it's hard.

Facebook has pytext and fairseq both based on pytorch with fairseq focusing on machine translation and summary tasks while pytext focusing on other types of tasks. So there are lots of demo code for text generation model written by pytorch or fairseq but almost nothing by pytext. I tried to transform the code from pytorch to pytext and found it's not easy for the decode part.

Do you have any suggestions for implementing the dialogue generation model? Should I only use your data_handler and write the model from scratch with pytorch?

Thank you!

About the baseline result

I tried the scripts in README, and this is my result:

Bleu_1 Bleu_2 Bleu_3 Bleu_4 CIDEr METEOR ROUGE_L
0.0477763 0.0160963 0.0063228 5.14E-07 0.0407825 0.0452248 0.0432814

Is that correct? Or did I make some mistake in the procedure?

A small bug in the starting code

Dear organizers,

I found a small bug, or in other words, a typo in the offered starting code. In the "mldc/task/retrieval.py", an attribute error (AttributeError: 'NoneType' object has no attribute 'close') will be caused by the code in line 110 and 111.
if self.model_needs_meta_training: train_iter = self.data_handler.get_train_iter( stack.enter_context(closing(train_iter)))

The right code should be:
if self.model_needs_meta_training: train_iter = self.data_handler.get_train_iter() stack.enter_context(closing(train_iter))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.