clinc / oos-eval Goto Github PK

View Code? Open in Web Editor NEW

201.0 12.0 42.0 1.75 MB

Repository that accompanies "An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction" (EMNLP 2019)

License: Other

oos-eval's Introduction

An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction

Repository that accompanies An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction.

FAQs

1. What are the relevant files?

See data/data_full.json for the "full" dataset. This is the dataset used in Table 1 (the "Full" columns). This file contains 150 "in-scope" intent classes, each with 100 train, 20 validation, and 30 test samples. There are 100 train and validation out-of-scope samples, and 1000 out-of-scope test samples.

2. What is the name of the dataset?

The dataset was not given a name in the original paper, but others have called it CLINC150.

3. What is this dataset for?

This dataset is for evaluating the performance of intent classification systems in the presence of "out-of-scope" queries. By "out-of-scope", we mean queries that do not fall into any of the system-supported intent classes. Most datasets include only data that is "in-scope". Our dataset includes both in-scope and out-of-scope data. You might also know the term "out-of-scope" by other terms, including "out-of-domain" or "out-of-distribution".

4. What language is the dataset in?

All queries are in English.

5. How does your dataset/evaluation handle multi-intent queries?

All samples/queries in our dataset are single-intent samples. We consider the problem of multi-intent classification to be future work.

6. How did you gather the dataset?

We used crowdsourcing to generate the dataset. We asked crowd workers to either paraphrase "seed" phrases, or respond to scenarios (e.g. "pretend you need to book a flight, what would you say?"). We used crowdsourcing to generate data for both in-scope and out-of-scope data.

Citation

If you find our dataset useful, please be sure to cite:

@inproceedings{larson-etal-2019-evaluation,
    title = "An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction",
    author = "Larson, Stefan  and
      Mahendran, Anish  and
      Peper, Joseph J.  and
      Clarke, Christopher  and
      Lee, Andrew  and
      Hill, Parker  and
      Kummerfeld, Jonathan K.  and
      Leach, Kevin  and
      Laurenzano, Michael A.  and
      Tang, Lingjia  and
      Mars, Jason",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    year = "2019",
    url = "https://www.aclweb.org/anthology/D19-1131"
}

oos-eval's People

Contributors

Stargazers

Watchers

oos-eval's Issues

Question about threshold in paper

Hi,
In paper "An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction", you made comparisons between three approaches: oos-train, oss-threshold and oss-binary.
But the paper did not clearly tell which threshold was used ? 0.5, 0.6, ....

Can you please give me that information ?
Rgds,

confusing about threshold setting

In our evaluation, the out-of-scope threshold was chosen to be the value which yielded the highest validation score across all intents, treating out-of-scope as its own intent.

I am a little confused by this sentence. Does it mean that we select oos's highest score on the known intention as the threshold in the validation set? If so, isn't oos's recall equal to 1 in each epoch of validation set, how do we early stop and select hyper-parameters?

confusing about Table 3 in paper.

I'm confusing about Table 3 in paper.

What is the experimental process?

I guess the binary classifier (oos detector) is first trained on "binary_undersample.json" (or "binary_wiki_aug.json" in wiki aug experiment), to detection whether the utterances are "in" or "oos", then build downstream multi-classes classifier (e.g. 150 classes for in-scope data) to deal with "in" samples from upstream oos detector.

In-Scope Accuracy was evaluated on "test" in "data_oos_plus.json", and Out-of-Scope Recall was evaluated on "oos_test" in "data_oos_plus.json".

Hyperparameter details for fine tuning bert-large in the oos-train setting

Hi,

I am using huggingface to fine-tune bert large on the CLINC dataset. I follow the hyperparameters mentioned in hyperparams.csv but there's ~3 point difference in inscope accuracy for the oos-train setting (93.49 v/s 96.9 for Full version of the dataset; similarly for the OOS-Plus setting). I am wondering if this is due to some HF defaults, for e.g., HF defaults to 1.0 for gradient clipping, I am not sure what did you use. Would it be possible to clarify a bit more about your fine-tuning process? It'd be very helpful.

Thanks,
Gaurav.

Clarification re results in Table 2

Hi,

Do the numbers in Table 2 for oos-train correspond to the test set? I am trying to replicate the BERT results (using transformers and datasets library) with the hyperparameters provided in this repo, but there's a near 10 point difference in the test set performance; however, my validation set performance is quite close to the numbers provided in the paper. It'd be great if you could clarify this.

Thanks,
Gaurav.

What are the 10 domains covered in this dataset?

Could not find the answer to this question in the documentation.
What are the different domains covered in the dataset?
Is there a domain - intent mapping, to extract the data of interest?

Re-partitioning Data: Which worker wrote which query?

Hi! I want re-partition the dataset to create 5 different train/valid/test splits for my analyses. In the paper, you mention that all queries from a given crowd worker were place in a single split. Is it possible to share information about which queries were generated by the same worker? I'd like to minimize any in-scope biases in my splits as well.