If you'd like to add your paper, do not email us. Instead, read the protocol for adding a new entry and send a pull request.
We group the papers by text classification, translation, summarization, question-answering, sequence tagging, parsing, grammatical-error-correction, generation, dialogue, multimodal, mitigating bias, mitigating class imbalance, and adversarial examples.
This repository is based on our paper, "A survey of data augmentation approaches in NLP (Findings of ACL '21)". You can cite it as follows:
@article{feng2021survey,
title={A Survey of Data Augmentation Approaches for NLP},
author={Feng, Steven Y and Gangal, Varun and Wei, Jason and Chandar, Sarath and Vosoughi, Soroush and Mitamura, Teruko and Hovy, Eduard},
journal={Findings of ACL},
year={2021}
}
Authors: Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy
Note: WIP. More papers will be added from our survey paper to this repo over the next month or so.
Inquiries should be directed to [email protected] or by opening an issue here.
Paper | Datasets |
---|---|
Synonym Replacement (Character-Level Convolutional Networks for Text Classification, NeurIPS '15) | AG’s News, DBPedia, Yelp, Yahoo Answers, Amazon |
That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets (EMNLP '15) | |
Robust Training under Linguistic Adversity (EACL '17) code | Movie review, customer review, SUBJ, SST |
Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations (NAACL '18) code | SST, SUBJ, MRQA, RT, TREC |
Variational Pretraining for Semi-supervised Text Classification (ACL '19) code | IMDB, AG News, Yahoo, hatespeech |
EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks (EMNLP '19) code | SST, CR, SUBJ, TREC, PC |
Nonlinear Mixup: Out-Of-Manifold Data Augmentation for Text Classification (AAAI '20) | TREC, SST, Subj, MR |
MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification (ACL '20) code | AG News, DBpedia, Yahoo, IMDb |
Unsupervised Data Augmentation for Consistency Training (NeurIPS '20) code | Yelp, IMDb, amazon, DBpedia |
Not Enough Data? Deep Learning to the Rescue! (AAAI '20) | ATIS, TREC, WVA |
SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness (EMNLP '20) code | IWSLT'14 |
Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation (EMNLP '20) | ICWSM 20’ Data Challenge, SemEval '17 sentiment analysis, SemEval '18 irony |
Textual Data Augmentation for Efficient Active Learning on Tiny Datasets (EMNLP '20) | SST2, TREC |
Text Augmentation in a Multi-Task View (EACL '21) | SST2, TREC, SUBJ |
Few-Shot Text Classification with Triplet Loss, Data Augmentation, and Curriculum Learning (NAACL '21) code | HUFF, COV-Q, AMZN, FEWREL |
Paper | Datasets |
---|---|
GenAug: Data Augmentation for Finetuning Text Generators (DeeLIO @ EMNLP '20) code | TO-DO |
Paper | Datasets |
---|---|
Backtranslation (Improving Neural Machine Translation Models with Monolingual Data, ACL '16) | WMT '15 en-de, IWSLT ''15 en-tr |
SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation (EMNLP '18) | IWSLT '15 en-vi, IWSLT '16 de-en, WMT '15 en-de |
Soft Contextual Data Augmentation for Neural Machine Translation (ACL '19) code | IWSLT '14 de/es/he-en, WMT '14 en-de |
Paper | Datasets |
---|---|
An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering (EMNLP '19 Workshop) | MRQA |
Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering (arxiv '19) | SQuAD, Trivia-QA, CMRC, DRCD |
XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering (arxiv '19) | XNLI, SQuAD |
Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering (arxiv '20) | MLQA, XQuAD, SQuAD-it, PIAF |
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering (ACL '20) code | WIQA, QuaRel, HotpotQA |
Paper | Datasets |
---|---|
Transforming Wikipedia into Augmented Data for Query-Focused Summarization (arxiv '19) | DUC |
Iterative Data Augmentation with Synthetic Data (Abstract Text Summarization: A Low Resource Challenge (EMNLP '19) | Swisstext, commoncrawl |
Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation (NAACL '21) | CNN-DailyMail |
Paper | Datasets |
---|---|
Data Augmentation via Dependency Tree Morphing for Low-Resource Languages (EMNLP '18) code | universal dependencies project |
TODO: https://www.aclweb.org/anthology/2020.emnlp-main.107/
Paper | Datasets |
---|---|
Using Wikipedia Edits in Low Resource Grammatical Error Correction. (WNUT @ EMNLP '18) | Falko-MERLIN GEC Corpus |
Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting (arxiv '19) | CoNLL-2014 , JFLEG |
SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation (EMNLP '18) | IWSLT 16 en-vi, IWSLT 15 de-en, WMT en-de |
Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data. (BEA @ ACL '19) | FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task) |
A neural grammatical error cor-rection system built on better pre-training and se-quential transfer learning. (BEA @ ACL '19) | FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task), Gutenberg, Tatoeba, WikiText-103 (Pretraining) |
Improving Grammatical Error Correction with Data Augmentation by Editing Latent Representation (COLING'20) | FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task) |
Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction. (NAACL'18) | Lang-8, CoNLL-2014, CoNLL-2013, JFLEG |
Corpora Generation for Grammatical Error Correction (NAACL'19) | CoNLL-2014, JFLEG, Lang-8 |
TO-DO
TO-DO
TO-DO
TO-DO
Paper | Datsets |
---|---|
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks (NAACL '18) | SST, SICK |
Make sure we get textattack |
Paper | Datsets |
---|---|
Good-Enough Compositional Data Augmentation (ACL '20) code | SCAN |
Sequence-Level Mixed Sample Data Augmentation (EMNLP '20) code | SCAN |