A Visual Survey of Data Augmentation in NLP An extensive overview

Awesome work, summarized it. 1.Lexical Substitution: <ul dir="au

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

2020/05/data-augmentation-for-nlp/ about blog-comments HOT 19 CLOSED

utterances-bot commented on May 30, 2024

2020/05/data-augmentation-for-nlp/

from blog-comments.

Comments (19)

amitness commented on May 30, 2024 1

Thank you for your great works!
I have a question.
Can your findings be used in other languages? Excluding 'Back translation'

Some of them are applicable to other languages as well:

You can apply word-embedding based word replacement if you can find embeddings for your language. For example, fasttext has word vectors for 157 languages
The MixUp method is independent of any language as it works on the representations directly
The noising techniques like character swap, random swap/insertion/deletion, sentence shuffling should work as well.

from blog-comments.

kurianbenoy commented on May 30, 2024

Thanks for this amazing article bro! I was just thinking, how can I do augmentation with text

from blog-comments.

KrithikaJayaraman commented on May 30, 2024

Excellent article. Keep going!

from blog-comments.

amitaug1984 commented on May 30, 2024

Awesome work, summarized it.

1.Lexical Substitution:

Thesaurus based substitution : words replaced by synonyms
Word Embeddings Substitution : replace with neighbour word in embedding space
Masked Language Model : Model to predict masked word
TFIDF based word replacement : word with low TD-IDF scores can be replaced without affecting ground truth

2.Back Translation : English to other language - back to english

3.Text Surface Transformation : transforming through contraction and expansion

4.Random Noise Injection :

Spelling error injection
Qwerty keyboard error injection
Unigram Noising : replace words based on unigram frequency distribution
Blank Noising : replace random word with placeholder
Sentence Shuffling : Shuffling of sentences

5.Instance Cross Augmentation:tweets with same polarity have their halves swapped

6.Syntax-tree Manipulation:active voice to passive voice

7.MixUp for Text :

wordMixup : word embeddings combined and passed through classifier
sentMixup : word embeddings passed through encoder,then combined and classification performed

8.Generative Methods : Generates additiona training data

Conditional Pre-trained Language Models : Fine tuning of pre-trained language model

from blog-comments.

sids07 commented on May 30, 2024

Wonderful article pretty informative

from blog-comments.

NLP-cr commented on May 30, 2024

Thanks for the wonderful review.
Please note that the Generative Methods technic you presented (8) was first proposed by the paper:
Not Enough Data? Deep Learning to the Rescue! (https://arxiv.org/abs/1911.03118)
I think I saw it in the AAAI20 conference.

from blog-comments.

amitness commented on May 30, 2024

@NLP-cr Thank you for pointing that out. I've reviewed the link you shared and have corrected the relevant section.

from blog-comments.

puzzler10 commented on May 30, 2024

Nice list! One more to add. I've seen text adversarial examples being used as data augmentation with some success (e.g. https://www.aclweb.org/anthology/N18-1089/), although this works best for small datasets, and may reduce accuracy for larger ones (https://arxiv.org/abs/1805.12152)

from blog-comments.

bpw1621 commented on May 30, 2024

This was a fantastic read on a topic I have not seen great literature review on before. Thanks a lot for taking the time to be as comprehensive as this seems to be!

from blog-comments.

ticiana commented on May 30, 2024

Very clear tutorial!! Thanks for your great job!

from blog-comments.

sbmaruf commented on May 30, 2024

Great review. A new paper for Generative Methods, https://arxiv.org/abs/2004.13240

from blog-comments.

wonyeongdeok commented on May 30, 2024

Thank you for your great works!
I have a question.
Can your findings be used in other languages? Excluding 'Back translation'

from blog-comments.

wonyeongdeok commented on May 30, 2024

@amitness
I am amazed by your rich knowledge. Your help will be very helpful to my project. Thank you very much!

from blog-comments.

aswin-giridhar commented on May 30, 2024

Thanks a lot, the article was very informative

from blog-comments.

yananchen1989 commented on May 30, 2024

It seems that these DA methods are only effective in a low-data regime. I tries these methods on text classification where I only sample 32 instances from each class and it works. However, if I enlarge the training samples, for example, 1000 samples each class, the DA does not work at all, in terms of accuracy.
Is there any study and paper on this problem ?

from blog-comments.

amitness commented on May 30, 2024

@yananchen1989 Yes, your observation is correct.

A similar result was also shown in the Easy Data Augmentation paper. See the section "4.2 Training Set Sizing". The paper also has other ablation studies.

from blog-comments.

lethaiq commented on May 30, 2024

@amitness,
In 2017 there is a paper that uses VAE to generate synthetic examples that significantly improve performance of clickbait detectors. This is published before recent efforts in using generative models such as GPT2. https://ieeexplore.ieee.org/abstract/document/9073621.

from blog-comments.

Eunhui-Kim commented on May 30, 2024

Thank you so much. It's so helpful to overview this area.

from blog-comments.

FeiyanLiu commented on May 30, 2024

Thank you so much.

from blog-comments.

2020/05/data-augmentation-for-nlp/ about blog-comments HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent