Comments (19)
Thank you for your great works!
I have a question.
Can your findings be used in other languages? Excluding 'Back translation'
Some of them are applicable to other languages as well:
- You can apply word-embedding based word replacement if you can find embeddings for your language. For example, fasttext has word vectors for 157 languages
- The MixUp method is independent of any language as it works on the representations directly
- The noising techniques like character swap, random swap/insertion/deletion, sentence shuffling should work as well.
from blog-comments.
Thanks for this amazing article bro! I was just thinking, how can I do augmentation with text
from blog-comments.
Excellent article. Keep going!
from blog-comments.
Awesome work, summarized it.
1.Lexical Substitution:
- Thesaurus based substitution : words replaced by synonyms
- Word Embeddings Substitution : replace with neighbour word in embedding space
- Masked Language Model : Model to predict masked word
- TFIDF based word replacement : word with low TD-IDF scores can be replaced without affecting ground truth
2.Back Translation : English to other language - back to english
3.Text Surface Transformation : transforming through contraction and expansion
4.Random Noise Injection :
- Spelling error injection
- Qwerty keyboard error injection
- Unigram Noising : replace words based on unigram frequency distribution
- Blank Noising : replace random word with placeholder
- Sentence Shuffling : Shuffling of sentences
5.Instance Cross Augmentation:tweets with same polarity have their halves swapped
6.Syntax-tree Manipulation:active voice to passive voice
7.MixUp for Text :
- wordMixup : word embeddings combined and passed through classifier
- sentMixup : word embeddings passed through encoder,then combined and classification performed
8.Generative Methods : Generates additiona training data
- Conditional Pre-trained Language Models : Fine tuning of pre-trained language model
from blog-comments.
Wonderful article pretty informative
from blog-comments.
Thanks for the wonderful review.
Please note that the Generative Methods technic you presented (8) was first proposed by the paper:
Not Enough Data? Deep Learning to the Rescue! (https://arxiv.org/abs/1911.03118)
I think I saw it in the AAAI20 conference.
from blog-comments.
@NLP-cr Thank you for pointing that out. I've reviewed the link you shared and have corrected the relevant section.
from blog-comments.
Nice list! One more to add. I've seen text adversarial examples being used as data augmentation with some success (e.g. https://www.aclweb.org/anthology/N18-1089/), although this works best for small datasets, and may reduce accuracy for larger ones (https://arxiv.org/abs/1805.12152)
from blog-comments.
This was a fantastic read on a topic I have not seen great literature review on before. Thanks a lot for taking the time to be as comprehensive as this seems to be!
from blog-comments.
Very clear tutorial!! Thanks for your great job!
from blog-comments.
Great review. A new paper for Generative Methods, https://arxiv.org/abs/2004.13240
from blog-comments.
Thank you for your great works!
I have a question.
Can your findings be used in other languages? Excluding 'Back translation'
from blog-comments.
@amitness
I am amazed by your rich knowledge. Your help will be very helpful to my project. Thank you very much!
from blog-comments.
Thanks a lot, the article was very informative
from blog-comments.
It seems that these DA methods are only effective in a low-data regime. I tries these methods on text classification where I only sample 32 instances from each class and it works. However, if I enlarge the training samples, for example, 1000 samples each class, the DA does not work at all, in terms of accuracy.
Is there any study and paper on this problem ?
from blog-comments.
@yananchen1989 Yes, your observation is correct.
A similar result was also shown in the Easy Data Augmentation paper. See the section "4.2 Training Set Sizing". The paper also has other ablation studies.
from blog-comments.
@amitness,
In 2017 there is a paper that uses VAE to generate synthetic examples that significantly improve performance of clickbait detectors. This is published before recent efforts in using generative models such as GPT2. https://ieeexplore.ieee.org/abstract/document/9073621.
from blog-comments.
Thank you so much. It's so helpful to overview this area.
from blog-comments.
Thank you so much.
from blog-comments.
Related Issues (20)
- 2020/06/google-colaboratory-tips/ HOT 6
- 2020/06/fastapi-vs-flask/ HOT 17
- 2020/03/illustrated-simclr/ HOT 32
- 2020/08/information-retrieval-evaluation/ HOT 6
- 2020/06/fasttext-embeddings/ HOT 8
- vscode-on-colab/ HOT 25
- 2020/02/tensorflow-hub-for-transfer-learning/
- 2020/02/albert-visual-summary/ HOT 3
- keyphrase-extraction/ HOT 1
- knowledge-transfer/ HOT 7
- 2020/03/fixmatch-semi-supervised/ HOT 5
- interactive-sentence-embeddings/ HOT 2
- 2019/12/migrating-to-pathlib/ HOT 3
- 2020/07/semi-supervised-learning/ HOT 6
- 2020/04/deepcluster/ HOT 7
- regex/ HOT 6
- 2020/03/illustrated-pirl/ HOT 1
- 2018/10/django-orm-for-sql-users/ HOT 2
- 2020/04/illustrated-self-labelling/ HOT 4
- 2019/03/automate-ssh-commands/ HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blog-comments.