hindi2vec's Introduction

hindi2vec

State-of-the-Art Language Modeling and Text Classification in Hindi Language

Results

We achieved State of the Art Perplexity = 46.81 for Hindi compared to 40.68 for English (lower is better)
- To the best of our knowledge on March 2, 2018

Downloads

EXCLUSIVE: BBC Hindi data of 4335 documents for text classification and text summarization. Release Notes
Raw Data: Hindi Wikipedia with about 21k unique tokens for minfreq = 50
- Wikipedia Processed Data - please use this to train your model
Pretrained Language Models that you can use in your classification for transfer learning

TODO

Language modeling based on wikipedia dump
Release Language Models: Hindi Language Model
Create Text classification Datasets
Benchmark text classification with FastText
Fine-tuning model for text classification
Add a leaderboard and allow submission, similar to SQuAD

Idea Dump

Change the custom head to be used for transliteration instead of classification, Hindi script (Devnagri) to English script (Roman)
MTL tasks for training and inference using custom heads
Text to Speech - using datasets from news recordings or Hindi subtitles of dubbed movies

Special thanks to Jeremy, Rachel and other contributors to fastai. This work is a reproduction of their work in English to Hindi. Thanks to @cstorm125 for thai2vec which inspired this work.

Recommend Projects