Pretrained Model | Description | Architecture | Pretrained Hours | Logs |
---|---|---|---|---|
hindi_pretrained_4kh | Trained on 4200 hours of Hindi Data | Base | 4200 | |
kannada_pretrained_1400h | Trained on 1400 hours of Kannada data | XLSR | 1400 | |
CLSRIL-23 | Cross Lingual Representations for Indic Languages, Contains 10,000 hours of training data from 23 Indic Languages | Base | 10,000 | wandb |
Language | Pretrained Model | Architecture | Finetuned Model | Finetuned Hours | Dictionary |
---|---|---|---|---|---|
Hindi | hindi_pretrained_4kh | Base | hindi_finetuned_4kh | 4200 | dict |
Kannada | kannada_pretrained_1400h | XLSR | kannada_finetuned_570h | 570 | dict |
English | english_finetuned_181h | Base | english_finetuned_181h | 181 | dict |
Marathi | hindi_pretrained_4kh | Base | marathi_finetuned_100h | 100 | dict |
Odia | hindi_pretrained_4kh | Base | odia_finetuned_100h | 100 | dict |
Tamil | hindi_pretrained_4kh | Base | tamil_finetuned_40h | 40 | dict |
Telugu | hindi_pretrained_4kh | Base | telugu_finetuned_40h | 40 | dict |
Gujarati | hindi_pretrained_4kh | Base | gujarati_finetuned_40h | 40 | dict |
Data is taken from AI For Bharat Corpus but we do post processing by tokenizing and removing duplicates.
Language | Type | Data | Sentences | Lexicon | LM |
---|---|---|---|---|---|
Hindi | kenlm | data | 13M | lexicon | lm |
Kannada | kenlm | data | 21M | lexicon | lm |
English | kenlm | data | 10M | lexicon | lm |
Marathi | kenlm | data | 22M | lexicon | lm |
Odia | kenlm | data | 0.36M | lexicon | lm |
Tamil | kenlm | data | 21M | lexicon | lm |
Telugu | kenlm | data | 26M | lexicon | lm |
Gujarati | kenlm | data | 30M | lexicon | lm |