Multilingual and code-switching ASR challenges for low resource Indian languages.

Shell 95.62% Python 0.54% Perl 3.84%

baseline_recipe_is21s_indic_asr_challenge's Introduction

Multilingual and code-switching ASR challenges for low resource Indian languages

Important notes about the code-switched ASR tasks (subtask 2)

Please visit this document.

baseline_recipe_is21s_indic_asr_challenge's People

Contributors

Stargazers

Watchers

Forkers

amirhussein96 bloodraven66 jagabandhumishra sumukharadhya nayanjha16

baseline_recipe_is21s_indic_asr_challenge's Issues

error while running local/check_audio_data_folder.sh command

/home/shivam23004/speech_corpus/mucs_2021/IS21_subtask_1_data/Gujarati - Correct format
Inconsistent #transcripts(3843) and #audio files(0)

Problems with datasets

While processing the code-switching data for Hindi-English we found the following issues:

The segment boundaries are not very good. This unfairly penalizes the Kaldi baseline provided. When scoring by paragraph, instead of by utterance, we get 27.6% WER using a similar setup nearly the same as the Kaldi baseline instead of 36% WER.
The test set transcriptions include many Hindi words written in English and English words written in Hindi. There should be a GLM file to give participants credit for correctly produced English/Hindi words regardless of the script. We can provide this for the Hindi test set, but the organizers should score the blind test sets in Bengali and Hindi with such a GLM file.
The transcriptions are fairly noisy and include poorly tokenized text with many removed punctuation marks, such 1004 (1 “point” 0 “point” 0 “point” 4). This is probably found data and it will be hard to systematically fix all such issues, however, it’s not ideal.
Example of noise in the transcript:
* two words are combined and are written as a single word, for example, हैंcolumns, मेंoutlook, हैंsingle, लिखेंthis, आप:थंडरबर्ड, अबsubject, factorand
* the same word is written in both Hindi and English: dotडॉट
* spelling mistakes in the transcripts: shft instead of shift, ड्रॉप=डाउन instead of ड्रॉपडाउन

Is there anybody else aware of these issues?

JHU-CLSP team

Error while running ./run.sh file the results baseline_recipe_is21s_indic_asr_challenge

warning: discount coeff 1 is out of range: 0
arpa2fst --disambig-symbol=#0 --read-symbol-table=data/lang/words.txt data/local/lm.arpa data/lang/G.fst
LOG (arpa2fst[5.5.8911-37628]:Read():arpa-file-parser.cc:94) Reading \data\ section.
LOG (arpa2fst[5.5.8911-37628]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
LOG (arpa2fst[5.5.8911-37628]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.
LOG (arpa2fst[5.5.8911-37628]:Read():arpa-file-parser.cc:149) Reading \3-grams: section.
LOG (arpa2fst[5.5.8911-37628]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 179596 to 58945
steps/make_mfcc.sh --nj 48 --cmd run.pl data/train exp/make_mfcc/train mfcc
utils/validate_data_dir.sh: Successfully validated data-directory data/train
steps/make_mfcc.sh [info]: segments file exists: using that.
run.pl: 48 / 48 failed, log is in exp/make_mfcc/train/make_mfcc_train..log
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train mfcc
steps/compute_cmvn_stats.sh: no such file data/train/feats.scp
steps/make_mfcc.sh --nj 48 --cmd run.pl data/test exp/make_mfcc/test mfcc
utils/validate_data_dir.sh: Successfully validated data-directory data/test
steps/make_mfcc.sh [info]: segments file exists: using that.
run.pl: 48 / 48 failed, log is in exp/make_mfcc/test/make_mfcc_test..log
steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test mfcc
steps/compute_cmvn_stats.sh: no such file data/test/feats.scp
mono training
steps/train_mono.sh --nj 48 --cmd run.pl data/train data/lang exp/mono
steps/train_mono.sh: Initializing monophone system.
feat-to-dim 'ark,s,cs:apply-cmvn --utt2spk=ark:data/train/split48/1/utt2spk scp:data/train/split48/1/cmvn.scp scp:data/train/split48/1/feats.scp ark:- | add-deltas ark:- ark:- |' -
apply-cmvn --utt2spk=ark:data/train/split48/1/utt2spk scp:data/train/split48/1/cmvn.scp scp:data/train/split48/1/feats.scp ark:-
WARNING (apply-cmvn[5.5.8911-37628]:Open():util/kaldi-table-inl.h:106) Failed to open script file data/train/split48/1/feats.scp
ERROR (apply-cmvn[5.5.891~1-37628]:SequentialTableReader():util/kaldi-table-inl.h:860) Error constructing TableReader: rspecifier is scp:data/train/split48/1/feats.scp

[ Stack-Trace: ]
/home/ubuntu/MASR/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x793) [0x7f7251d01183]
apply-cmvn(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x25) [0x55830a72a1c1]
apply-cmvn(kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix > >::SequentialTableReader(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)+0xc4) [0x55830a7397fe]
apply-cmvn(main+0x7b7) [0x55830a728720]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f72518fb0b3]
apply-cmvn(_start+0x2e) [0x55830a727eae]

kaldi::KaldiFatalErroradd-deltas ark:- ark:-
ERROR (feat-to-dim[5.5.891~1-37628]:main():feat-to-dim.cc:58) Could not read any features (empty archive?)

[ Stack-Trace: ]
/home/ubuntu/MASR/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x793) [0x7fb2577f5183]
feat-to-dim(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x25) [0x55eef38b75d9]
feat-to-dim(main+0x2f7) [0x55eef38b6f20]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fb2573ef0b3]
feat-to-dim(_start+0x2e) [0x55eef38b6b6e]

kaldi::KaldiFatalErrorerror getting feature dimension
mono training done
mono decoding on test
tri1 training
steps/align_si.sh --nj 48 --cmd run.pl data/train data/lang exp/mono exp/mono_ali
mkgraph.sh: expected exp/mono/final.mdl to exist
steps/decode.sh --nj 24 --cmd run.pl exp/mono/graph data/test exp/mono/decode_test
steps/align_si.sh: expected file exp/mono/tree to exist
steps/train_deltas.sh --boost-silence 1.25 --cmd run.pl 2000 20000 data/train data/lang exp/mono_ali exp/tri1
train_deltas.sh: no such file exp/mono_ali/final.mdl
tri1 training done
tri1 decoding on test
tri2b training
steps/align_si.sh --nj 48 --cmd run.pl data/train data/lang exp/tri1 exp/tri1_ali
steps/decode.sh: Error: no such file data/test/split24/1/feats.scp
mono decoding on test done
mkgraph.sh: expected exp/tri1/final.mdl to exist
steps/decode.sh --nj 24 --cmd run.pl exp/tri1/graph data/test exp/tri1/decode_test
steps/align_si.sh: expected file exp/tri1/tree to exist
steps/train_lda_mllt.sh --cmd run.pl --splice-opts --left-context=3 --right-context=3 2500 15000 data/train data/lang exp/tri1_ali exp/tri2b
train_lda_mllt.sh: no such file exp/tri1_ali/final.mdl
tri2b training done
tri2b decoding on test
tri3b training
steps/decode.sh: Error: no such file data/test/split24/1/feats.scp
tri1 decoding on test done.
steps/align_si.sh --nj 48 --cmd run.pl data/train data/lang exp/tri2b exp/tri2b_ali
mkgraph.sh: expected exp/tri2b/final.mdl to exist
steps/decode.sh --nj 24 --cmd run.pl exp/tri2b/graph data/test exp/tri2b/decode_test
steps/align_si.sh: expected file exp/tri2b/tree to exist
steps/train_sat.sh --cmd run.pl 2500 15000 data/train data/lang exp/tri2b_ali exp/tri3b
train_sat.sh: no such file data/train/feats.scp
tri3b training done
tri4b training
tri3b decoding on test
steps/align_fmllr.sh --nj 48 --cmd run.pl data/train data/lang exp/tri3b exp/tri3b_ali
steps/decode.sh: Error: no such file data/test/split24/1/feats.scp
tri2b decoding on test done.
mkgraph.sh: expected exp/tri3b/final.mdl to exist
steps/decode_fmllr.sh --nj 24 --cmd run.pl exp/tri3b/graph data/test exp/tri3b/decode_test
cp: cannot stat 'exp/tri3b/tree': No such file or directory
cp: cannot stat 'exp/tri3b/final.mdl': No such file or directory
steps/train_sat.sh --cmd run.pl 4200 40000 data/train data/lang exp/tri3b_ali exp/tri4b
train_sat.sh: no such file data/train/feats.scp
tri4b training done
tri4b decoding on test
local/chain/run_tdnn.sh --stage 0 --nj 48 --decode_nj 24
mkgraph.sh: expected exp/tri4b/final.mdl to exist
steps/decode_fmllr.sh --nj 24 --cmd run.pl exp/tri4b/graph data/test exp/tri4b/decode_test
local/nnet3/run_ivector_common.sh: expected file data/train/feats.scp to exist
filter_scps.pl: warning: some input lines did not get output
Could not open id-list file data/test/split24/1/tmp.reco at utils/filter_scps.pl line 99.
tri4b decoding on test done.
cat: exp/tri3b/graph/phones/silence.csl: No such file or directory
tri3b decoding on test done.
grep: exp/mono/decode_test/wer_: No such file or directory
grep: exp/tri1/decode_test/wer_: No such file or directory
grep: exp/tri2b/decode_test/wer_: No such file or directory
grep: exp/tri3b/decode_test/wer_: No such file or directory
grep: exp/tri4b/decode_test/wer_*: No such file or directory

data directory is missing in is21-subtask2-kaldi baseline

data directory is missing in both hindi_baseline and bengali_baseline. please check once it is not showing in corpus/ folder

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

navana-tech / baseline_recipe_is21s_indic_asr_challenge Goto Github PK