Please visit this document.
navana-tech / baseline_recipe_is21s_indic_asr_challenge Goto Github PK
View Code? Open in Web Editor NEWMultilingual and code-switching ASR challenges for low resource Indian languages.
Multilingual and code-switching ASR challenges for low resource Indian languages.
Please visit this document.
/home/shivam23004/speech_corpus/mucs_2021/IS21_subtask_1_data/Gujarati - Correct format
Inconsistent #transcripts(3843) and #audio files(0)
While processing the code-switching data for Hindi-English we found the following issues:
The segment boundaries are not very good. This unfairly penalizes the Kaldi baseline provided. When scoring by paragraph, instead of by utterance, we get 27.6% WER using a similar setup nearly the same as the Kaldi baseline instead of 36% WER.
The test set transcriptions include many Hindi words written in English and English words written in Hindi. There should be a GLM file to give participants credit for correctly produced English/Hindi words regardless of the script. We can provide this for the Hindi test set, but the organizers should score the blind test sets in Bengali and Hindi with such a GLM file.
The transcriptions are fairly noisy and include poorly tokenized text with many removed punctuation marks, such 1004 (1 “point” 0 “point” 0 “point” 4). This is probably found data and it will be hard to systematically fix all such issues, however, it’s not ideal.
Example of noise in the transcript:
* two words are combined and are written as a single word, for example, हैंcolumns, मेंoutlook, हैंsingle, लिखेंthis, आप:थंडरबर्ड, अबsubject, factorand
* the same word is written in both Hindi and English: dotडॉट
* spelling mistakes in the transcripts: shft instead of shift, ड्रॉप=डाउन instead of ड्रॉपडाउन
Is there anybody else aware of these issues?
JHU-CLSP team
warning: discount coeff 1 is out of range: 0
arpa2fst --disambig-symbol=#0 --read-symbol-table=data/lang/words.txt data/local/lm.arpa data/lang/G.fst
LOG (arpa2fst[5.5.8911-37628]:Read():arpa-file-parser.cc:94) Reading \data\ section.1-37628]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
LOG (arpa2fst[5.5.891
LOG (arpa2fst[5.5.8911-37628]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.1-37628]:Read():arpa-file-parser.cc:149) Reading \3-grams: section.
LOG (arpa2fst[5.5.891
LOG (arpa2fst[5.5.8911-37628]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 179596 to 589451-37628]:Open():util/kaldi-table-inl.h:106) Failed to open script file data/train/split48/1/feats.scp
steps/make_mfcc.sh --nj 48 --cmd run.pl data/train exp/make_mfcc/train mfcc
utils/validate_data_dir.sh: Successfully validated data-directory data/train
steps/make_mfcc.sh [info]: segments file exists: using that.
run.pl: 48 / 48 failed, log is in exp/make_mfcc/train/make_mfcc_train..log
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train mfcc
steps/compute_cmvn_stats.sh: no such file data/train/feats.scp
steps/make_mfcc.sh --nj 48 --cmd run.pl data/test exp/make_mfcc/test mfcc
utils/validate_data_dir.sh: Successfully validated data-directory data/test
steps/make_mfcc.sh [info]: segments file exists: using that.
run.pl: 48 / 48 failed, log is in exp/make_mfcc/test/make_mfcc_test..log
steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test mfcc
steps/compute_cmvn_stats.sh: no such file data/test/feats.scp
mono training
steps/train_mono.sh --nj 48 --cmd run.pl data/train data/lang exp/mono
steps/train_mono.sh: Initializing monophone system.
feat-to-dim 'ark,s,cs:apply-cmvn --utt2spk=ark:data/train/split48/1/utt2spk scp:data/train/split48/1/cmvn.scp scp:data/train/split48/1/feats.scp ark:- | add-deltas ark:- ark:- |' -
apply-cmvn --utt2spk=ark:data/train/split48/1/utt2spk scp:data/train/split48/1/cmvn.scp scp:data/train/split48/1/feats.scp ark:-
WARNING (apply-cmvn[5.5.891
ERROR (apply-cmvn[5.5.891~1-37628]:SequentialTableReader():util/kaldi-table-inl.h:860) Error constructing TableReader: rspecifier is scp:data/train/split48/1/feats.scp
[ Stack-Trace: ]
/home/ubuntu/MASR/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x793) [0x7f7251d01183]
apply-cmvn(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x25) [0x55830a72a1c1]
apply-cmvn(kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix > >::SequentialTableReader(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)+0xc4) [0x55830a7397fe]
apply-cmvn(main+0x7b7) [0x55830a728720]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f72518fb0b3]
apply-cmvn(_start+0x2e) [0x55830a727eae]
kaldi::KaldiFatalErroradd-deltas ark:- ark:-
ERROR (feat-to-dim[5.5.891~1-37628]:main():feat-to-dim.cc:58) Could not read any features (empty archive?)
[ Stack-Trace: ]
/home/ubuntu/MASR/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x793) [0x7fb2577f5183]
feat-to-dim(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x25) [0x55eef38b75d9]
feat-to-dim(main+0x2f7) [0x55eef38b6f20]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fb2573ef0b3]
feat-to-dim(_start+0x2e) [0x55eef38b6b6e]
kaldi::KaldiFatalErrorerror getting feature dimension
mono training done
mono decoding on test
tri1 training
steps/align_si.sh --nj 48 --cmd run.pl data/train data/lang exp/mono exp/mono_ali
mkgraph.sh: expected exp/mono/final.mdl to exist
steps/decode.sh --nj 24 --cmd run.pl exp/mono/graph data/test exp/mono/decode_test
steps/align_si.sh: expected file exp/mono/tree to exist
steps/train_deltas.sh --boost-silence 1.25 --cmd run.pl 2000 20000 data/train data/lang exp/mono_ali exp/tri1
train_deltas.sh: no such file exp/mono_ali/final.mdl
tri1 training done
tri1 decoding on test
tri2b training
steps/align_si.sh --nj 48 --cmd run.pl data/train data/lang exp/tri1 exp/tri1_ali
steps/decode.sh: Error: no such file data/test/split24/1/feats.scp
mono decoding on test done
mkgraph.sh: expected exp/tri1/final.mdl to exist
steps/decode.sh --nj 24 --cmd run.pl exp/tri1/graph data/test exp/tri1/decode_test
steps/align_si.sh: expected file exp/tri1/tree to exist
steps/train_lda_mllt.sh --cmd run.pl --splice-opts --left-context=3 --right-context=3 2500 15000 data/train data/lang exp/tri1_ali exp/tri2b
train_lda_mllt.sh: no such file exp/tri1_ali/final.mdl
tri2b training done
tri2b decoding on test
tri3b training
steps/decode.sh: Error: no such file data/test/split24/1/feats.scp
tri1 decoding on test done.
steps/align_si.sh --nj 48 --cmd run.pl data/train data/lang exp/tri2b exp/tri2b_ali
mkgraph.sh: expected exp/tri2b/final.mdl to exist
steps/decode.sh --nj 24 --cmd run.pl exp/tri2b/graph data/test exp/tri2b/decode_test
steps/align_si.sh: expected file exp/tri2b/tree to exist
steps/train_sat.sh --cmd run.pl 2500 15000 data/train data/lang exp/tri2b_ali exp/tri3b
train_sat.sh: no such file data/train/feats.scp
tri3b training done
tri4b training
tri3b decoding on test
steps/align_fmllr.sh --nj 48 --cmd run.pl data/train data/lang exp/tri3b exp/tri3b_ali
steps/decode.sh: Error: no such file data/test/split24/1/feats.scp
tri2b decoding on test done.
mkgraph.sh: expected exp/tri3b/final.mdl to exist
steps/decode_fmllr.sh --nj 24 --cmd run.pl exp/tri3b/graph data/test exp/tri3b/decode_test
cp: cannot stat 'exp/tri3b/tree': No such file or directory
cp: cannot stat 'exp/tri3b/final.mdl': No such file or directory
steps/train_sat.sh --cmd run.pl 4200 40000 data/train data/lang exp/tri3b_ali exp/tri4b
train_sat.sh: no such file data/train/feats.scp
tri4b training done
tri4b decoding on test
local/chain/run_tdnn.sh --stage 0 --nj 48 --decode_nj 24
mkgraph.sh: expected exp/tri4b/final.mdl to exist
steps/decode_fmllr.sh --nj 24 --cmd run.pl exp/tri4b/graph data/test exp/tri4b/decode_test
local/nnet3/run_ivector_common.sh: expected file data/train/feats.scp to exist
filter_scps.pl: warning: some input lines did not get output
Could not open id-list file data/test/split24/1/tmp.reco at utils/filter_scps.pl line 99.
tri4b decoding on test done.
cat: exp/tri3b/graph/phones/silence.csl: No such file or directory
tri3b decoding on test done.
grep: exp/mono/decode_test/wer_: No such file or directory
grep: exp/tri1/decode_test/wer_: No such file or directory
grep: exp/tri2b/decode_test/wer_: No such file or directory
grep: exp/tri3b/decode_test/wer_: No such file or directory
grep: exp/tri4b/decode_test/wer_*: No such file or directory
data directory is missing in both hindi_baseline and bengali_baseline. please check once it is not showing in corpus/ folder
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.