srvk / lm_build Goto Github PK
View Code? Open in Web Editor NEWAdapting your own Language Model for Kaldi
Home Page: http://speechkitchen.org/kaldi-language-model-building/
Adapting your own Language Model for Kaldi
Home Page: http://speechkitchen.org/kaldi-language-model-building/
I'm looking to build a language model with a small amount of text, and for experimental purposes I'm also trying with a small amount of example_txt.
So when I was earlier using the instructions under the heading "Adapting your own Language Model for EESEN-tedlium" here
cd ~/eesen/asr_egs/tedlium/v2-30ms/lm_build
./train_lms.sh example_txt local_lm
cd ..
lm_build/utils/decode_graph_newlm.sh data/lang_phn_test
When I tried with an example text of merely 145 words, it was successful in building a language model but the results were pretty bad. Most of the words in the transcript were from outside the example_txt. So, I tried modifying wordlist.txt to only include about ~90 words which were only the words I expected in the transcript.
I got an error like:
compute_perplexity: no unigram-state weight for predicted word "BA"
(I think it was something other than "BA", "BH" or something... I can find out if it's important)
I played around and realized that the wordlist.txt had to be at least about 47k odd words and that would get rid of the error. So I padded with fake words full of symbols and things were working. (although not as optimally as I'd like)
Since run_adapt.sh seemed to be a better recipe as I wrote about in another discussion, i tried that. Even keeping the original large dictionary, if I just reduced the example_txt to be a small piece of 145 words, it repeatedly gave the message:
"compute_perplexity: for history-state "", no total-count % is seen
(perhaps you didn't put the training n-grams through interpolate_ngrams?)"
and eventually ended with:
"Usage: optimize_alpha.pl alpha1 perplexity@alpha1 alpha2 perplexity@alpha2 alpha3 perplexity@alph3 at /home/vagrant/eesen/tools/kaldi_lm/optimize_alpha.pl line 23.
Expecting files adapt_lm/3gram-mincount//ngrams_disc and adapt_lm/3gram-mincount//../word_map to exist
E.g. see egs/wsj/s3/local/wsj_train_lm.sh for examples.
Finding OOV words in ARPA LM but not in our words.txt
gzip: adapt_lm/3gram-mincount/lm_pr6.0.gz: No such file or directory
Composing the decoding graph using our own ARPA LM
No such file adapt_lm/3gram-mincount/lm_pr6.0.gz"
Any thoughts on why these errors are occurring? (more interested in the run_adapt.sh recipe now)
I guess in our target application, we will have much more data than this, but the example_txt could still be much smaller compared to the original example_txt file (specific and narrow domain). So I guess it's worthwhile for me to understand the problem occurring above beyond the scope of the toy experiment too.
when I try to open your website by this link ,
I get this error "Error establishing a database connection" is your website down ? could you please fix it ?
regards
Ehsan
Hey,
The link (http://speech-kitchen.org/kaldi-language-model-building/) in your REAMDME.md that shall provide an explanation on how to build an own Kaldi LM seems to be broken.
I'd be happy to have a look at this guide.
Best!
Hi~
I've noticed code: ../utils/ctc_compile_dict_token.sh
in main executing script run_adapt.sh (line 66), but there doesn't exist a script under utils folder named this.
should I cp from somewhere else?
Please help to check this ~
Best regards,
Li
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.