The lm_build from srvk

Building Language Model with low amount of data

I'm looking to build a language model with a small amount of text, and for experimental purposes I'm also trying with a small amount of example_txt.

So when I was earlier using the instructions under the heading "Adapting your own Language Model for EESEN-tedlium" here

cd ~/eesen/asr_egs/tedlium/v2-30ms/lm_build
./train_lms.sh example_txt local_lm
cd ..
lm_build/utils/decode_graph_newlm.sh data/lang_phn_test

When I tried with an example text of merely 145 words, it was successful in building a language model but the results were pretty bad. Most of the words in the transcript were from outside the example_txt. So, I tried modifying wordlist.txt to only include about ~90 words which were only the words I expected in the transcript.

I got an error like:
compute_perplexity: no unigram-state weight for predicted word "BA"
(I think it was something other than "BA", "BH" or something... I can find out if it's important)

I played around and realized that the wordlist.txt had to be at least about 47k odd words and that would get rid of the error. So I padded with fake words full of symbols and things were working. (although not as optimally as I'd like)

Since run_adapt.sh seemed to be a better recipe as I wrote about in another discussion, i tried that. Even keeping the original large dictionary, if I just reduced the example_txt to be a small piece of 145 words, it repeatedly gave the message:

"compute_perplexity: for history-state "", no total-count % is seen
(perhaps you didn't put the training n-grams through interpolate_ngrams?)"

and eventually ended with:

"Usage: optimize_alpha.pl alpha1 perplexity@alpha1 alpha2 perplexity@alpha2 alpha3 perplexity@alph3 at /home/vagrant/eesen/tools/kaldi_lm/optimize_alpha.pl line 23.
Expecting files adapt_lm/3gram-mincount//ngrams_disc and adapt_lm/3gram-mincount//../word_map to exist
E.g. see egs/wsj/s3/local/wsj_train_lm.sh for examples.
Finding OOV words in ARPA LM but not in our words.txt
gzip: adapt_lm/3gram-mincount/lm_pr6.0.gz: No such file or directory
Composing the decoding graph using our own ARPA LM
No such file adapt_lm/3gram-mincount/lm_pr6.0.gz"

Any thoughts on why these errors are occurring? (more interested in the run_adapt.sh recipe now)

I guess in our target application, we will have much more data than this, but the example_txt could still be much smaller compared to the original example_txt file (specific and narrow domain). So I guess it's worthwhile for me to understand the problem occurring above beyond the scope of the toy experiment too.

Error establishing a database connection

when I try to open your website by this link ,

http://speechkitchen.org/

I get this error "Error establishing a database connection" is your website down ? could you please fix it ?

regards
Ehsan

Broken link in README.md

Hey,

The link (http://speech-kitchen.org/kaldi-language-model-building/) in your REAMDME.md that shall provide an explanation on how to build an own Kaldi LM seems to be broken.
I'd be happy to have a look at this guide.

Best!

Missing Script : ctc_compile_dict_token.sh in Utils folder

Hi~
I've noticed code: ../utils/ctc_compile_dict_token.sh in main executing script run_adapt.sh (line 66), but there doesn't exist a script under utils folder named this.

should I cp from somewhere else?

Please help to check this ~

Best regards,

srvk / lm_build Goto Github PK

lm_build's People

Contributors

Stargazers

Watchers

Forkers

lm_build's Issues

Building Language Model with low amount of data

Error establishing a database connection

Broken link in README.md

Missing Script : ctc_compile_dict_token.sh in Utils folder

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent