Giter Club home page Giter Club logo

Comments (4)

novitoll avatar novitoll commented on July 1, 2024

So, I'm assuming that I don't need to train any Language Model or fill dictionary, as my current focus is to understand how it works with pre-built models.. Only things I'd like to tune is probably Feature Extraction part and perhaps Acoustic Model.

I had a chance to play around with the same audio file using Google Speech API, Microsoft Cognitive Services - Bing Speech API, and IBM Watson via online service.

So Google API offers sync and async API with 1 min and 80 mins audio file duration limit respectively. And Microsoft's limit is 10 sec per each file. And I split my full audio (3:12) into N-parts per 10 sec using ffmpeg, and with pocketsphinx_continuous utility I got worse results than using the same params via same utility but using 1 entire audio file. So question is - how does pocketsphinx get the utterance?

From this paper, I got that

Feature vectors are typically computed every 10 ms using an overlapping analysis window of around 25 ms

I'm not worried too much about accuracy, because eventually I will maybe use Kaldi :) But would like to understand Sphinx first as it is great toolkit to start (appreciate you, guys).

This is probably some Quora question, but please suggest if my understanding of this is correct and if I'm going right direction on the way of understanding SR. Thanks again.

from pocketsphinx-python.

nshmyrev avatar nshmyrev commented on July 1, 2024

def recognize_audio(audio_file, args):

This code is not correct. It is good for processing the audio at once, but not for audio chunks going one by one because you restart after first word is recognized and chunks are processed independently, not continuously. Continuous processing example with VAD is here:

https://github.com/cmusphinx/pocketsphinx/blob/master/swig/python/test/continuous_test.py

And I split my full audio (3:12) into N-parts per 10 sec using ffmpeg, and with pocketsphinx_continuous utility I got worse results than using the same params via same utility but using 1 entire audio file.

Recognizer needs few second to estimate channel parameters (CMN), for that reason it is better to process audio continuously without restarts. Alternative is batch processing when audio parameters are estimated from the whole utterance at once. There is full_utt parameter in ps_process_raw to perform batch processing.

from pocketsphinx-python.

novitoll avatar novitoll commented on July 1, 2024

it is better to process audio continuously without restarts

Ok, thanks, so I tried with this code below, see I start utterance only once here.

So does this 1024 mean that 1024 bytes of audio are read separately in loop, and Decoder works only with this 1024 bytes chunk? I mean, there is no information whether this chunk has the full utterance or it can be chunked abruptly. Is it the same as pocketsphinx_continuous work? Does it use 1024 bytes too?

def recognize_audio(audio_file, args):
    try:
        decoder.start_utt()
        stream = open(audio_file, 'rb')
        in_speech_bf = False
        while True:
            buf = stream.read(args.chunk_size)
            if buf:
                decoder.process_raw(buf, False, False)  # full_utt - False
                if decoder.get_in_speech() != in_speech_bf:
                    in_speech_bf = decoder.get_in_speech()
                    if decoder.hyp() is not None:
                        # decoder.end_utt()
                        hypothesis.append(decoder.hyp().hypstr)
                        # decoder.start_utt()
            else:
                break
    except Exception, ex:
        print 'Error occurred with %s \n%s' % (audio_file, ex)

from pocketsphinx-python.

nshmyrev avatar nshmyrev commented on July 1, 2024

Ok, thanks, so I tried with this code below, see I start utterance only once here.

It is a bad idea to modify example without understanding. Original code is correct, your modification is wrong.

So does this 1024 mean that 1024 bytes of audio are read separately in loop, and Decoder works only with this 1024 bytes chunk?

Decoder remembers previously processed audio since start_utt

I mean, there is no information whether this chunk has the full utterance or it can be chunked abruptly.

chunks are not full utterances since full_utt is false.

Is it the same as pocketsphinx_continuous work?

Yes

Does it use 1024 bytes too?

It uses 4096

from pocketsphinx-python.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.