Hello, Could you please tell if the window which is defined by <code

Size of audio file in bytes to read about pocketsphinx-python HOT 4 CLOSED

cmusphinx commented on July 1, 2024

Size of audio file in bytes to read

from pocketsphinx-python.

Comments (4)

novitoll commented on July 1, 2024

So, I'm assuming that I don't need to train any Language Model or fill dictionary, as my current focus is to understand how it works with pre-built models.. Only things I'd like to tune is probably Feature Extraction part and perhaps Acoustic Model.

I had a chance to play around with the same audio file using Google Speech API, Microsoft Cognitive Services - Bing Speech API, and IBM Watson via online service.

So Google API offers sync and async API with 1 min and 80 mins audio file duration limit respectively. And Microsoft's limit is 10 sec per each file. And I split my full audio (3:12) into N-parts per 10 sec using ffmpeg, and with pocketsphinx_continuous utility I got worse results than using the same params via same utility but using 1 entire audio file. So question is - how does pocketsphinx get the utterance?

From this paper, I got that

Feature vectors are typically computed every 10 ms using an overlapping analysis window of around 25 ms

I'm not worried too much about accuracy, because eventually I will maybe use Kaldi :) But would like to understand Sphinx first as it is great toolkit to start (appreciate you, guys).

This is probably some Quora question, but please suggest if my understanding of this is correct and if I'm going right direction on the way of understanding SR. Thanks again.

from pocketsphinx-python.

nshmyrev commented on July 1, 2024

def recognize_audio(audio_file, args):

This code is not correct. It is good for processing the audio at once, but not for audio chunks going one by one because you restart after first word is recognized and chunks are processed independently, not continuously. Continuous processing example with VAD is here:

https://github.com/cmusphinx/pocketsphinx/blob/master/swig/python/test/continuous_test.py

And I split my full audio (3:12) into N-parts per 10 sec using ffmpeg, and with pocketsphinx_continuous utility I got worse results than using the same params via same utility but using 1 entire audio file.

Recognizer needs few second to estimate channel parameters (CMN), for that reason it is better to process audio continuously without restarts. Alternative is batch processing when audio parameters are estimated from the whole utterance at once. There is full_utt parameter in ps_process_raw to perform batch processing.

from pocketsphinx-python.

novitoll commented on July 1, 2024

it is better to process audio continuously without restarts

Ok, thanks, so I tried with this code below, see I start utterance only once here.

So does this 1024 mean that 1024 bytes of audio are read separately in loop, and Decoder works only with this 1024 bytes chunk? I mean, there is no information whether this chunk has the full utterance or it can be chunked abruptly. Is it the same as pocketsphinx_continuous work? Does it use 1024 bytes too?

def recognize_audio(audio_file, args):
    try:
        decoder.start_utt()
        stream = open(audio_file, 'rb')
        in_speech_bf = False
        while True:
            buf = stream.read(args.chunk_size)
            if buf:
                decoder.process_raw(buf, False, False)  # full_utt - False
                if decoder.get_in_speech() != in_speech_bf:
                    in_speech_bf = decoder.get_in_speech()
                    if decoder.hyp() is not None:
                        # decoder.end_utt()
                        hypothesis.append(decoder.hyp().hypstr)
                        # decoder.start_utt()
            else:
                break
    except Exception, ex:
        print 'Error occurred with %s \n%s' % (audio_file, ex)

from pocketsphinx-python.

nshmyrev commented on July 1, 2024

Ok, thanks, so I tried with this code below, see I start utterance only once here.

It is a bad idea to modify example without understanding. Original code is correct, your modification is wrong.

So does this 1024 mean that 1024 bytes of audio are read separately in loop, and Decoder works only with this 1024 bytes chunk?

Decoder remembers previously processed audio since start_utt

I mean, there is no information whether this chunk has the full utterance or it can be chunked abruptly.

chunks are not full utterances since full_utt is false.

Is it the same as pocketsphinx_continuous work?

Yes

Does it use 1024 bytes too?

It uses 4096

from pocketsphinx-python.

Size of audio file in bytes to read about pocketsphinx-python HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent