oliverguhr / wav2vec2-live Goto Github PK

View Code? Open in Web Editor NEW

302.0 7.0 52.0 2.91 MB

A live speech recognition using Facebooks wav2vec 2.0 model.

License: MIT License

Python 100.00%

speech-recognition wav2vec2 pyaudio wav2vec speech-to-text asr speech

wav2vec2-live's Introduction

automatic speech recognition with wav2vec2

Use any wav2vec model with a microphone.

Setup

I recommend to install this project in a virtual environment.

python3 -m venv ./venv
source ./venv/bin/activate
pip install -r requirements.txt

Depending on linux distribution you might encounter an error that portaudio was not found when installing pyaudio. For Ubuntu you can solve that issue by installing the "portaudio19-dev" package.

sudo apt install portaudio19-dev

Finally you can test the speech recognition:

python live_asr.py

Possible Issues:

The code uses the systems default audio device. Please make sure that you set your systems default audio device correctly.
"attempt to connect to server failed" you can safely ignore this message from pyaudio. It just means, that pyaudio can't connect to the jack audio server.

Usage

You can use any wav2vec2 model from the huggingface model hub. Just set the model name, all files will be downloaded on first execution.

from live_asr import LiveWav2Vec2

english_model = "facebook/wav2vec2-large-960h-lv60-self"
german_model = "maxidl/wav2vec2-large-xlsr-german"
asr = LiveWav2Vec2(german_model,device_name="default")
asr.start()

try:        
    while True:
        text,sample_length,inference_time = asr.get_last_text()                        
        print(f"{sample_length:.3f}s"
        +f"\t{inference_time:.3f}s"
        +f"\t{text}")
        
except KeyboardInterrupt:   
    asr.stop()

wav2vec2-live's People

Contributors

Stargazers

Watchers

wav2vec2-live's Issues

use it with local model

thanks a lot for your efforts, could you please tell me how I can use it with a local model on my computer (without uploading the model to hugging face)? , it's giving me this error message
ValueError: too many values to unpack (expected 3)

break in little pause

i am using the default code, but it looks like taking command very fast if I take a little pause it breaks the line and and prints it the
next line for example if i speak "how are you" its taking "how" then "are " in the next line,,,,
i can i reduce sensetivity of taking command??

Inference error: shape mismatch

Hi, I was testing live_asr.py on my macOS Monterey (Python=3.8.11) under the following environment:

halo==0.0.31
numpy==1.21.4
PyAudio==0.2.11
Rx==3.2.0
SoundFile==0.10.3.post1
torch==1.10.0
torchaudio==0.10.0
transformers==4.8.2
webrtcvad==2.0.10

When I run python live_asr.py, I came across errors as below:

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

listening to your voice

/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/transformers/feature_extraction_utils.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:201.)
  tensor = as_tensor(value)
/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py:986: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  return (input_length - kernel_size) // stride + 1
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/jkang/Desktop/test_asr/wav2vec2-live/live_asr.py", line 92, in asr_process
    text = wave2vec_asr.buffer_to_text(float64_buffer).lower()
  File "/Users/jkang/Desktop/test_asr/wav2vec2-live/wav2vec2_inference.py", line 22, in buffer_to_text
    logits = self.model(inputs.input_values, attention_mask=torch.ones(len(inputs.input_values[0]))).logits
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1528, in forward
    return_dict=return_dict,
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1171, in forward
    return_dict=return_dict,
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 791, in forward
    hidden_states[~attention_mask] = 0
IndexError: The shape of the mask [1920, 5] at index 0 does not match the shape of the indexed tensor [1, 5, 1024] at index 0

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/jkang/Desktop/test_asr/wav2vec2-live/live_asr.py", line 68, in vad_process
    frame = stream.read(CHUNK)
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/pyaudio.py", line 608, in read
    return pa.read_stream(self._stream, num_frames, exception_on_overflow)
OSError: [Errno -9981] Input overflowed

I think the critical issue is this:

IndexError: The shape of the mask [1920, 5] at index 0 does not match the shape of the indexed tensor [1, 5, 1024] at index 0

I installed transformers==4.8.2, but the error occurs which is probably not related to the transformers' version in my guess.

Could you help me with what caused this error?

Thank you

How to improve the performance?

The default model that you shared in this repo, is for English. I checked the performance, it's not giving a good result.How to improve the model and reduce WER?

Whisper Version

Your current library is great. Can you please provide a whisper-live version of your codebase?

using custom transformer models throw exception

Hi,

I use model "m3hrdadfi/wav2vec2-large-xlsr-turkish" for live_asr.py but I get the exception below:

torch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, roun
ding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  return (input_length - kernel_size) // stride + 1
Exception in thread Thread-1:
Traceback (most recent call last):
  File "C:\Program Files\Python38\lib\threading.py", line 932, in _bootstrap_inner
    self.run()
  File "C:\Program Files\Python38\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File ".\live_asr.py", line 89, in asr_process
    text = wave2vec_asr.buffer_to_text(float64_buffer).lower()
  File "D:\Dev\pertev\utils\wav2vec2_inference.py", line 22, in buffer_to_text
    logits = self.model(inputs.input_values, attention_mask=torch.ones(len(inputs.input_values[0]))).logits
  File "D:\Dev\pertev\venv\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Dev\pertev\venv\lib\site-packages\transformers\models\wav2vec2\modeling_wav2vec2.py", line 1494, in forward
    outputs = self.wav2vec2(
  File "D:\Dev\pertev\venv\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Dev\pertev\venv\lib\site-packages\transformers\models\wav2vec2\modeling_wav2vec2.py", line 1076, in forward
    encoder_outputs = self.encoder(
  File "D:\Dev\pertev\venv\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Dev\pertev\venv\lib\site-packages\transformers\models\wav2vec2\modeling_wav2vec2.py", line 695, in forward
    hidden_states[~attention_mask] = 0
IndexError: The shape of the mask [2400, 7] at index 0 does not match the shape of the indexed tensor [1, 7, 1024] at index 0

Problem when running with combination of wav2vec2 model and 4-grams LM

Hello. I'm using your code with model saved in local directory (model is downloaded from here). We can use this model with or without 4-gram LM.
When i use model without LM, everything is ok.
But when i use model with 4-gram LM (code for combination of wav2vec2 and 4-grams LM is here), there is error when running kenlm.model(n_gram_path):
OSError: [Errno -9981] Input overflowed
Can you please check your code for this error?
Sorry for my bad English

Optimisation for GPU with CUDA

Hello @oliverguhr ! Thank you for your code, which is working fine on my CPU.
For practical reasons (another device with poor CPU but good GPU), I would like to be able to run it on GPU, using CUDA. The problem is that when I run the code using

device = "cuda:0" if torch.cuda.is_available() else "cpu"
self.model.to(device)

I get a RuntimeError: CUDA error: out of memory error, even though I have 2GB memory available, and even though your code already uses with torch.no_grad():.

Do you know if the code can be adapted to use less memory, or how much memory would be needed for an inference?

offline mode feasibility

Please let me know whether this model will work in offline mode.

Internal PortAudio error on MacOS

Thanks for sharing this project!

I am on a mac, and I seem to have problems with opening the pyAudio stream here: https://github.com/oliverguhr/wav2vec2-live/blob/main/live_asr.py#L59-L64

Looking at the documentation: https://people.csail.mit.edu/hubert/pyaudio/docs/ I see that to use a microphone, a callback function is required.

I made sure that I am picking the correct device microphone here: https://github.com/oliverguhr/wav2vec2-live/blob/main/live_asr.py#L56-L57

The error I am getting is:

Process Process-2:
Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.9.2_4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/Cellar/[email protected]/3.9.2_4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/othrif/spectrum/voice/experiment/wav2vec/042121/wav2vec2-live/live_asr.py", line 62, in vad_process
    stream = audio.open(input_device_index=selected_input_device_id,
  File "/Users/othrif/spectrum/voice/experiment/wav2vec/042121/wav2vec2-live/.venv/lib/python3.9/site-packages/pyaudio.py", line 750, in open
    stream = Stream(self, *args, **kwargs)
  File "/Users/othrif/spectrum/voice/experiment/wav2vec/042121/wav2vec2-live/.venv/lib/python3.9/site-packages/pyaudio.py", line 441, in __init__
    self._stream = pa.open(**arguments)
OSError: [Errno -9986] Internal PortAudio error

Any thoughts on what might be causing this?

Transformers Version Pre-4.8.2

Just an fyi, since this commit your wav2vec2-live is only compatible with transformers==4.8.2 and before

Live ASR Engine with whisper

I have implemented (not from scratch) LiveASREngine using whisper using the following codebase written by you:

https://github.com/oliverguhr/wav2vec2-live

The only change I made was in the wav2vec2_inference.py: initialized whisper model with hugging face pipeline.
my code: https://github.com/Dimlight/LiveASREngine
The problem I am facing now:

If I do not say anything and the entire room is silent, the engine continuously prints "you" or "thank you", I tested the system in a quiet room. still getting the same issue.

Can anyone please help me what can be the reasons for getting this kind of result?

onnx-quantized model

hello, thanks for your great work.
does this repo support onnx quantized models?

Performance is not good, only for me?

Hi , I'm using the code given, no errors, only one warning, but the performance is terrible, am I the only one experincing that issues?
Buy terrible I mean it is mostly hard to see the connection between what said and the transcription.
The warning I'm getting is:
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Any idea someone?

mms model

hi,
does this repo support mms model (which is like wav2vec2.0 but have more transformer layers.)?