Giter Club home page Giter Club logo

wav2vec2-live's Introduction

automatic speech recognition with wav2vec2

Use any wav2vec model with a microphone.

demo gif

Setup

I recommend to install this project in a virtual environment.

python3 -m venv ./venv
source ./venv/bin/activate
pip install -r requirements.txt

Depending on linux distribution you might encounter an error that portaudio was not found when installing pyaudio. For Ubuntu you can solve that issue by installing the "portaudio19-dev" package.

sudo apt install portaudio19-dev

Finally you can test the speech recognition:

python live_asr.py

Possible Issues:

  • The code uses the systems default audio device. Please make sure that you set your systems default audio device correctly.

  • "attempt to connect to server failed" you can safely ignore this message from pyaudio. It just means, that pyaudio can't connect to the jack audio server.

Usage

You can use any wav2vec2 model from the huggingface model hub. Just set the model name, all files will be downloaded on first execution.

from live_asr import LiveWav2Vec2

english_model = "facebook/wav2vec2-large-960h-lv60-self"
german_model = "maxidl/wav2vec2-large-xlsr-german"
asr = LiveWav2Vec2(german_model,device_name="default")
asr.start()

try:        
    while True:
        text,sample_length,inference_time = asr.get_last_text()                        
        print(f"{sample_length:.3f}s"
        +f"\t{inference_time:.3f}s"
        +f"\t{text}")
        
except KeyboardInterrupt:   
    asr.stop()  

wav2vec2-live's People

Contributors

oliverguhr avatar programadorartificial avatar voidful avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

wav2vec2-live's Issues

use it with local model

thanks a lot for your efforts, could you please tell me how I can use it with a local model on my computer (without uploading the model to hugging face)? , it's giving me this error message
ValueError: too many values to unpack (expected 3)
image

break in little pause

i am using the default code, but it looks like taking command very fast if I take a little pause it breaks the line and and prints it the
next line for example if i speak "how are you" its taking "how" then "are " in the next line,,,,
i can i reduce sensetivity of taking command??

Inference error: shape mismatch

Hi, I was testing live_asr.py on my macOS Monterey (Python=3.8.11) under the following environment:

halo==0.0.31
numpy==1.21.4
PyAudio==0.2.11
Rx==3.2.0
SoundFile==0.10.3.post1
torch==1.10.0
torchaudio==0.10.0
transformers==4.8.2
webrtcvad==2.0.10

When I run python live_asr.py, I came across errors as below:

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

listening to your voice

/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/transformers/feature_extraction_utils.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:201.)
  tensor = as_tensor(value)
/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py:986: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  return (input_length - kernel_size) // stride + 1
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/jkang/Desktop/test_asr/wav2vec2-live/live_asr.py", line 92, in asr_process
    text = wave2vec_asr.buffer_to_text(float64_buffer).lower()
  File "/Users/jkang/Desktop/test_asr/wav2vec2-live/wav2vec2_inference.py", line 22, in buffer_to_text
    logits = self.model(inputs.input_values, attention_mask=torch.ones(len(inputs.input_values[0]))).logits
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1528, in forward
    return_dict=return_dict,
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1171, in forward
    return_dict=return_dict,
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 791, in forward
    hidden_states[~attention_mask] = 0
IndexError: The shape of the mask [1920, 5] at index 0 does not match the shape of the indexed tensor [1, 5, 1024] at index 0

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/jkang/Desktop/test_asr/wav2vec2-live/live_asr.py", line 68, in vad_process
    frame = stream.read(CHUNK)
  File "/Users/jkang/anaconda3/envs/jk/lib/python3.7/site-packages/pyaudio.py", line 608, in read
    return pa.read_stream(self._stream, num_frames, exception_on_overflow)
OSError: [Errno -9981] Input overflowed

I think the critical issue is this:

IndexError: The shape of the mask [1920, 5] at index 0 does not match the shape of the indexed tensor [1, 5, 1024] at index 0

I installed transformers==4.8.2, but the error occurs which is probably not related to the transformers' version in my guess.

Could you help me with what caused this error?

Thank you

How to improve the performance?

The default model that you shared in this repo, is for English. I checked the performance, it's not giving a good result.How to improve the model and reduce WER?

Whisper Version

Your current library is great. Can you please provide a whisper-live version of your codebase?

using custom transformer models throw exception

Hi,

I use model "m3hrdadfi/wav2vec2-large-xlsr-turkish" for live_asr.py but I get the exception below:

torch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, roun
ding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  return (input_length - kernel_size) // stride + 1
Exception in thread Thread-1:
Traceback (most recent call last):
  File "C:\Program Files\Python38\lib\threading.py", line 932, in _bootstrap_inner
    self.run()
  File "C:\Program Files\Python38\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File ".\live_asr.py", line 89, in asr_process
    text = wave2vec_asr.buffer_to_text(float64_buffer).lower()
  File "D:\Dev\pertev\utils\wav2vec2_inference.py", line 22, in buffer_to_text
    logits = self.model(inputs.input_values, attention_mask=torch.ones(len(inputs.input_values[0]))).logits
  File "D:\Dev\pertev\venv\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Dev\pertev\venv\lib\site-packages\transformers\models\wav2vec2\modeling_wav2vec2.py", line 1494, in forward
    outputs = self.wav2vec2(
  File "D:\Dev\pertev\venv\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Dev\pertev\venv\lib\site-packages\transformers\models\wav2vec2\modeling_wav2vec2.py", line 1076, in forward
    encoder_outputs = self.encoder(
  File "D:\Dev\pertev\venv\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Dev\pertev\venv\lib\site-packages\transformers\models\wav2vec2\modeling_wav2vec2.py", line 695, in forward
    hidden_states[~attention_mask] = 0
IndexError: The shape of the mask [2400, 7] at index 0 does not match the shape of the indexed tensor [1, 7, 1024] at index 0

Problem when running with combination of wav2vec2 model and 4-grams LM

Hello. I'm using your code with model saved in local directory (model is downloaded from here). We can use this model with or without 4-gram LM.
When i use model without LM, everything is ok.
But when i use model with 4-gram LM (code for combination of wav2vec2 and 4-grams LM is here), there is error when running kenlm.model(n_gram_path):
OSError: [Errno -9981] Input overflowed
Can you please check your code for this error?
Sorry for my bad English

Optimisation for GPU with CUDA

Hello @oliverguhr ! Thank you for your code, which is working fine on my CPU.
For practical reasons (another device with poor CPU but good GPU), I would like to be able to run it on GPU, using CUDA. The problem is that when I run the code using

device = "cuda:0" if torch.cuda.is_available() else "cpu"
self.model.to(device)

I get a RuntimeError: CUDA error: out of memory error, even though I have 2GB memory available, and even though your code already uses with torch.no_grad():.

Do you know if the code can be adapted to use less memory, or how much memory would be needed for an inference?

Internal PortAudio error on MacOS

Thanks for sharing this project!

I am on a mac, and I seem to have problems with opening the pyAudio stream here: https://github.com/oliverguhr/wav2vec2-live/blob/main/live_asr.py#L59-L64

Looking at the documentation: https://people.csail.mit.edu/hubert/pyaudio/docs/ I see that to use a microphone, a callback function is required.

I made sure that I am picking the correct device microphone here: https://github.com/oliverguhr/wav2vec2-live/blob/main/live_asr.py#L56-L57

The error I am getting is:

Process Process-2:
Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.9.2_4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/Cellar/[email protected]/3.9.2_4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/othrif/spectrum/voice/experiment/wav2vec/042121/wav2vec2-live/live_asr.py", line 62, in vad_process
    stream = audio.open(input_device_index=selected_input_device_id,
  File "/Users/othrif/spectrum/voice/experiment/wav2vec/042121/wav2vec2-live/.venv/lib/python3.9/site-packages/pyaudio.py", line 750, in open
    stream = Stream(self, *args, **kwargs)
  File "/Users/othrif/spectrum/voice/experiment/wav2vec/042121/wav2vec2-live/.venv/lib/python3.9/site-packages/pyaudio.py", line 441, in __init__
    self._stream = pa.open(**arguments)
OSError: [Errno -9986] Internal PortAudio error

Any thoughts on what might be causing this?

Live ASR Engine with whisper

I have implemented (not from scratch) LiveASREngine using whisper using the following codebase written by you:

https://github.com/oliverguhr/wav2vec2-live

The only change I made was in the wav2vec2_inference.py: initialized whisper model with hugging face pipeline.
my code: https://github.com/Dimlight/LiveASREngine
The problem I am facing now:

If I do not say anything and the entire room is silent, the engine continuously prints "you" or "thank you", I tested the system in a quiet room. still getting the same issue.

Can anyone please help me what can be the reasons for getting this kind of result?

onnx-quantized model

hello, thanks for your great work.
does this repo support onnx quantized models?

Performance is not good, only for me?

Hi , I'm using the code given, no errors, only one warning, but the performance is terrible, am I the only one experincing that issues?
Buy terrible I mean it is mostly hard to see the connection between what said and the transcription.
The warning I'm getting is:
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Any idea someone?

mms model

hi,
does this repo support mms model (which is like wav2vec2.0 but have more transformer layers.)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.