Giter Club home page Giter Club logo

realtimestt's Introduction

RealtimeSTT

Easy-to-use, low-latency speech-to-text library for realtime applications

New

Custom wake words with OpenWakeWord. Thanks to the developers of this!

About the Project

RealtimeSTT listens to the microphone and transcribes voice into text.

Hint: Check out Linguflex, the original project from which RealtimeSTT is spun off. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.

It's ideal for:

  • Voice Assistants
  • Applications requiring fast and precise speech-to-text conversion
RealtimeSTT.mp4

Updates

Latest Version: v0.2.2

See release history.

Hint: Since we use the multiprocessing module now, ensure to include the if __name__ == '__main__': protection in your code to prevent unexpected behavior, especially on platforms like Windows. For a detailed explanation on why this is important, visit the official Python documentation on multiprocessing.

Features

  • Voice Activity Detection: Automatically detects when you start and stop speaking.
  • Realtime Transcription: Transforms speech to text in real-time.
  • Wake Word Activation: Can activate upon detecting a designated wake word.

Hint: Check out RealtimeTTS, the output counterpart of this library, for text-to-voice capabilities. Together, they form a powerful realtime audio wrapper around large language models.

Tech Stack

This library uses:

These components represent the "industry standard" for cutting-edge applications, providing the most modern and effective foundation for building high-end solutions.

Installation

pip install RealtimeSTT

This will install all the necessary dependencies, including a CPU support only version of PyTorch.

Although it is possible to run RealtimeSTT with a CPU installation only (use a small model like "tiny" or "base" in this case) you will get way better experience using:

GPU Support with CUDA (recommended)

Updating PyTorch for CUDA Support

To upgrade your PyTorch installation to enable GPU support with CUDA, follow these instructions based on your specific CUDA version. This is useful if you wish to enhance the performance of RealtimeSTT with CUDA capabilities.

For CUDA 11.8:

To update PyTorch and Torchaudio to support CUDA 11.8, use the following commands:

pip install torch==2.3.1+cu118 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118

For CUDA 12.X:

To update PyTorch and Torchaudio to support CUDA 12.X, execute the following:

pip install torch==2.3.1+cu121 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121

Replace 2.3.1 with the version of PyTorch that matches your system and requirements.

Steps That Might Be Necessary Before

Note: To check if your NVIDIA GPU supports CUDA, visit the official CUDA GPUs list.

If you didn't use CUDA models before, some additional steps might be needed one time before installation. These steps prepare the system for CUDA support and installation of the GPU-optimized installation. This is recommended for those who require better performance and have a compatible NVIDIA GPU. To use RealtimeSTT with GPU support via CUDA please also follow these steps:

  1. Install NVIDIA CUDA Toolkit:

  2. Install NVIDIA cuDNN:

    • select between CUDA 11.8 or CUDA 12.X Toolkit
      • for 12.X visit cuDNN Downloads.
        • Select operating system and version.
        • Download and install the software.
      • for 11.8 visit NVIDIA cuDNN Archive.
        • Click on "Download cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x".
        • Download and install the software.
  3. Install ffmpeg:

    Note: Installation of ffmpeg might not actually be needed to operate RealtimeSTT *thanks to jgilbert2017 for pointing this out

    You can download an installer for your OS from the ffmpeg Website.

    Or use a package manager:

Quick Start

Basic usage:

Manual Recording

Start and stop of recording are manually triggered.

recorder.start()
recorder.stop()
print(recorder.text())

Automatic Recording

Recording based on voice activity detection.

with AudioToTextRecorder() as recorder:
    print(recorder.text())

When running recorder.text in a loop it is recommended to use a callback, allowing the transcription to be run asynchronously:

def process_text(text):
    print (text)
    
while True:
    recorder.text(process_text)

Wakewords

Keyword activation before detecting voice. Write the comma-separated list of your desired activation keywords into the wake_words parameter. You can choose wake words from these list: alexa, americano, blueberry, bumblebee, computer, grapefruits, grasshopper, hey google, hey siri, jarvis, ok google, picovoice, porcupine, terminator.

recorder = AudioToTextRecorder(wake_words="jarvis")

print('Say "Jarvis" then speak.')
print(recorder.text())

Callbacks

You can set callback functions to be executed on different events (see Configuration) :

def my_start_callback():
    print("Recording started!")

def my_stop_callback():
    print("Recording stopped!")

recorder = AudioToTextRecorder(on_recording_start=my_start_callback,
                               on_recording_stop=my_stop_callback)

Feed chunks

If you don't want to use the local microphone set use_microphone parameter to false and provide raw PCM audiochunks in 16-bit mono (samplerate 16000) with this method:

recorder.feed_audio(audio_chunk)

Shutdown

You can shutdown the recorder safely by using the context manager protocol:

with AudioToTextRecorder() as recorder:
    [...]

Or you can call the shutdown method manually (if using "with" is not feasible):

recorder.shutdown()

Testing the Library

The test subdirectory contains a set of scripts to help you evaluate and understand the capabilities of the RealtimeTTS library.

Test scripts depending on RealtimeTTS library may require you to enter your azure service region within the script. When using OpenAI-, Azure- or Elevenlabs-related demo scripts the API Keys should be provided in the environment variables OPENAI_API_KEY, AZURE_SPEECH_KEY and ELEVENLABS_API_KEY (see RealtimeTTS)

  • simple_test.py

    • Description: A "hello world" styled demonstration of the library's simplest usage.
  • realtimestt_test.py

    • Description: Showcasing live-transcription.
  • wakeword_test.py

    • Description: A demonstration of the wakeword activation.
  • translator.py

    • Dependencies: Run pip install openai realtimetts.
    • Description: Real-time translations into six different languages.
  • openai_voice_interface.py

    • Dependencies: Run pip install openai realtimetts.
    • Description: Wake word activated and voice based user interface to the OpenAI API.
  • advanced_talk.py

    • Dependencies: Run pip install openai keyboard realtimetts.
    • Description: Choose TTS engine and voice before starting AI conversation.
  • minimalistic_talkbot.py

    • Dependencies: Run pip install openai realtimetts.
    • Description: A basic talkbot in 20 lines of code.

The example_app subdirectory contains a polished user interface application for the OpenAI API based on PyQt5.

Configuration

Initialization Parameters for AudioToTextRecorder

When you initialize the AudioToTextRecorder class, you have various options to customize its behavior.

General Parameters

  • model (str, default="tiny"): Model size or path for transcription.

    • Options: 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2'.
    • Note: If a size is provided, the model will be downloaded from the Hugging Face Hub.
  • language (str, default=""): Language code for transcription. If left empty, the model will try to auto-detect the language. Supported language codes are listed in Whisper Tokenizer library.

  • compute_type (str, default="default"): Specifies the type of computation to be used for transcription. See Whisper Quantization

  • input_device_index (int, default=0): Audio Input Device Index to use.

  • gpu_device_index (int, default=0): GPU Device Index to use. The model can also be loaded on multiple GPUs by passing a list of IDs (e.g. [0, 1, 2, 3]).

  • device (str, default="cuda"): Device for model to use. Can either be "cuda" or "cpu".

  • on_recording_start: A callable function triggered when recording starts.

  • on_recording_stop: A callable function triggered when recording ends.

  • on_transcription_start: A callable function triggered when transcription starts.

  • ensure_sentence_starting_uppercase (bool, default=True): Ensures that every sentence detected by the algorithm starts with an uppercase letter.

  • ensure_sentence_ends_with_period (bool, default=True): Ensures that every sentence that doesn't end with punctuation such as "?", "!" ends with a period

  • use_microphone (bool, default=True): Usage of local microphone for transcription. Set to False if you want to provide chunks with feed_audio method.

  • spinner (bool, default=True): Provides a spinner animation text with information about the current recorder state.

  • level (int, default=logging.WARNING): Logging level.

  • handle_buffer_overflow (bool, default=True): If set, the system will log a warning when an input overflow occurs during recording and remove the data from the buffer.

  • beam_size (int, default=5): The beam size to use for beam search decoding.

  • initial_prompt (str or iterable of int, default=None): Initial prompt to be fed to the transcription models.

  • suppress_tokens (list of int, default=[-1]): Tokens to be suppressed from the transcription output.

  • on_recorded_chunk: A callback function that is triggered when a chunk of audio is recorded. Submits the chunk data as parameter.

  • debug_mode (bool, default=False): If set, the system prints additional debug information to the console.

Real-time Transcription Parameters

Note: When enabling realtime description a GPU installation is strongly advised. Using realtime transcription may create high GPU loads.

  • enable_realtime_transcription (bool, default=False): Enables or disables real-time transcription of audio. When set to True, the audio will be transcribed continuously as it is being recorded.

  • realtime_model_type (str, default="tiny"): Specifies the size or path of the machine learning model to be used for real-time transcription.

    • Valid options: 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2'.
  • realtime_processing_pause (float, default=0.2): Specifies the time interval in seconds after a chunk of audio gets transcribed. Lower values will result in more "real-time" (frequent) transcription updates but may increase computational load.

  • on_realtime_transcription_update: A callback function that is triggered whenever there's an update in the real-time transcription. The function is called with the newly transcribed text as its argument.

  • on_realtime_transcription_stabilized: A callback function that is triggered whenever there's an update in the real-time transcription and returns a higher quality, stabilized text as its argument.

  • beam_size_realtime (int, default=3): The beam size to use for real-time transcription beam search decoding.

Voice Activation Parameters

  • silero_sensitivity (float, default=0.6): Sensitivity for Silero's voice activity detection ranging from 0 (least sensitive) to 1 (most sensitive). Default is 0.6.

  • silero_use_onnx (bool, default=False): Enables usage of the pre-trained model from Silero in the ONNX (Open Neural Network Exchange) format instead of the PyTorch format. Default is False. Recommended for faster performance.

  • silero_deactivity_detection (bool, default=False): Enables the Silero model for end-of-speech detection. More robust against background noise. Utilizes additional GPU resources but improves accuracy in noisy environments. When False, uses the default WebRTC VAD, which is more sensitive but may continue recording longer due to background sounds.

  • webrtc_sensitivity (int, default=3): Sensitivity for the WebRTC Voice Activity Detection engine ranging from 0 (least aggressive / most sensitive) to 3 (most aggressive, least sensitive). Default is 3.

  • post_speech_silence_duration (float, default=0.2): Duration in seconds of silence that must follow speech before the recording is considered to be completed. This ensures that any brief pauses during speech don't prematurely end the recording.

  • min_gap_between_recordings (float, default=1.0): Specifies the minimum time interval in seconds that should exist between the end of one recording session and the beginning of another to prevent rapid consecutive recordings.

  • min_length_of_recording (float, default=1.0): Specifies the minimum duration in seconds that a recording session should last to ensure meaningful audio capture, preventing excessively short or fragmented recordings.

  • pre_recording_buffer_duration (float, default=0.2): The time span, in seconds, during which audio is buffered prior to formal recording. This helps counterbalancing the latency inherent in speech activity detection, ensuring no initial audio is missed.

  • on_vad_detect_start: A callable function triggered when the system starts to listen for voice activity.

  • on_vad_detect_stop: A callable function triggered when the system stops to listen for voice activity.

Wake Word Parameters

  • wakeword_backend (str, default="pvporcupine"): Specifies the backend library to use for wake word detection. Supported options include 'pvporcupine' for using the Porcupine wake word engine or 'oww' for using the OpenWakeWord engine.

  • openwakeword_model_paths (str, default=None): Comma-separated paths to model files for the openwakeword library. These paths point to custom models that can be used for wake word detection when the openwakeword library is selected as the wakeword_backend.

  • openwakeword_inference_framework (str, default="onnx"): Specifies the inference framework to use with the openwakeword library. Can be either 'onnx' for Open Neural Network Exchange format or 'tflite' for TensorFlow Lite.

  • wake_words (str, default=""): Initiate recording when using the 'pvporcupine' wakeword backend. Multiple wake words can be provided as a comma-separated string. Supported wake words are: alexa, americano, blueberry, bumblebee, computer, grapefruits, grasshopper, hey google, hey siri, jarvis, ok google, picovoice, porcupine, terminator. For the 'openwakeword' backend, wake words are automatically extracted from the provided model files, so specifying them here is not necessary.

  • wake_words_sensitivity (float, default=0.6): Sensitivity level for wake word detection (0 for least sensitive, 1 for most sensitive).

  • wake_word_activation_delay (float, default=0): Duration in seconds after the start of monitoring before the system switches to wake word activation if no voice is initially detected. If set to zero, the system uses wake word activation immediately.

  • wake_word_timeout (float, default=5): Duration in seconds after a wake word is recognized. If no subsequent voice activity is detected within this window, the system transitions back to an inactive state, awaiting the next wake word or voice activation.

  • wake_word_buffer_duration (float, default=0.1): Duration in seconds to buffer audio data during wake word detection. This helps in cutting out the wake word from the recording buffer so it does not falsely get detected along with the following spoken text, ensuring cleaner and more accurate transcription start triggers. Increase this if parts of the wake word get detected as text.

  • on_wakeword_detected: A callable function triggered when a wake word is detected.

  • on_wakeword_timeout: A callable function triggered when the system goes back to an inactive state after when no speech was detected after wake word activation.

  • on_wakeword_detection_start: A callable function triggered when the system starts to listen for wake words

  • on_wakeword_detection_end: A callable function triggered when stopping to listen for wake words (e.g. because of timeout or wake word detected)

OpenWakeWord

Training models

Look here for information about how to train your own OpenWakeWord models. You can use a simple Google Colab notebook for a start or use a more detailed notebook that enables more customization (can produce high quality models, but requires more development experience).

Convert model to ONNX format

You might need to use tf2onnx to convert tensorflow tflite models to onnx format:

pip install -U tf2onnx
python -m tf2onnx.convert --tflite my_model_filename.tflite --output my_model_filename.onnx

Configure RealtimeSTT

Suggested starting parameters for OpenWakeWord usage:

    with AudioToTextRecorder(
        wakeword_backend="oww",
        wake_words_sensitivity=0.35,
        openwakeword_model_paths="word1.onnx,word2.onnx",
        wake_word_buffer_duration=1,
        ) as recorder:

Contribution

Contributions are always welcome!

Shoutout to Steven Linn for providing docker support.

License

MIT

Author

Kolja Beigel
Email: [email protected]
GitHub

realtimestt's People

Contributors

danielwoz avatar fuleinist avatar hannesdelbeke avatar jaggzh avatar koljab avatar pia-ai avatar stevenlafl avatar vancoder1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

realtimestt's Issues

Simple test not working

Hi, I'm trying to use this library but for me it doesnt seem to be working, I have suspicions that it is because of me installing the gpu variant, but not fully because the last step cudnn is installed does not have an installer on windows. Here is the error I get when running it:

Say something...
Process Process-1:
Traceback (most recent call last):
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 578, in _transcription_worker
transcription = " ".join(seg.text for seg in segments)
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 578, in
transcription = " ".join(seg.text for seg in segments)
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\faster_whisper\transcribe.py", line 508, in generate_segments
encoder_output = self.encode(segment)
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\faster_whisper\transcribe.py", line 767, in encode
return self.model.encode(features, to_cpu=to_cpu)
RuntimeError: Library cublas64_12.dll is not found or cannot be loaded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\multiprocessing\process.py", line 314, in _bootstrap
self.run()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 581, in _transcription_worker
except faster_whisper.WhisperError as e:
AttributeError: module 'faster_whisper' has no attribute 'WhisperError'
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\multiprocessing\connection.py", line 312, in _recv_bytes
nread, err = ov.GetOverlappedResult(True)
BrokenPipeError: [WinError 109] The pipe has been ended

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "c:\Users\rober\RealtimeSTT\tests\realtimestt_test.py", line 6, in
while (True): print(recorder.text(), end=" ", flush=True)
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 825, in text
return self.transcribe()
File "C:\Users\rober\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 777, in transcribe
status, result = self.parent_transcription_pipe.recv()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\multiprocessing\connection.py", line 250, in recv
buf = self._recv_bytes()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\multiprocessing\connection.py", line 321, in _recv_bytes
raise EOFError
EOFError

TTS

Impressive work - do you have any insight on applying the same methodology for TTS?

Shutdown issue

Hello, I have problem with shutdown method when using microphone =False, it always stuck on

    logging.debug('Finishing recording thread')
    if self.recording_thread:
        self.recording_thread.join()

Example code:

if __name__ == '__main__':
    import pyaudio
    import threading
    from RealtimeSTT import AudioToTextRecorder
    import wave
    import time

    import logging


    recorder = None
    recorder_ready = threading.Event()

    recorder_config = {
        'spinner': False,
        'use_microphone': False,
        'model': "tiny.en",
        'language': 'en',
        'silero_sensitivity': 0.4,
        'webrtc_sensitivity': 2,
        'post_speech_silence_duration': 0.7,
        'min_length_of_recording': 0,
        'min_gap_between_recordings': 0
    }

    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 16000
    CHUNK = 1024

    REALTIMESTT = True


    def recorder_thread():
        global recorder
        print("Initializing RealtimeSTT...")
        recorder = AudioToTextRecorder(**recorder_config,level=logging.DEBUG)
        print("RealtimeSTT initialized")
        recorder_ready.set()
        while True:
            full_sentence = recorder.text()
            if full_sentence:
                print(f"\rSentence: {full_sentence}")




    recorder_thread = threading.Thread(target=recorder_thread)
    recorder_thread.start()
    recorder_ready.wait()
    with wave.open('Iiterviewing.wav', 'rb') as wav_file:
        assert wav_file.getnchannels() == CHANNELS
        assert wav_file.getsampwidth() == pyaudio.get_sample_size(FORMAT)
        assert wav_file.getframerate() == RATE
        data = wav_file.readframes(CHUNK)
        while data:
            time.sleep(0.1)
            recorder.feed_audio(data)
            data = wav_file.readframes(CHUNK)
    print("before")
    recorder.shutdown()
    print("after")


Do I actually need NVIDIA CUDA 12 rather than 11.8?

I just tried the instructions for implementing GPU Support e.g. installing NVIDIA CUDA Toolkit 11.8 and NVIDIA cuDNN 8.7.0 for CUDA 11.x, specifically, as per the readme (which involves some manual file moving around and PATH environment updating on Windows as per Nvidia's instructions at https://docs.nvidia.com/deeplearning/cudnn/installation/windows.html - restarted since then to ensure PATH updated properly everywhere and confirmed it can be found in the proper place when in my venv via which cublas64_11.dll (I use Git bash for my terminal in VSCode, but where via cmd works, too) and then the pytorch uninstall/reinstall pip commands specifying cuda 11.8

Then I tried to run a super simple test script:

from RealtimeSTT import AudioToTextRecorder

def process_text(text):
  print(text, end=" ", flush=True)

if __name__ == '__main__':
  with AudioToTextRecorder(
    spinner=False,
    model="tiny.en",
    language="en",
    # enable_realtime_transcription=True,
    realtime_model_type="tiny.en"
  ) as recorder:
    print("Say something...")
    while True:
      recorder.text(process_text)

But got this error:

Exception: Library cublas64_12.dll is not found or cannot be loaded

The stack trace stops in RealtimeSTT so I manually hunted around in the .venv files in vscode for a reference to cuda 12. The only reference matching a regex of cublas(.*)12 in ./.venv/* is the METADATA file for PyTorch i.e. .venv\Lib\site-packages\torch-2.2.2+cu118.dist-info\METADATA . Specifically:

Metadata-Version: 2.1
Name: torch
Version: 2.2.2+cu118
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Download-URL: https://github.com/pytorch/pytorch/tags
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Keywords: pytorch,machine learning
Classifier: ...
...
Requires-Dist: nvidia-cuda-nvrtc-cu12 ==12.1.105 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu12 ==12.1.105 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu12 ==12.1.105 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu12 ==8.9.2.26 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cublas-cu12 ==12.1.3.1 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu12 ==11.0.2.54 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu12 ==10.3.2.106 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu12 ==11.4.5.107 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu12 ==12.1.0.106 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu12 ==2.19.3 ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu12 ==12.1.105 ; platform_system == "Linux" and platform_machine == "x86_64"
...

I'm fairly inexperienced poking around to resolve Python's specific dependency hell 😅 so I'm not sure if this represents the cause of the issue or not.

I tried the uninstall, then pip cache purge, and then the re-install in case a cahed wheel was the issue, but still have the problem.

Add a "on_recorded" function OR fix on_recorded_chunk

on_recorded_chunk kinda ignores the voice activation process and is called as often as the CPU allows it per second, providing 1kb Data without any voice or whatsoever.

I kinda don't see any use case why this should be helpful. On the other hand, I guess a full data exporting function at the end of voice activation makes much more sense, since this
a) has real user data instead of white noise 1kb chunk
b) this real data can be f.e sent to an external Speech to Text Database or just to a server for later usage / further training.

Or am I thinking something wrong here?

Multiprocessing issue on macOS

Hi,

Thank you for your great work! I am running 0.1.11 with Python 3.10 on latest macOS, and I am getting this error when running simple_test.py, which is similar to #7 and #29

Could you please share any workaround for this? Great thanks!

Say something... RealTimeSTT: root - ERROR - Unhandled exeption in _recording_worker: Exception in thread Thread-1 (_recording_worker): Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/RealtimeSTT/audio_recorder.py", line 994, in _recording_worker while (self.audio_queue.qsize() > File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/queues.py", line 126, in qsize return self._maxsize - self._sem._semlock._get_value()

unable to run script

Hi there,

I've been desperate to try your script after I saw it on reddit (we had a brief chat), but I can't for the life of me figure out what's going on?

I've tried:
Running from the GH repo with pip install realtimestt
Running from the GH repo without pip install realtimestt
running in a different env just using pip install realtimestt
running your test scripts
running the most 'basic' vanilla script

Environment:
MacBook Pro
macOS Ventura Version 13.5.1 (22G90)
Apple M2 Max
Conda Environment (fresh)
ffmpeg installed with Conda
Python 3.11.5
Pip freeze dump:
av==10.0.0
certifi==2023.7.22
charset-normalizer==3.2.0
colorama==0.4.6
coloredlogs==15.0.1
ctranslate2==3.19.0
enum34==1.1.10
faster-whisper==0.8.0
filelock==3.12.4
flatbuffers==23.5.26
fsspec==2023.9.1
halo==0.0.31
huggingface-hub==0.17.1
humanfriendly==10.0
idna==3.4
Jinja2==3.1.2
log-symbols==0.0.14
MarkupSafe==2.1.3
mpmath==1.3.0
networkx==3.1
numpy==1.25.2
onnxruntime==1.15.1
packaging==23.1
protobuf==4.24.3
pvporcupine==1.9.5
PyAudio==0.2.13
PyYAML==6.0.1
requests==2.31.0
six==1.16.0
spinners==0.0.24
sympy==1.12
termcolor==2.3.0
tokenizers==0.13.3
torch==2.0.1
torchaudio==2.0.2
tqdm==4.66.1
typing_extensions==4.7.1
urllib3==2.0.4
webrtcvad==2.0.10

Console dump:
[ctranslate2] [thread 2542417] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
File "/.../test whisper.py", line 4, in
recorder = AudioToTextRecorder(spinner=True, language="en", model="tiny.en", level=logging.WARNING)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/site-packages/RealtimeSTT/audio_recorder.py", line 246, in init
self.silero_vad_model, _ = torch.hub.load(

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/site-packages/torch/hub.py", line 555, in load
repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load",

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/site-packages/torch/hub.py", line 199, in _get_cache_or_reload
repo_owner, repo_name, ref = _parse_repo_info(github)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/site-packages/torch/hub.py", line 142, in _parse_repo_info
with urlopen(f"https://github.com/{repo_owner}/{repo_name}/tree/main/"):

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/urllib/request.py", line 519, in open
response = self._open(req, data)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/urllib/request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/urllib/request.py", line 496, in _call_chain
result = func(*args)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/urllib/request.py", line 1391, in https_open
return self.do_open(http.client.HTTPSConnection, req,

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/urllib/request.py", line 1352, in do_open
r = h.getresponse()

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/http/client.py", line 1378, in getresponse
response.begin()

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/http/client.py", line 318, in begin
version, status, reason = self._read_status()

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/http/client.py", line 279, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/socket.py", line 706, in readinto
return self._sock.recv_into(b)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/ssl.py", line 1278, in recv_into
return self.read(nbytes, buffer)

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/ssl.py", line 1134, in read
return self._sslobj.read(len, buffer)

KeyboardInterrupt
Exception ignored in: <function AudioToTextRecorder.del at 0x1523b23e0>
Traceback (most recent call last):
File "/.../anaconda3/envs/open-interpreter/lib/python3.11/site-packages/RealtimeSTT/audio_recorder.py", line 894, in del
self.shutdown()

File "/.../anaconda3/envs/open-interpreter/lib/python3.11/site-packages/RealtimeSTT/audio_recorder.py", line 397, in shutdown
self.recording_thread.join()

AttributeError: 'AudioToTextRecorder' object has no attribute 'recording_thread'

Would love some help here!

Thanks,

The Captain

browser client example phrases repetition

I'm encountering slow real-time transcription and occasional repetition issues while using the browser client example (i didn't change anything in the script). It seems that the transcription process is significantly delayed, and certain phrases are repeated multiple times as if the same chunk of text is being transcribed repeatedly.

How to choose the CUDA version?

hello,@KoljaB
I update the latest RealtimeSTT,but something wrong with the CUDA version for the ctranslate2 in the linux env(ubuntu).

RealTimeSTT: root - ERROR - Error initializing main faster_whisper transcription model: CUDA failed with error the operation cannot be performed in the present state
Traceback (most recent call last):
File "/home/ubuntu/stt_translate/audio_recorder.py", line 629, in _transcription_worker
model = faster_whisper.WhisperModel(
File "/usr/local/lib/python3.10/dist-packages/faster_whisper/transcribe.py", line 144, in init
self.model = ctranslate2.models.Whisper(
RuntimeError: CUDA failed with error the operation cannot be performed in the present state

How to choose the CUDA version with the latest RealtimeSTT?should I download cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x"?
thanks!

Support quantized models to save memory

First, thanks for creating a fantastic project! I was looking for a way to run Whisper or some other speech-to-text model in realtime. I found several potential solutions but this one is clearly the best, especially for implementing custom applications on top.

I noticed that faster-whisper supports quantized models but RealtimeSTT currently doesn't expose that option. With int8 quantization, models take up much less VRAM (or RAM, if run on CPU only). The quality of model output may suffer a little bit, but I think it's still a worthwhile optimization when memory is tight.

I have a laptop with an integrated NVIDIA GeForce MX150 GPU that only has 2GB VRAM. I was able to run the small model without problems (with tiny as the realtime model), but the medium and larger models gave a CUDA out of memory error.

To enable quantization, I tweaked the initialization of WhisperModel here

self.realtime_model_type = faster_whisper.WhisperModel(
model_size_or_path=self.realtime_model_type,
device='cuda' if torch.cuda.is_available() else 'cpu'
)
and here
model = faster_whisper.WhisperModel(
model_size_or_path=model_path,
device='cuda' if torch.cuda.is_available() else 'cpu'
)

by adding the parameter compute_type='int8'. This resulted in quantized models and the medium model can now fit on my feeble GPU; sadly, the large-v2 model is still too big.

GPU VRAM requirements as reported by nvidia-smi with and without quantization of the main model (realtime model is always tiny with the same quantization applied as for the main model):

model default int8
tiny 542MiB 246MiB
base 914MiB 278MiB
small 1386MiB 532MiB
medium out of memory 980MiB
large-v2 out of memory out of memory

This could be exposed as an additional parameter compute_type for AudioToTextRecorder; or possibly two separate parameters, one for the realtime model and another for the main model. This parameter would then simply be passed as compute_type to the WhisperModel(s).

Cuda Error

Followed the installation steps to run the repo. When I try running realtimestt_test.py, I get this runtime error:

root - ERROR - Unhandled exeption in _realtime_worker: parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for 
execution on the device
Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Users\uber_\AppData\Local\Programs\Python\Python39\lib\threading.py", line 973, in _bootstrap_inner
    self.run()
  File "C:\Users\uber_\AppData\Local\Programs\Python\Python39\lib\threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "C:\VisionBox\NLP\transcription_translation\.venv\lib\site-packages\RealtimeSTT\audio_recorder.py", line 1302, in _realtime_worker
    self.realtime_transcription_text = " ".join(
  File "C:\VisionBox\NLP\transcription_translation\.venv\lib\site-packages\RealtimeSTT\audio_recorder.py", line 1302, in <genexpr>       
    self.realtime_transcription_text = " ".join(
  File "C:\VisionBox\NLP\transcription_translation\.venv\lib\site-packages\faster_whisper\transcribe.py", line 511, in generate_segments 
    encoder_output = self.encode(segment)
  File "C:\VisionBox\NLP\transcription_translation\.venv\lib\site-packages\faster_whisper\transcribe.py", line 762, in encode
    return self.model.encode(features, to_cpu=to_cpu)
RuntimeError: parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device
Traceback (most recent call last):
    recorder.text(process_text)
  File "C:\VisionBox\NLP\transcription_translation\.venv\lib\site-packages\RealtimeSTT\audio_recorder.py", line 882, in text
    self.wait_audio()
  File "C:\VisionBox\NLP\transcription_translation\.venv\lib\site-packages\RealtimeSTT\audio_recorder.py", line 802, in wait_audio
    if (self.stop_recording_event.wait(timeout=0.02)):

I confirmed that Cuda 11.8 is installed using: nvcc -V and have the cuDNN (8.7.0) files as well. Torch is also available.

Provide ID for data & transcribtion

Especially for slow devices (for example cpu) there is the problem of speech going on while transcription is happening. I want to add a "listen again" function using on_recorded_chunk saving the chunk and making this file available for listening on frontend. I want to map transcription and data later on since transcription is async. To be able to achieve this, I have to have the queue.

Can you please add arguments "transcription_id" as argument on_recorded_chunk and also recorder.text?

Thank you.

Launches but does not display any text

Just spinner with text "speak now". Tried large-v3, small and tiny models, no difference. Mic working well, I just tested it by recording audio using python. It's probably trying, because CPU is heavily loaded, but there's no result.

MacOS Sonoma 14.4

There are no errors in console, just warning:
[2024-04-24 23:53:30.031] [ctranslate2] [thread 8069444] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.

main transcription model path

Hi! I've tried to find path where models are downloaded and maybe try to set another path.

code from example
model="small.en", language="en", wake_words="jarvis"

CUDA initialization error on current master

Hi! While I was trying to develop a PR for the shutdown lockup issue, I noticed that the recent commits on master broke model initialization:

RealTimeSTT: root - ERROR - Error initializing main faster_whisper transcription model: CUDA failed with error initialization error
Traceback (most recent call last):
  File "/nix/store/df8n8d04bdfdbwpgccjsaw8sc4hin2pk-python3-3.11.9-env/lib/python3.11/site-packages/RealtimeSTT/audio_recorder.py", line 629, in _transcription_worker
    model = faster_whisper.WhisperModel(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/df8n8d04bdfdbwpgccjsaw8sc4hin2pk-python3-3.11.9-env/lib/python3.11/site-packages/faster_whisper/transcribe.py", line 144, in __init__
    self.model = ctranslate2.models.Whisper(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA failed with error initialization error

If I leave everything the same and downgrade to v1.1.16 everything works fine. My guess is that this happens because you were previously calling torch.cuda.is_available() in each thread which initialized the cuda runtime in that thread. This isn't happening anymore possibly leading the observed error.

No Internet Connection

I have installed the RealtimeSTT but when running a few of the test files, it gave this error:

RuntimeError: It looks like there is no internet connection and the repo could not be found in the cache (/Users/username/.cache/torch/hub)

I am a little bit confused, so can this be resolved?

No output shown or Logs.

I ran the websocket server at host 0.0.0.0 and connected to it from my machine, it shows some errors and then says "Waiting for clients" and when it connects without problems but it doesn't show any output

└─ ... ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM sysdefault
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM sysdefault
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.front
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround21
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround21
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround40
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround41
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround50
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround51
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround71
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.hdmi
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.hdmi
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.modem
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.modem
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.phoneline
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.phoneline
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM dmix
Cannot connect to server socket err = No such file or directory
Cannot connect to server request channel
jack server is not running or cannot be started
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
[2024-04-15 17:33:11.582] [ctranslate2] [thread 854988] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
[2024-04-15 17:33:12.901] [ctranslate2] [thread 855019] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
└─ OK
waiting for clients
└─ OK. 
waiting for sentence
└─ ... 

In the client I see vad started. and it is stuck there.

OSError: [Errno -9997] Invalid sample rate

OS: Linux Arch
Audio system: Pipewire (with alsa, pulse audio etc plugins).

It seems it is only possible for PyAudio to open an audio device with Pipewire at its default sample rate, not 16000Hz.

Would it be possible to run RealtimeSTT at a higher frequency than 16000Hz?

issue with microphone

dear

my card sound working fine but with your program i tried many change in my ubuntun pci sound configuration but its not working and giving me errors
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_route.c:877:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave

please any helps

Question: verify installation

I just installed RealtimeSTT on windows.

I downloaded and installed cudnn and ffmpeg but they are just archives with binaries.

I didn't add anything to my PATH but the realtimestt_test.py seems to work fine.

How do I verify that my installation is correct (what are cuddn/ffmpeg required by)?

Also, is there any direct way to verify that my GPU is being used?

Thanks

[Feature request] Abort execution

Hi! I found recorder.abort() function, and think it will be great to have something:

  1. Always listen for wake word in background, even if transcripting.
  2. If wake word detected interrupt transcription and make new session.
  3. Also will be great to have callback to do something on interruption.

Cannot run example browerclient

I have error when use the example of browser client folder:

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/Users/hieunguyenminh/opt/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/Users/hieunguyenminh/opt/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/hieunguyenminh/CODE ALL/TalkToListen/STT-VAD/venv/lib/python3.9/site-packages/RealtimeSTT/audio_recorder.py", line 985, in _recording_worker
    while (self.audio_queue.qsize() >
  File "/Users/hieunguyenminh/opt/anaconda3/lib/python3.9/multiprocessing/queues.py", line 126, in qsize
    return self._maxsize - self._sem._semlock._get_value()
NotImplementedError
  • I tried python 3.11 and 3.9 but they both has this. I wonder what version do you use that have this qsize() supported?
  • I want to make an app that connect client and server, what example do you recommend?

Thank you for creating this app, this helps me a lot!

does it support client-server mode realtime STT?

Hi, KoljaB

Thanks for your contribution. I learned a lot although I'm still a newbie.
I have a application scenario :
A linux machine with Nvidia GPU, I want use it (as server) to realtime transtribe audio from macbook( as client) ;
Does this object support this application scenario?

Regards,
Snow

How to pass audio file and transcribe it

Hi team, I am working on implementing a voice chatbot using RealtimeSTT library. For the speech-to-text part, I am using the RealTimeSTT library. Here, I am attempting to provide an audio file as input and transcribe it. You mentioned that if we don't want to use a microphone, we should set 'set_microphone' to False and provide the audio as a 16-bit PCM chunk to obtain the transcribed text as output.
i have implemented the code as below.

import soundfile as sf
import numpy as np
import json
from scipy.signal import resample
from RealtimeSTT import AudioToTextRecorder

def format_audio(filepath):
# Read the audio file
    data, samplerate = sf.read(filepath)
    print("data------------->",data)
    pcm_16 = np.maximum(-32768, np.minimum(32767, data*32768)).astype(np.int16)
    print("pcm data----------------->",pcm_16)
    # Create metadata
    metadata = {
        "sampleRate": samplerate,
    }
    metadata_json = json.dumps(metadata)
    metadata_bytes = metadata_json.encode('utf-8')

    # Create buffer for metadata length (4 bytes for 32-bit integer)
    metadata_length = len(metadata_bytes).to_bytes(4, byteorder='little')

    # Combine metadata length, metadata, and audio data into a single message
    combined_data = metadata_length + metadata_bytes + pcm_16.tobytes()
    # print("Combined data: ", combined_data)
    return combined_data


def decode_and_resample(
            audio_data,
            original_sample_rate,
            target_sample_rate):

        # Decode 16-bit PCM data to numpy array
        audio_np = np.frombuffer(audio_data, dtype=np.int16)

        # Calculate the number of samples after resampling
        num_original_samples = len(audio_np)
        num_target_samples = int(num_original_samples * target_sample_rate /
                                 original_sample_rate)

        # Resample the audio
        resampled_audio = resample(audio_np, num_target_samples)

        return resampled_audio.astype(np.int16).tobytes()

if __name__ == '__main__':
    combined_data = format_audio('chat1.wav')
    recorder_config = {
        'spinner': False,
        'use_microphone': False,
        'model': 'tiny.en',
        'language': 'en',
        'silero_sensitivity': 0.4,
        'webrtc_sensitivity': 2,
        'post_speech_silence_duration': 0.7,
        'min_length_of_recording': 0,
        'min_gap_between_recordings': 0,
        'enable_realtime_transcription': False,
        'realtime_processing_pause': 0,
        'realtime_model_type': 'tiny.en'
    }

    recorder = AudioToTextRecorder(**recorder_config)
    metadata_length = int.from_bytes(combined_data[:4], byteorder='little')
    metadata_json = combined_data[4:4+metadata_length].decode('utf-8')
    metadata = json.loads(metadata_json)
    sample_rate = metadata['sampleRate']
    chunk = combined_data[4+metadata_length:]
    resampled_chunk = decode_and_resample(chunk, sample_rate, 16000)
    recorder.feed_audio(resampled_chunk)

    # Get the transcribed text
    text = recorder.text()
    print(f"Transcribed text: {text}")

but i got response as;

[2024-04-28 11:36:10.671] [ctranslate2] [thread 21416] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
data-------------> [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 9.15527344e-05
1.22070312e-04 1.52587891e-04]
pcm data-----------------> [0 0 0 ... 3 4 5]
[2024-04-28 11:36:57.949] [ctranslate2] [thread 17900] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
RealTimeSTT: root - WARNING - Audio queue size exceeds latency limit. Current size: 104. Discarding old audio chunks.

But after this, I have to get the transcribed text as output, but it is running forever and didn't give any output after that warning message. I don't have any clue to identify what I'm doing wrong. According to my findings, for real-time transcription, it is working fine (when we speak, it continuously transcribes), but when we give an audio file, how do we transcribe it? It would be helpful if you provide a solution for passing the audio file and getting the text from it.

[Feature request] Update porcupine version for use with macOS arm

@KoljaB I wanted to know if updating porcupine is part of the plan. This updates porcupine to v3.0 which supports mac m series and allows you to create custom wake-words given you have an access key.
I also have a working forked repo which does this. If it appeals to you I can make a pull request.

Scipy missing from requirements.txt?

I installed RealtimeSTT via pip (in a new and otherwise empty virtual environment) and tried to run a simple test script with from RealtimeSTT import AudioToTextRecorder as the only import. I then got a module missing error for scipy. I installed it via pip and then everyting worked fine (via CPU, as noted in the readme).

Does scipy need adding to the requirements.txt files?

Example for using remote GPU server?

Since GPU is necessary for mid-large models and I think most users don't have their nvidia GPUs so they including me need to connect remote GPU server to run the whisper models. But the current module or examples don't have a way to connect local mic to remote GPU server. Could someone point me or give me examples on how to accomplish that?

openai.ChatCompletion no longer supported

I ran tests\minimalistic_talkbot.py
RealTimeSTT 0.1.8
RealTimeTTS 0.3.4
openai 1.6.0

I got this error:

RealTimeSTT: root - WARNING - error in play() with engine azure:

You tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run openai migrate to automatically upgrade your codebase to use the 1.0.0 interface.

Alternatively, you can pin your installation to the old version, e.g. pip install openai==0.28

A detailed migration guide is available here: openai/openai-python#742

Traceback: Traceback (most recent call last):
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\RealtimeTTS\text_to_stream.py", line 308, in play
for sentence in chunk_generator:
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\RealtimeTTS\text_to_stream.py", line 552, in _synthesis_chunk_generator
for chunk in generator:
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\stream2sentence\stream2sentence.py", line 193, in generate_sentences
for char in _generate_characters(generator, log_characters):
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\stream2sentence\stream2sentence.py", line 85, in _generate_characters
for chunk in generator:
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\RealtimeTTS\threadsafe_generators.py", line 237, in next
token = next(self.generator)
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\RealtimeTTS\threadsafe_generators.py", line 147, in next
self._current_str = next(self._current_iterator)
File "R:\projects\RealtimeSTT\tests\minimalistic_talkbot.py", line 11, in generate
for chunk in openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=messages, stream=True):
File "R:\projects\RealtimeSTT\test_env\lib\site-packages\openai\lib_old_api.py", line 39, in call
raise APIRemovedInV1(symbol=self._symbol)
openai.lib._old_api.APIRemovedInV1:

You tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run openai migrate to automatically upgrade your codebase to use the 1.0.0 interface.

Alternatively, you can pin your installation to the old version, e.g. pip install openai==0.28

A detailed migration guide is available here: openai/openai-python#742

Imput device: what you hear

Good afternoon, is there any possibility to install an imput device on what you hear?
I am considering this program to transcribe in real time the entire conversation between several people I hear, is it possible?

Apple Neural Engine integration?

Thanks for this amazing work.

I have a mac and since it has neural engine to leverage, I was wondering if there is any way of integrating that to use this module. Would this be a possible feature addition?

macOS

Code:

import ssl

ssl._create_default_https_context = ssl._create_unverified_context
import torch

model, _ = torch.hub.load(repo_or_dir="snakers4/silero-vad", model="silero_vad", verbose=True)

if __name__ == '__main__':
    recorder = AudioToTextRecorder(spinner=False)

    print("Say something...")
    while (True):
        print(recorder.text(), end=" ", flush=True)

Error:

RealTimeSTT: root - ERROR - Unhandled exeption in _recording_worker: 
Exception in thread Thread-1 (_recording_worker):
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/hanxirui/workspace/python/DataScience/venv/lib/python3.11/site-packages/RealtimeSTT/audio_recorder.py", line 667, in _recording_worker
    while self.audio_queue.qsize() > self.allowed_latency_limit:
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/queues.py", line 126, in qsize
    return self._maxsize - self._sem._semlock._get_value()
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError```

Pytorch version mismatch

Servus, I followed all the steps and propose a tiny tweak to the Readme.

The regular pip install: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 creates a version mismatch between torch & torchaudio (on a brand new VM, configured from scratch).

RuntimeError: Detected that PyTorch and TorchAudio were compiled with different CUDA versions. PyTorch has CUDA version 11.8 whereas TorchAudio has CUDA version 11.7. Please install the TorchAudio version that matches your PyTorch version.

I recommend pinning the versions to fix the issue (if it appears): pip install torch==2.0.1+cu118 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
Source/inspiration.

Code works amazingly well, kudos 👍

I tried the German version and that works very well too (had to mute the logs). You have done something great here Kolja! GG.

Screenshot from 2023-09-16 12-26-22

Works on both 2060 & 4090, Ubuntu 22.04.

the on_realtime_transcription_update text issue

Hi,@KoljaB
I use the on_realtime_transcription_update(also try the on_realtime_transcription_stabilized) text,send to the web client,but it always show the pre text word.
Is there other method that can avoid show the repeated pre words?Thanks!
image
I try to implement live transcription function similar to Zoom Meeting in the product.

Interval error

Almostly working fine, but I got an error several times when I was using this function AudioToTextRecorder.text(on_transcription_finished=on_transcription_finished)

The error:

Traceback (most recent call last):
  File "C:\ooba-voice\prod\main.py", line 32, in <module>
- transcribing    main()
  File "C:\ooba-voice\prod\main.py", line 27, in main
    recorder.text(on_transcription_finished=on_transcription_finished)
  File "C:\Users\name\AppData\Roaming\Python\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 543, in text
    threading.Thread(target=on_transcription_finished, args=(self.transcribe(),)).start()
  File "C:\Users\name\AppData\Roaming\Python\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 506, in transcribe
    self._set_state("transcribing")
  File "C:\Users\name\AppData\Roaming\Python\Python310\site-packages\RealtimeSTT\audio_recorder.py", line 1004, in _set_state
    self.halo._interval = 50
AttributeError: 'NoneType' object has no attribute '_interval'

What is the problem?

And this has nothing to do with above one but it often stops recording while I'm still speaking. Could you teach me how to solve this problem too?
...and I cannot kill a process completely with Ctrl+C from windows 11 cmd

summary

  • AttributeError: 'NoneType' object has no attribute '_interval' (AudioToTextRecorder.text(on_transcription_finished=on_transcription_finished))
  • Stop recording while still speaking (AudioToTextRecorder.text(on_transcription_finished=on_transcription_finished))
  • Cannot kill a process complety with Ctrl+C from windows 11 cmd

transcribing multiple audio streams simultaneously

@KoljaB
Hello, I'd like to ask if a single AudioToTextRecorder object supports transcribing or recording multiple audio streams simultaneously?
How can I implement simultaneous transcribing or recording multiple audio streams?
Thank you!

Record blocked while transcribing (no real async possible)

when .text(function) is called, microphone is blocked and it is not listening. Speech happening at the same time is not captured and is also not listed in the recorder.audio_queue.qsize()

So a basic customer journey example.

  • "Speak now"
  • User Speaks - "Recording"
  • User stops speaking. "Transcribing"
  • Transcribing disappears. Faster-Whisper is trying to transcribe using large-v3
  • User speaks while big transcription is happening Gets ignored
  • Result of first text appears. "Speak now"

I tried enable_realtime_transcription': True and False. Both has the problem.
I am using recorder.text(process_text) which according to docs async as soon as I provide a function in .text(). But it appears to be not that async.

Can you please solve this? The queue appears to be buggy and with slow gpu / cpu, there is guaranteed data loss due to race condition.

Thank you

input pcm buffer_size issue

BUFFER_SIZE = 512
self.buffer_size = BUFFER_SIZE
feed_audio(self, chunk):
when I call feed_audio,the input data size is 640/768(16k,mono) from our realtime server,should I change the buffer_size (512) in the audio_recorder?
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.