etienneab3d / whisperhallu Goto Github PK

View Code? Open in Web Editor NEW

212.0 9.0 21.0 12.41 MB

Experimental code: sound file preprocessing to optimize Whisper transcriptions without hallucinated texts

Python 100.00%

asr noise-removal sound-processing text-to-speech vad whisper audio-processing vocals

whisperhallu's Introduction

WhisperHallu

Experimental code: sound file preprocessing to optimize Whisper transcriptions without hallucinated texts

See this discussion: openai/whisper#679

Main algo

remove noise by voice extraction using Facebook Demucs or Deezer Spleeter.
remove silences, and normalize loudness with ffmpeg.
remove noise parts using Silero VAD.
add voice markers.
apply speech compressor (requires ffmpeg 4.4, while Google Colab is 4.2, it has to be upgraded, see below).
try to transcribe. If markers are present in output, transcription is OK.
if not, try to invert markers. If markers are present in output, transcription is OK.
if not, try without markers.

Processing options and parameters

use Whisper V1, V2 or V3 (V2 by default, because V3 seems bad with music).
beam_size (2 by default), patience, temperature.
process only a subpart of the input file (needs a post-processing of timestamp values).
various time stretching methods tested (see in-code comments. Needs a post-processing of timestamp values. It was an interesting suggested idea, but no real gain obtained on my side).
vocals remix (with or without speech normalization).
multiple final transcriptions (get multiple results, knowing Whisper is not stable from one run to an other, without doing pre-processing several times)

Complement

May be used to produce "accurate transcriptions" for WhisperTimeSync:
https://github.com/EtienneAb3d/WhisperTimeSync

May be tested using NeuroSpell Dictaphone:
https://neurospell.com/

WhisperHallu and WhisperTimeSync are used to extract vocals and lyrics in karaok-AI:
https://github.com/EtienneAb3d/karaok-AI

ChatMate is a complete versatile ChatGPT automation tool, including explanations to produce a SRT file translator to Chinese (as an example):
https://github.com/EtienneAb3d/ChatMate

Google Colab

Standard Whisper:
https://colab.research.google.com/drive/1-GpXaNaGFXKX9VXl60JGVVrGO41t09KA?usp=sharing

Faster Whisper:
https://colab.research.google.com/drive/1RkvOtUTbUD5NVsRI4aKEqJO8BRo8BFIY?usp=sharing

Install

Upgrade ffmpeg to version 4.4 on Google Colab

! add-apt-repository -y ppa:savoury1/ffmpeg4
! apt-get -qq install -y ffmpeg

!ffmpeg -version

Output:
==========
ffmpeg version 4.4.3-0ubuntu1~20.04.sav2 Copyright (c) 2000-2022 the FFmpeg developers
[...]

Demucs (if used)

pip install -U demucs

Spleeter (if used)

pip install spleeter

Standard Whisper (if used)

sudo apt update && sudo apt install ffmpeg

sudo apt install python3
sudo apt install python3-pip
sudo apt install virtualenv

virtualenv -p python3 ../venvWhisper
. ../venvWhisper/bin/activate

pip install -U openai-whisper

pip3 install torchaudio

Faster Whisper (if used in place of Whisper)

sudo apt update && sudo apt install ffmpeg

sudo apt install python3
sudo apt install python3-pip
sudo apt install virtualenv

virtualenv -p python3 ../venvFasterWhisper
. ../venvFasterWhisper/bin/activate

git clone https://github.com/guillaumekln/faster-whisper.git
cd faster-whisper/

pip install -e .[conversion]
pip install -e .

cd ..

ct2-transformers-converter --model openai/whisper-medium --output_dir whisper-medium-ct2 --quantization float16
ct2-transformers-converter --model openai/whisper-large --output_dir whisper-large-ct2 --quantization float16

pip3 install torchaudio

SM4T (if used in place of Whisper)

sudo apt update && sudo apt install ffmpeg

sudo apt install python3
sudo apt install python3-pip
sudo apt install virtualenv

virtualenv -p python3 ../venvSM4T
. ../venvSM4T/bin/activate

git clone https://github.com/facebookresearch/seamless_communication.git
cd seamless_communication/

pip install --upgrade pip
pip install .

m4t_predict "On ne fait pas d'omelette sans casser des oeufs." t2tt eng --src_lang fra

pip3 install torchaudio

Code

from transcribeHallu import loadModel
from transcribeHallu import transcribePrompt

##### The audio language may be different from the one for the output transcription.
path="/path/to/your/en/sound/file"
lngInput="en"

##### Activate this for music file to get a minimal processing
isMusic=False

##### Need to be adapted for each language.
##### For prompt examples, see transcribeHallu.py getPrompt(lng:str)
lng="en"
prompt= "Whisper, Ok. "\
	+"A pertinent sentence for your purpose in your language. "\
	+"Ok, Whisper. Whisper, Ok. "\
	+"Ok, Whisper. Whisper, Ok. "\
	+"Please find here, an unlikely ordinary sentence. "\
	+"This is to avoid a repetition to be deleted. "\
	+"Ok, Whisper. "

##### Model size to use
modelSize="medium"
loadModel("0",modelSize=modelSize)

result = transcribePrompt(path=path, lng=lng, prompt=prompt, lngInput=lngInput,isMusic=isMusic)

This tool is a demonstration of our know-how.
If you are interested in a commercial/industrial AI linguistic project, contact us:
https://cubaix.com

whisperhallu's People

Contributors

Stargazers

Watchers

whisperhallu's Issues

SRT and Translation options

Can srt subtitles and translation be added?

Should a GPU help this algorithm go faster or no?

So from what I've seen when the script runs it attempts to run as a GPU if one is present, which of course is great. In fact I think it's even the default. For whatever reason it doesn't run as GPU on my NVIDIA A100. I have no issues with running whisper ... --device cuda, it works great and reduces the runtime of my transcription by an order of magnitude. I wish I could get the same result with Hallu. What am I missing? Thanks! Let me know if you want any other information from me.

GPU out of memory

I have a 8G gpu, sometimes it will crash as out of memory. How to reduce the cost of memory?

Can't load Whisper model: large

Good eavening, i was trying to run whisper hallu on my computer but the program rais me this exeption:
Can't load Whisper model: large

this is my code:

from transcribeHallu import loadModel
from transcribeHallu import transcribePrompt

import argparse
import os

#### ARGUMENTS
parser = argparse.ArgumentParser()

parser.add_argument('-i', '--input', help="input audio file")
parser.add_argument('-o', '--output', help="output directory (by default is the input dir)")
parser.add_argument('-lng', '--lenguage', help="audio input lenguage (by default is it)")

args = vars(parser.parse_args())


##### The audio language may be different from the one for the output transcription.
path=args["input"]
if not path:
   raise Exception("Missing audio input parameter, use -h for more info")

if not args["output"]:
   out_path = os.path.split(args["input"])[0]
else:
   out_path = args["output"]

if not args["lenguage"]:
   lngInput="it"
else:
   lngInput=args["lenguage"]

##### Activate this for music file to get a minimal processing
isMusic=False

##### Need to be adapted for each language.
##### For prompt examples, see transcribeHallu.py getPrompt(lng:str)
lng=lngInput
prompt= "Whisper, Ok. "\
   +"A pertinent sentence for your purpose in your language. "\
   +"Ok, Whisper. Whisper, Ok. "\
   +"Ok, Whisper. Whisper, Ok. "\
   +"Please find here, an unlikely ordinary sentence. "\
   +"This is to avoid a repetition to be deleted. "\
   +"Ok, Whisper. "

##### Model size to use
if lngInput == "it":
   modelSize = "large"
else:
   modelSize="medium"
   
loadModel("0",modelSize=modelSize)

result = transcribePrompt(path=path, lng=lng, lngInput=lngInput,isMusic=isMusic)

i checked if the model was in the correct path, and it was.
I'm on windows 10, and i am using faster-whisper

LoadModel fails if Whisper standard not installed

Hi @EtienneAb3d , in an attempt to use faster_whisper I omitted installing statndard Whisper. I get the following error from load model function:

/content/WhisperHallu/transcribeHallu.py in loadModel(gpu, modelSize)
58 global model
59 device="cuda" #cuda cpu
---> 61 if isFaster:
62 if(modelSize == "large"):

NameError: name 'isFaster' is not defined

SubtitleEdit

Does your program work in this program and if so, how?

Can I remove other person's voice

Sometimes when I am using whisper for real time transcribe, someone else will speak in my room, then whisper will transcribe their voice also. It is a headache. And I want to remove other person's voice. Can anyone gives some suggestion?

Missing LICENSE

Seems like the project is missing LICENSE.txt

KeyError: 'word_timestamps'

Traceback (most recent call last):
  File "C:\Apps\WhisperHallu\transcribeHallu.py", line 404, in transcribeMARK
    if(transcribe_options["word_timestamps"]):

Can't get the application to work properly with addSRT=True set - otherwise I get no output other than the console output.

Cannot obtain timestamps along with transcription

@EtienneAb3d Is there a way to get the timestamps of each sentence/word once the transcription is done? I only get the result transcription text

Support for using Whisper via API

Just wondering if it is possible to add support for using Whisper via API instead of locally

New Google Colab Error

Google Colab "Standard Whisper"

TypeError Traceback (most recent call last)
in <cell line: 21>()
19
20 loadModel("0")
---> 21 result = transcribePrompt(path=path, lng=lng, prompt=prompt)

/content/WhisperHallu/transcribeHallu.py in transcribePrompt(path, lng, prompt, lngInput, isMusic, addSRT, truncDuration, maxDuration)
212 print("PROMPT="+prompt,flush=True)
213 opts = dict(language=lng,initial_prompt=prompt)
--> 214 return transcribeOpts(path, opts,lngInput,isMusic=isMusic,addSRT=addSRT,truncDuration=truncDuration,maxDuration=maxDuration)
215
216 def transcribeOpts(path: str,opts: dict

TypeError: transcribeOpts() got an unexpected keyword argument 'truncDuration

Google Colab Error

Hey,
the Google Colab "Standard Whisper" doesn't work anymore.
getting the following error.
`/usr/local/lib/python3.10/dist-packages/torchaudio/_extension/utils.py in _check_cuda_version()
190 t_version = f"{t_version[0]}.{t_version[1]}"
191 if ta_version != t_version:
--> 192 raise RuntimeError(
193 "Detected that PyTorch and TorchAudio were compiled with different CUDA versions. "
194 f"PyTorch has CUDA version {t_version} whereas TorchAudio has CUDA version {ta_version}. "

RuntimeError: Detected that PyTorch and TorchAudio were compiled with different CUDA versions. PyTorch has CUDA version 12.1 whereas TorchAudio has CUDA version 11.8. Please install the TorchAudio version that matches your PyTorch version.`

Why not use markers when the duration of the audio exceed 30s?

Can’t load Whisper model

Hi,
Thank you for sharing this.
I’ve been trying to use it but kept encountering the problem:

100%|█████████████████████████████████████| 2.87G/2.87G [07:12<00:00, 7.13MiB/s]
Can't load Whisper model: large

I’ve tried using the medium model but the problem remained. Could you please help?

No such file or directory: '<file_path>.WAV.wav.vocals.wav.SILCUT.wav.VAD.wav.MRK.wav.CPS.wav'

Hi! I've followed your know-how step by step in a linux machine. However, I haven't managed to make it work. I execute python <name_of_file> and here is the output it returns.
app_logs.txt
Thank you in advance

The "segment" time is wrong from the real time when the silence is removed

I run the same as you, the results are good, but the "segment" time is wrong compared to the real time of the audio. So is there a way when I run the code, the silence is still removed, but the text segment time is still the same as the audio time? Hope you can help. Thank you very much

Use FFMPEG silenceremove and VAD

Wonderful pipeline, thanks a lot for your great work! It was a bit painful to set up in my case but now it works like a charm. Still I am curious about the usage of silenceremove and VAD together. Shouldn't silero-vad be capable of doing silenceremove's job too?
Surely I am mistaken and there is a good reason, I just want to know which.

Can't load Whisper Model

Hi there,

I don't have a lot of knowledge about programming, and I'm struggling to deal with your tool. I tried to use the tool in Jupyter Anaconda. I installed all the required libraries, but when I attempted to run the final code, I encountered the same error message multiple times:

vbnet
Copy code
Python >= 3.10
Using cache found in C:\Users\renat/.cache\torch\hub\snakers4_silero-vad_master
Using Demucs
Using standard Whisper
LOADING: medium GPU:0 BS: 2
100%|█████████████████████████████████████| 1.42G/1.42G [00:25<00:00, 59.9MiB/s]
Can't load Whisper model: STD/medium

Below you can find more information:

RuntimeError Traceback (most recent call last)
File ~\Downloads\Jupyter\WhisperHallu\transcribeHallu.py:110, in loadModel(gpu, modelSize)
109 print("LOADING: "+modelSize+" GPU:"+gpu+" BS: "+str(beam_size))
--> 110 model = whisper.load_model(modelSize,device=torch.device("cuda:"+gpu)) #May be "cpu"
111 elif whisperFound == "SM4T":

File ~\AppData\Roaming\Python\Python311\site-packages\whisper_init_.py:146, in load_model(name, device, download_root, in_memory)
143 with (
144 io.BytesIO(checkpoint_file) if in_memory else open(checkpoint_file, "rb")
145 ) as fp:
--> 146 checkpoint = torch.load(fp, map_location=device)
147 del checkpoint_file

File ~\AppData\Roaming\Python\Python311\site-packages\torch\serialization.py:1014, in load(f, map_location, pickle_module, weights_only, mmap, **pickle_load_args)
1013 raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
-> 1014 return _load(opened_zipfile,
1015 map_location,
1016 pickle_module,
1017 overall_storage=overall_storage,
1018 **pickle_load_args)
1019 if mmap:

File ~\AppData\Roaming\Python\Python311\site-packages\torch\serialization.py:1422, in _load(zip_file, map_location, pickle_module, pickle_file, overall_storage, **pickle_load_args)
1421 unpickler.persistent_load = persistent_load
-> 1422 result = unpickler.load()
1424 torch._utils._validate_loaded_sparse_tensors()

File ~\AppData\Roaming\Python\Python311\site-packages\torch\serialization.py:1392, in _load..persistent_load(saved_id)
1391 nbytes = numel * torch._utils._element_size(dtype)
-> 1392 typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
1394 return typed_storage

File ~\AppData\Roaming\Python\Python311\site-packages\torch\serialization.py:1366, in _load..load_tensor(dtype, numel, key, location)
1363 # TODO: Once we decide to break serialization FC, we can
1364 # stop wrapping with TypedStorage
1365 typed_storage = torch.storage.TypedStorage(
-> 1366 wrap_storage=restore_location(storage, location),
1367 dtype=dtype,
1368 _internal=True)
1370 if typed_storage._data_ptr() != 0:

File ~\AppData\Roaming\Python\Python311\site-packages\torch\serialization.py:1299, in _get_restore_location..restore_location(storage, location)
1298 def restore_location(storage, location):
-> 1299 return default_restore_location(storage, str(map_location))

File ~\AppData\Roaming\Python\Python311\site-packages\torch\serialization.py:381, in default_restore_location(storage, location)
380 for _, _, fn in _package_registry:
--> 381 result = fn(storage, location)
382 if result is not None:

File ~\AppData\Roaming\Python\Python311\site-packages\torch\serialization.py:274, in _cuda_deserialize(obj, location)
273 if location.startswith('cuda'):
--> 274 device = validate_cuda_device(location)
275 if getattr(obj, "_torch_load_uninitialized", False):

File ~\AppData\Roaming\Python\Python311\site-packages\torch\serialization.py:258, in validate_cuda_device(location)
257 if not torch.cuda.is_available():
--> 258 raise RuntimeError('Attempting to deserialize object on a CUDA '
259 'device but torch.cuda.is_available() is False. '
260 'If you are running on a CPU-only machine, '
261 'please use torch.load with map_location=torch.device('cpu') '
262 'to map your storages to the CPU.')
263 device_count = torch.cuda.device_count()

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

During handling of the above exception, another exception occurred:

SystemExit Traceback (most recent call last)
[... skipping hidden 1 frame]

Cell In[3], line 26
15 #Example
16 #lng="uk"
17 #prompt= "Whisper, Ok. "
(...)
23 # +"Ok, Whisper. "
24 #path="/path/to/your/uk/sound/file"
---> 26 loadModel("0")
27 result = transcribePrompt(path=path, lng=lng, prompt=prompt)

File ~\Downloads\Jupyter\WhisperHallu\transcribeHallu.py:117, in loadModel(gpu, modelSize)
116 print("Can't load Whisper model: "+whisperFound+"/"+modelSize)
--> 117 sys.exit(-1)

SystemExit: -1

During handling of the above exception, another exception occurred:

AttributeError Traceback (most recent call last)
[... skipping hidden 1 frame]

File ~\anaconda3\Lib\site-packages\IPython\core\interactiveshell.py:2097, in InteractiveShell.showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code)
2094 if exception_only:
2095 stb = ['An exception has occurred, use %tb to see '
2096 'the full traceback.\n']
-> 2097 stb.extend(self.InteractiveTB.get_exception_only(etype,
2098 value))
2099 else:
2101 def contains_exceptiongroup(val):

File ~\anaconda3\Lib\site-packages\IPython\core\ultratb.py:710, in ListTB.get_exception_only(self, etype, value)
702 def get_exception_only(self, etype, value):
703 """Only print the exception type and message, without a traceback.
704
705 Parameters
(...)
708 value : exception value
709 """
--> 710 return ListTB.structured_traceback(self, etype, value)

File ~\anaconda3\Lib\site-packages\IPython\core\ultratb.py:568, in ListTB.structured_traceback(self, etype, evalue, etb, tb_offset, context)
565 chained_exc_ids.add(id(exception[1]))
566 chained_exceptions_tb_offset = 0
567 out_list = (
--> 568 self.structured_traceback(
569 etype,
570 evalue,
571 (etb, chained_exc_ids), # type: ignore
572 chained_exceptions_tb_offset,
573 context,
574 )
575 + chained_exception_message
576 + out_list)
578 return out_list

File ~\anaconda3\Lib\site-packages\IPython\core\ultratb.py:1435, in AutoFormattedTB.structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context)
1433 else:
1434 self.tb = etb
-> 1435 return FormattedTB.structured_traceback(
1436 self, etype, evalue, etb, tb_offset, number_of_lines_of_context
1437 )

File ~\anaconda3\Lib\site-packages\IPython\core\ultratb.py:1326, in FormattedTB.structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
1323 mode = self.mode
1324 if mode in self.verbose_modes:
1325 # Verbose modes need a full traceback
-> 1326 return VerboseTB.structured_traceback(
1327 self, etype, value, tb, tb_offset, number_of_lines_of_context
1328 )
1329 elif mode == 'Minimal':
1330 return ListTB.get_exception_only(self, etype, value)

File ~\anaconda3\Lib\site-packages\IPython\core\ultratb.py:1173, in VerboseTB.structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context)
1164 def structured_traceback(
1165 self,
1166 etype: type,
(...)
1170 number_of_lines_of_context: int = 5,
1171 ):
1172 """Return a nice text document describing the traceback."""
-> 1173 formatted_exception = self.format_exception_as_a_whole(etype, evalue, etb, number_of_lines_of_context,
1174 tb_offset)
1176 colors = self.Colors # just a shorthand + quicker name lookup
1177 colorsnormal = colors.Normal # used a lot

File ~\anaconda3\Lib\site-packages\IPython\core\ultratb.py:1063, in VerboseTB.format_exception_as_a_whole(self, etype, evalue, etb, number_of_lines_of_context, tb_offset)
1060 assert isinstance(tb_offset, int)
1061 head = self.prepare_header(str(etype), self.long_header)
1062 records = (
-> 1063 self.get_records(etb, number_of_lines_of_context, tb_offset) if etb else []
1064 )
1066 frames = []
1067 skipped = 0

File ~\anaconda3\Lib\site-packages\IPython\core\ultratb.py:1131, in VerboseTB.get_records(self, etb, number_of_lines_of_context, tb_offset)
1129 while cf is not None:
1130 try:
-> 1131 mod = inspect.getmodule(cf.tb_frame)
1132 if mod is not None:
1133 mod_name = mod.name

AttributeError: 'tuple' object has no attribute 'tb_frame'

In my research, I discovered that the issue is related to something with the GPU. Here are my PC specifications:
AMD Ryzen 7 6800H with Radeon Graphics 3.20 GHz
16.0 GB RAM
NVIDIA GeForce RTX 3070 Ti
Thank you in advance.

.

RAM Exceedance on Google Colab Handling Long Audio

The standard Google Colab notebook crashes with a ~40 mins audio file, as it use up the RAM during voice extraction with demucs. Could we prevent this by not loading the Whisper model until demucs finishes?