mahmoudashraf97 / whisper-diarization Goto Github PK

View Code? Open in Web Editor NEW

2.0K 41.0 209.0 84 KB

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper

License: BSD 2-Clause "Simplified" License

Python 42.20% Jupyter Notebook 57.80%

asr speaker-diarization speech speech-recognition speech-to-text whisper

whisper-diarization's People

Contributors

Stargazers

Watchers

Forkers

cafew wilfoderek alperdurukan kevingele jzysl423 iamjackg lucasleandro1204 albcunha alex-songs thegoodwei juvesito prcutler amagimedia unpopulr andrewtrench urvishp80 regud federicotorrielli fdlinky stefrodopcodes94 codingchild2424 domnasrabadi albertocopado uzzije hedrergudene atrhacker frankie336 hunterboy2023 kasperhk aymericr knaik smxsm iamzeid paulsunnypark ryanrussell matthewgard1 jjhw omerbenamram tronko55 verheesj privilego nicolaygerold badarsaghir bernardocecchetto stepri pistachiobaby catyung techthiyanes mitchaka14 artemk pmcbride laflechaenelaire sangchulsuh hoonlight johnknott densinh togetherlyhub n3rdc4ptn danushk97 sghael atelfo cthulhuatemybrain ken2190 abdullahbaa5 shujaagideon manoj060603 haiderasad tmoreci runyourself alexuh spladder87 paddy0914 vedvalvib gumplus racheroni zolaliu phillipliang cball02 bencoster theplasmak mrrefactoring piqosoft jaedukseo ntinkler-calendly toby1091 rochemedia ravimahadevan hwrdtm dasmy johnbont simonaszilinskas librinostri k2m5t2 tomonarifeehan servin erikxt t109ab0014 jamesqh nkabram tsipporahc

whisper-diarization's Issues

whisper-diarzation VS whisperX

Has anyone compared the performance of these two models?

linalg.eigh Convergence Issue in Speaker Diarization

Hi all,

First, thanks to @MahmoudAshraf97 for this valuable diarization tool. Very appreciated.

I encountered an error in some audio files with "linalg.eigh" failing to converge:

_LinAlgError: linalg.eigh: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 842).

Has anyone faced this issue and found a solution?
Any suggestions would be appreciated.
Thanks!

Out of memory in Colab (free)

Thanks for the Colab.

I ran into the dreaded out memory while

processing the embedding files saved to nemo_outputs/speaker_outputs/embeddings

Just trying to understand: My audio file is only 1h18 minutes long, but its embeddings amount to: "Dataset loaded with 9209 items, total duration of 2.55 hours." Since segmentations and subsegmentations are done automatically, I wonder what the upper limit filesize for the script to work on Colab would be.

Full error:
RuntimeError: CUDA out of memory. Tried to allocate 7.61 GiB (GPU 0; 14.75 GiB total capacity; 7.81 GiB already allocated; 5.65 GiB free; 7.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

pip install errors

This error related to nemo_toolkit dependencies is causing pip install -r requirements.txt to fail.

I have a similar setup (python 3.10.6, latest pip/setuptools, macOS 13.1 M1)

How can i improve the diarization sensitvity?

I tried the code on a 30 min 2 speaker conversation and it only detected one speaker. Also it put all the text over a single time stamp ex. 00:00:01,020 --> 00:29:13,065 ... Is this normal?

Can we somehow increase the diarization sensitivity? I looked into the nvidia documentation but can't find any way to do it.
Also tried whisperx it worked much better but still miss classified some part of the text.

Faster With GPUs?

Progress on diarisation

I have made some (modest) progress on this, if anyone wishes to have a look:

https://github.com/mirix/approaches-to-diarisation/tree/main

I'm getting an error

Hello, first of all, thanks for the project.
I installed the requirements and ran the following command as you said:
python diarize.py -a AUDIO_FILE_NAME
I am getting the following error:
Can you help me? Thanks.

│ /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:10 │
│ 02 in _get_module                                                            │
│                                                                              │
│    999 │                                                                     │
│   1000 │   def _get_module(self, module_name: str):                          │
│   1001 │   │   try:                                                          │
│ ❱ 1002 │   │   │   return importlib.import_module("." + module_name, self.__ │
│   1003 │   │   except Exception as e:                                        │
│   1004 │   │   │   raise RuntimeError(                                       │
│   1005 │   │   │   │   f"Failed to import {self.__name__}.{module_name} beca │
│                                                                              │
│ /usr/lib/python3.8/importlib/__init__.py:127 in import_module                │
│                                                                              │
│   124 │   │   │   if character != '.':                                       │
│   125 │   │   │   │   break                                                  │
│   126 │   │   │   level += 1                                                 │
│ ❱ 127 │   return _bootstrap._gcd_import(name[level:], package, level)        │
│   128                                                                        │
│   129                                                                        │
│   130 _RELOADING = {}                                                        │
│ <frozen importlib._bootstrap>:1014 in _gcd_import                            │
│ <frozen importlib._bootstrap>:991 in _find_and_load                          │
│ <frozen importlib._bootstrap>:975 in _find_and_load_unlocked                 │
│ <frozen importlib._bootstrap>:671 in _load_unlocked                          │
│ <frozen importlib._bootstrap_external>:848 in exec_module                    │
│ <frozen importlib._bootstrap>:219 in _call_with_frames_removed               │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/transformers/models/xlm_roberta/model │
│ ing_tf_xlm_roberta.py:19 in <module>                                         │
│                                                                              │
│    16 """ TF 2.0 XLM-RoBERTa model."""                                       │
│    17                                                                        │
│    18 from ...utils import add_start_docstrings, logging                     │
│ ❱  19 from ..roberta.modeling_tf_roberta import (                            │
│    20 │   TFRobertaForCausalLM,                                              │
│    21 │   TFRobertaForMaskedLM,                                              │
│    22 │   TFRobertaForMultipleChoice,                                        │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/transformers/models/roberta/modeling_ │
│ tf_roberta.py:36 in <module>                                                 │
│                                                                              │
│     33 │   TFSequenceClassifierOutput,                                       │
│     34 │   TFTokenClassifierOutput,                                          │
│     35 )                                                                     │
│ ❱   36 from ...modeling_tf_utils import (                                    │
│     37 │   TFCausalLanguageModelingLoss,                                     │
│     38 │   TFMaskedLanguageModelingLoss,                                     │
│     39 │   TFModelInputType,                                                 │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/transformers/modeling_tf_utils.py:38  │
│ in <module>                                                                  │
│                                                                              │
│     35 from tensorflow.python.keras.saving import hdf5_format                │
│     36                                                                       │
│     37 from huggingface_hub import Repository, list_repo_files               │
│ ❱   38 from keras.saving.hdf5_format import save_attributes_to_hdf5_group    │
│     39 from requests import HTTPError                                        │
│     40 from transformers.utils.hub import convert_file_size_to_int, get_chec │
│     41                                                                       │
╰──────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'keras.saving.hdf5_format'

The above exception was the direct cause of the following exception:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/whisper-diarization/diarize.py:145 in <module>                      │
│                                                                              │
│   142                                                                        │
│   143 if whisper_results["language"] in punct_model_langs:                   │
│   144 │   # restoring punctuation in the transcript to help realign the sent │
│ ❱ 145 │   punct_model = PunctuationModel(model="kredor/punctuate-all")       │
│   146 │                                                                      │
│   147 │   words_list = list(map(lambda x: x["word"], wsm))                   │
│   148                                                                        │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/deepmultilingualpunctuation/punctuati │
│ onmodel.py:9 in __init__                                                     │
│                                                                              │
│    6 class PunctuationModel():                                               │
│    7 │   def __init__(self, model = "oliverguhr/fullstop-punctuation-multila │
│    8 │   │   if torch.cuda.is_available():                                   │
│ ❱  9 │   │   │   self.pipe = pipeline("ner",model, grouped_entities=False, d │
│   10 │   │   else:                                                           │
│   11 │   │   │   self.pipe = pipeline("ner",model, grouped_entities=False)   │
│   12                                                                         │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/transformers/pipelines/__init__.py:65 │
│ 0 in pipeline                                                                │
│                                                                              │
│   647 │   # Forced if framework already defined, inferred if it's None       │
│   648 │   # Will load the correct model if possible                          │
│   649 │   model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["p │
│ ❱ 650 │   framework, model = infer_framework_load_model(                     │
│   651 │   │   model,                                                         │
│   652 │   │   model_classes=model_classes,                                   │
│   653 │   │   config=config,                                                 │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py:233 in │
│ infer_framework_load_model                                                   │
│                                                                              │
│    230 │   │   │   │   │   if _class is not None:                            │
│    231 │   │   │   │   │   │   classes.append(_class)                        │
│    232 │   │   │   │   if look_tf:                                           │
│ ❱  233 │   │   │   │   │   _class = getattr(transformers_module, f"TF{archit │
│    234 │   │   │   │   │   if _class is not None:                            │
│    235 │   │   │   │   │   │   classes.append(_class)                        │
│    236 │   │   │   class_tuple = class_tuple + tuple(classes)                │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:99 │
│ 3 in __getattr__                                                             │
│                                                                              │
│    990 │   │   │   value = self._get_module(name)                            │
│    991 │   │   elif name in self._class_to_module.keys():                    │
│    992 │   │   │   module = self._get_module(self._class_to_module[name])    │
│ ❱  993 │   │   │   value = getattr(module, name)                             │
│    994 │   │   else:                                                         │
│    995 │   │   │   raise AttributeError(f"module {self.__name__} has no attr │
│    996                                                                       │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:99 │
│ 2 in __getattr__                                                             │
│                                                                              │
│    989 │   │   if name in self._modules:                                     │
│    990 │   │   │   value = self._get_module(name)                            │
│    991 │   │   elif name in self._class_to_module.keys():                    │
│ ❱  992 │   │   │   module = self._get_module(self._class_to_module[name])    │
│    993 │   │   │   value = getattr(module, name)                             │
│    994 │   │   else:                                                         │
│    995 │   │   │   raise AttributeError(f"module {self.__name__} has no attr │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:10 │
│ 04 in _get_module                                                            │
│                                                                              │
│   1001 │   │   try:                                                          │
│   1002 │   │   │   return importlib.import_module("." + module_name, self.__ │
│   1003 │   │   except Exception as e:                                        │
│ ❱ 1004 │   │   │   raise RuntimeError(                                       │
│   1005 │   │   │   │   f"Failed to import {self.__name__}.{module_name} beca │
│   1006 │   │   │   │   f" traceback):\n{e}"                                  │
│   1007 │   │   │   ) from e                                                  │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Failed to import 
transformers.models.xlm_roberta.modeling_tf_xlm_roberta because of the following
error (look up to see its traceback):
No module named 'keras.saving.hdf5_format'

Faster Whisper

Hi,

I was wondering if there's an easy way to utilise this: https://github.com/guillaumekln/faster-whisper

I am using the Notebook file.

Thanks

unable to locate SRT file

I run diarize.py and all seems to work fine and I can see all the outputs in temp_outputs but I am unable to locate the SRT file.

just running diarize should create the SRT file yes? or do I need to run whisper specifically first?

Can you adapt the google Colab platform

!git clone https://github.com/MahmoudAshraf97/whisper-diarization
cd /content/whisper-diarization/
!pip install -r ./requirements.txt
!python diarize.py -a cs.m4a

I had a problem running on the Colab platform and didn't know how to fix it.

dependencies missing from requirements.txt

omegaconf is imported in helpers.py

Requirements.txt Error

Recent update on Nemo version is causing dependency error with transformer version.
git+https://github.com/NVIDIA/[email protected]#egg=nemo_toolkit[asr] ----> previous command
changed to
nemo_toolkit[asr]==1.15.0 ----> new command
Second one is the updated command that gives error.

Article describing whisper-diarization to other diarization methods

Guys,

Do you have an article describing whisper-diarization to other diarization methods.

Thanks a lot,
AlexG.

Issue with Colab

Tried to run "Whisper_Transcription_+_NeMo_Diarization.ipynb"
First installation step produced an error.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 3.20.1 which is incompatible.
tensorflow-metadata 1.13.1 requires protobuf<5,>=3.20.3, but you have protobuf 3.20.1 which is incompatible.
onnx 1.13.1 requires protobuf<4,>=3.20.2, but you have protobuf 3.20.1 which is incompatible.
googleapis-common-protos 1.59.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-translate 3.11.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-language 2.9.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-firestore 2.11.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-datastore 2.15.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-bigquery 3.9.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-bigquery-storage 2.19.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-api-core 2.11.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
Successfully installed protobuf-3.20.1 pytorch-lightning-1.6.5

OSError                                   Traceback (most recent call last)
[<ipython-input-6-99c1ac25fe61>](https://localhost:8080/#) in <cell line: 11>()
      9 import librosa
     10 import soundfile
---> 11 from nemo.collections.asr.models.msdd_models import NeuralDiarizer
     12 from deepmultilingualpunctuation import PunctuationModel
     13 import re

ModuleNotFoundError: No module named 'pytorch_lightning.core.module'

Trying to install this in a miniconda3 enviroment, and it runs but it keeps saying "ModuleNotFoundError: No module named 'pytorch_lightning.core.module'"

I even tried making a completely new one to no avail.

I installed from the requirements.txt, does this not work in a VE?

Also tried manually installing lightning but to no avail.

ValueError: No default align-model for language: tl

ValueError Traceback (most recent call last)
in <cell line: 10>()
8
9 device = "cuda"
---> 10 alignment_model, metadata = whisperx.load_align_model(
11 language_code=whisper_results["language"], device=device
12 )

/usr/local/lib/python3.10/dist-packages/whisperx/alignment.py in load_align_model(language_code, device, model_name)
51 print(f"There is no default alignment model set for this language ({language_code}).
52 Please find a wav2vec2.0 model finetuned on this language in https://huggingface.co/models, then pass the model name in --align_model [MODEL_NAME]")
---> 53 raise ValueError(f"No default align-model for language: {language_code}")
54
55 if model_name in torchaudio.pipelines.all:

ValueError: No default align-model for language: tl

No module names 'faster_whisper'

Hi there,

I'm getting an error that the "faster_whisper" module doesn't exist. Would love any guidance.

Thanks!

Issue in NeMo MSDD diarization model

While trying to diarize using the MSDD model, I get the below error:

PicklingError: Can't pickle <class 'nemo.collections.common.parts.preprocessing.collections.SpeechLabelEntity'>: attribute lookup SpeechLabelEntity on nemo.collections.common.parts.preprocessing.collections failed

Kindly acknowledge if anyone had encountered this and solved it.

Issue on google colab

Hi there,

Thank you for your valuable work; it is very much appreciated. However, I encountered an error when trying to run the code on Google Colab.
at line: from nemo.collections.asr.models.msdd_models import NeuralDiarizer

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
[<ipython-input-4-a7e6fec2e8e2>](https://localhost:8080/#) in <cell line: 11>()
      9 import librosa
     10 import soundfile
---> 11 from nemo.collections.asr.models.msdd_models import NeuralDiarizer
     12 from deepmultilingualpunctuation import PunctuationModel
     13 import re

18 frames
[/usr/local/lib/python3.9/dist-packages/nemo/collections/asr/__init__.py](https://localhost:8080/#) in <module>
     13 # limitations under the License.
     14 
---> 15 from nemo.collections.asr import data, losses, models, modules
     16 from nemo.package_info import __version__
     17 

[/usr/local/lib/python3.9/dist-packages/nemo/collections/asr/losses/__init__.py](https://localhost:8080/#) in <module>
     13 # limitations under the License.
     14 
---> 15 from nemo.collections.asr.losses.angularloss import AngularSoftmaxLoss
     16 from nemo.collections.asr.losses.audio_losses import SDRLoss
     17 from nemo.collections.asr.losses.ctc import CTCLoss

[/usr/local/lib/python3.9/dist-packages/nemo/collections/asr/losses/angularloss.py](https://localhost:8080/#) in <module>
     16 import torch
     17 
---> 18 from nemo.core.classes import Loss, Typing, typecheck
     19 from nemo.core.neural_types import LabelsType, LogitsType, LossType, NeuralType
     20 

[/usr/local/lib/python3.9/dist-packages/nemo/core/__init__.py](https://localhost:8080/#) in <module>
     14 
     15 import nemo.core.neural_types
---> 16 from nemo.core.classes import *

[/usr/local/lib/python3.9/dist-packages/nemo/core/classes/__init__.py](https://localhost:8080/#) in <module>
     16 import hydra
     17 import omegaconf
---> 18 import pytorch_lightning
     19 
     20 from nemo.core.classes.common import (

[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/__init__.py](https://localhost:8080/#) in <module>
     28     _logger.propagate = False
     29 
---> 30 from pytorch_lightning.callbacks import Callback  # noqa: E402
     31 from pytorch_lightning.core import LightningDataModule, LightningModule  # noqa: E402
     32 from pytorch_lightning.trainer import Trainer  # noqa: E402

[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/callbacks/__init__.py](https://localhost:8080/#) in <module>
     12 # See the License for the specific language governing permissions and
     13 # limitations under the License.
---> 14 from pytorch_lightning.callbacks.base import Callback
     15 from pytorch_lightning.callbacks.device_stats_monitor import DeviceStatsMonitor
     16 from pytorch_lightning.callbacks.early_stopping import EarlyStopping

[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/callbacks/base.py](https://localhost:8080/#) in <module>
     23 
     24 import pytorch_lightning as pl
---> 25 from pytorch_lightning.utilities.types import STEP_OUTPUT
     26 
     27 

[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/utilities/__init__.py](https://localhost:8080/#) in <module>
     16 import numpy
     17 
---> 18 from pytorch_lightning.utilities.apply_func import move_data_to_device  # noqa: F401
     19 from pytorch_lightning.utilities.distributed import AllGatherGrad  # noqa: F401
     20 from pytorch_lightning.utilities.enums import (  # noqa: F401

[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/utilities/apply_func.py](https://localhost:8080/#) in <module>
     27 
     28 from pytorch_lightning.utilities.exceptions import MisconfigurationException
---> 29 from pytorch_lightning.utilities.imports import _compare_version, _TORCHTEXT_LEGACY
     30 from pytorch_lightning.utilities.warnings import rank_zero_deprecation
     31 

[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/utilities/imports.py](https://localhost:8080/#) in <module>
    116 _TORCH_QUANTIZE_AVAILABLE = bool([eg for eg in torch.backends.quantized.supported_engines if eg != "none"])
    117 _TORCHTEXT_AVAILABLE = _package_available("torchtext")
--> 118 _TORCHTEXT_LEGACY: bool = _TORCHTEXT_AVAILABLE and _compare_version("torchtext", operator.lt, "0.11.0")
    119 _TORCHVISION_AVAILABLE = _package_available("torchvision")
    120 _WANDB_AVAILABLE = _package_available("wandb")

[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/utilities/imports.py](https://localhost:8080/#) in _compare_version(package, op, version, use_base_version)
     69     """
     70     try:
---> 71         pkg = importlib.import_module(package)
     72     except (ImportError, DistributionNotFound):
     73         return False

[/usr/lib/python3.9/importlib/__init__.py](https://localhost:8080/#) in import_module(name, package)
    125                 break
    126             level += 1
--> 127     return _bootstrap._gcd_import(name[level:], package, level)
    128 
    129 

[/usr/local/lib/python3.9/dist-packages/torchtext/__init__.py](https://localhost:8080/#) in <module>
      4 
      5 # the following import has to happen first in order to load the torchtext C++ library
----> 6 from torchtext import _extension  # noqa: F401
      7 
      8 _TEXT_BUCKET = "https://download.pytorch.org/models/text/"

[/usr/local/lib/python3.9/dist-packages/torchtext/_extension.py](https://localhost:8080/#) in <module>
     62 
     63 
---> 64 _init_extension()

[/usr/local/lib/python3.9/dist-packages/torchtext/_extension.py](https://localhost:8080/#) in _init_extension()
     56         raise ImportError("torchtext C++ Extension is not found.")
     57 
---> 58     _load_lib("libtorchtext")
     59     # This import is for initializing the methods registered via PyBind11
     60     # This has to happen after the base library is loaded

[/usr/local/lib/python3.9/dist-packages/torchtext/_extension.py](https://localhost:8080/#) in _load_lib(lib)
     48     if not path.exists():
     49         return False
---> 50     torch.ops.load_library(path)
     51     return True
     52 

[/usr/local/lib/python3.9/dist-packages/torch/_ops.py](https://localhost:8080/#) in load_library(self, path)
    571             ) from e
    572 
--> 573         # let the script frontend know that op is identical to the builtin op
    574         # with qualified_op_name
    575         torch.jit._builtins._register_builtin(op, qualified_op_name)

[/usr/lib/python3.9/ctypes/__init__.py](https://localhost:8080/#) in __init__(self, name, mode, handle, use_errno, use_last_error, winmode)
    372 
    373         if handle is None:
--> 374             self._handle = _dlopen(self._name, mode)
    375         else:
    376             self._handle = handle

OSError: /usr/local/lib/python3.9/dist-packages/torchtext/lib/libtorchtext.so: undefined symbol: _ZN2at4_ops10select_int4callERKNS_6TensorElN3c106SymIntE

looks like it is caused by torchtext then i tried to uninstall it and reinstall it. but nothing get changed.

Error converting audio for NeMo compatibility?

Hello Mahmoud. First off, thank you for this awesome contribution to the community. I've been trying to get reliable diarization with whisper for months, so I'm excited to try your implementation.

However, I'm running into the following error when running your diarize.py script on a random audio file from youtube:

(MA97_whisper_diarization) PS C:\-\-\-\-\repos\whisper-diarization> python diarize.py -a ..\..\audio\lotr_trailer.wav
[NeMo W 2023-02-07 16:30:09 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-02-07 16:30:10 nemo_logging:349] C:\-\-\Miniconda3\envs\MA97_whisper_diarization\lib\site-packages\torch\jit\annotations.py:309: UserWarning: TorchScript will treat type annotations of Tensor dtype-specific subtypes as if they are normal Tensors. dtype constraints are not enforced in compilation either.
      warnings.warn("TorchScript will treat type annotations of Tensor "

Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
Source splitting failed, using original audio file. Use --no-stem argument to disable it.
  0%| 0/17569 [00:00 
15%|███████████▋| 2572/17569 [00:28<02:48
28%|██████████████████████| 4960/17569 [00:32<01:11, 1 
45%|███████████████████████████████████| 7832/17569 [00:35<00:33, 2 
59%|█████████████████████████████████████████████▊| 10328/17569 [00:39<00:19, 3 
74%|█████████████████████████████████████████████████████████ | 13002/17569 [00:46<00:12, 3 
91%|███████████████████████████████████████████████████████████████████████ | 16002/17569 [00:49<00:03, 
4100%██████████████████████████████████████████████████████████████████████████████| 17569/17569 [00:50<00:00, 
5100%|██████████████████████████████████████████████████████████████████████████████| 17569/17569 [00:50<00:00, 346.50frames/s]
Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth" to C:\-\-/.cache\torch\hub\checkpoints\wav2vec2_fairseq_base_ls960_asr_ls960.pth
100%|███████████████████████████████████████████████████████████████████████| 360M/360M [00:05<00:00, 69.7MB/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\-\-\-\-\repos\whisper-diarization\diarize.py:115 in       │
│ <module>                                                                                         │
│                                                                                                  │
│   112                                                                                            │
│   113 # convert audio to mono for NeMo combatibility                                             │
│   114 signal, sample_rate = librosa.load(vocal_target, sr=None)                                  │
│ ❱ 115 os.chdir("temp_outputs")                                                                   │
│   116 soundfile.write("mono_file.wav", signal, sample_rate, "PCM_24")                            │
│   117                                                                                            │
│   118 # Initialize NeMo MSDD diarization model                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: [WinError 2] The system cannot find the file specified: 'temp_outputs'

Any recommendations for a fix?

ValueError: mutable default for field override_dirname is not allowed: use default_factory

Traceback (most recent call last):
  File "C:\Users\gamen\Downloads\whisper-diarization-main\diarize.py", line 9, in <module>
    from nemo.collections.asr.models.msdd_models import NeuralDiarizer
  File "C:\Python311\Lib\site-packages\nemo\collections\asr\__init__.py", line 15, in <module>
    from nemo.collections.asr import data, losses, models, modules
  File "C:\Python311\Lib\site-packages\nemo\collections\asr\losses\__init__.py", line 15, in <module>
    from nemo.collections.asr.losses.angularloss import AngularSoftmaxLoss
  File "C:\Python311\Lib\site-packages\nemo\collections\asr\losses\angularloss.py", line 18, in <module>
    from nemo.core.classes import Loss, Typing, typecheck
  File "C:\Python311\Lib\site-packages\nemo\core\__init__.py", line 16, in <module>
    from nemo.core.classes import *
  File "C:\Python311\Lib\site-packages\nemo\core\classes\__init__.py", line 16, in <module>
    import hydra
  File "C:\Python311\Lib\site-packages\hydra\__init__.py", line 5, in <module>
    from hydra import utils
  File "C:\Python311\Lib\site-packages\hydra\utils.py", line 8, in <module>
    import hydra._internal.instantiate._instantiate2
  File "C:\Python311\Lib\site-packages\hydra\_internal\instantiate\_instantiate2.py", line 12, in <module>
    from hydra._internal.utils import _locate
  File "C:\Python311\Lib\site-packages\hydra\_internal\utils.py", line 18, in <module>
    from hydra.core.utils import get_valid_filename, validate_config_path
  File "C:\Python311\Lib\site-packages\hydra\core\utils.py", line 20, in <module>
    from hydra.core.hydra_config import HydraConfig
  File "C:\Python311\Lib\site-packages\hydra\core\hydra_config.py", line 6, in <module>
    from hydra.conf import HydraConf
  File "C:\Python311\Lib\site-packages\hydra\conf\__init__.py", line 46, in <module>
    class JobConf:
  File "C:\Python311\Lib\site-packages\hydra\conf\__init__.py", line 75, in JobConf
    @dataclass
     ^^^^^^^^^
  File "C:\Python311\Lib\dataclasses.py", line 1221, in dataclass
    return wrap(cls)
           ^^^^^^^^^
  File "C:\Python311\Lib\dataclasses.py", line 1211, in wrap
    return _process_class(cls, init, repr, eq, order, unsafe_hash,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\dataclasses.py", line 959, in _process_class
    cls_fields.append(_get_field(cls, name, type, kw_only))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\dataclasses.py", line 816, in _get_field
    raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'hydra.conf.JobConf.JobConfig.OverrideDirname'> for field override_dirname is not allowed: use default_factory```

List index out of range error for diarize_paralell.py

/home/.........../token_classification.py:168: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="none"` instead.
  warnings.warn(
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/....../whisper-diarization/venv/diarize_parallel.py:137 in         │
│ <module>                                                                     │
│                                                                              │
│   134 │                                                                      │
│   135 │   words_list = list(map(lambda x: x["word"], wsm))                   │
│   136 │                                                                      │
│ ❱ 137 │   labled_words = punct_model.predict(words_list)                     │
│   138 │                                                                      │
│   139 │   ending_puncts = ".?!"                                              │
│   140 │   model_puncts = ".,?-:"                                             │
│                                                                              │
│ /home/.....//whisper-diarization/venv/ython3.10/site-packages/deepmultilingualpunctuation/pun │
│ ctuationmodel.py:39 in predict                                               │
│                                                                              │
│   36 │   │                                                                   │
│   37 │   │   # if the last batch is smaller than the overlap,                │
│   38 │   │   # we can just remove it                                         │
│ ❱ 39 │   │   if len(batches[-1]) <= overlap:                                 │
│   40 │   │   │   batches.pop()                                               │
│   41 │   │                                                                   │
│   42 │   │   tagged_words = []                                               │
╰──────────────────────────────────────────────────────────────────────────────╯
IndexError: list index out of range

I tried throwing a try block around len(batches[-1] <= overlap, and threw in len(baches)[0] <= overlap to boot (not great at this programming thing really and still learning) in the punctuationmodel.py file and was able to successfully generate srt files / transcriptions for a couple audio files I was working with,but then it came back.

Hope this is helpful!

Turn off source splitting

Hi,

What is the simplest way to do this?

Thanks

Issues with silence

I have a lot of files that error out because of silences which aren't of significant lengths - a couple of seconds or so.

I know Whisper has issues with silence and there is a VAD built in, but is there any way to improve this other than having to pre-process it?

Thanks

Stereo Voice File Diarization

I have stereo voice file. Both of the channels can be seperated into left and right channel(two speakers). What should be modified in the code so that it processes and diarizes it accordingly. Thanks

error while running this command "python diarize.py -a "sample.mp3" --no-stem"

  def backtrace(trace: np.ndarray):
Traceback (most recent call last):
  File "c:\users\t_care\appdata\local\programs\python\python38\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\users\t_care\appdata\local\programs\python\python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\Scripts\whisper.exe\__main__.py", line 7, in <module>
  File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\whisper\transcribe.py", line 433, in cli
    model = load_model(model_name, device=device, download_root=model_dir)
  File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\whisper\__init__.py", line 144, in load_model
    checkpoint = torch.load(fp, map_location=device)
  File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 809, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 1172, in _load
    result = unpickler.load()
  File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 1142, in persistent_load
    typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 1116, in load_tensor
    wrap_storage=restore_location(storage, location),
  File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 1083, in restore_location
    return default_restore_location(storage, map_location)
  File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 217, in default_restore_location
    result = fn(storage, location)
  File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 182, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 166, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
PS C:\Users\T_Care\Desktop\whisper_dia\whisper-diarization> python diarize.py -a "sample.mp3" --no-stem
C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\pkg_resources\__init__.py:123: PkgResourcesDeprecationWarning: otobuf is an invalid version and will not be supported in a future release
  warnings.warn(
[NeMo W 2023-07-03 10:47:18 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-07-03 10:47:18 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2023-07-03 10:47:36 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2023-07-03 10:47:36 cloud:58] Found existing object C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo.
[NeMo I 2023-07-03 10:47:36 cloud:64] Re-using file from: C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo
[NeMo I 2023-07-03 10:47:36 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-07-03 10:47:36 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config :
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true

[NeMo W 2023-07-03 10:47:36 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
    Validation config :
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false

[NeMo W 2023-07-03 10:47:36 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config :
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    seq_eval_mode: false

[NeMo I 2023-07-03 10:47:36 features:287] PADDING: 16
[NeMo I 2023-07-03 10:47:37 features:287] PADDING: 16
[NeMo I 2023-07-03 10:47:38 save_restore_connector:247] Model EncDecDiarLabelModel was successfully restored from C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo.
[NeMo I 2023-07-03 10:47:38 features:287] PADDING: 16
[NeMo I 2023-07-03 10:47:38 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2023-07-03 10:47:38 cloud:58] Found existing object C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo.
[NeMo I 2023-07-03 10:47:38 cloud:64] Re-using file from: C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo
[NeMo I 2023-07-03 10:47:38 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-07-03 10:47:38 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config :
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      shift:
        prob: 0.5
        min_shift_ms: -10.0
        max_shift_ms: 10.0
      white_noise:
        prob: 0.5
        min_level: -90
        max_level: -46
        norm: true
      noise:
        prob: 0.5
        manifest_path: /manifests/noise_0_1_musan_fs.json
        min_snr_db: 0
        max_snr_db: 30
        max_gain_db: 300.0
        norm: true
      gain:
        prob: 0.5
        min_gain_dbfs: -10.0
        max_gain_dbfs: 10.0
        norm: true
    num_workers: 16
    pin_memory: true

[NeMo W 2023-07-03 10:47:38 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
    Validation config :
    manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: false
    val_loss_idx: 0
    num_workers: 16
    pin_memory: true

[NeMo W 2023-07-03 10:47:38 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config :
    manifest_filepath: null
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: false
    test_loss_idx: 0

[NeMo I 2023-07-03 10:47:38 features:287] PADDING: 16
[NeMo I 2023-07-03 10:47:38 save_restore_connector:247] Model EncDecClassificationModel was successfully restored from C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo.
[NeMo I 2023-07-03 10:47:38 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2023-07-03 10:47:38 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false
    }
[NeMo W 2023-07-03 10:47:38 clustering_diarizer:411] Deleting previous clustering diarizer outputs.
[NeMo I 2023-07-03 10:47:38 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2023-07-03 10:47:38 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 332.27it/s] 
[NeMo I 2023-07-03 10:47:38 vad_utils:101] The prepared manifest file exists. Overwriting!
[NeMo I 2023-07-03 10:47:38 classification_models:263] Perform streaming frame-level VAD
[NeMo I 2023-07-03 10:47:38 collections:298] Filtered duration for loading collection is 0.000000.
[NeMo I 2023-07-03 10:47:38 collections:301] Dataset loaded with 1 items, total duration of  0.00 hours.
[NeMo I 2023-07-03 10:47:38 collections:303] # 1 files loaded accounting to # 1 labels
vad:   0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s] 
Traceback (most recent call last):
  File "diarize.py", line 112, in <module>
    msdd_model.diarize()
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 1180, in diarize
    self.clustering_embedding.prepare_cluster_embs_infer()
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 699, in prepare_cluster_embs_infer
    self.emb_sess_test_dict, self.emb_seq_test, self.clus_test_label_dict, _ = self.run_clustering_diarizer(
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 866, in run_clustering_diarizer   
    scores = self.clus_diar_model.diarize(batch_size=self.cfg_diar_infer.batch_size)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 437, in diarize
    self._perform_speech_activity_detection()
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 325, in _perform_speech_activity_detection
    self._run_vad(manifest_vad_input)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 218, in _run_vad
    for i, test_batch in enumerate(
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\tqdm\std.py", line 1178, in __iter__
    for obj in iterable:
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 441, in __iter__
    return self._get_iterator()
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 1042, in __init__
    w.start()
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'nemo.collections.common.parts.preprocessing.collections.SpeechLabelEntity'>: attribute lookup SpeechLabelEntity on nemo.collections.common.parts.preprocessing.collections failed
PS C:\Users\T_Care\Desktop\whisper_dia\whisper-diarization> C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\pkg_resources\__init__.py:123: PkgResourcesDeprecationWarning: otobuf is an invalid version and will not be supported in a future release
  warnings.warn(
[NeMo W 2023-07-03 10:47:42 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-07-03 10:47:42 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2023-07-03 10:47:59 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2023-07-03 10:47:59 cloud:58] Found existing object C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo.
[NeMo I 2023-07-03 10:47:59 cloud:64] Re-using file from: C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo
[NeMo I 2023-07-03 10:47:59 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-07-03 10:48:00 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config :
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true

[NeMo W 2023-07-03 10:48:00 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
    Validation config :
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false

[NeMo W 2023-07-03 10:48:00 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config :
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    seq_eval_mode: false

[NeMo I 2023-07-03 10:48:00 features:287] PADDING: 16
[NeMo I 2023-07-03 10:48:00 features:287] PADDING: 16
[NeMo I 2023-07-03 10:48:02 save_restore_connector:247] Model EncDecDiarLabelModel was successfully restored from C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo.
[NeMo I 2023-07-03 10:48:02 features:287] PADDING: 16
[NeMo I 2023-07-03 10:48:02 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2023-07-03 10:48:02 cloud:58] Found existing object C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo.
[NeMo I 2023-07-03 10:48:02 cloud:64] Re-using file from: C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo
[NeMo I 2023-07-03 10:48:02 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-07-03 10:48:02 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config :
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      shift:
        prob: 0.5
        min_shift_ms: -10.0
        max_shift_ms: 10.0
      white_noise:
        prob: 0.5
        min_level: -90
        max_level: -46
        norm: true
      noise:
        prob: 0.5
        manifest_path: /manifests/noise_0_1_musan_fs.json
        min_snr_db: 0
        max_snr_db: 30
        max_gain_db: 300.0
        norm: true
      gain:
        prob: 0.5
        min_gain_dbfs: -10.0
        max_gain_dbfs: 10.0
        norm: true
    num_workers: 16
    pin_memory: true

[NeMo W 2023-07-03 10:48:02 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
    Validation config :
    manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: false
    val_loss_idx: 0
    num_workers: 16
    pin_memory: true

[NeMo W 2023-07-03 10:48:02 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config :
    manifest_filepath: null
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: false
    test_loss_idx: 0

[NeMo I 2023-07-03 10:48:02 features:287] PADDING: 16
[NeMo I 2023-07-03 10:48:02 save_restore_connector:247] Model EncDecClassificationModel was successfully restored from C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo.
[NeMo I 2023-07-03 10:48:02 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2023-07-03 10:48:02 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false
    }
[NeMo W 2023-07-03 10:48:02 clustering_diarizer:411] Deleting previous clustering diarizer outputs.
[NeMo I 2023-07-03 10:48:02 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2023-07-03 10:48:02 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 335.06it/s] 
[NeMo I 2023-07-03 10:48:02 vad_utils:101] The prepared manifest file exists. Overwriting!
[NeMo I 2023-07-03 10:48:02 classification_models:263] Perform streaming frame-level VAD
[NeMo I 2023-07-03 10:48:02 collections:298] Filtered duration for loading collection is 0.000000.
[NeMo I 2023-07-03 10:48:02 collections:301] Dataset loaded with 1 items, total duration of  0.00 hours.
[NeMo I 2023-07-03 10:48:02 collections:303] # 1 files loaded accounting to # 1 labels
vad:   0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s] 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\T_Care\Desktop\whisper_dia\whisper-diarization\diarize.py", line 112, in <module>
    msdd_model.diarize()
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 1180, in diarize
    self.clustering_embedding.prepare_cluster_embs_infer()
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 699, in prepare_cluster_embs_infer
    self.emb_sess_test_dict, self.emb_seq_test, self.clus_test_label_dict, _ = self.run_clustering_diarizer(
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 866, in run_clustering_diarizer   
    scores = self.clus_diar_model.diarize(batch_size=self.cfg_diar_infer.batch_size)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 437, in diarize
    self._perform_speech_activity_detection()
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 325, in _perform_speech_activity_detection
    self._run_vad(manifest_vad_input)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 218, in _run_vad
    for i, test_batch in enumerate(
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\tqdm\std.py", line 1178, in __iter__
    for obj in iterable:
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 441, in __iter__
    return self._get_iterator()
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 1042, in __init__
    w.start()
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

cannot install dependencies

Hi!

I've tried to install dependencies from your requirements and I received an error:

ERROR: Cannot install nemo-toolkit[asr]==1.15.0 and transformers==4.26.1 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested transformers==4.26.1
nemo-toolkit[asr] 1.15.0 depends on transformers<=4.21.2 and >=4.0.1; extra == "asr"

Could you please explain what I should do to install all the dependencies?

Can't setup and not sure what's wrong.

Pls help.

C:\Users\Administrator\Documents\GitHub\whisper-diarization>pip install -r requirements.txt
Collecting git+https://github.com/m-bain/whisperX.git@4cb167a225c0ebaea127fd6049abfaa3af9f8bb4 (from -r requirements.txt (line 5))
  Cloning https://github.com/m-bain/whisperX.git (to revision 4cb167a225c0ebaea127fd6049abfaa3af9f8bb4) to c:\users\administrator\appdata\local\temp\pip-req-build-g_0nd3n1
  Running command git clone --filter=blob:none --quiet https://github.com/m-bain/whisperX.git 'C:\Users\Administrator\AppData\Local\Temp\pip-req-build-g_0nd3n1'
  fatal: unable to access 'https://github.com/m-bain/whisperX.git/': Recv failure: Connection was reset
  error: subprocess-exited-with-error

  × git clone --filter=blob:none --quiet https://github.com/m-bain/whisperX.git 'C:\Users\Administrator\AppData\Local\Temp\pip-req-build-g_0nd3n1' did not run successfully.
  │ exit code: 128
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet https://github.com/m-bain/whisperX.git 'C:\Users\Administrator\AppData\Local\Temp\pip-req-build-g_0nd3n1' did not run successfully.
│ exit code: 128
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

RuntimeError: Calculated padded input size per channel: (1). Kernel size: (2). Kernel size can't be greater than actual input size

@MahmoudAshraf97
I have below error:

python3 diarize.py -a ../whisperX/output.wav
[NeMo W 2023-06-19 11:53:43 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-06-19 11:53:44 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /data/dProjects/whisper-diarization/temp_outputs/htdemucs
Separating track ../whisperX/output.wav
100%|██████████████████████████████████████████████████████████████████████| 602.55/602.55 [00:12<00:00, 48.97seconds/s]
Failed to align segment (" Google."): backtrack failed, resorting to original...
Failed to align segment: duration smaller than 0.02s time precision
Failed to align segment: duration smaller than 0.02s time precision
Traceback (most recent call last):
  File "diarize.py", line 89, in <module>
    result_aligned = whisperx.align(
  File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/whisperx/alignment.py", line 224, in align
    emissions, _ = model(waveform_segment.to(device))
  File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torchaudio/models/wav2vec2/model.py", line 116, in forward
    x, lengths = self.feature_extractor(waveforms, lengths)
  File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torchaudio/models/wav2vec2/components.py", line 141, in forward
    x, length = layer(x, length)  # (batch, feature, frame)
  File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torchaudio/models/wav2vec2/components.py", line 90, in forward
    x = self.conv(x)
  File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (1). Kernel size: (2). Kernel size can't be greater than actual input size

Updates to Portuguese Language

Just an update to everyone that reads this repo: it worked perfectly in Portuguese for me.

cuDNN version incompatibility

I get this every time I try to run it. I've tried like 100x ways to get the new cuDNN. I am using Miniconda, but nothing works. Any ideas?

RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (8, 5, 0) but found runtime version (8, 2, 1). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.

Language issues [No default align-model for language: gu]

Error :
Detected language: Gujarati
100%|██████████| 8533/8533 [00:16<00:00, 512.95frames/s]There is no default alignment model set for this language (gu). Please find a wav2vec2.0 model finetuned on this language in https://huggingface.co/models, then pass the model name in --align_model [MODEL_NAME]

ValueError Traceback (most recent call last)
in <cell line: 8>()
6
7 device = "cuda"
----> 8 alignment_model, metadata = whisperx.load_align_model(
9 language_code=whisper_results["language"], device=device
10 )

/usr/local/lib/python3.9/dist-packages/whisperx/alignment.py in load_align_model(language_code, device, model_name)
51 print(f"There is no default alignment model set for this language ({language_code}).
52 Please find a wav2vec2.0 model finetuned on this language in https://huggingface.co/models, then pass the model name in --align_model [MODEL_NAME]")
---> 53 raise ValueError(f"No default align-model for language: {language_code}")
54
55 if model_name in torchaudio.pipelines.all:

ValueError: No default align-model for language: gu

Diarization does not work for Russian?

I run diarization on M1 Mac like this:
python3 diarize.py -a trimmed_1min.wav --no-stem --whisper-model large-v2 --device cpu

I get this error:
File "/Users/alexanderblagochevsky/Documents/whisper-diarization/diarize.py", line 160, in
f'Punctuation restoration is not available for {whisper_results["language"]} language.'
TypeError: list indices must be integers or slices, not str

I've tried commenting out punctuation part of the diarization.py and the script completes without error but in the output I get the whole conversation as only 1 speaker.

Is there any demo video for introducing this repo? I have install it with VS code, but don't manage to use it, especially about the format of the audio name.

Speaker Timestamps

I tested this with audio that has multiple speakers. It worked fairly well. Is there a way to add timestamps for when each speaker was detected?

error with ffmpeg while trying to transcibe

Can you please check this error and help me fix this issue to get the transcription and diarization ?
Error pasted below:

warnings.warn("FP16 is not supported on CPU; using FP32 instead")
    
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/gvnavin/anaconda3/lib/python3.9/site-packages/whisper/audio.py:46 in load_audio            │
│                                                                                                  │
│    43 │   │   # This launches a subprocess to decode audio while down-mixing and resampling as   │
│    44 │   │   # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.             │
│    45 │   │   out, _ = (                                                                         │
│ ❱  46 │   │   │   ffmpeg.input(file, threads=0)                                                  │
│    47 │   │   │   .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)                  │
│    48 │   │   │   .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)     │
│    49 │   │   )                                                                                  │
│                                                                                                  │
│ /home/gvnavin/anaconda3/lib/python3.9/site-packages/ffmpeg/_run.py:325 in run                    │
│                                                                                                  │
│   322 │   out, err = process.communicate(input)                                                  │
│   323 │   retcode = process.poll()                                                               │
│   324 │   if retcode:                                                                            │
│ ❱ 325 │   │   raise Error('ffmpeg', out, err)                                                    │
│   326 │   return out, err                                                                        │
│   327                                                                                            │
│   328                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Error: ffmpeg error (see stderr output for detail)

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/gvnavin/ws/whisper-diarization/diarize.py:94 in <module>                                   │
│                                                                                                  │
│    91                                                                                            │
│    92 # Large models result in considerably better and more aligned (words, timestamps) mappin   │
│    93 whisper_model = load_model(args.model_name)                                                │
│ ❱  94 whisper_results = whisper_model.transcribe(vocal_target, beam_size=None, verbose=False)    │
│    95                                                                                            │
│    96 # clear gpu vram                                                                           │
│    97 del whisper_model                                                                          │
│                                                                                                  │
│ /home/gvnavin/anaconda3/lib/python3.9/site-packages/whisper/transcribe.py:121 in transcribe      │
│                                                                                                  │
│   118 │   │   decode_options["fp16"] = False                                                     │
│   119 │                                                                                          │
│   120 │   # Pad 30-seconds of silence to the input audio, for slicing                            │
│ ❱ 121 │   mel = log_mel_spectrogram(audio, padding=N_SAMPLES)                                    │
│   122 │   content_frames = mel.shape[-1] - N_FRAMES                                              │
│   123 │                                                                                          │
│   124 │   if decode_options.get("language", None) is None:                                       │
│                                                                                                  │
│ /home/gvnavin/anaconda3/lib/python3.9/site-packages/whisper/audio.py:130 in log_mel_spectrogram  │
│                                                                                                  │
│   127 │   """                                                                                    │
│   128 │   if not torch.is_tensor(audio):                                                         │
│   129 │   │   if isinstance(audio, str):                                                         │
│ ❱ 130 │   │   │   audio = load_audio(audio)                                                      │
│   131 │   │   audio = torch.from_numpy(audio)                                                    │
│   132 │                                                                                          │
│   133 │   if device is not None:                                                                 │
│                                                                                                  │
│ /home/gvnavin/anaconda3/lib/python3.9/site-packages/whisper/audio.py:51 in load_audio            │
│                                                                                                  │
│    48 │   │   │   .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)     │
│    49 │   │   )                                                                                  │
│    50 │   except ffmpeg.Error as e:                                                              │
│ ❱  51 │   │   raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e            │
│    52 │                                                                                          │
│    53 │   return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0             │
│    54                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Failed to load audio: ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 
--enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio 
--enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack 
--enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband 
--enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab 
--enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 
--enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm 
--enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100

Is Korean available in this model?

Hello.

To begin with, Thank you for your model development and support.

As a Korean, I am interested in Korean-based STT implementation.

Would this whisper-diarization model work in korean?

Thank you for trying to help if you can :)

Are there hardware dependencies?

I followed the instructions in the README, and successfully installed all specified dependencies. I'm now trying to run the package from the command line. Using a Windows Surface 3 laptop running Windows 10 Pro, I run into this error:

❯ python diarize.py -a MY_FILE.mp3

[NeMo W 2023-05-05 12:07:02 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-05-05 12:07:05 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
Source splitting failed, using original audio file. Use --no-stem argument to disable it.
Traceback (most recent call last):
  File "<MY LOCAL PATH>/diarize.py", line 56, in <module>
    whisper_model = WhisperModel(args.model_name, device="cuda", compute_type="float16")
  File "<MY VENVS FOLDER>\whisper-diarization\lib\site-packages\faster_whisper\transcribe.py", line 120, in __init__
    self.model = ctranslate2.models.Whisper(
RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version

Is this a hardware issue or am I missing something? Thank you!

Could we set up a replicate.com instance to run this easily?

how to handle files with no Voice?

I have data set of files, if I upload files that are of, lets say 10 seconds with no voice or some background noise, it gives error ValueError: All files present in manifest contains silence, aborting next steps

how should i handle this in the code?

Can I use it in Turkish?

First of all, thank you for maintaining this tool.

Will it work in Turkish ?

Regards.

Kernel size error

Hi,

When I run the whisperx.align part, certain soundfiles gives me this error:

RuntimeError: Calculated padded input size per channel: (1). Kernel size: (2). Kernel size can't be greater than actual input size.

Any idea on what causes this error and how to fix it?

mutable default <class 'nemo.core.classes.mixins.adapter_mixin_strategies.ResidualAddAdapterStrategyConfig'> for field adapter_strategy is not allowed: use default_factory

I receive the following error message:

[NeMo W 2023-06-14 11:08:57 nemo_logging:393] Apex was not found. Using the lamb or fused_adam optimizer will error out.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/emoman/Work/diarisation/whisper-diarization/diarize.py:9 in <module>                       │
│                                                                                                  │
│     6 import torch                                                                               │
│     7 import librosa                                                                             │
│     8 import soundfile                                                                           │
│ ❱   9 from nemo.collections.asr.models.msdd_models import NeuralDiarizer                         │
│    10 from deepmultilingualpunctuation import PunctuationModel                                   │
│    11 import re                                                                                  │
│    12 import logging                                                                             │
│                                                                                                  │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/asr/__init__.py:15 in <module> │
│                                                                                                  │
│   12 # See the License for the specific language governing permissions and                       │
│   13 # limitations under the License.                                                            │
│   14                                                                                             │
│ ❱ 15 from nemo.collections.asr import data, losses, models, modules                              │
│   16 from nemo.package_info import __version__                                                   │
│   17                                                                                             │
│   18 # Set collection version equal to NeMo version.                                             │
│                                                                                                  │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/asr/losses/__init__.py:16 in   │
│ <module>                                                                                         │
│                                                                                                  │
│   13 # limitations under the License.                                                            │
│   14                                                                                             │
│   15 from nemo.collections.asr.losses.angularloss import AngularSoftmaxLoss                      │
│ ❱ 16 from nemo.collections.asr.losses.audio_losses import SDRLoss                                │
│   17 from nemo.collections.asr.losses.ctc import CTCLoss                                         │
│   18 from nemo.collections.asr.losses.lattice_losses import LatticeLoss                          │
│   19 from nemo.collections.asr.losses.ssl_losses.contrastive import ContrastiveLoss              │
│                                                                                                  │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/asr/losses/audio_losses.py:21  │
│ in <module>                                                                                      │
│                                                                                                  │
│    18 import numpy as np                                                                         │
│    19 import torch                                                                               │
│    20                                                                                            │
│ ❱  21 from nemo.collections.asr.parts.preprocessing.features import make_seq_mask_like           │
│    22 from nemo.collections.asr.parts.utils.audio_utils import toeplitz                          │
│    23 from nemo.core.classes import Loss, Typing, typecheck                                      │
│    24 from nemo.core.neural_types import AudioSignal, LengthsType, LossType, MaskType, NeuralT   │
│                                                                                                  │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/asr/parts/preprocessing/__init │
│ __.py:16 in <module>                                                                             │
│                                                                                                  │
│   13 # limitations under the License.                                                            │
│   14                                                                                             │
│   15 from nemo.collections.asr.parts.preprocessing.feature_loader import ExternalFeatureLoade    │
│ ❱ 16 from nemo.collections.asr.parts.preprocessing.features import FeaturizerFactory, Filterb    │
│   17 from nemo.collections.asr.parts.preprocessing.perturb import (                              │
│   18 │   AudioAugmentor,                                                                         │
│   19 │   AugmentationDataset,                                                                    │
│                                                                                                  │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/asr/parts/preprocessing/featur │
│ es.py:44 in <module>                                                                             │
│                                                                                                  │
│    41 import torch                                                                               │
│    42 import torch.nn as nn                                                                      │
│    43                                                                                            │
│ ❱  44 from nemo.collections.asr.parts.preprocessing.perturb import AudioAugmentor                │
│    45 from nemo.collections.asr.parts.preprocessing.segment import AudioSegment                  │
│    46 from nemo.utils import logging                                                             │
│    47                                                                                            │
│                                                                                                  │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/asr/parts/preprocessing/pertur │
│ b.py:50 in <module>                                                                              │
│                                                                                                  │
│     47 from torch.utils.data import IterableDataset                                              │
│     48                                                                                           │
│     49 from nemo.collections.asr.parts.preprocessing.segment import AudioSegment                 │
│ ❱   50 from nemo.collections.common.parts.preprocessing import collections, parsers              │
│     51 from nemo.utils import logging                                                            │
│     52                                                                                           │
│     53 # TODO @blisc: Perhaps refactor instead of import guarding                                │
│                                                                                                  │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/common/__init__.py:16 in       │
│ <module>                                                                                         │
│                                                                                                  │
│   13 # limitations under the License.                                                            │
│   14                                                                                             │
│   15 import nemo.collections.common.callbacks                                                    │
│ ❱ 16 from nemo.collections.common import data, losses, parts, tokenizers                         │
│   17 from nemo.package_info import __version__                                                   │
│   18                                                                                             │
│   19 # Set collection version equal to NeMo version.                                             │
│                                                                                                  │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/common/parts/__init__.py:15 in │
│ <module>                                                                                         │
│                                                                                                  │
│   12 # See the License for the specific language governing permissions and                       │
│   13 # limitations under the License.                                                            │
│   14                                                                                             │
│ ❱ 15 from nemo.collections.common.parts.adapter_modules import LinearAdapter, LinearAdapterCo    │
│   16 from nemo.collections.common.parts.mlm_scorer import MLMScorer                              │
│   17 from nemo.collections.common.parts.multi_layer_perceptron import MultiLayerPerceptron       │
│   18 from nemo.collections.common.parts.transformer_utils import *                               │
│                                                                                                  │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/common/parts/adapter_modules.p │
│ y:147 in <module>                                                                                │
│                                                                                                  │
│   144 │   │   return x                                                                           │
│   145                                                                                            │
│   146                                                                                            │
│ ❱ 147 @dataclass                                                                                 │
│   148 class LinearAdapterConfig:                                                                 │
│   149 │   in_features: int                                                                       │
│   150 │   dim: int                                                                               │
│                                                                                                  │
│ /usr/lib/python3.11/dataclasses.py:1223 in dataclass                                             │
│                                                                                                  │
│   1220 │   │   return wrap                                                                       │
│   1221 │                                                                                         │
│   1222 │   # We're called as @dataclass without parens.                                          │
│ ❱ 1223 │   return wrap(cls)                                                                      │
│   1224                                                                                           │
│   1225                                                                                           │
│   1226 def fields(class_or_instance):                                                            │
│                                                                                                  │
│ /usr/lib/python3.11/dataclasses.py:1213 in wrap                                                  │
│                                                                                                  │
│   1210 │   """                                                                                   │
│   1211 │                                                                                         │
│   1212 │   def wrap(cls):                                                                        │
│ ❱ 1213 │   │   return _process_class(cls, init, repr, eq, order, unsafe_hash,                    │
│   1214 │   │   │   │   │   │   │     frozen, match_args, kw_only, slots,                         │
│   1215 │   │   │   │   │   │   │     weakref_slot)                                               │
│   1216                                                                                           │
│                                                                                                  │
│ /usr/lib/python3.11/dataclasses.py:958 in _process_class                                         │
│                                                                                                  │
│    955 │   │   │   kw_only = True                                                                │
│    956 │   │   else:                                                                             │
│    957 │   │   │   # Otherwise it's a field of some type.                                        │
│ ❱  958 │   │   │   cls_fields.append(_get_field(cls, name, type, kw_only))                       │
│    959 │                                                                                         │
│    960 │   for f in cls_fields:                                                                  │
│    961 │   │   fields[f.name] = f                                                                │
│                                                                                                  │
│ /usr/lib/python3.11/dataclasses.py:815 in _get_field                                             │
│                                                                                                  │
│    812 │   # indicator for mutability.  Read the __hash__ attribute from the class,              │
│    813 │   # not the instance.                                                                   │
│    814 │   if f._field_type is _FIELD and f.default.__class__.__hash__ is None:                  │
│ ❱  815 │   │   raise ValueError(f'mutable default {type(f.default)} for field '                  │
│    816 │   │   │   │   │   │    f'{f.name} is not allowed: use default_factory')                 │
│    817 │                                                                                         │
│    818 │   return f                                                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: mutable default <class 'nemo.core.classes.mixins.adapter_mixin_strategies.ResidualAddAdapterStrategyConfig'> for field adapter_strategy is not 
allowed: use default_factory

Vocal separation

Hello guys, can you advise a good ool or neural network just to separate the vocals from the rest of the noise?

issues with ffmpeg when source audio files are in a directory

diarize.py -a ./source/audio.wav

Failed to load audio: ffmpeg
temp_outputs/htdemucs_ft/source/audio/vocals.wav: No such file or directory

its happening because the actual vocals.wav file is in temp_outputs/htdemucs_ft/audio/vocals.wav

I think this conflict is happening because the source audio file was in ./source/audio.wav, which messed some things up.

I did get it working by going into the source directory and calling ../diarize.py -a audio.wav

is this fixable? if not a note on the readme would help clear some confusion

ASR was succeed， but didn't got diarization

I followed the steps on README, and got a test.txt after command
python .\diarize.py -a .\test.wav

but there isn't diarization in the .txt

it's like:

Speaker 0: Personally, that's too many questions for me to answer. How you doing? My name's Josh. Hey, I'm Tone. How you doing? I didn't catch your name again? Huh? Tone. Tone? Tony. Tony, Tony, how you doing? ......

there is only one speaker in it

WhisperX takedown

WhisperX was taken down by github. How can we still have access to it?

Limit subtitle length and improve Whisper accuracy

Awesome project here and the diarization works really well. I might be a me problem but, the subtitles isn't as accurate as Whisper normally (using --whisper-model large,) is its worst with names and places e.g.. Fin, bin, Ken, ben, rave, Dave. While Whisper gets them 100% correct, I find it weird because your repo does use Whisper. Also when speakers are talking for a long time individually, the subtitle length seems to be indefinitely

No such file or directory error

Hi there,

Thanks a lot for your efforts.

I am really close to getting this working but I get the below error.

Stand alone Whisper is working for me.

Is there any chance you could please have a look? Thanks.

C:\Users\Main>[NeMo W 2023-03-26 23:48:50 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-03-26 23:48:50 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
Source splitting failed, using original audio file. Use --no-stem argument to disable it.
Traceback (most recent call last):
  File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper\audio.py", line 46, in load_audio
    ffmpeg.input(file, threads=0)
  File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\site-packages\ffmpeg\_run.py", line 325, in run
    raise Error('ffmpeg', out, err)
ffmpeg._run.Error: ffmpeg error (see stderr output for detail)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 268, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Main\diarize.py", line 94, in <module>
    whisper_results = whisper_model.transcribe(vocal_target, beam_size=None, verbose=False)
  File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper\transcribe.py", line 121, in transcribe
    mel = log_mel_spectrogram(audio, padding=N_SAMPLES)
  File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper\audio.py", line 130, in log_mel_spectrogram
    audio = load_audio(audio)
  File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper\audio.py", line 51, in load_audio
    raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
RuntimeError: Failed to load audio: ffmpeg version 2023-03-23-git-30cea1d39b-full_build-www.gyan.dev Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 12.2.0 (Rev10, Built by MSYS2 project)
  configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-libsnappy --enable-zlib --enable-librist --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-libbluray --enable-libcaca --enable-sdl2 --enable-libaribb24 --enable-libdav1d --enable-libdavs2 --enable-libuavs3d --enable-libzvbi --enable-librav1e --enable-libsvtav1 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxavs2 --enable-libxvid --enable-libaom --enable-libjxl --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-frei0r --enable-libfreetype --enable-libfribidi --enable-liblensfun --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-ffnvcodec --enable-nvdec --enable-nvenc --enable-d3d11va --enable-dxva2 --enable-libvpl --enable-libshaderc --enable-vulkan --enable-libplacebo --enable-opencl --enable-libcdio --enable-libgme --enable-libmodplug --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libshine --enable-libtheora --enable-libtwolame --enable-libvo-amrwbenc --enable-libilbc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-ladspa --enable-libbs2b --enable-libflite --enable-libmysofa --enable-librubberband --enable-libsoxr --enable-chromaprint
  libavutil      58.  5.100 / 58.  5.100
  libavcodec     60.  6.101 / 60.  6.101
  libavformat    60.  4.100 / 60.  4.100
  libavdevice    60.  2.100 / 60.  2.100
  libavfilter     9.  4.100 /  9.  4.100
  libswscale      7.  2.100 /  7.  2.100
  libswresample   4. 11.100 /  4. 11.100
  libpostproc    57.  2.100 / 57.  2.100
File3.mp3: No such file or directory

mahmoudashraf97 / whisper-diarization Goto Github PK

whisper-diarization's People

Contributors

Stargazers

Watchers

Forkers

whisper-diarization's Issues

Recommend Projects

Recommend Topics

Recommend Org