mahmoudashraf97 / whisper-diarization Goto Github PK
View Code? Open in Web Editor NEWAutomatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
License: BSD 2-Clause "Simplified" License
Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
License: BSD 2-Clause "Simplified" License
Has anyone compared the performance of these two models?
Hi all,
First, thanks to @MahmoudAshraf97 for this valuable diarization tool. Very appreciated.
I encountered an error in some audio files with "linalg.eigh" failing to converge:
_LinAlgError: linalg.eigh: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 842).
Has anyone faced this issue and found a solution?
Any suggestions would be appreciated.
Thanks!
Thanks for the Colab.
I ran into the dreaded out memory while
processing the embedding files saved to nemo_outputs/speaker_outputs/embeddings
Just trying to understand: My audio file is only 1h18 minutes long, but its embeddings amount to: "Dataset loaded with 9209 items, total duration of 2.55 hours." Since segmentations and subsegmentations are done automatically, I wonder what the upper limit filesize for the script to work on Colab would be.
Full error:
RuntimeError: CUDA out of memory. Tried to allocate 7.61 GiB (GPU 0; 14.75 GiB total capacity; 7.81 GiB already allocated; 5.65 GiB free; 7.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
This error related to nemo_toolkit
dependencies is causing pip install -r requirements.txt
to fail.
I have a similar setup (python 3.10.6, latest pip/setuptools, macOS 13.1 M1)
I tried the code on a 30 min 2 speaker conversation and it only detected one speaker. Also it put all the text over a single time stamp ex. 00:00:01,020 --> 00:29:13,065 ... Is this normal?
Can we somehow increase the diarization sensitivity? I looked into the nvidia documentation but can't find any way to do it.
Also tried whisperx it worked much better but still miss classified some part of the text.
I have made some (modest) progress on this, if anyone wishes to have a look:
https://github.com/mirix/approaches-to-diarisation/tree/main
Hello, first of all, thanks for the project.
I installed the requirements and ran the following command as you said:
python diarize.py -a AUDIO_FILE_NAME
I am getting the following error:
Can you help me? Thanks.
│ /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:10 │
│ 02 in _get_module │
│ │
│ 999 │ │
│ 1000 │ def _get_module(self, module_name: str): │
│ 1001 │ │ try: │
│ ❱ 1002 │ │ │ return importlib.import_module("." + module_name, self.__ │
│ 1003 │ │ except Exception as e: │
│ 1004 │ │ │ raise RuntimeError( │
│ 1005 │ │ │ │ f"Failed to import {self.__name__}.{module_name} beca │
│ │
│ /usr/lib/python3.8/importlib/__init__.py:127 in import_module │
│ │
│ 124 │ │ │ if character != '.': │
│ 125 │ │ │ │ break │
│ 126 │ │ │ level += 1 │
│ ❱ 127 │ return _bootstrap._gcd_import(name[level:], package, level) │
│ 128 │
│ 129 │
│ 130 _RELOADING = {} │
│ <frozen importlib._bootstrap>:1014 in _gcd_import │
│ <frozen importlib._bootstrap>:991 in _find_and_load │
│ <frozen importlib._bootstrap>:975 in _find_and_load_unlocked │
│ <frozen importlib._bootstrap>:671 in _load_unlocked │
│ <frozen importlib._bootstrap_external>:848 in exec_module │
│ <frozen importlib._bootstrap>:219 in _call_with_frames_removed │
│ │
│ /usr/local/lib/python3.8/dist-packages/transformers/models/xlm_roberta/model │
│ ing_tf_xlm_roberta.py:19 in <module> │
│ │
│ 16 """ TF 2.0 XLM-RoBERTa model.""" │
│ 17 │
│ 18 from ...utils import add_start_docstrings, logging │
│ ❱ 19 from ..roberta.modeling_tf_roberta import ( │
│ 20 │ TFRobertaForCausalLM, │
│ 21 │ TFRobertaForMaskedLM, │
│ 22 │ TFRobertaForMultipleChoice, │
│ │
│ /usr/local/lib/python3.8/dist-packages/transformers/models/roberta/modeling_ │
│ tf_roberta.py:36 in <module> │
│ │
│ 33 │ TFSequenceClassifierOutput, │
│ 34 │ TFTokenClassifierOutput, │
│ 35 ) │
│ ❱ 36 from ...modeling_tf_utils import ( │
│ 37 │ TFCausalLanguageModelingLoss, │
│ 38 │ TFMaskedLanguageModelingLoss, │
│ 39 │ TFModelInputType, │
│ │
│ /usr/local/lib/python3.8/dist-packages/transformers/modeling_tf_utils.py:38 │
│ in <module> │
│ │
│ 35 from tensorflow.python.keras.saving import hdf5_format │
│ 36 │
│ 37 from huggingface_hub import Repository, list_repo_files │
│ ❱ 38 from keras.saving.hdf5_format import save_attributes_to_hdf5_group │
│ 39 from requests import HTTPError │
│ 40 from transformers.utils.hub import convert_file_size_to_int, get_chec │
│ 41 │
╰──────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'keras.saving.hdf5_format'
The above exception was the direct cause of the following exception:
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/whisper-diarization/diarize.py:145 in <module> │
│ │
│ 142 │
│ 143 if whisper_results["language"] in punct_model_langs: │
│ 144 │ # restoring punctuation in the transcript to help realign the sent │
│ ❱ 145 │ punct_model = PunctuationModel(model="kredor/punctuate-all") │
│ 146 │ │
│ 147 │ words_list = list(map(lambda x: x["word"], wsm)) │
│ 148 │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepmultilingualpunctuation/punctuati │
│ onmodel.py:9 in __init__ │
│ │
│ 6 class PunctuationModel(): │
│ 7 │ def __init__(self, model = "oliverguhr/fullstop-punctuation-multila │
│ 8 │ │ if torch.cuda.is_available(): │
│ ❱ 9 │ │ │ self.pipe = pipeline("ner",model, grouped_entities=False, d │
│ 10 │ │ else: │
│ 11 │ │ │ self.pipe = pipeline("ner",model, grouped_entities=False) │
│ 12 │
│ │
│ /usr/local/lib/python3.8/dist-packages/transformers/pipelines/__init__.py:65 │
│ 0 in pipeline │
│ │
│ 647 │ # Forced if framework already defined, inferred if it's None │
│ 648 │ # Will load the correct model if possible │
│ 649 │ model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["p │
│ ❱ 650 │ framework, model = infer_framework_load_model( │
│ 651 │ │ model, │
│ 652 │ │ model_classes=model_classes, │
│ 653 │ │ config=config, │
│ │
│ /usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py:233 in │
│ infer_framework_load_model │
│ │
│ 230 │ │ │ │ │ if _class is not None: │
│ 231 │ │ │ │ │ │ classes.append(_class) │
│ 232 │ │ │ │ if look_tf: │
│ ❱ 233 │ │ │ │ │ _class = getattr(transformers_module, f"TF{archit │
│ 234 │ │ │ │ │ if _class is not None: │
│ 235 │ │ │ │ │ │ classes.append(_class) │
│ 236 │ │ │ class_tuple = class_tuple + tuple(classes) │
│ │
│ /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:99 │
│ 3 in __getattr__ │
│ │
│ 990 │ │ │ value = self._get_module(name) │
│ 991 │ │ elif name in self._class_to_module.keys(): │
│ 992 │ │ │ module = self._get_module(self._class_to_module[name]) │
│ ❱ 993 │ │ │ value = getattr(module, name) │
│ 994 │ │ else: │
│ 995 │ │ │ raise AttributeError(f"module {self.__name__} has no attr │
│ 996 │
│ │
│ /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:99 │
│ 2 in __getattr__ │
│ │
│ 989 │ │ if name in self._modules: │
│ 990 │ │ │ value = self._get_module(name) │
│ 991 │ │ elif name in self._class_to_module.keys(): │
│ ❱ 992 │ │ │ module = self._get_module(self._class_to_module[name]) │
│ 993 │ │ │ value = getattr(module, name) │
│ 994 │ │ else: │
│ 995 │ │ │ raise AttributeError(f"module {self.__name__} has no attr │
│ │
│ /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:10 │
│ 04 in _get_module │
│ │
│ 1001 │ │ try: │
│ 1002 │ │ │ return importlib.import_module("." + module_name, self.__ │
│ 1003 │ │ except Exception as e: │
│ ❱ 1004 │ │ │ raise RuntimeError( │
│ 1005 │ │ │ │ f"Failed to import {self.__name__}.{module_name} beca │
│ 1006 │ │ │ │ f" traceback):\n{e}" │
│ 1007 │ │ │ ) from e │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Failed to import
transformers.models.xlm_roberta.modeling_tf_xlm_roberta because of the following
error (look up to see its traceback):
No module named 'keras.saving.hdf5_format'
Hi,
I was wondering if there's an easy way to utilise this: https://github.com/guillaumekln/faster-whisper
I am using the Notebook file.
Thanks
I run diarize.py and all seems to work fine and I can see all the outputs in temp_outputs but I am unable to locate the SRT file.
just running diarize should create the SRT file yes? or do I need to run whisper specifically first?
!git clone https://github.com/MahmoudAshraf97/whisper-diarization
cd /content/whisper-diarization/
!pip install -r ./requirements.txt
!python diarize.py -a cs.m4a
I had a problem running on the Colab platform and didn't know how to fix it.
omegaconf is imported in helpers.py
Recent update on Nemo version is causing dependency error with transformer version.
git+https://github.com/NVIDIA/[email protected]#egg=nemo_toolkit[asr]
----> previous command
changed to
nemo_toolkit[asr]==1.15.0
----> new command
Second one is the updated command that gives error.
Guys,
Do you have an article describing whisper-diarization to other diarization methods.
Thanks a lot,
AlexG.
Tried to run "Whisper_Transcription_+_NeMo_Diarization.ipynb"
First installation step produced an error.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 3.20.1 which is incompatible.
tensorflow-metadata 1.13.1 requires protobuf<5,>=3.20.3, but you have protobuf 3.20.1 which is incompatible.
onnx 1.13.1 requires protobuf<4,>=3.20.2, but you have protobuf 3.20.1 which is incompatible.
googleapis-common-protos 1.59.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-translate 3.11.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-language 2.9.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-firestore 2.11.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-datastore 2.15.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-bigquery 3.9.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-cloud-bigquery-storage 2.19.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
google-api-core 2.11.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.20.1 which is incompatible.
Successfully installed protobuf-3.20.1 pytorch-lightning-1.6.5
OSError Traceback (most recent call last)
[<ipython-input-6-99c1ac25fe61>](https://localhost:8080/#) in <cell line: 11>()
9 import librosa
10 import soundfile
---> 11 from nemo.collections.asr.models.msdd_models import NeuralDiarizer
12 from deepmultilingualpunctuation import PunctuationModel
13 import re
Trying to install this in a miniconda3 enviroment, and it runs but it keeps saying "ModuleNotFoundError: No module named 'pytorch_lightning.core.module'"
I even tried making a completely new one to no avail.
I installed from the requirements.txt, does this not work in a VE?
Also tried manually installing lightning but to no avail.
ValueError Traceback (most recent call last)
in <cell line: 10>()
8
9 device = "cuda"
---> 10 alignment_model, metadata = whisperx.load_align_model(
11 language_code=whisper_results["language"], device=device
12 )
/usr/local/lib/python3.10/dist-packages/whisperx/alignment.py in load_align_model(language_code, device, model_name)
51 print(f"There is no default alignment model set for this language ({language_code}).
52 Please find a wav2vec2.0 model finetuned on this language in https://huggingface.co/models, then pass the model name in --align_model [MODEL_NAME]")
---> 53 raise ValueError(f"No default align-model for language: {language_code}")
54
55 if model_name in torchaudio.pipelines.all:
ValueError: No default align-model for language: tl
Hi there,
I'm getting an error that the "faster_whisper" module doesn't exist. Would love any guidance.
Thanks!
While trying to diarize using the MSDD model, I get the below error:
PicklingError: Can't pickle <class 'nemo.collections.common.parts.preprocessing.collections.SpeechLabelEntity'>: attribute lookup SpeechLabelEntity on nemo.collections.common.parts.preprocessing.collections failed
Kindly acknowledge if anyone had encountered this and solved it.
Hi there,
Thank you for your valuable work; it is very much appreciated. However, I encountered an error when trying to run the code on Google Colab.
at line: from nemo.collections.asr.models.msdd_models import NeuralDiarizer
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
[<ipython-input-4-a7e6fec2e8e2>](https://localhost:8080/#) in <cell line: 11>()
9 import librosa
10 import soundfile
---> 11 from nemo.collections.asr.models.msdd_models import NeuralDiarizer
12 from deepmultilingualpunctuation import PunctuationModel
13 import re
18 frames
[/usr/local/lib/python3.9/dist-packages/nemo/collections/asr/__init__.py](https://localhost:8080/#) in <module>
13 # limitations under the License.
14
---> 15 from nemo.collections.asr import data, losses, models, modules
16 from nemo.package_info import __version__
17
[/usr/local/lib/python3.9/dist-packages/nemo/collections/asr/losses/__init__.py](https://localhost:8080/#) in <module>
13 # limitations under the License.
14
---> 15 from nemo.collections.asr.losses.angularloss import AngularSoftmaxLoss
16 from nemo.collections.asr.losses.audio_losses import SDRLoss
17 from nemo.collections.asr.losses.ctc import CTCLoss
[/usr/local/lib/python3.9/dist-packages/nemo/collections/asr/losses/angularloss.py](https://localhost:8080/#) in <module>
16 import torch
17
---> 18 from nemo.core.classes import Loss, Typing, typecheck
19 from nemo.core.neural_types import LabelsType, LogitsType, LossType, NeuralType
20
[/usr/local/lib/python3.9/dist-packages/nemo/core/__init__.py](https://localhost:8080/#) in <module>
14
15 import nemo.core.neural_types
---> 16 from nemo.core.classes import *
[/usr/local/lib/python3.9/dist-packages/nemo/core/classes/__init__.py](https://localhost:8080/#) in <module>
16 import hydra
17 import omegaconf
---> 18 import pytorch_lightning
19
20 from nemo.core.classes.common import (
[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/__init__.py](https://localhost:8080/#) in <module>
28 _logger.propagate = False
29
---> 30 from pytorch_lightning.callbacks import Callback # noqa: E402
31 from pytorch_lightning.core import LightningDataModule, LightningModule # noqa: E402
32 from pytorch_lightning.trainer import Trainer # noqa: E402
[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/callbacks/__init__.py](https://localhost:8080/#) in <module>
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
---> 14 from pytorch_lightning.callbacks.base import Callback
15 from pytorch_lightning.callbacks.device_stats_monitor import DeviceStatsMonitor
16 from pytorch_lightning.callbacks.early_stopping import EarlyStopping
[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/callbacks/base.py](https://localhost:8080/#) in <module>
23
24 import pytorch_lightning as pl
---> 25 from pytorch_lightning.utilities.types import STEP_OUTPUT
26
27
[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/utilities/__init__.py](https://localhost:8080/#) in <module>
16 import numpy
17
---> 18 from pytorch_lightning.utilities.apply_func import move_data_to_device # noqa: F401
19 from pytorch_lightning.utilities.distributed import AllGatherGrad # noqa: F401
20 from pytorch_lightning.utilities.enums import ( # noqa: F401
[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/utilities/apply_func.py](https://localhost:8080/#) in <module>
27
28 from pytorch_lightning.utilities.exceptions import MisconfigurationException
---> 29 from pytorch_lightning.utilities.imports import _compare_version, _TORCHTEXT_LEGACY
30 from pytorch_lightning.utilities.warnings import rank_zero_deprecation
31
[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/utilities/imports.py](https://localhost:8080/#) in <module>
116 _TORCH_QUANTIZE_AVAILABLE = bool([eg for eg in torch.backends.quantized.supported_engines if eg != "none"])
117 _TORCHTEXT_AVAILABLE = _package_available("torchtext")
--> 118 _TORCHTEXT_LEGACY: bool = _TORCHTEXT_AVAILABLE and _compare_version("torchtext", operator.lt, "0.11.0")
119 _TORCHVISION_AVAILABLE = _package_available("torchvision")
120 _WANDB_AVAILABLE = _package_available("wandb")
[/usr/local/lib/python3.9/dist-packages/pytorch_lightning/utilities/imports.py](https://localhost:8080/#) in _compare_version(package, op, version, use_base_version)
69 """
70 try:
---> 71 pkg = importlib.import_module(package)
72 except (ImportError, DistributionNotFound):
73 return False
[/usr/lib/python3.9/importlib/__init__.py](https://localhost:8080/#) in import_module(name, package)
125 break
126 level += 1
--> 127 return _bootstrap._gcd_import(name[level:], package, level)
128
129
[/usr/local/lib/python3.9/dist-packages/torchtext/__init__.py](https://localhost:8080/#) in <module>
4
5 # the following import has to happen first in order to load the torchtext C++ library
----> 6 from torchtext import _extension # noqa: F401
7
8 _TEXT_BUCKET = "https://download.pytorch.org/models/text/"
[/usr/local/lib/python3.9/dist-packages/torchtext/_extension.py](https://localhost:8080/#) in <module>
62
63
---> 64 _init_extension()
[/usr/local/lib/python3.9/dist-packages/torchtext/_extension.py](https://localhost:8080/#) in _init_extension()
56 raise ImportError("torchtext C++ Extension is not found.")
57
---> 58 _load_lib("libtorchtext")
59 # This import is for initializing the methods registered via PyBind11
60 # This has to happen after the base library is loaded
[/usr/local/lib/python3.9/dist-packages/torchtext/_extension.py](https://localhost:8080/#) in _load_lib(lib)
48 if not path.exists():
49 return False
---> 50 torch.ops.load_library(path)
51 return True
52
[/usr/local/lib/python3.9/dist-packages/torch/_ops.py](https://localhost:8080/#) in load_library(self, path)
571 ) from e
572
--> 573 # let the script frontend know that op is identical to the builtin op
574 # with qualified_op_name
575 torch.jit._builtins._register_builtin(op, qualified_op_name)
[/usr/lib/python3.9/ctypes/__init__.py](https://localhost:8080/#) in __init__(self, name, mode, handle, use_errno, use_last_error, winmode)
372
373 if handle is None:
--> 374 self._handle = _dlopen(self._name, mode)
375 else:
376 self._handle = handle
OSError: /usr/local/lib/python3.9/dist-packages/torchtext/lib/libtorchtext.so: undefined symbol: _ZN2at4_ops10select_int4callERKNS_6TensorElN3c106SymIntE
looks like it is caused by torchtext then i tried to uninstall it and reinstall it. but nothing get changed.
Hello Mahmoud. First off, thank you for this awesome contribution to the community. I've been trying to get reliable diarization with whisper for months, so I'm excited to try your implementation.
However, I'm running into the following error when running your diarize.py script on a random audio file from youtube:
(MA97_whisper_diarization) PS C:\-\-\-\-\repos\whisper-diarization> python diarize.py -a ..\..\audio\lotr_trailer.wav
[NeMo W 2023-02-07 16:30:09 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-02-07 16:30:10 nemo_logging:349] C:\-\-\Miniconda3\envs\MA97_whisper_diarization\lib\site-packages\torch\jit\annotations.py:309: UserWarning: TorchScript will treat type annotations of Tensor dtype-specific subtypes as if they are normal Tensors. dtype constraints are not enforced in compilation either.
warnings.warn("TorchScript will treat type annotations of Tensor "
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
Source splitting failed, using original audio file. Use --no-stem argument to disable it.
0%| 0/17569 [00:00
15%|███████████▋| 2572/17569 [00:28<02:48
28%|██████████████████████| 4960/17569 [00:32<01:11, 1
45%|███████████████████████████████████| 7832/17569 [00:35<00:33, 2
59%|█████████████████████████████████████████████▊| 10328/17569 [00:39<00:19, 3
74%|█████████████████████████████████████████████████████████ | 13002/17569 [00:46<00:12, 3
91%|███████████████████████████████████████████████████████████████████████ | 16002/17569 [00:49<00:03,
4100%██████████████████████████████████████████████████████████████████████████████| 17569/17569 [00:50<00:00,
5100%|██████████████████████████████████████████████████████████████████████████████| 17569/17569 [00:50<00:00, 346.50frames/s]
Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth" to C:\-\-/.cache\torch\hub\checkpoints\wav2vec2_fairseq_base_ls960_asr_ls960.pth
100%|███████████████████████████████████████████████████████████████████████| 360M/360M [00:05<00:00, 69.7MB/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\-\-\-\-\repos\whisper-diarization\diarize.py:115 in │
│ <module> │
│ │
│ 112 │
│ 113 # convert audio to mono for NeMo combatibility │
│ 114 signal, sample_rate = librosa.load(vocal_target, sr=None) │
│ ❱ 115 os.chdir("temp_outputs") │
│ 116 soundfile.write("mono_file.wav", signal, sample_rate, "PCM_24") │
│ 117 │
│ 118 # Initialize NeMo MSDD diarization model │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: [WinError 2] The system cannot find the file specified: 'temp_outputs'
Any recommendations for a fix?
Traceback (most recent call last):
File "C:\Users\gamen\Downloads\whisper-diarization-main\diarize.py", line 9, in <module>
from nemo.collections.asr.models.msdd_models import NeuralDiarizer
File "C:\Python311\Lib\site-packages\nemo\collections\asr\__init__.py", line 15, in <module>
from nemo.collections.asr import data, losses, models, modules
File "C:\Python311\Lib\site-packages\nemo\collections\asr\losses\__init__.py", line 15, in <module>
from nemo.collections.asr.losses.angularloss import AngularSoftmaxLoss
File "C:\Python311\Lib\site-packages\nemo\collections\asr\losses\angularloss.py", line 18, in <module>
from nemo.core.classes import Loss, Typing, typecheck
File "C:\Python311\Lib\site-packages\nemo\core\__init__.py", line 16, in <module>
from nemo.core.classes import *
File "C:\Python311\Lib\site-packages\nemo\core\classes\__init__.py", line 16, in <module>
import hydra
File "C:\Python311\Lib\site-packages\hydra\__init__.py", line 5, in <module>
from hydra import utils
File "C:\Python311\Lib\site-packages\hydra\utils.py", line 8, in <module>
import hydra._internal.instantiate._instantiate2
File "C:\Python311\Lib\site-packages\hydra\_internal\instantiate\_instantiate2.py", line 12, in <module>
from hydra._internal.utils import _locate
File "C:\Python311\Lib\site-packages\hydra\_internal\utils.py", line 18, in <module>
from hydra.core.utils import get_valid_filename, validate_config_path
File "C:\Python311\Lib\site-packages\hydra\core\utils.py", line 20, in <module>
from hydra.core.hydra_config import HydraConfig
File "C:\Python311\Lib\site-packages\hydra\core\hydra_config.py", line 6, in <module>
from hydra.conf import HydraConf
File "C:\Python311\Lib\site-packages\hydra\conf\__init__.py", line 46, in <module>
class JobConf:
File "C:\Python311\Lib\site-packages\hydra\conf\__init__.py", line 75, in JobConf
@dataclass
^^^^^^^^^
File "C:\Python311\Lib\dataclasses.py", line 1221, in dataclass
return wrap(cls)
^^^^^^^^^
File "C:\Python311\Lib\dataclasses.py", line 1211, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\dataclasses.py", line 959, in _process_class
cls_fields.append(_get_field(cls, name, type, kw_only))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\dataclasses.py", line 816, in _get_field
raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'hydra.conf.JobConf.JobConfig.OverrideDirname'> for field override_dirname is not allowed: use default_factory```
/home/.........../token_classification.py:168: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="none"` instead.
warnings.warn(
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/....../whisper-diarization/venv/diarize_parallel.py:137 in │
│ <module> │
│ │
│ 134 │ │
│ 135 │ words_list = list(map(lambda x: x["word"], wsm)) │
│ 136 │ │
│ ❱ 137 │ labled_words = punct_model.predict(words_list) │
│ 138 │ │
│ 139 │ ending_puncts = ".?!" │
│ 140 │ model_puncts = ".,?-:" │
│ │
│ /home/.....//whisper-diarization/venv/ython3.10/site-packages/deepmultilingualpunctuation/pun │
│ ctuationmodel.py:39 in predict │
│ │
│ 36 │ │ │
│ 37 │ │ # if the last batch is smaller than the overlap, │
│ 38 │ │ # we can just remove it │
│ ❱ 39 │ │ if len(batches[-1]) <= overlap: │
│ 40 │ │ │ batches.pop() │
│ 41 │ │ │
│ 42 │ │ tagged_words = [] │
╰──────────────────────────────────────────────────────────────────────────────╯
IndexError: list index out of range
I tried throwing a try block around len(batches[-1] <= overlap, and threw in len(baches)[0] <= overlap to boot (not great at this programming thing really and still learning) in the punctuationmodel.py file and was able to successfully generate srt files / transcriptions for a couple audio files I was working with,but then it came back.
Hope this is helpful!
Hi,
What is the simplest way to do this?
Thanks
I have a lot of files that error out because of silences which aren't of significant lengths - a couple of seconds or so.
I know Whisper has issues with silence and there is a VAD built in, but is there any way to improve this other than having to pre-process it?
Thanks
I have stereo voice file. Both of the channels can be seperated into left and right channel(two speakers). What should be modified in the code so that it processes and diarizes it accordingly. Thanks
def backtrace(trace: np.ndarray):
Traceback (most recent call last):
File "c:\users\t_care\appdata\local\programs\python\python38\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\users\t_care\appdata\local\programs\python\python38\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\Scripts\whisper.exe\__main__.py", line 7, in <module>
File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\whisper\transcribe.py", line 433, in cli
model = load_model(model_name, device=device, download_root=model_dir)
File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\whisper\__init__.py", line 144, in load_model
checkpoint = torch.load(fp, map_location=device)
File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 809, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 1172, in _load
result = unpickler.load()
File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 1142, in persistent_load
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 1116, in load_tensor
wrap_storage=restore_location(storage, location),
File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 1083, in restore_location
return default_restore_location(storage, map_location)
File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 217, in default_restore_location
result = fn(storage, location)
File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 182, in _cuda_deserialize
device = validate_cuda_device(location)
File "c:\users\t_care\appdata\local\programs\python\python38\lib\site-packages\torch\serialization.py", line 166, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
PS C:\Users\T_Care\Desktop\whisper_dia\whisper-diarization> python diarize.py -a "sample.mp3" --no-stem
C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\pkg_resources\__init__.py:123: PkgResourcesDeprecationWarning: otobuf is an invalid version and will not be supported in a future release
warnings.warn(
[NeMo W 2023-07-03 10:47:18 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-07-03 10:47:18 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2023-07-03 10:47:36 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2023-07-03 10:47:36 cloud:58] Found existing object C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo.
[NeMo I 2023-07-03 10:47:36 cloud:64] Re-using file from: C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo
[NeMo I 2023-07-03 10:47:36 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-07-03 10:47:36 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: true
[NeMo W 2023-07-03 10:47:36 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: false
[NeMo W 2023-07-03 10:47:36 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
Test config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: false
seq_eval_mode: false
[NeMo I 2023-07-03 10:47:36 features:287] PADDING: 16
[NeMo I 2023-07-03 10:47:37 features:287] PADDING: 16
[NeMo I 2023-07-03 10:47:38 save_restore_connector:247] Model EncDecDiarLabelModel was successfully restored from C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo.
[NeMo I 2023-07-03 10:47:38 features:287] PADDING: 16
[NeMo I 2023-07-03 10:47:38 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2023-07-03 10:47:38 cloud:58] Found existing object C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo.
[NeMo I 2023-07-03 10:47:38 cloud:64] Re-using file from: C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo
[NeMo I 2023-07-03 10:47:38 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-07-03 10:47:38 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
sample_rate: 16000
labels:
- background
- speech
batch_size: 256
shuffle: true
is_tarred: false
tarred_audio_filepaths: null
tarred_shard_strategy: scatter
augmentor:
shift:
prob: 0.5
min_shift_ms: -10.0
max_shift_ms: 10.0
white_noise:
prob: 0.5
min_level: -90
max_level: -46
norm: true
noise:
prob: 0.5
manifest_path: /manifests/noise_0_1_musan_fs.json
min_snr_db: 0
max_snr_db: 30
max_gain_db: 300.0
norm: true
gain:
prob: 0.5
min_gain_dbfs: -10.0
max_gain_dbfs: 10.0
norm: true
num_workers: 16
pin_memory: true
[NeMo W 2023-07-03 10:47:38 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
sample_rate: 16000
labels:
- background
- speech
batch_size: 256
shuffle: false
val_loss_idx: 0
num_workers: 16
pin_memory: true
[NeMo W 2023-07-03 10:47:38 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
Test config :
manifest_filepath: null
sample_rate: 16000
labels:
- background
- speech
batch_size: 128
shuffle: false
test_loss_idx: 0
[NeMo I 2023-07-03 10:47:38 features:287] PADDING: 16
[NeMo I 2023-07-03 10:47:38 save_restore_connector:247] Model EncDecClassificationModel was successfully restored from C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo.
[NeMo I 2023-07-03 10:47:38 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2023-07-03 10:47:38 msdd_models:865] Clustering Parameters: {
"oracle_num_speakers": false,
"max_num_speakers": 8,
"enhanced_count_thres": 80,
"max_rp_threshold": 0.25,
"sparse_search_volume": 30,
"maj_vote_spk_count": false
}
[NeMo W 2023-07-03 10:47:38 clustering_diarizer:411] Deleting previous clustering diarizer outputs.
[NeMo I 2023-07-03 10:47:38 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2023-07-03 10:47:38 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 332.27it/s]
[NeMo I 2023-07-03 10:47:38 vad_utils:101] The prepared manifest file exists. Overwriting!
[NeMo I 2023-07-03 10:47:38 classification_models:263] Perform streaming frame-level VAD
[NeMo I 2023-07-03 10:47:38 collections:298] Filtered duration for loading collection is 0.000000.
[NeMo I 2023-07-03 10:47:38 collections:301] Dataset loaded with 1 items, total duration of 0.00 hours.
[NeMo I 2023-07-03 10:47:38 collections:303] # 1 files loaded accounting to # 1 labels
vad: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "diarize.py", line 112, in <module>
msdd_model.diarize()
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 1180, in diarize
self.clustering_embedding.prepare_cluster_embs_infer()
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 699, in prepare_cluster_embs_infer
self.emb_sess_test_dict, self.emb_seq_test, self.clus_test_label_dict, _ = self.run_clustering_diarizer(
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 866, in run_clustering_diarizer
scores = self.clus_diar_model.diarize(batch_size=self.cfg_diar_infer.batch_size)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 437, in diarize
self._perform_speech_activity_detection()
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 325, in _perform_speech_activity_detection
self._run_vad(manifest_vad_input)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 218, in _run_vad
for i, test_batch in enumerate(
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\tqdm\std.py", line 1178, in __iter__
for obj in iterable:
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 441, in __iter__
return self._get_iterator()
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 1042, in __init__
w.start()
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
reduction.dump(process_obj, to_child)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'nemo.collections.common.parts.preprocessing.collections.SpeechLabelEntity'>: attribute lookup SpeechLabelEntity on nemo.collections.common.parts.preprocessing.collections failed
PS C:\Users\T_Care\Desktop\whisper_dia\whisper-diarization> C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\pkg_resources\__init__.py:123: PkgResourcesDeprecationWarning: otobuf is an invalid version and will not be supported in a future release
warnings.warn(
[NeMo W 2023-07-03 10:47:42 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-07-03 10:47:42 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2023-07-03 10:47:59 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2023-07-03 10:47:59 cloud:58] Found existing object C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo.
[NeMo I 2023-07-03 10:47:59 cloud:64] Re-using file from: C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo
[NeMo I 2023-07-03 10:47:59 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-07-03 10:48:00 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: true
[NeMo W 2023-07-03 10:48:00 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: false
[NeMo W 2023-07-03 10:48:00 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
Test config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: false
seq_eval_mode: false
[NeMo I 2023-07-03 10:48:00 features:287] PADDING: 16
[NeMo I 2023-07-03 10:48:00 features:287] PADDING: 16
[NeMo I 2023-07-03 10:48:02 save_restore_connector:247] Model EncDecDiarLabelModel was successfully restored from C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo.
[NeMo I 2023-07-03 10:48:02 features:287] PADDING: 16
[NeMo I 2023-07-03 10:48:02 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2023-07-03 10:48:02 cloud:58] Found existing object C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo.
[NeMo I 2023-07-03 10:48:02 cloud:64] Re-using file from: C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo
[NeMo I 2023-07-03 10:48:02 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-07-03 10:48:02 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
sample_rate: 16000
labels:
- background
- speech
batch_size: 256
shuffle: true
is_tarred: false
tarred_audio_filepaths: null
tarred_shard_strategy: scatter
augmentor:
shift:
prob: 0.5
min_shift_ms: -10.0
max_shift_ms: 10.0
white_noise:
prob: 0.5
min_level: -90
max_level: -46
norm: true
noise:
prob: 0.5
manifest_path: /manifests/noise_0_1_musan_fs.json
min_snr_db: 0
max_snr_db: 30
max_gain_db: 300.0
norm: true
gain:
prob: 0.5
min_gain_dbfs: -10.0
max_gain_dbfs: 10.0
norm: true
num_workers: 16
pin_memory: true
[NeMo W 2023-07-03 10:48:02 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
sample_rate: 16000
labels:
- background
- speech
batch_size: 256
shuffle: false
val_loss_idx: 0
num_workers: 16
pin_memory: true
[NeMo W 2023-07-03 10:48:02 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
Test config :
manifest_filepath: null
sample_rate: 16000
labels:
- background
- speech
batch_size: 128
shuffle: false
test_loss_idx: 0
[NeMo I 2023-07-03 10:48:02 features:287] PADDING: 16
[NeMo I 2023-07-03 10:48:02 save_restore_connector:247] Model EncDecClassificationModel was successfully restored from C:\Users\T_Care\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo.
[NeMo I 2023-07-03 10:48:02 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2023-07-03 10:48:02 msdd_models:865] Clustering Parameters: {
"oracle_num_speakers": false,
"max_num_speakers": 8,
"enhanced_count_thres": 80,
"max_rp_threshold": 0.25,
"sparse_search_volume": 30,
"maj_vote_spk_count": false
}
[NeMo W 2023-07-03 10:48:02 clustering_diarizer:411] Deleting previous clustering diarizer outputs.
[NeMo I 2023-07-03 10:48:02 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2023-07-03 10:48:02 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 335.06it/s]
[NeMo I 2023-07-03 10:48:02 vad_utils:101] The prepared manifest file exists. Overwriting!
[NeMo I 2023-07-03 10:48:02 classification_models:263] Perform streaming frame-level VAD
[NeMo I 2023-07-03 10:48:02 collections:298] Filtered duration for loading collection is 0.000000.
[NeMo I 2023-07-03 10:48:02 collections:301] Dataset loaded with 1 items, total duration of 0.00 hours.
[NeMo I 2023-07-03 10:48:02 collections:303] # 1 files loaded accounting to # 1 labels
vad: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 125, in _main
prepare(preparation_data)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\T_Care\Desktop\whisper_dia\whisper-diarization\diarize.py", line 112, in <module>
msdd_model.diarize()
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 1180, in diarize
self.clustering_embedding.prepare_cluster_embs_infer()
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 699, in prepare_cluster_embs_infer
self.emb_sess_test_dict, self.emb_seq_test, self.clus_test_label_dict, _ = self.run_clustering_diarizer(
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 866, in run_clustering_diarizer
scores = self.clus_diar_model.diarize(batch_size=self.cfg_diar_infer.batch_size)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 437, in diarize
self._perform_speech_activity_detection()
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 325, in _perform_speech_activity_detection
self._run_vad(manifest_vad_input)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 218, in _run_vad
for i, test_batch in enumerate(
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\tqdm\std.py", line 1178, in __iter__
for obj in iterable:
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 441, in __iter__
return self._get_iterator()
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 1042, in __init__
w.start()
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "C:\Users\T_Care\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Hi!
I've tried to install dependencies from your requirements and I received an error:
ERROR: Cannot install nemo-toolkit[asr]==1.15.0 and transformers==4.26.1 because these package versions have conflicting dependencies.
The conflict is caused by:
The user requested transformers==4.26.1
nemo-toolkit[asr] 1.15.0 depends on transformers<=4.21.2 and >=4.0.1; extra == "asr"
Could you please explain what I should do to install all the dependencies?
Pls help.
C:\Users\Administrator\Documents\GitHub\whisper-diarization>pip install -r requirements.txt
Collecting git+https://github.com/m-bain/whisperX.git@4cb167a225c0ebaea127fd6049abfaa3af9f8bb4 (from -r requirements.txt (line 5))
Cloning https://github.com/m-bain/whisperX.git (to revision 4cb167a225c0ebaea127fd6049abfaa3af9f8bb4) to c:\users\administrator\appdata\local\temp\pip-req-build-g_0nd3n1
Running command git clone --filter=blob:none --quiet https://github.com/m-bain/whisperX.git 'C:\Users\Administrator\AppData\Local\Temp\pip-req-build-g_0nd3n1'
fatal: unable to access 'https://github.com/m-bain/whisperX.git/': Recv failure: Connection was reset
error: subprocess-exited-with-error
× git clone --filter=blob:none --quiet https://github.com/m-bain/whisperX.git 'C:\Users\Administrator\AppData\Local\Temp\pip-req-build-g_0nd3n1' did not run successfully.
│ exit code: 128
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× git clone --filter=blob:none --quiet https://github.com/m-bain/whisperX.git 'C:\Users\Administrator\AppData\Local\Temp\pip-req-build-g_0nd3n1' did not run successfully.
│ exit code: 128
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
@MahmoudAshraf97
I have below error:
python3 diarize.py -a ../whisperX/output.wav
[NeMo W 2023-06-19 11:53:43 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-06-19 11:53:44 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /data/dProjects/whisper-diarization/temp_outputs/htdemucs
Separating track ../whisperX/output.wav
100%|██████████████████████████████████████████████████████████████████████| 602.55/602.55 [00:12<00:00, 48.97seconds/s]
Failed to align segment (" Google."): backtrack failed, resorting to original...
Failed to align segment: duration smaller than 0.02s time precision
Failed to align segment: duration smaller than 0.02s time precision
Traceback (most recent call last):
File "diarize.py", line 89, in <module>
result_aligned = whisperx.align(
File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/whisperx/alignment.py", line 224, in align
emissions, _ = model(waveform_segment.to(device))
File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torchaudio/models/wav2vec2/model.py", line 116, in forward
x, lengths = self.feature_extractor(waveforms, lengths)
File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torchaudio/models/wav2vec2/components.py", line 141, in forward
x, length = layer(x, length) # (batch, feature, frame)
File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torchaudio/models/wav2vec2/components.py", line 90, in forward
x = self.conv(x)
File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/data/dProjects/faster-whisper/venFasterWhsiper/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (1). Kernel size: (2). Kernel size can't be greater than actual input size
Just an update to everyone that reads this repo: it worked perfectly in Portuguese for me.
I get this every time I try to run it. I've tried like 100x ways to get the new cuDNN. I am using Miniconda, but nothing works. Any ideas?
RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (8, 5, 0) but found runtime version (8, 2, 1). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.
Error :
Detected language: Gujarati
100%|██████████| 8533/8533 [00:16<00:00, 512.95frames/s]There is no default alignment model set for this language (gu). Please find a wav2vec2.0 model finetuned on this language in https://huggingface.co/models, then pass the model name in --align_model [MODEL_NAME]
ValueError Traceback (most recent call last)
in <cell line: 8>()
6
7 device = "cuda"
----> 8 alignment_model, metadata = whisperx.load_align_model(
9 language_code=whisper_results["language"], device=device
10 )
/usr/local/lib/python3.9/dist-packages/whisperx/alignment.py in load_align_model(language_code, device, model_name)
51 print(f"There is no default alignment model set for this language ({language_code}).
52 Please find a wav2vec2.0 model finetuned on this language in https://huggingface.co/models, then pass the model name in --align_model [MODEL_NAME]")
---> 53 raise ValueError(f"No default align-model for language: {language_code}")
54
55 if model_name in torchaudio.pipelines.all:
ValueError: No default align-model for language: gu
I run diarization on M1 Mac like this:
python3 diarize.py -a trimmed_1min.wav --no-stem --whisper-model large-v2 --device cpu
I get this error:
File "/Users/alexanderblagochevsky/Documents/whisper-diarization/diarize.py", line 160, in
f'Punctuation restoration is not available for {whisper_results["language"]} language.'
TypeError: list indices must be integers or slices, not str
I've tried commenting out punctuation part of the diarization.py and the script completes without error but in the output I get the whole conversation as only 1 speaker.
I tested this with audio that has multiple speakers. It worked fairly well. Is there a way to add timestamps for when each speaker was detected?
Can you please check this error and help me fix this issue to get the transcription and diarization ?
Error pasted below:
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/gvnavin/anaconda3/lib/python3.9/site-packages/whisper/audio.py:46 in load_audio │
│ │
│ 43 │ │ # This launches a subprocess to decode audio while down-mixing and resampling as │
│ 44 │ │ # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed. │
│ 45 │ │ out, _ = ( │
│ ❱ 46 │ │ │ ffmpeg.input(file, threads=0) │
│ 47 │ │ │ .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr) │
│ 48 │ │ │ .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True) │
│ 49 │ │ ) │
│ │
│ /home/gvnavin/anaconda3/lib/python3.9/site-packages/ffmpeg/_run.py:325 in run │
│ │
│ 322 │ out, err = process.communicate(input) │
│ 323 │ retcode = process.poll() │
│ 324 │ if retcode: │
│ ❱ 325 │ │ raise Error('ffmpeg', out, err) │
│ 326 │ return out, err │
│ 327 │
│ 328 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Error: ffmpeg error (see stderr output for detail)
The above exception was the direct cause of the following exception:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/gvnavin/ws/whisper-diarization/diarize.py:94 in <module> │
│ │
│ 91 │
│ 92 # Large models result in considerably better and more aligned (words, timestamps) mappin │
│ 93 whisper_model = load_model(args.model_name) │
│ ❱ 94 whisper_results = whisper_model.transcribe(vocal_target, beam_size=None, verbose=False) │
│ 95 │
│ 96 # clear gpu vram │
│ 97 del whisper_model │
│ │
│ /home/gvnavin/anaconda3/lib/python3.9/site-packages/whisper/transcribe.py:121 in transcribe │
│ │
│ 118 │ │ decode_options["fp16"] = False │
│ 119 │ │
│ 120 │ # Pad 30-seconds of silence to the input audio, for slicing │
│ ❱ 121 │ mel = log_mel_spectrogram(audio, padding=N_SAMPLES) │
│ 122 │ content_frames = mel.shape[-1] - N_FRAMES │
│ 123 │ │
│ 124 │ if decode_options.get("language", None) is None: │
│ │
│ /home/gvnavin/anaconda3/lib/python3.9/site-packages/whisper/audio.py:130 in log_mel_spectrogram │
│ │
│ 127 │ """ │
│ 128 │ if not torch.is_tensor(audio): │
│ 129 │ │ if isinstance(audio, str): │
│ ❱ 130 │ │ │ audio = load_audio(audio) │
│ 131 │ │ audio = torch.from_numpy(audio) │
│ 132 │ │
│ 133 │ if device is not None: │
│ │
│ /home/gvnavin/anaconda3/lib/python3.9/site-packages/whisper/audio.py:51 in load_audio │
│ │
│ 48 │ │ │ .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True) │
│ 49 │ │ ) │
│ 50 │ except ffmpeg.Error as e: │
│ ❱ 51 │ │ raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e │
│ 52 │ │
│ 53 │ return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0 │
│ 54 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Failed to load audio: ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64
--enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio
--enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack
--enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband
--enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab
--enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2
--enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm
--enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil 56. 70.100 / 56. 70.100
Hello.
To begin with, Thank you for your model development and support.
As a Korean, I am interested in Korean-based STT implementation.
Would this whisper-diarization
model work in korean?
Thank you for trying to help if you can :)
I followed the instructions in the README, and successfully installed all specified dependencies. I'm now trying to run the package from the command line. Using a Windows Surface 3 laptop running Windows 10 Pro, I run into this error:
❯ python diarize.py -a MY_FILE.mp3
[NeMo W 2023-05-05 12:07:02 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-05-05 12:07:05 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
Source splitting failed, using original audio file. Use --no-stem argument to disable it.
Traceback (most recent call last):
File "<MY LOCAL PATH>/diarize.py", line 56, in <module>
whisper_model = WhisperModel(args.model_name, device="cuda", compute_type="float16")
File "<MY VENVS FOLDER>\whisper-diarization\lib\site-packages\faster_whisper\transcribe.py", line 120, in __init__
self.model = ctranslate2.models.Whisper(
RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version
Is this a hardware issue or am I missing something? Thank you!
I have data set of files, if I upload files that are of, lets say 10 seconds with no voice or some background noise, it gives error ValueError: All files present in manifest contains silence, aborting next steps
how should i handle this in the code?
First of all, thank you for maintaining this tool.
Will it work in Turkish ?
Regards.
Hi,
When I run the whisperx.align part, certain soundfiles gives me this error:
RuntimeError: Calculated padded input size per channel: (1). Kernel size: (2). Kernel size can't be greater than actual input size.
Any idea on what causes this error and how to fix it?
I receive the following error message:
[NeMo W 2023-06-14 11:08:57 nemo_logging:393] Apex was not found. Using the lamb or fused_adam optimizer will error out.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/emoman/Work/diarisation/whisper-diarization/diarize.py:9 in <module> │
│ │
│ 6 import torch │
│ 7 import librosa │
│ 8 import soundfile │
│ ❱ 9 from nemo.collections.asr.models.msdd_models import NeuralDiarizer │
│ 10 from deepmultilingualpunctuation import PunctuationModel │
│ 11 import re │
│ 12 import logging │
│ │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/asr/__init__.py:15 in <module> │
│ │
│ 12 # See the License for the specific language governing permissions and │
│ 13 # limitations under the License. │
│ 14 │
│ ❱ 15 from nemo.collections.asr import data, losses, models, modules │
│ 16 from nemo.package_info import __version__ │
│ 17 │
│ 18 # Set collection version equal to NeMo version. │
│ │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/asr/losses/__init__.py:16 in │
│ <module> │
│ │
│ 13 # limitations under the License. │
│ 14 │
│ 15 from nemo.collections.asr.losses.angularloss import AngularSoftmaxLoss │
│ ❱ 16 from nemo.collections.asr.losses.audio_losses import SDRLoss │
│ 17 from nemo.collections.asr.losses.ctc import CTCLoss │
│ 18 from nemo.collections.asr.losses.lattice_losses import LatticeLoss │
│ 19 from nemo.collections.asr.losses.ssl_losses.contrastive import ContrastiveLoss │
│ │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/asr/losses/audio_losses.py:21 │
│ in <module> │
│ │
│ 18 import numpy as np │
│ 19 import torch │
│ 20 │
│ ❱ 21 from nemo.collections.asr.parts.preprocessing.features import make_seq_mask_like │
│ 22 from nemo.collections.asr.parts.utils.audio_utils import toeplitz │
│ 23 from nemo.core.classes import Loss, Typing, typecheck │
│ 24 from nemo.core.neural_types import AudioSignal, LengthsType, LossType, MaskType, NeuralT │
│ │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/asr/parts/preprocessing/__init │
│ __.py:16 in <module> │
│ │
│ 13 # limitations under the License. │
│ 14 │
│ 15 from nemo.collections.asr.parts.preprocessing.feature_loader import ExternalFeatureLoade │
│ ❱ 16 from nemo.collections.asr.parts.preprocessing.features import FeaturizerFactory, Filterb │
│ 17 from nemo.collections.asr.parts.preprocessing.perturb import ( │
│ 18 │ AudioAugmentor, │
│ 19 │ AugmentationDataset, │
│ │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/asr/parts/preprocessing/featur │
│ es.py:44 in <module> │
│ │
│ 41 import torch │
│ 42 import torch.nn as nn │
│ 43 │
│ ❱ 44 from nemo.collections.asr.parts.preprocessing.perturb import AudioAugmentor │
│ 45 from nemo.collections.asr.parts.preprocessing.segment import AudioSegment │
│ 46 from nemo.utils import logging │
│ 47 │
│ │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/asr/parts/preprocessing/pertur │
│ b.py:50 in <module> │
│ │
│ 47 from torch.utils.data import IterableDataset │
│ 48 │
│ 49 from nemo.collections.asr.parts.preprocessing.segment import AudioSegment │
│ ❱ 50 from nemo.collections.common.parts.preprocessing import collections, parsers │
│ 51 from nemo.utils import logging │
│ 52 │
│ 53 # TODO @blisc: Perhaps refactor instead of import guarding │
│ │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/common/__init__.py:16 in │
│ <module> │
│ │
│ 13 # limitations under the License. │
│ 14 │
│ 15 import nemo.collections.common.callbacks │
│ ❱ 16 from nemo.collections.common import data, losses, parts, tokenizers │
│ 17 from nemo.package_info import __version__ │
│ 18 │
│ 19 # Set collection version equal to NeMo version. │
│ │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/common/parts/__init__.py:15 in │
│ <module> │
│ │
│ 12 # See the License for the specific language governing permissions and │
│ 13 # limitations under the License. │
│ 14 │
│ ❱ 15 from nemo.collections.common.parts.adapter_modules import LinearAdapter, LinearAdapterCo │
│ 16 from nemo.collections.common.parts.mlm_scorer import MLMScorer │
│ 17 from nemo.collections.common.parts.multi_layer_perceptron import MultiLayerPerceptron │
│ 18 from nemo.collections.common.parts.transformer_utils import * │
│ │
│ /home/emoman/.local/lib/python3.11/site-packages/nemo/collections/common/parts/adapter_modules.p │
│ y:147 in <module> │
│ │
│ 144 │ │ return x │
│ 145 │
│ 146 │
│ ❱ 147 @dataclass │
│ 148 class LinearAdapterConfig: │
│ 149 │ in_features: int │
│ 150 │ dim: int │
│ │
│ /usr/lib/python3.11/dataclasses.py:1223 in dataclass │
│ │
│ 1220 │ │ return wrap │
│ 1221 │ │
│ 1222 │ # We're called as @dataclass without parens. │
│ ❱ 1223 │ return wrap(cls) │
│ 1224 │
│ 1225 │
│ 1226 def fields(class_or_instance): │
│ │
│ /usr/lib/python3.11/dataclasses.py:1213 in wrap │
│ │
│ 1210 │ """ │
│ 1211 │ │
│ 1212 │ def wrap(cls): │
│ ❱ 1213 │ │ return _process_class(cls, init, repr, eq, order, unsafe_hash, │
│ 1214 │ │ │ │ │ │ │ frozen, match_args, kw_only, slots, │
│ 1215 │ │ │ │ │ │ │ weakref_slot) │
│ 1216 │
│ │
│ /usr/lib/python3.11/dataclasses.py:958 in _process_class │
│ │
│ 955 │ │ │ kw_only = True │
│ 956 │ │ else: │
│ 957 │ │ │ # Otherwise it's a field of some type. │
│ ❱ 958 │ │ │ cls_fields.append(_get_field(cls, name, type, kw_only)) │
│ 959 │ │
│ 960 │ for f in cls_fields: │
│ 961 │ │ fields[f.name] = f │
│ │
│ /usr/lib/python3.11/dataclasses.py:815 in _get_field │
│ │
│ 812 │ # indicator for mutability. Read the __hash__ attribute from the class, │
│ 813 │ # not the instance. │
│ 814 │ if f._field_type is _FIELD and f.default.__class__.__hash__ is None: │
│ ❱ 815 │ │ raise ValueError(f'mutable default {type(f.default)} for field ' │
│ 816 │ │ │ │ │ │ f'{f.name} is not allowed: use default_factory') │
│ 817 │ │
│ 818 │ return f │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: mutable default <class 'nemo.core.classes.mixins.adapter_mixin_strategies.ResidualAddAdapterStrategyConfig'> for field adapter_strategy is not
allowed: use default_factory
Hello guys, can you advise a good ool or neural network just to separate the vocals from the rest of the noise?
diarize.py -a ./source/audio.wav
Failed to load audio: ffmpeg
temp_outputs/htdemucs_ft/source/audio/vocals.wav: No such file or directory
its happening because the actual vocals.wav file is in temp_outputs/htdemucs_ft/audio/vocals.wav
I think this conflict is happening because the source audio file was in ./source/audio.wav, which messed some things up.
I did get it working by going into the source directory and calling ../diarize.py -a audio.wav
is this fixable? if not a note on the readme would help clear some confusion
I followed the steps on README, and got a test.txt after command
python .\diarize.py -a .\test.wav
but there isn't diarization in the .txt
it's like:
Speaker 0: Personally, that's too many questions for me to answer. How you doing? My name's Josh. Hey, I'm Tone. How you doing? I didn't catch your name again? Huh? Tone. Tone? Tony. Tony, Tony, how you doing? ......
there is only one speaker in it
WhisperX was taken down by github. How can we still have access to it?
Awesome project here and the diarization works really well. I might be a me problem but, the subtitles isn't as accurate as Whisper normally (using --whisper-model large,) is its worst with names and places e.g.. Fin, bin, Ken, ben, rave, Dave. While Whisper gets them 100% correct, I find it weird because your repo does use Whisper. Also when speakers are talking for a long time individually, the subtitle length seems to be indefinitely
Hi there,
Thanks a lot for your efforts.
I am really close to getting this working but I get the below error.
Stand alone Whisper is working for me.
Is there any chance you could please have a look? Thanks.
C:\Users\Main>[NeMo W 2023-03-26 23:48:50 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-03-26 23:48:50 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
Source splitting failed, using original audio file. Use --no-stem argument to disable it.
Traceback (most recent call last):
File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper\audio.py", line 46, in load_audio
ffmpeg.input(file, threads=0)
File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\site-packages\ffmpeg\_run.py", line 325, in run
raise Error('ffmpeg', out, err)
ffmpeg._run.Error: ffmpeg error (see stderr output for detail)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 125, in _main
prepare(preparation_data)
File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 268, in run_path
return _run_module_code(code, init_globals, run_name,
File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\Main\diarize.py", line 94, in <module>
whisper_results = whisper_model.transcribe(vocal_target, beam_size=None, verbose=False)
File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper\transcribe.py", line 121, in transcribe
mel = log_mel_spectrogram(audio, padding=N_SAMPLES)
File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper\audio.py", line 130, in log_mel_spectrogram
audio = load_audio(audio)
File "C:\Users\Main\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper\audio.py", line 51, in load_audio
raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
RuntimeError: Failed to load audio: ffmpeg version 2023-03-23-git-30cea1d39b-full_build-www.gyan.dev Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 12.2.0 (Rev10, Built by MSYS2 project)
configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-libsnappy --enable-zlib --enable-librist --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-libbluray --enable-libcaca --enable-sdl2 --enable-libaribb24 --enable-libdav1d --enable-libdavs2 --enable-libuavs3d --enable-libzvbi --enable-librav1e --enable-libsvtav1 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxavs2 --enable-libxvid --enable-libaom --enable-libjxl --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-frei0r --enable-libfreetype --enable-libfribidi --enable-liblensfun --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-ffnvcodec --enable-nvdec --enable-nvenc --enable-d3d11va --enable-dxva2 --enable-libvpl --enable-libshaderc --enable-vulkan --enable-libplacebo --enable-opencl --enable-libcdio --enable-libgme --enable-libmodplug --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libshine --enable-libtheora --enable-libtwolame --enable-libvo-amrwbenc --enable-libilbc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-ladspa --enable-libbs2b --enable-libflite --enable-libmysofa --enable-librubberband --enable-libsoxr --enable-chromaprint
libavutil 58. 5.100 / 58. 5.100
libavcodec 60. 6.101 / 60. 6.101
libavformat 60. 4.100 / 60. 4.100
libavdevice 60. 2.100 / 60. 2.100
libavfilter 9. 4.100 / 9. 4.100
libswscale 7. 2.100 / 7. 2.100
libswresample 4. 11.100 / 4. 11.100
libpostproc 57. 2.100 / 57. 2.100
File3.mp3: No such file or directory
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.