yistlin / dvector Goto Github PK

Speaker embedding (d-vector) trained with GE2E loss

Python 100.00%

speaker-embedding ge2e pytorch dvector speaker-verification speaker-encoder torchscript

dvector's Introduction

D-vector

This is a PyTorch implementation of speaker embedding trained with GE2E loss. The original paper about GE2E loss could be found here: Generalized End-to-End Loss for Speaker Verification

Usage

import torch
import torchaudio

wav2mel = torch.jit.load("wav2mel.pt")
dvector = torch.jit.load("dvector.pt").eval()

wav_tensor, sample_rate = torchaudio.load("example.wav")
mel_tensor = wav2mel(wav_tensor, sample_rate)  # shape: (frames, mel_dim)
emb_tensor = dvector.embed_utterance(mel_tensor)  # shape: (emb_dim)

You can also embed multiple utterances of a speaker at once:

emb_tensor = dvector.embed_utterances([mel_tensor_1, mel_tensor_2])  # shape: (emb_dim)

There are 2 modules in this example:

wav2mel.pt is the preprocessing module which is composed of 2 modules:
- sox_effects.pt is used to normalize volume, remove silence, resample audio to 16 KHz, 16 bits, and remix all channels to single channel
- log_melspectrogram.pt is used to transform waveforms to log mel spectrograms
dvector.pt is the speaker encoder

Since all the modules are compiled with TorchScript, you can simply load them and use anywhere without any dependencies.

Pretrianed models & preprocessing modules

You can download them from the page of Releases.

Evaluate model performance

You can evaluate the performance of the model with equal error rate. For example, download the official test splits (veri_test.txt and veri_test2.txt) from The VoxCeleb1 Dataset and run the following command:

python equal_error_rate.py VoxCeleb1/test VoxCeleb1/test/veri_test.txt -w wav2mel.pt -c dvector.pt

So far, the released checkpoint was only trained on VoxCeleb1 without any data augmentation. Its performance on the official test splits of VoxCeleb1 are as following:

Test Split	Equal Error Rate	Threshold
veri_test.txt	12.0%	0.222
veri_test2.txt	11.9%	0.223

Train from scratch

Preprocess training data

To use the script provided here, you have to organize your raw data in this way:

all utterances from a speaker should be put under a directory (speaker directory)
all speaker directories should be put under a directory (root directory)
speaker directory can have subdirectories and utterances can be placed under subdirectories

And you can extract utterances from multiple root directories, e.g.

python preprocess.py VoxCeleb1/dev LibriSpeech/train-clean-360 -o preprocessed

If you need to modify some audio preprocessing hyperparameters, directly modify data/wav2mel.py. After preprocessing, 3 preprocessing modules will be saved in the output directory:

wav2mel.pt
sox_effects.pt
log_melspectrogram.pt

The first module wav2mel.pt is composed of the second and the third modules. These modules were compiled with TorchScript and can be used anywhere to preprocess audio data.

Train a model

You have to specify where to store checkpoints and logs, e.g.

python train.py preprocessed <model_dir>

During training, logs will be put under <model_dir>/logs and checkpoints will be placed under <model_dir>/checkpoints. For more details, check the usage with python train.py -h.

Use different speaker encoders

By default I'm using 3-layerd LSTM with attentive pooling as the speaker encoder, but you can use speaker encoders of different architecture. For more information, please take a look at modules/dvector.py.

Visualize speaker embeddings

You can visualize speaker embeddings using a trained d-vector. Note that you have to structure speakers' directories in the same way as for preprocessing. e.g.

python visualize.py LibriSpeech/dev-clean -w wav2mel.pt -c dvector.pt -o tsne.jpg

The following plot is the dimension reduction result (using t-SNE) of some utterances from LibriSpeech.

dvector's People

Contributors

Stargazers

Watchers

dvector's Issues

Make preprocessing fully differentiable with torch API

I appreciate your efforts, nice work.
But your audio_toolkit was implement in librosa and numpy, which was not differentiable.
It might limited the application. Eg. If I have an TTS model to generated Mel spectrogram, and if your dvector if fully differentiable, we can use this like a discriminator, to force the TTS model output exactly as expected person.
From waveform to Melspectrogram, you can make preprocessing fully differentiable with torchaudio, and it seems it can keep consitency with librosa

Can not reshape tensor due visualization

I am run visualization and get error:

[INFO] model loaded.
Preprocess: 1% 12/889 [00:00<00:27, 31.83it/s]
Traceback (most recent call last):
File "visualize.py", line 89, in
visualize(**vars(PARSER.parse_args()))
File "visualize.py", line 43, in visualize
mel_tensor = wav2mel(wav_tensor, sample_rate)
File "/home/vvs/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/data/wav2mel.py", line 20, in forward
sample_rate: int) -> Tensor:
wav_tensor0 = (self.sox_effects).forward(wav_tensor, sample_rate, )
mel_tensor = (self.log_melspectrogram).forward(wav_tensor0, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return mel_tensor

.....

Traceback of TorchScript, original code (most recent call last):
File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/torchaudio/transforms.py", line 96, in forward
Fourier bins, and time is the number of window hops (n_frame).
"""
return F.spectrogram(
~~~~~~~~~~~~~ <--- HERE
waveform,
self.pad,
File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/torchaudio/functional/functional.py", line 88, in spectrogram
# pack batch
shape = waveform.size()
waveform = waveform.reshape(-1, shape[-1])
~~~~~~~~~~~~~~~~ <--- HERE

# default values are consistent with librosa.core.spectrum._spectrogram

RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous

cannot reshape tensor of 0 elements into shape [-1, 0]

When the input tensor shape is [1, 800] or [1, 320] and When I use the following code

mel_tensor = wav2mel(wav_tensor, 16000) # 16000 is the sample rate

I met with the following error:

Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/data/wav2mel.py", line 20, in forward
sample_rate: int) -> Tensor:
wav_tensor0 = (self.sox_effects).forward(wav_tensor, sample_rate, )
mel_tensor = (self.log_melspectrogram).forward(wav_tensor0, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return mel_tensor
class SoxEffects(Module):
File "code/torch/data/wav2mel.py", line 43, in forward
def forward(self: torch.data.wav2mel.LogMelspectrogram,
wav_tensor: Tensor) -> Tensor:
_3 = (self.melspectrogram).forward(wav_tensor, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
mel_tensor = torch.numpy_T(torch.squeeze(_3, 0))
_4 = torch.clamp(mel_tensor, 1.0000000000000001e-09, None)
File "code/torch/torchaudio/transforms.py", line 20, in forward
def forward(self: torch.torchaudio.transforms.MelSpectrogram,
waveform: Tensor) -> Tensor:
specgram = (self.spectrogram).forward(waveform, )
~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
mel_specgram = (self.mel_scale).forward(specgram, )
return mel_specgram
File "code/torch/torchaudio/transforms.py", line 41, in forward
waveform: Tensor) -> Tensor:
_0 = torch.torchaudio.functional.functional.spectrogram
_1 = _0(waveform, 0, self.window, 400, 160, 400, 2., False, self.center, self.pad_mode, self.onesided, )
~~ <--- HERE
return _1
class MelScale(Module):
File "code/torch/torchaudio/functional/functional.py", line 18, in spectrogram
waveform0 = waveform
shape = torch.size(waveform0)
waveform2 = torch.reshape(waveform0, [-1, shape[-1]])
~~~~~~~~~~~~~ <--- HERE
spec_f = torch.torch.functional.stft(waveform2, n_fft, hop_length, win_length, window, center, pad_mode, False, onesided, True, )
_0 = torch.slice(shape, 0, -1, 1)

Traceback of TorchScript, original code (most recent call last):
File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/torchaudio/transforms.py", line 96, in forward
Fourier bins, and time is the number of window hops (n_frame).
"""
return F.spectrogram(
~~~~~~~~~~~~~ <--- HERE
waveform,
self.pad,
File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/torchaudio/functional/functional.py", line 88, in spectrogram
# pack batch
shape = waveform.size()
waveform = waveform.reshape(-1, shape[-1])
~~~~~~~~~~~~~~~~ <--- HERE
# default values are consistent with librosa.core.spectrum._spectrogram
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous

How can I solve this problem?

window size -› seg_len

How to realize the window size is drawn from a uniform distribution within [240ms, 1600ms] during training？

In your source code dvector.py, there are two questions. One is the conditional judgment: if utterance. size (1) < = self. seg _ len:, which should be compared with the 0 th dimension, because the 1 ST dimension is 40, so the horizontal dimension is smaller than seg_len=160, and the following sliding window part unfold cannot be reached; Second, the output shape of unfold is [bacth_size, 40, seg_len], while the input shape of AttentivePooledLSTMDvector should be [bacth_size, seg_len, 40], that is, size(-1) must be 40.

As for the uniform distribution seg_len, can I directly add the evenly distributed seg_len when traversing each utterance?

I hope you can give me an answer, thank you!

model loading issue

I am using pre-trained model to get the embedding but getting below error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-5-4bcf886e7e7f> in <module>
      2 import torchaudio
      3 
----> 4 wav2mel = torch.jit.load("wav2mel.pt")
      5 dvector = torch.jit.load("dvector.pt").eval()
      6 

~/anaconda3/lib/python3.8/site-packages/torch/jit/_serialization.py in load(f, map_location, _extra_files)
    159     cu = torch._C.CompilationUnit()
    160     if isinstance(f, str) or isinstance(f, pathlib.Path):
--> 161         cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
    162     else:
    163         cpp_module = torch._C.import_ir_module_from_buffer(

RuntimeError: 
Class Namespace cannot be used as a value:
Serialized   File "code/__torch__/torchaudio/sox_effects/sox_effects.py", line 5
    effects: List[List[str]],
    channels_first: bool=True) -> Tuple[Tensor, int]:
  in_signal = __torch__.torch.classes.torchaudio.TensorSignal.__new__(__torch__.torch.classes.torchaudio.TensorSignal)
                                                                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  _0 = (in_signal).__init__(tensor, sample_rate, channels_first, )
  out_signal = ops.torchaudio.sox_effects_apply_effects_tensor(in_signal, effects)
'apply_effects_tensor' is being compiled since it was called from 'SoxEffects.forward'
Serialized   File "code/__torch__/data/wav2mel.py", line 29
    wav_tensor: Tensor,
    sample_rate: int) -> Tensor:
    _0 = __torch__.torchaudio.sox_effects.sox_effects.apply_effects_tensor
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _1 = _0(wav_tensor, sample_rate, self.effects, True, )
    wav_tensor1, _2, = _1

CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm

I wanna get a 601 dim dvector,and follow the step "Train from scratch"

python preprocess.py ../LibriSpeech/train-clean-360 -o preprocessed
python train.py preprocessed train601

then there is an error Traceback to FILE "modules/dvector.py"
"""Forward a batch through network."""
lstm_outs, _ = self.lstm(inputs) # (batch, seg_len, dim_cell)
embeds = torch.tanh(self.embedding(lstm_outs)) # (batch, seg_len, dim_emb)
~~~~~~~~~~~~~~ <--- HERE

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

How long should the utterances be?

Hi, awesome repo! I am wondering, how should I cut my data? Is this setup good for sentence-level utterances?
In the code, it seems that the model is trained on short chunks of voiced audio, but in the viasualization script, the embeddings are extracted from whole utterances (so a single embedding vector for the whole sentence). Did you experiment with different setups? For example, did you try to extract embeddings for segments of the given utterance and average them? My data is in an ASR setup, so it is pretty much sentence based. Any advice how to further segment such data? Regards, Jan

runtime error in preprocess.py: does not have a getstate method defined

I run the code in preprocess.py and it has an error:
Traceback (most recent call last): File "preprocess.py", line 92, in <module> preprocess(**vars(PARSER.parse_args())) File "preprocess.py", line 71, in preprocess for speaker_name, mel_tensor in tqdm(dataloader, ncols=0, desc="Preprocess"): File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__ for obj in iterable: File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 359, in __iter__ return self._get_iterator() File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 918, in __init__ w.start() File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/context.py", line 283, in _Popen return Popen(process_obj) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__ super().__init__(process_obj) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__ self._launch(process_obj) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/Users/xinyuewang/opt/anaconda3/lib/python3.8/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) RuntimeError: Tried to serialize object __torch__.data.wav2mel.Wav2Mel which does not have a __getstate__ method defined!
does anyone know how to solve this?
my version:
python==3.8 torch==1.8.0 torchaudio==0.8.0

RuntimeError: Unknown builtin op: torchaudio::sox_effects_apply_effects_tensor.

When i run the usage,i encountered a problem:
C:\Users\86151\Desktop\Voice-Recognize-system-master\Scripts\python.exe C:\Users\86151\Desktop\Voice-Recognize-system-master\dvector-master\demo.py
Traceback (most recent call last):
File "C:\Users\86151\Desktop\Voice-Recognize-system-master\dvector-master\demo.py", line 5, in
wav2mel = torch.jit.load("wav2mel.pt")
File "C:\Users\86151\Desktop\Voice-Recognize-system-master\lib\site-packages\torch\jit_serialization.py", line 162, in load
cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files, _restore_shapes) # type: ignore[call-arg]
RuntimeError:
Unknown builtin op: torchaudio::sox_effects_apply_effects_tensor.
Could not find any similar ops to torchaudio::sox_effects_apply_effects_tensor. This op may not exist or may not be currently supported in TorchScript.
:
File "code/torch/torchaudio/sox_effects/sox_effects.py", line 5
effects: List[List[str]],
channels_first: bool=True) -> Tuple[Tensor, int]:
_0, _1 = ops.torchaudio.sox_effects_apply_effects_tensor(tensor, sample_rate, effects, channels_first)
~ <--- HERE
return (_0, _1)
'apply_effects_tensor' is being compiled since it was called from 'SoxEffects.forward'
Serialized File "code/torch/data/wav2mel.py", line 31
wav_tensor: Tensor,
sample_rate: int) -> Tensor:
_0 = torch.torchaudio.sox_effects.sox_effects.apply_effects_tensor
~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_1 = _0(wav_tensor, sample_rate, self.effects, True, )
wav_tensor1, _2, = _1

System:Windows 11
torch :1.11.0+cu113
torchaudio:0.11.0+cu113
I saw that windows may have problems with sox,so is there any solvement to the problem?Like changing a package or something?

Issue in visualise.py -> Unknown builtin op: torchaudio_sox::apply_effects_tensor.

Running visualize("preprocessed", "preprocessed/wav2mel.pt", "dvector-step5000.pt", ".")

Getting the following error while torch.jit.load

RuntimeError: 
Unknown builtin op: torchaudio_sox::apply_effects_tensor.
Could not find any similar ops to torchaudio_sox::apply_effects_tensor. This op may not exist or may not be currently supported in TorchScript.
:
  File "code/__torch__/torchaudio/sox_effects/sox_effects.py", line 5
    effects: List[List[str]],
    channels_first: bool=True) -> Tuple[Tensor, int]:
  _0, _1 = ops.torchaudio_sox.apply_effects_tensor(tensor, sample_rate, effects, channels_first)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  return (_0, _1)
'apply_effects_tensor' is being compiled since it was called from 'SoxEffects.forward'
Serialized   File "code/__torch__/data/wav2mel.py", line 33
    wav_tensor: Tensor,
    sample_rate: int) -> Tensor:
    _0 = __torch__.torchaudio.sox_effects.sox_effects.apply_effects_tensor
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    effects = self.effects
    _1 = _0(wav_tensor, sample_rate, effects, True, )