notebook Google Colab

Can you create a training model notebook on Google Colab?
I tried to do it, but something is not working.


你好,你提供的demo很好。我有个问题:提供出来的pretrain_models是只采用vctk数据训练而已吗? 有加入其他训练数据训练吗?


Training using LibriTTS

Hi, I am amazed at the potential of this model.
If I train using LibriTTS, should I re-learn the hifi-gan pretrained with the vctk the freevc currently use? changes

Since you made the changes to the data_utils the other day I am noticing that it seems to be training on much shorter segments.

the eval audios are also all below 1 second long, most of them just 40 frames. also noticing some obvious gaps/blurs in the generated/GT mels. Can see the "Blurry spots" in these generated mels. where the formants just disappear.

genvGT david
genvGT david 2

About convert

Thank you for your research results! Do freevc support text to speech?

I have a few questions about the paper.

Hi, thank you for such creative work on Voice Conversion.

I have 3 questions about your paper, FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion.
It would help a lot if you would answer them.

  1. Did you freeze the WavLM module during training? (Also, did you use pre-trained weights for this module?)
  2. Can I get some architectural information about the bottleneck extractor? Is it just two fully connected layers that map the x_ssl features into d and then 2d dimension?
  3. During training, using the SR augmentation horizontally would un-align the source mel and the target wav.
    So I assume that you only used the SR augmentation vertically during training. Is this true?

++ Edit
I have one further question.
Could you possibly share the config file for the Hifi-gan that vocodes the augmented x_mel into WavLm input y' s?

About the WER and CER

Sorry to bother you, in your paper you mentioned that Word error rate (WER) and character error rate (CER) between source and converted speech are obtained by an ASR model. about the code for wer and cer calculation. Are you able to open source it?

Changing the dimension of speaker embeddings

I tried to swap out the speaker encoder with another one. Each utterance will be encoded into a vector of shape (512,). I tried to change "gin_channels": 256 into "gin_channels": 512 in the config of freevc.json. Training the model from scratch results in nan values in the log: [nan, nan, nan, nan, nan, 200, 0.0002]. Do I have to change anything else in the code?

Improving Target speaker conversion

Hi @OlaWod , this is a really good project, thanks for sharing, I have a 2 question about the project

  • Is it possible to improve the voice conversation by providing it multiple audio samples of the unknown target speaker at inference time
  • is it possible to create a feedback loop after a voice conversation to improve the model if the voice conversation sounds realistic or not like the target speaker by allow users to vote on the output

Preprocess question

Your creates 2 folders, 16khz and 22khz. Isnt the model only training on the 16khz data?

The uses the 22k folder as its default, but why when wavlm is limited to 16khz?

Is there a purpose for this or just a default that you forgot to change?
if I specify to use the 16khz folder, will this affect the preprocess or training any?

About Speaker Embeddings

Hello again, I notice that for training, if using the pretrained speaker encoder and SR, the same speaker embedding is used regardless of the SR augmentation.
(edit: also the audio file and spectrogram are the same, regardless of SR)
I just wanted to double check if this is by design?

Thanks so much for helping everyone :)

Model Resolution

If I wanted to double the "resolution" of the model, would I just double all of the model parameters? Or do certain parts need to stay a certain size to maintain the information bottleneck? What would be the ideal parameters for a "higher res" model?

Under train in the config file, What is C_mel and C_Kl?

Question about

In, what is the target of the following code?

    spec_seglen = spec_lengths[-1] if spec_lengths[-1] < self.hps.train.max_speclen + 1 else self.hps.train.max_speclen + 1
    wav_seglen = spec_seglen * 

    spec_padded, ids_slice = commons.rand_spec_segments(spec_padded, spec_lengths, spec_seglen)
    wav_padded = commons.slice_segments(wav_padded, ids_slice *, wav_seglen)
    c_padded = commons.slice_segments(c_padded, ids_slice, spec_seglen)[:,:,:-1]

    spec_padded = spec_padded[:,:,:-1]
    wav_padded = wav_padded[:,:,]

Bug with num_workers=8

Executing command CUDA_VISIBLE_DEVICES="0" python -c configs/freevc.json -m freevc

But the training process continues

INFO:freevc:Train Epoch: 1 [0%]
INFO:freevc:[6.033028602600098, 4.603592395782471, 0.2524043619632721, 88.27883911132812, 37.568702697753906, 0, 0.0002]
terminate called without an active exception
terminate called without an active exception
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f73359e7430>
Traceback (most recent call last):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/", line 1466, in __del__
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/", line 1430, in _shutdown_workers
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/", line 149, in join
    res = self._popen.wait(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/", line 44, in wait
    if not wait([self.sentinel], timeout):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/", line 931, in wait
    ready =
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/", line 66, in handler
RuntimeError: DataLoader worker (pid 14261) is killed by signal: Aborted.

SR questions

How did you arrive at plus 12 and minus 12 mels? Was this just just a random choice or did you test 3,6, 20 etc to find the optimal range?

Why did you choose to convert to mels, and then resize, and vocode back to wav? Why not just process the audio file and change the pitch?

you had mentioned in another post that you had gotten better results with a custom trained vocoder, but retrained with the standard VCTK for compatibility with other models. Could you share your custom trained vocoder model?


48k voice conversion

Which one is the best way to generate 48K speech?
(1) Train 90k steps using the 16k dataset, then train 48k using (960 hop_length)
(2) After generating 24k speech, use a super-resolution model for 48k speech conversion.

I really appreciate about your kindly reply. Thanks very much.

Some observations after 290k steps


After training for 290k (and still going) on 2 GPUs, I have decided to test conversion.

Results so far:

-- Conversion to M seen (but not seen in HifiGan):

  1. F seen to M seen -- very similar to the src F seen
  2. M seen to M seen -- very similar to the src M seen
  3. F unseen to M seen -- very similar to the src F seen
  4. M unseen to M seen -- pretty similar to the tgt M seen

So, 4 is pretty good, whilst 1-3 are similar to the source, not the target.

My setup is the same as yours, used Freevc with SR pre training.

Dataset has 110 VCTK speakers in total (including 40 custom speakers).

Is this the question of more training? Or HifiGan not trained on custom speakers?

-- Conversion to M seen (also seen in HifiGan):

  1. F unseen to M seen -- pretty similar to the tgt M seen
  2. M unseen to M seen -- pretty similar to the tgt M seen

So, it seems like the issue is with Vocoder -- it must be trained on all dataset beforehand.

WHAMR! artificial reverberation

I think a better way of extracting speak content might be to use WHAMR! to add artificial reverberation along with background noise to the training dataset in order for the WavLM to extract content, this would be useful since it can get extract the content that is not from studio grade audio. This is the same technique whisper AI uses for ASR

Training Log for 24khz

could you post the tensor board logs for the new 24khz training run if you still have them?

s-o-p pronunciation , high-low tone distortion

Hi, Thank you for quick replies and kindness.

While testing through fine-tuning on various data, i found that distortion occurs in o, s, and p pronunciation in common, and distortion also occurs when generating high-pitched sounds rather than target speaker data. Is there a way to solve two problems?

Or, it would be very helpful if you could tell me the metric that can check the presence or absence of distortion.

Thanks very much.


Hello, I was wondering if I could try fine-tuning the pretrained model?
I do not have reasonable hardware to train from scratch, unfortunately.
It looks like I would need the pretrained discriminator, if you are able to release it.
Thank you very much, and feel free to let me know if you've already experimented with this.

SR Process

Hi again,

I am finding that the wavs created by SR preprocess slightly differ in length from the original. Seems to be different randomly up to about 0.01 seconds. Is this a problem?

And, did you find it is not useful to apply horizontal SR?

Thank you a lot


Hello , in the paper you said "Our models are trained up to 900k steps on a single NVIDIA 3090
GPU.The batch size is set to 64 with a maximum segment length of 128 frames"

1- How many hours/days did it take you to train these 900k steps , if you can be specific ?

2-Is this method data hungry, because in the paper you said "Only VCTK corpus is used for training" and VCTK has almost
44 hours of speech from 107 speakers" can this algorithm be used for example with 2 speakers with 4-5 hours of speech or 5 speakers
with 5-6 hours of speech (Seen to Seen of course or unseen to seen) and give the same quality and similarity as the paper?

3-Is feeding the algorithm 48Khz or atleast 22050khz audio instead of 16khz going to make that huge of a difference?

4-You said "And a HiFi-GAN v1 vocoder is used to transform the modified mel-spectrogram into waveform"
IS using a better vocoder than hifi-gan v1 (like the new released ones in 2022 Gan based or DDPM based)
going to make that huge of a difference?

SR based augmentation execution


There's a list of commands lines -- can they be run in parallel on separate GPUs or they must be run in this order?

run these if you want to train with SR-based augmentation

CUDA_VISIBLE_DEVICES=1 python --min 68 --max 72
CUDA_VISIBLE_DEVICES=1 python --min 73 --max 76
CUDA_VISIBLE_DEVICES=2 python --min 77 --max 80
CUDA_VISIBLE_DEVICES=2 python --min 81 --max 84
CUDA_VISIBLE_DEVICES=3 python --min 85 --max 88
CUDA_VISIBLE_DEVICES=3 python --min 89 --max 92


1、有关speaker encoder
论文中的实验数据结论是,音色相似度,pretrained d-vector好于simple trained speaker encoder(lstm+linear+mean)
interspeech2022 best paper zero shot tts中采用的speaker encoder结构,以及 transferTTS中用的mel输入的speaker encoder结构,他们和本文中的-s一样都是不采用预训练,我觉得有价值对比一下,他们都没有做和pretrained d-vector的ab实验
根据代码发现似乎是压到了192(如果有不对请指出),但是hubert base256相比192也没有很多,192只是相对1024压缩了很多,我认为这压缩得还不够小,自监督本身出来给VC用的维度其实不需要1024那么多;可以尝试压缩到更暴力的维度,比如只有4维,再加上量化等

Training Logs

Hi, wondering if you could post your logs from the model trained to 900k steps? Would love to take a look at the progression of the losses during training etc.

Necessary preprocessing for inference wav data

Hi, Thanks for the great work! I'm trying to test the inference part with my own wav file but the output quality is less than I expected and I'm suspecting it's due to the input file.
Could you give me some instruction for how to preprocess the input source/target wav?

High CPU load on single gpu

Wondering if there is a way to optimize the code for single GPU training?
Im running a 2080ti and the GPU is only at a 67% load, but the CPU is at 60%-80% load (128 threads) so my CPU is pulling more power then my GPU when training.
NCCL isnt supported on windows so I had to switch the backend to GLOO, but it seems this is causing some sort of CPU loop which creates and unusually high CPU load and the GPU gets bottlenecked.
possible to remove the distributed training?

Training dataset

Hi. If we want to extend the training dataset, are there any publicly available datasets other than VCTK that we can use? Is having two or more speakers speaking the same line of sentence a requirement, as in the VCTK dataset? Thanks.



Question about the scale of gaussian noise in the mel spectrogram

In the proposed SR-augmentation,
when the mel is squeezed, you pad it with the highest frequency bin value and add Gaussian noise.

Can you share the scale of the added gaussian noise?

(I guess it can differ depending on how you preprocess the audio to get the mel.
I am currently using the 'mel_spectrogram_torch' function of

If you used a different code for mel processing, it would be great to know.)

Question for commons.slice_segments(wav_padded, ids_slice * 480, wav_seglen)

Hi Thanks for great work and prompt feedback.

I am trying to finetune 24k model with custom dataset.
But when I try to make 'wav_padded' using 'commons'.slice_disclass(slice_disclass,ids_disclass * 480, wav_seglen), error occurred.
In the code, if the 'ids_slice * 480 + wav_seqlen' is longer than the length of 'wav_padded', an error occurs, so how can I solve this case?
The followings are error message.

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/hudson-4way/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/hudson-4way/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/", line 61, in fetch
return self.collate_fn(data)
File "/home/hudson-4way/Voice/FreeVC/", line 198, in call
wav_padded = commons.slice_segments(wav_padded, ids_slice * 480, wav_seglen)
File "/home/hudson-4way/Voice/FreeVC/", line 54, in slice_segments
ret[i] = x[i, :, idx_str:idx_end]
RuntimeError: The expanded size of the tensor (37440) must match the existing size (36983) at non-singleton dimension 1. Target sizes: [1, 37440]. Tensor sizes: [36983]



Using pretrained models shows epoch 1372 -- is this expected?

I've downloaded D-freevc.pth and freevc.pth and renamed them respectively to D_0.pth, and G_0.pth.

As you can see it shows step 253800. Running this time on 1 gpu with 64 batch size.

CUDA_VISIBLE_DEVICES="0" python -c configs/freevc.json -m freevc
INFO:freevc:{'train': {'log_interval': 200, 'eval_interval': 5000, 'seed': 1234, 'epochs': 10000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 64, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 8960, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0, 'use_sr': True, 'max_speclen': 128, 'port': '8001'}, 'data': {'training_files': 'filelists/train.txt', 'validation_files': 'filelists/val.txt', 'max_wav_value': 32768.0, 'sampling_rate': 16000, 'filter_length': 1280, 'hop_length': 320, 'win_length': 1280, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [10, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 256, 'ssl_dim': 1024, 'use_spk': True}, 'model_dir': './logs/freevc'}
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
INFO:freevc:Loaded checkpoint './logs/freevc/G_0.pth' (iteration 1372)
INFO:freevc:Loaded checkpoint './logs/freevc/D_0.pth' (iteration 1372)
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:freevc:Train Epoch: 1372 [89%]
INFO:freevc:[2.6407272815704346, 2.6749014854431152, 10.811471939086914, 19.81775665283203, 2.324024200439453, 253800, 0.00016843613063817603]
INFO:freevc:====> Epoch: 1372


Some observations after 690k steps

For unseen F to seen M conversion, the resulting pitch is very close to the source speaker , especially if the source pitch is much higher than seen M pitch.

I've used SR-based data augmentation step.

  1. Unseen F from LibriTTS test-clean 5142_36600_000006_000000.wav
  2. Seen speaker (p326) audio used as a reference audio
3. Conversion result unseen F to seen M


Question about target speaker reproduction/transfer

About how long into training did the voice conversion effect start to be noticeable for you?

I ask because so far in my experiments training from scratch, audio reproduction for the source speaker seems to be pretty good, but passing a different speaker as the conversion target doesn't change the voice much. It did work better when I fine-tuned your pretrained model on my dataset, so I'm wondering how long I should let new experiments run before I deem them failures.

