olawod / freevc Goto Github PK

View Code? Open in Web Editor NEW

548.0 548.0 100.0 15.25 MB

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

License: MIT License

Python 100.00%

pytorch speech voice-conversion

freevc's People

Contributors

Stargazers

Watchers

Forkers

sophiefy qiaolinwang ryu1845 entn-at ishine whitefu maxmax2016 shaun95 jqueguiner jupinter space-pope aijianiula0601 steven850 slives-lab aliang-voice mxuer speechoceantech bobo-paopao muruganr96 techthiyanes mbilalai dacson onepercentmagic abylouw likkkez jammaya kappuchino dongsig dr-data zshy1205 keyboardcartel prince381 taijichadao bradgrimm bourne007-13 taichuai hqt98 if-ai platform-kit thanhkm heybroad siddy819 rogervaas redbeard-himalaya centaurioun wancaiyan ysdnt yt605155624 scalesrzn gitgubm8 dannynis aydous nzpeng fabiocat93 neurence lexemeai-internship hitech777 zhaomingxu123 kikidesign lucashueda aquilatrindade v-yunbin sumanthreddykaliki coolbe thornbirdzhang xiaolonz zcloud2014 iamleon121 shashadhar mueller91 doublec2 geekcheng tksavy 5l1v3r1 zhikanggfu qalabeabbas49 bamaao skysbird leonardlichking coderwpf sdlibowen alireza-14 910882575 clumsyroot rakkaalhazimi liroda tan90xx ldodev liucr roham-meh oytunturk wangchen59 ythyty arjunbahuguna kelseyicotton lennox-elaphurus exercise-book-yq hanasim

freevc's Issues

Question about data_util.py

In data_util.py, what is the target of the following code?

    spec_seglen = spec_lengths[-1] if spec_lengths[-1] < self.hps.train.max_speclen + 1 else self.hps.train.max_speclen + 1
    wav_seglen = spec_seglen * self.hps.data.hop_length 

    spec_padded, ids_slice = commons.rand_spec_segments(spec_padded, spec_lengths, spec_seglen)
    wav_padded = commons.slice_segments(wav_padded, ids_slice * self.hps.data.hop_length, wav_seglen)
    
    c_padded = commons.slice_segments(c_padded, ids_slice, spec_seglen)[:,:,:-1]

    spec_padded = spec_padded[:,:,:-1]
    wav_padded = wav_padded[:,:,:-self.hps.data.hop_length]

Question about the scale of gaussian noise in the mel spectrogram

In the proposed SR-augmentation,
when the mel is squeezed, you pad it with the highest frequency bin value and add Gaussian noise.

Can you share the scale of the added gaussian noise?

(I guess it can differ depending on how you preprocess the audio to get the mel.
I am currently using the 'mel_spectrogram_torch' function of https://github.com/jaywalnut310/vits/blob/2e561ba58618d021b5b8323d3765880f7e0ecfdb/mel_processing.py

If you used a different code for mel processing, it would be great to know.)

I have a few questions about the paper.

Hi, thank you for such creative work on Voice Conversion.

I have 3 questions about your paper, FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion.
It would help a lot if you would answer them.

Did you freeze the WavLM module during training? (Also, did you use pre-trained weights for this module?)
Can I get some architectural information about the bottleneck extractor? Is it just two fully connected layers that map the x_ssl features into d and then 2d dimension?
During training, using the SR augmentation horizontally would un-align the source mel and the target wav.
So I assume that you only used the SR augmentation vertically during training. Is this true?

++ Edit
I have one further question.
Could you possibly share the config file for the Hifi-gan that vocodes the augmented x_mel into WavLm input y' s?

Bug with num_workers=8

Executing command CUDA_VISIBLE_DEVICES="0" python train.py -c configs/freevc.json -m freevc

But the training process continues

INFO:freevc:Train Epoch: 1 [0%]
INFO:freevc:[6.033028602600098, 4.603592395782471, 0.2524043619632721, 88.27883911132812, 37.568702697753906, 0, 0.0002]
terminate called without an active exception
terminate called without an active exception
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f73359e7430>
Traceback (most recent call last):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1430, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/popen_fork.py", line 44, in wait
    if not wait([self.sentinel], timeout):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 14261) is killed by signal: Aborted.

SR based augmentation execution

Hello,

There's a list of commands lines -- can they be run in parallel on separate GPUs or they must be run in this order?

run these if you want to train with SR-based augmentation

CUDA_VISIBLE_DEVICES=1 python preprocess_sr.py --min 68 --max 72
CUDA_VISIBLE_DEVICES=1 python preprocess_sr.py --min 73 --max 76
CUDA_VISIBLE_DEVICES=2 python preprocess_sr.py --min 77 --max 80
CUDA_VISIBLE_DEVICES=2 python preprocess_sr.py --min 81 --max 84
CUDA_VISIBLE_DEVICES=3 python preprocess_sr.py --min 85 --max 88
CUDA_VISIBLE_DEVICES=3 python preprocess_sr.py --min 89 --max 92

Changing the dimension of speaker embeddings

I tried to swap out the speaker encoder with another one. Each utterance will be encoded into a vector of shape (512,). I tried to change "gin_channels": 256 into "gin_channels": 512 in the config of freevc.json. Training the model from scratch results in nan values in the log: [nan, nan, nan, nan, nan, 200, 0.0002]. Do I have to change anything else in the code?

关于训练数据问题

你好，你提供的demo很好。我有个问题：提供出来的pretrain_models是只采用vctk数据训练而已吗？有加入其他训练数据训练吗？
我按照你的代码，用你代码的原始处理数据方式训练了vctk后进行训练，在v100机器，用两张卡，训练了一周，效果没有你公开的pretrain_models好。所以有疑惑。

1.训练的数据集只用vctk吗？
2.训练需要多长时间，几张卡呢？

Using pretrained models shows epoch 1372 -- is this expected?

I've downloaded D-freevc.pth and freevc.pth and renamed them respectively to D_0.pth, and G_0.pth.

As you can see it shows step 253800. Running this time on 1 gpu with 64 batch size.

CUDA_VISIBLE_DEVICES="0" python train.py -c configs/freevc.json -m freevc
INFO:freevc:{'train': {'log_interval': 200, 'eval_interval': 5000, 'seed': 1234, 'epochs': 10000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 64, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 8960, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0, 'use_sr': True, 'max_speclen': 128, 'port': '8001'}, 'data': {'training_files': 'filelists/train.txt', 'validation_files': 'filelists/val.txt', 'max_wav_value': 32768.0, 'sampling_rate': 16000, 'filter_length': 1280, 'hop_length': 320, 'win_length': 1280, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [10, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 256, 'ssl_dim': 1024, 'use_spk': True}, 'model_dir': './logs/freevc'}
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
./logs/freevc/G_0.pth
INFO:freevc:Loaded checkpoint './logs/freevc/G_0.pth' (iteration 1372)
./logs/freevc/D_0.pth
INFO:freevc:Loaded checkpoint './logs/freevc/D_0.pth' (iteration 1372)
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:freevc:Train Epoch: 1372 [89%]
INFO:freevc:[2.6407272815704346, 2.6749014854431152, 10.811471939086914, 19.81775665283203, 2.324024200439453, 253800, 0.00016843613063817603]
INFO:freevc:====> Epoch: 1372

High CPU load on single gpu

Wondering if there is a way to optimize the code for single GPU training?
Im running a 2080ti and the GPU is only at a 67% load, but the CPU is at 60%-80% load (128 threads) so my CPU is pulling more power then my GPU when training.
NCCL isnt supported on windows so I had to switch the backend to GLOO, but it seems this is causing some sort of CPU loop which creates and unusually high CPU load and the GPU gets bottlenecked.
possible to remove the distributed training?

How to make this model work for 24k/48kHz sampling rates?

Is only thing required is retraining WavLM to the respective sampling rates?

Any update on the code release?

Model Resolution

If I wanted to double the "resolution" of the model, would I just double all of the model parameters? Or do certain parts need to stay a certain size to maintain the information bottleneck? What would be the ideal parameters for a "higher res" model?

Under train in the config file, What is C_mel and C_Kl?

Preprocess question

Your downsample.py creates 2 folders, 16khz and 22khz. Isnt the model only training on the 16khz data?

The preprocess_sr.py uses the 22k folder as its default, but why when wavlm is limited to 16khz?

Is there a purpose for this or just a default that you forgot to change?
if I specify to use the 16khz folder, will this affect the preprocess or training any?

Training using LibriTTS

Hi, I am amazed at the potential of this model.
If I train using LibriTTS, should I re-learn the hifi-gan pretrained with the vctk the freevc currently use?

I have a question about your WER/CER results in the paper.

In your paper, you report WER and CER results of about 4.23% and 1.46%.
Also, you mentioned that you used https://huggingface.co/facebook/hubert-large-ls960-ft as the ASR model.

But, when using the same ASR model on ground truth VCTK utterances, I get WER/CER of about 6.43% and 1.95%.
So I assume our codes for measuring WER/CER are different.

Could you share the code for evaluating WER/CER? Or at least a code fragment of it?
Thank you.

Question for commons.slice_segments(wav_padded, ids_slice * 480, wav_seglen)

Hi Thanks for great work and prompt feedback.

I am trying to finetune 24k model with custom dataset.
But when I try to make 'wav_padded' using 'commons'.slice_disclass(slice_disclass,ids_disclass * 480, wav_seglen), error occurred.
In the code, if the 'ids_slice * 480 + wav_seqlen' is longer than the length of 'wav_padded', an error occurs, so how can I solve this case?
The followings are error message.

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/hudson-4way/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/hudson-4way/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/home/hudson-4way/Voice/FreeVC/data_utils_24.py", line 198, in call
wav_padded = commons.slice_segments(wav_padded, ids_slice * 480, wav_seglen)
File "/home/hudson-4way/Voice/FreeVC/commons.py", line 54, in slice_segments
ret[i] = x[i, :, idx_str:idx_end]
RuntimeError: The expanded size of the tensor (37440) must match the existing size (36983) at non-singleton dimension 1. Target sizes: [1, 37440]. Tensor sizes: [36983]

In Tensorboard, validation no see audio of source and target speakers

How can change to see voice conversion?
Only see "gt" and "generated" but where is the source speaker and who is traget?

Thanks you

Improving Target speaker conversion

Hi @OlaWod , this is a really good project, thanks for sharing, I have a 2 question about the project

Is it possible to improve the voice conversation by providing it multiple audio samples of the unknown target speaker at inference time
is it possible to create a feedback loop after a voice conversation to improve the model if the voice conversation sounds realistic or not like the target speaker by allow users to vote on the output

Fine-tuning

Hello, I was wondering if I could try fine-tuning the pretrained model?
I do not have reasonable hardware to train from scratch, unfortunately.
It looks like I would need the pretrained discriminator, if you are able to release it.
Thank you very much, and feel free to let me know if you've already experimented with this.

About Speaker Embeddings

Hello again, I notice that for training, if using the pretrained speaker encoder and SR, the same speaker embedding is used regardless of the SR augmentation.
(edit: also the audio file and spectrogram are the same, regardless of SR)
I just wanted to double check if this is by design?

Thanks so much for helping everyone :)

SR Process

Hi again,

I am finding that the wavs created by SR preprocess slightly differ in length from the original. Seems to be different randomly up to about 0.01 seconds. Is this a problem?

And, did you find it is not useful to apply horizontal SR?

Thank you a lot

About convert

Hi!
Thank you for your research results! Do freevc support text to speech?

Question about target speaker reproduction/transfer

About how long into training did the voice conversion effect start to be noticeable for you?

I ask because so far in my experiments training from scratch, audio reproduction for the source speaker seems to be pretty good, but passing a different speaker as the conversion target doesn't change the voice much. It did work better when I fine-tuned your pretrained model on my dataset, so I'm wondering how long I should let new experiments run before I deem them failures.

What differense is have with spk encoder and without? freevc vs freevc-s?

SR questions

How did you arrive at plus 12 and minus 12 mels? Was this just just a random choice or did you test 3,6, 20 etc to find the optimal range?

Why did you choose to convert to mels, and then resize, and vocode back to wav? Why not just process the audio file and change the pitch?

you had mentioned in another post that you had gotten better results with a custom trained vocoder, but retrained with the standard VCTK for compatibility with other models. Could you share your custom trained vocoder model?

Thanks,
Steven

how many steps should i wait till it starts speaking ?

48k voice conversion

Which one is the best way to generate 48K speech?
(1) Train 90k steps using the 16k dataset, then train 48k using (960 hop_length)
(2) After generating 24k speech, use a super-resolution model for 48k speech conversion.

I really appreciate about your kindly reply. Thanks very much.

好奇一个问题

您好，拜读了论文，想了解一下，模型的参数数量和运行时间大概是多久呢？或者说转换一秒的语音在3090显卡上需要多久的运算时间呢。

s-o-p pronunciation , high-low tone distortion

Hi, Thank you for quick replies and kindness.

While testing through fine-tuning on various data, i found that distortion occurs in o, s, and p pronunciation in common, and distortion also occurs when generating high-pitched sounds rather than target speaker data. Is there a way to solve two problems?

Or, it would be very helpful if you could tell me the metric that can check the presence or absence of distortion.

Thanks very much.

Data_utils.py changes

Since you made the changes to the data_utils the other day I am noticing that it seems to be training on much shorter segments.

the eval audios are also all below 1 second long, most of them just 40 frames. also noticing some obvious gaps/blurs in the generated/GT mels. Can see the "Blurry spots" in these generated mels. where the formants just disappear.

About the WER and CER

Sorry to bother you, in your paper you mentioned that Word error rate (WER) and character error rate (CER) between source and converted speech are obtained by an ASR model. about the code for wer and cer calculation. Are you able to open source it?

Some observations after 690k steps

For unseen F to seen M conversion, the resulting pitch is very close to the source speaker , especially if the source pitch is much higher than seen M pitch.

I've used SR-based data augmentation step.

Unseen F from LibriTTS test-clean 5142_36600_000006_000000.wav
Seen speaker (p326) audio used as a reference audio

https://user-images.githubusercontent.com/53978091/211267486-b8551ae6-5f91-450e-8a4d-758b973b2c17.mp4
3. Conversion result unseen F to seen M

p326_converted.mp4

Questions

Hello , in the paper you said "Our models are trained up to 900k steps on a single NVIDIA 3090
GPU.The batch size is set to 64 with a maximum segment length of 128 frames"

1- How many hours/days did it take you to train these 900k steps , if you can be specific ?

2-Is this method data hungry, because in the paper you said "Only VCTK corpus is used for training" and VCTK has almost
44 hours of speech from 107 speakers" can this algorithm be used for example with 2 speakers with 4-5 hours of speech or 5 speakers
with 5-6 hours of speech (Seen to Seen of course or unseen to seen) and give the same quality and similarity as the paper?

3-Is feeding the algorithm 48Khz or atleast 22050khz audio instead of 16khz going to make that huge of a difference?

4-You said "And a HiFi-GAN v1 vocoder is used to transform the modified mel-spectrogram into waveform"
IS using a better vocoder than hifi-gan v1 (like the new released ones in 2022 Gan based or DDPM based)
going to make that huge of a difference?

拜读了您的论文，有几个问题

1、有关speaker encoder
论文中的实验数据结论是，音色相似度，pretrained d-vector好于simple trained speaker encoder(lstm+linear+mean)
interspeech2022 best paper zero shot tts中采用的speaker encoder结构，以及https://arxiv.org/abs/2106.03153 transferTTS中用的mel输入的speaker encoder结构，他们和本文中的-s一样都是不采用预训练，我觉得有价值对比一下，他们都没有做和pretrained d-vector的ab实验
2、有关音色泄露
（1）有关频谱增强
A、采用vocoder进行音色音调时长的增强，虽然能改变音色，但是我担心会不会导致wavLM的语义识别性能下降；是否考虑用传统的变调变音色算法进行ab实验；
B、有没有可能，模型本身也能学会数据增强变调变音色的reconstruction（也就是说本质上还是认为变调后的训练src和tgt依然是同一个音色，被模型自动学去重建了src的最佳形态，那还是音色泄漏了）
（2）有关bottleneck维度压缩
根据代码发现似乎是压到了192（如果有不对请指出），但是hubert base256相比192也没有很多，192只是相对1024压缩了很多，我认为这压缩得还不够小，自监督本身出来给VC用的维度其实不需要1024那么多；可以尝试压缩到更暴力的维度，比如只有4维，再加上量化等
3、有关实验数据

MOS
unseen2unseen打败了seen2seen，有点想不明白

notebook Google Colab

Can you create a training model notebook on Google Colab?
I tried to do it, but something is not working.

关于代码

作者您好，
代码近期会公布吗？对您的工作很感兴趣，想学习一下代码。

Does single node muti gpu accelerate training?

I've tried a single and two gpu run. I have observed training acceleration for two gpu setup. Should there be, though?

WHAMR! artificial reverberation

I think a better way of extracting speak content might be to use WHAMR! to add artificial reverberation along with background noise to the training dataset in order for the WavLM to extract content, this would be useful since it can get extract the content that is not from studio grade audio. This is the same technique whisper AI uses for ASR

https://wham.whisper.ai/

Some observations after 290k steps

Hello,

After training for 290k (and still going) on 2 GPUs, I have decided to test conversion.

Results so far:

-- Conversion to M seen (but not seen in HifiGan):

F seen to M seen -- very similar to the src F seen
M seen to M seen -- very similar to the src M seen
F unseen to M seen -- very similar to the src F seen
M unseen to M seen -- pretty similar to the tgt M seen

So, 4 is pretty good, whilst 1-3 are similar to the source, not the target.

My setup is the same as yours, used Freevc with SR pre training.

Dataset has 110 VCTK speakers in total (including 40 custom speakers).

Is this the question of more training? Or HifiGan not trained on custom speakers?

UPDATE:
-- Conversion to M seen (also seen in HifiGan):

F unseen to M seen -- pretty similar to the tgt M seen
M unseen to M seen -- pretty similar to the tgt M seen

So, it seems like the issue is with Vocoder -- it must be trained on all dataset beforehand.

请问该方案和hubert+vits的方案，对比下来，有什么优劣？不知道有人尝试过吗？ (how is FreeVC compared to HubertFeat + vits)

如题

Vocoder isn't used in interefernce?

Looking at convert.py I don't see that vocoder is used to get waveform , correct?

Necessary preprocessing for inference wav data

Hi, Thanks for the great work! I'm trying to test the inference part with my own wav file but the output quality is less than I expected and I'm suspecting it's due to the input file.
Could you give me some instruction for how to preprocess the input source/target wav?