olawod / freevc Goto Github PK
View Code? Open in Web Editor NEWFreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
License: MIT License
FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
License: MIT License
In data_util.py, what is the target of the following code?
spec_seglen = spec_lengths[-1] if spec_lengths[-1] < self.hps.train.max_speclen + 1 else self.hps.train.max_speclen + 1
wav_seglen = spec_seglen * self.hps.data.hop_length
spec_padded, ids_slice = commons.rand_spec_segments(spec_padded, spec_lengths, spec_seglen)
wav_padded = commons.slice_segments(wav_padded, ids_slice * self.hps.data.hop_length, wav_seglen)
c_padded = commons.slice_segments(c_padded, ids_slice, spec_seglen)[:,:,:-1]
spec_padded = spec_padded[:,:,:-1]
wav_padded = wav_padded[:,:,:-self.hps.data.hop_length]
In the proposed SR-augmentation,
when the mel is squeezed, you pad it with the highest frequency bin value and add Gaussian noise.
Can you share the scale of the added gaussian noise?
(I guess it can differ depending on how you preprocess the audio to get the mel.
I am currently using the 'mel_spectrogram_torch' function of https://github.com/jaywalnut310/vits/blob/2e561ba58618d021b5b8323d3765880f7e0ecfdb/mel_processing.py
If you used a different code for mel processing, it would be great to know.)
Hi, thank you for such creative work on Voice Conversion.
I have 3 questions about your paper, FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion.
It would help a lot if you would answer them.
++ Edit
I have one further question.
Could you possibly share the config file for the Hifi-gan that vocodes the augmented x_mel into WavLm input y' s?
Executing command CUDA_VISIBLE_DEVICES="0" python train.py -c configs/freevc.json -m freevc
But the training process continues
INFO:freevc:Train Epoch: 1 [0%]
INFO:freevc:[6.033028602600098, 4.603592395782471, 0.2524043619632721, 88.27883911132812, 37.568702697753906, 0, 0.0002]
terminate called without an active exception
terminate called without an active exception
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f73359e7430>
Traceback (most recent call last):
File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
self._shutdown_workers()
File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1430, in _shutdown_workers
w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/popen_fork.py", line 44, in wait
if not wait([self.sentinel], timeout):
File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/home/sk/anaconda3/envs/freevc/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 14261) is killed by signal: Aborted.
Hello,
There's a list of commands lines -- can they be run in parallel on separate GPUs or they must be run in this order?
CUDA_VISIBLE_DEVICES=1 python preprocess_sr.py --min 68 --max 72
CUDA_VISIBLE_DEVICES=1 python preprocess_sr.py --min 73 --max 76
CUDA_VISIBLE_DEVICES=2 python preprocess_sr.py --min 77 --max 80
CUDA_VISIBLE_DEVICES=2 python preprocess_sr.py --min 81 --max 84
CUDA_VISIBLE_DEVICES=3 python preprocess_sr.py --min 85 --max 88
CUDA_VISIBLE_DEVICES=3 python preprocess_sr.py --min 89 --max 92
I tried to swap out the speaker encoder with another one. Each utterance will be encoded into a vector of shape (512,). I tried to change "gin_channels": 256
into "gin_channels": 512
in the config of freevc.json
. Training the model from scratch results in nan
values in the log: [nan, nan, nan, nan, nan, 200, 0.0002]
. Do I have to change anything else in the code?
你好,你提供的demo很好。我有个问题:提供出来的pretrain_models是只采用vctk数据训练而已吗? 有加入其他训练数据训练吗?
我按照你的代码,用你代码的原始处理数据方式训练了vctk后进行训练,在v100机器,用两张卡,训练了一周,效果没有你公开的pretrain_models好。所以有疑惑。
1.训练的数据集只用vctk吗?
2.训练需要多长时间,几张卡呢?
I've downloaded D-freevc.pth and freevc.pth and renamed them respectively to D_0.pth, and G_0.pth.
As you can see it shows step 253800. Running this time on 1 gpu with 64 batch size.
CUDA_VISIBLE_DEVICES="0" python train.py -c configs/freevc.json -m freevc
INFO:freevc:{'train': {'log_interval': 200, 'eval_interval': 5000, 'seed': 1234, 'epochs': 10000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 64, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 8960, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0, 'use_sr': True, 'max_speclen': 128, 'port': '8001'}, 'data': {'training_files': 'filelists/train.txt', 'validation_files': 'filelists/val.txt', 'max_wav_value': 32768.0, 'sampling_rate': 16000, 'filter_length': 1280, 'hop_length': 320, 'win_length': 1280, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [10, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 256, 'ssl_dim': 1024, 'use_spk': True}, 'model_dir': './logs/freevc'}
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
./logs/freevc/G_0.pth
INFO:freevc:Loaded checkpoint './logs/freevc/G_0.pth' (iteration 1372)
./logs/freevc/D_0.pth
INFO:freevc:Loaded checkpoint './logs/freevc/D_0.pth' (iteration 1372)
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:freevc:Train Epoch: 1372 [89%]
INFO:freevc:[2.6407272815704346, 2.6749014854431152, 10.811471939086914, 19.81775665283203, 2.324024200439453, 253800, 0.00016843613063817603]
INFO:freevc:====> Epoch: 1372
Wondering if there is a way to optimize the code for single GPU training?
Im running a 2080ti and the GPU is only at a 67% load, but the CPU is at 60%-80% load (128 threads) so my CPU is pulling more power then my GPU when training.
NCCL isnt supported on windows so I had to switch the backend to GLOO, but it seems this is causing some sort of CPU loop which creates and unusually high CPU load and the GPU gets bottlenecked.
possible to remove the distributed training?
Is only thing required is retraining WavLM to the respective sampling rates?
If I wanted to double the "resolution" of the model, would I just double all of the model parameters? Or do certain parts need to stay a certain size to maintain the information bottleneck? What would be the ideal parameters for a "higher res" model?
Under train in the config file, What is C_mel and C_Kl?
Your downsample.py creates 2 folders, 16khz and 22khz. Isnt the model only training on the 16khz data?
The preprocess_sr.py uses the 22k folder as its default, but why when wavlm is limited to 16khz?
Is there a purpose for this or just a default that you forgot to change?
if I specify to use the 16khz folder, will this affect the preprocess or training any?
Hi, I am amazed at the potential of this model.
If I train using LibriTTS, should I re-learn the hifi-gan pretrained with the vctk the freevc currently use?
In your paper, you report WER and CER results of about 4.23% and 1.46%.
Also, you mentioned that you used https://huggingface.co/facebook/hubert-large-ls960-ft as the ASR model.
But, when using the same ASR model on ground truth VCTK utterances, I get WER/CER of about 6.43% and 1.95%.
So I assume our codes for measuring WER/CER are different.
Could you share the code for evaluating WER/CER? Or at least a code fragment of it?
Thank you.
Hi Thanks for great work and prompt feedback.
I am trying to finetune 24k model with custom dataset.
But when I try to make 'wav_padded' using 'commons'.slice_disclass(slice_disclass,ids_disclass * 480, wav_seglen), error occurred.
In the code, if the 'ids_slice * 480 + wav_seqlen' is longer than the length of 'wav_padded', an error occurs, so how can I solve this case?
The followings are error message.
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/hudson-4way/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/hudson-4way/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/home/hudson-4way/Voice/FreeVC/data_utils_24.py", line 198, in call
wav_padded = commons.slice_segments(wav_padded, ids_slice * 480, wav_seglen)
File "/home/hudson-4way/Voice/FreeVC/commons.py", line 54, in slice_segments
ret[i] = x[i, :, idx_str:idx_end]
RuntimeError: The expanded size of the tensor (37440) must match the existing size (36983) at non-singleton dimension 1. Target sizes: [1, 37440]. Tensor sizes: [36983]
How can change to see voice conversion?
Only see "gt" and "generated" but where is the source speaker and who is traget?
Thanks you
Hi @OlaWod , this is a really good project, thanks for sharing, I have a 2 question about the project
Hello, I was wondering if I could try fine-tuning the pretrained model?
I do not have reasonable hardware to train from scratch, unfortunately.
It looks like I would need the pretrained discriminator, if you are able to release it.
Thank you very much, and feel free to let me know if you've already experimented with this.
Hello again, I notice that for training, if using the pretrained speaker encoder and SR, the same speaker embedding is used regardless of the SR augmentation.
(edit: also the audio file and spectrogram are the same, regardless of SR)
I just wanted to double check if this is by design?
Thanks so much for helping everyone :)
Hi again,
I am finding that the wavs created by SR preprocess slightly differ in length from the original. Seems to be different randomly up to about 0.01 seconds. Is this a problem?
And, did you find it is not useful to apply horizontal SR?
Thank you a lot
Hi!
Thank you for your research results! Do freevc support text to speech?
About how long into training did the voice conversion effect start to be noticeable for you?
I ask because so far in my experiments training from scratch, audio reproduction for the source speaker seems to be pretty good, but passing a different speaker as the conversion target doesn't change the voice much. It did work better when I fine-tuned your pretrained model on my dataset, so I'm wondering how long I should let new experiments run before I deem them failures.
How did you arrive at plus 12 and minus 12 mels? Was this just just a random choice or did you test 3,6, 20 etc to find the optimal range?
Why did you choose to convert to mels, and then resize, and vocode back to wav? Why not just process the audio file and change the pitch?
you had mentioned in another post that you had gotten better results with a custom trained vocoder, but retrained with the standard VCTK for compatibility with other models. Could you share your custom trained vocoder model?
Thanks,
Steven
Which one is the best way to generate 48K speech?
(1) Train 90k steps using the 16k dataset, then train 48k using (960 hop_length)
(2) After generating 24k speech, use a super-resolution model for 48k speech conversion.
I really appreciate about your kindly reply. Thanks very much.
您好,拜读了论文,想了解一下,模型的参数数量和运行时间大概是多久呢?或者说转换一秒的语音在3090显卡上需要多久的运算时间呢。
Hi, Thank you for quick replies and kindness.
While testing through fine-tuning on various data, i found that distortion occurs in o, s, and p pronunciation in common, and distortion also occurs when generating high-pitched sounds rather than target speaker data. Is there a way to solve two problems?
Or, it would be very helpful if you could tell me the metric that can check the presence or absence of distortion.
Thanks very much.
Since you made the changes to the data_utils the other day I am noticing that it seems to be training on much shorter segments.
the eval audios are also all below 1 second long, most of them just 40 frames. also noticing some obvious gaps/blurs in the generated/GT mels. Can see the "Blurry spots" in these generated mels. where the formants just disappear.
Sorry to bother you, in your paper you mentioned that Word error rate (WER) and character error rate (CER) between source and converted speech are obtained by an ASR model. about the code for wer and cer calculation. Are you able to open source it?
For unseen F to seen M conversion, the resulting pitch is very close to the source speaker , especially if the source pitch is much higher than seen M pitch.
I've used SR-based data augmentation step.
https://user-images.githubusercontent.com/53978091/211267486-b8551ae6-5f91-450e-8a4d-758b973b2c17.mp4
3. Conversion result unseen F to seen M
Hello , in the paper you said "Our models are trained up to 900k steps on a single NVIDIA 3090
GPU.The batch size is set to 64 with a maximum segment length of 128 frames"
1- How many hours/days did it take you to train these 900k steps , if you can be specific ?
2-Is this method data hungry, because in the paper you said "Only VCTK corpus is used for training" and VCTK has almost
44 hours of speech from 107 speakers" can this algorithm be used for example with 2 speakers with 4-5 hours of speech or 5 speakers
with 5-6 hours of speech (Seen to Seen of course or unseen to seen) and give the same quality and similarity as the paper?
3-Is feeding the algorithm 48Khz or atleast 22050khz audio instead of 16khz going to make that huge of a difference?
4-You said "And a HiFi-GAN v1 vocoder is used to transform the modified mel-spectrogram into waveform"
IS using a better vocoder than hifi-gan v1 (like the new released ones in 2022 Gan based or DDPM based)
going to make that huge of a difference?
1、有关speaker encoder
论文中的实验数据结论是,音色相似度,pretrained d-vector好于simple trained speaker encoder(lstm+linear+mean)
interspeech2022 best paper zero shot tts中采用的speaker encoder结构,以及https://arxiv.org/abs/2106.03153 transferTTS中用的mel输入的speaker encoder结构,他们和本文中的-s一样都是不采用预训练,我觉得有价值对比一下,他们都没有做和pretrained d-vector的ab实验
2、有关音色泄露
(1)有关频谱增强
A、采用vocoder进行音色音调时长的增强,虽然能改变音色,但是我担心会不会导致wavLM的语义识别性能下降;是否考虑用传统的变调变音色算法进行ab实验;
B、有没有可能,模型本身也能学会数据增强变调变音色的reconstruction(也就是说本质上还是认为变调后的训练src和tgt依然是同一个音色,被模型自动学去重建了src的最佳形态,那还是音色泄漏了)
(2)有关bottleneck维度压缩
根据代码发现似乎是压到了192(如果有不对请指出),但是hubert base256相比192也没有很多,192只是相对1024压缩了很多,我认为这压缩得还不够小,自监督本身出来给VC用的维度其实不需要1024那么多;可以尝试压缩到更暴力的维度,比如只有4维,再加上量化等
3、有关实验数据
MOS
unseen2unseen打败了seen2seen,有点想不明白
Can you create a training model notebook on Google Colab?
I tried to do it, but something is not working.
作者您好,
代码近期会公布吗?对您的工作很感兴趣,想学习一下代码。
I've tried a single and two gpu run. I have observed training acceleration for two gpu setup. Should there be, though?
I think a better way of extracting speak content might be to use WHAMR! to add artificial reverberation along with background noise to the training dataset in order for the WavLM to extract content, this would be useful since it can get extract the content that is not from studio grade audio. This is the same technique whisper AI uses for ASR
Hello,
After training for 290k (and still going) on 2 GPUs, I have decided to test conversion.
Results so far:
-- Conversion to M seen (but not seen in HifiGan):
So, 4 is pretty good, whilst 1-3 are similar to the source, not the target.
My setup is the same as yours, used Freevc with SR pre training.
Dataset has 110 VCTK speakers in total (including 40 custom speakers).
Is this the question of more training? Or HifiGan not trained on custom speakers?
UPDATE:
-- Conversion to M seen (also seen in HifiGan):
So, it seems like the issue is with Vocoder -- it must be trained on all dataset beforehand.
如题
Looking at convert.py I don't see that vocoder is used to get waveform , correct?
Hi, Thanks for the great work! I'm trying to test the inference part with my own wav file but the output quality is less than I expected and I'm suspecting it's due to the input file.
Could you give me some instruction for how to preprocess the input source/target wav?
Many options have,
LJ_V1
LJ_V2
LJ_V3
And from here: https://drive.google.com/drive/folders/1-eEYTB5Av9jNql0WGBlRoi-WH2J7bp5Y
Which one?
Thanks you
I dont have file like it, how to generate them?
Get exception file not found.
Thanks for amazing work!
I'm doing custom datasets' fine tuning with your model.
Could you also share the discriminator of freevc-24 model?
Thanks!
could you post the tensor board logs for the new 24khz training run if you still have them?
The model doesn't seek to work well for foreign languages. Any tips on how can it equally better for foreign-to-foreign U2S voice conversion ?
Hi, wondering if you could post your logs from the model trained to 900k steps? Would love to take a look at the progression of the losses during training etc.
does it work with any language ?
Hi. If we want to extend the training dataset, are there any publicly available datasets other than VCTK that we can use? Is having two or more speakers speaking the same line of sentence a requirement, as in the VCTK dataset? Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.