1、有关speaker encoder 论文中的实验数据结论是，音色相似度，pretrained d-vector好于simple trained speaker

A better speaker encoder structure can bring better results. In our paper, we ju

拜读了您的论文，有几个问题 about freevc HOT 2 CLOSED

olawod commented on July 23, 2024

拜读了您的论文，有几个问题

from freevc.

Comments (2)

OlaWod commented on July 23, 2024 1

A better speaker encoder structure can bring better results. In our paper, we just want to prove that, as long as the extracted content representation is clean enough, the speaker encoder will learn to model the missing speaker information, even using such an extremely simple speaker encoder structure.
2(1)A. I think as long as the vocoder is good enough, the quality degradation won't be impactful. I‘ve never seen anyone do an ablation study to data augmentation methods, they just propose it. So currently I don't have the plan to do this ablation, sorry.
2(1)B. That's why we compress the bottleneck. Using a naive autoencoder we can do waveform reconstruction. If we compress the latent dim of this autoencoder to a proper size then we can do the VC task.
2(2). Yes it's 192. A too narrow bottleneck will lose some content information, while a too wide bottleneck will contain some speaker information. If we use a bottleneck dimension of 4, it will lose a lot of content information. Searching the best bottleneck dimension is troublesome and thus we use the SR-based augmentation to help the model learn to discard residual speaker information in the 192-dim bottleneck. As for quantization, at the very beginning of our experiment, we used residual vector quantization after the 192-dim bottleneck and found that it didn't bring any significant improvement, so we removed it.
I think this may be because of the quality of source speech. Seen sources, which are from VCTK, generally have a more unclear pronunciation (like p259_464); while unseen sources, which are from LibriTTS, have more background noise (like 5105_28233_000016_000001). From the demo page we can hear that our model can ignore the noise but the pronunciation, which is also part of content, remains the same. Also, some unseen sources have a much longer length (5105_28233_000016_000001 is 21 seconds long), I don't know if the wav length can affect the quality judgement.

from freevc.

splinter21 commented on July 23, 2024

有关reconstruction的解释：
我从前玩过galgame角色的变声器，主角是一个幼女角色，我在用au做数据增强的时候，发现降调2key后，出其得像声优真人录音，而galgame角色的音色，反而是被变调后的。那么我通过声音的人类自然性的判断，破解了作者录音后升调2key作为galgame角色声线的过程，这就是一个reconstruction
一个0key的录音，被+-key增强后，自然度和真实性肯定不如原本的0key录音，我人耳尚且可以逆向，我认为模型也完全可以学会把他重建到0key，那么推理时担心的src音色，又重新被还原泄露了。

from freevc.

拜读了您的论文，有几个问题 about freevc HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent