Giter Club home page Giter Club logo

Comments (2)

OlaWod avatar OlaWod commented on July 23, 2024 1
  1. A better speaker encoder structure can bring better results. In our paper, we just want to prove that, as long as the extracted content representation is clean enough, the speaker encoder will learn to model the missing speaker information, even using such an extremely simple speaker encoder structure.
  2. 2(1)A. I think as long as the vocoder is good enough, the quality degradation won't be impactful. I‘ve never seen anyone do an ablation study to data augmentation methods, they just propose it. So currently I don't have the plan to do this ablation, sorry.
  3. 2(1)B. That's why we compress the bottleneck. Using a naive autoencoder we can do waveform reconstruction. If we compress the latent dim of this autoencoder to a proper size then we can do the VC task.
  4. 2(2). Yes it's 192. A too narrow bottleneck will lose some content information, while a too wide bottleneck will contain some speaker information. If we use a bottleneck dimension of 4, it will lose a lot of content information. Searching the best bottleneck dimension is troublesome and thus we use the SR-based augmentation to help the model learn to discard residual speaker information in the 192-dim bottleneck. As for quantization, at the very beginning of our experiment, we used residual vector quantization after the 192-dim bottleneck and found that it didn't bring any significant improvement, so we removed it.
  5. I think this may be because of the quality of source speech. Seen sources, which are from VCTK, generally have a more unclear pronunciation (like p259_464); while unseen sources, which are from LibriTTS, have more background noise (like 5105_28233_000016_000001). From the demo page we can hear that our model can ignore the noise but the pronunciation, which is also part of content, remains the same. Also, some unseen sources have a much longer length (5105_28233_000016_000001 is 21 seconds long), I don't know if the wav length can affect the quality judgement.

from freevc.

splinter21 avatar splinter21 commented on July 23, 2024

有关reconstruction的解释:
我从前玩过galgame角色的变声器,主角是一个幼女角色,我在用au做数据增强的时候,发现降调2key后,出其得像声优真人录音,而galgame角色的音色,反而是被变调后的。那么我通过声音的人类自然性的判断,破解了作者录音后升调2key作为galgame角色声线的过程,这就是一个reconstruction
一个0key的录音,被+-key增强后,自然度和真实性肯定不如原本的0key录音,我人耳尚且可以逆向,我认为模型也完全可以学会把他重建到0key,那么推理时担心的src音色,又重新被还原泄露了。

from freevc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.