Comments (32)
The VC replicates the Voice, so shape of the vocal tract to create the same formants as the target.
Accent is nothing to do with the physical properties of the voice or speaker, accent or dialect is simply pronunciation which makes it content.
so having separate content models for each accent would work, downside there is the many models needed as well as the system then only working with a single language.
the way it works currently lets the system work with any language, it can also handle "vocal grunts" and some singing, none of this would be possible with something that attempts to replicate a targets accent. You would also need a much longer sample file from the target to get something like that to work, a 5 second clip wouldn't cut it, you would need several minutes if not hours of the target speaker to get the correct pronunciations from that target.
you also start to loose control of the output the more data you try to pull from the target, so if the system is then reliant on the target for pronunciation and accent, a side effect of this is not being able to control the output emotion anymore, because it relies on the target for this and not the source. So you cant make the output sad or happy, unless you also have the target speaker with sad or happy recordings.
Any VC will still require you to do some acting, even if you implemented the system I mentioned, and were ok with all of those limits. You still need to mimic the targets speech patterns and cadence. Might as well use a system that is more flexible and mimic the accent as well.
This is why you see other VC systems(commercial ones) also implement a txt system, so it requires a script to go with the recordings, and those systems are fine tuned to a single target, they cant do many to many. They have a separate model you load for each target, and the system was heavily trained on hrs worth of data of that single target speaker.
what your wanting will never work with a many to many model, or with just a short sample from the target.
from freevc.
fact that target conversions retain the accent of the source speaker, when ideally it'd work the other way around.
This is to be expected as accents are not part of voice, and no VC will ever be able to change that.
accent is part of content.
What you are talking about is content transfer in this case. So if you have an American source speaker and a British target You would need several content models, one for each accent, then when you say "tomato" in American English, have it recognize that, then pull the word "tomato" from the "British English" content model and use that content embed before performing the VC.
from freevc.
For large pitch differences, its best to add F0 analyses to the VC. This also helps maintain pitch variance with natural speech VS "read text"
from freevc.
Hmmm, weired. #21 also reports worse results than mine. Tomorrow I'll check the results of my checkpoint trained with data_utils.py
to see if this is the problem. I thought it would provide better performance than data_utils_old.py
though, cause it loads more data in a batch. Other than that I can't think of any other reason now.
from freevc.
Tested checkpoint trained with data_utils.py
, it does has this conversion failure problem. Sorry for the bug, I'll update the code later.
Though, I can not figure out why this happens, it just loads larger amounts of data. Maybe I also messed up something when I clean up my code? Please ping me if failure still exists, and I'll check up my old dirty code to see if I missed something.
from freevc.
Tested again, it does not have conversion failure. In my first test I just passed freevc-s.json
to load freevc.
But, the resulted speech has some 'splitting up' voice. data_utils.py
does has problem, the concatenation deteriorates model performance.
data_utils_old.py
is from the code before my cleaning up, and has many old-version variable names. I'll update it later, after I make sure nothing is wrong. (Hope I do not mess up this time)
from freevc.
Yes. Btw code updated.
from freevc.
I noticed that
data_utils_old
includes a speaker ID along with the audio data. I don't see any usage of speaker ID in the version oftrain
that got committed; was it used in a previous version? Seems like that could be beneficial for informing conversion.
nope, it's just an unused variable. As can be seen the model has nowhere to consume speaker id (it does not have an embedding table), it's pointless to pass a speaker id.
The whole thing goes like this: before I uploaded the code I need to clean them up, because there were many confusing variable names, bad-formatted filenames, useless codes, etc. When cleaning up data_utils.py
I though it could be improved and wrote these. As it has a different logic with the old code I used, I also uploaded data_utils_old.py
, just want to declare that, "if you get better results than my pretrained checkpoints, it's just because I improved the code". But, unfortunately, this 'improve' does the opposite. 😓
from freevc.
As can be seen the model has nowhere to consume speaker id (it does not have an embedding table), it's pointless to pass a speaker id.
Yeah, that makes sense. I can see that there's no embedding table in committed versions of the model; just wanted to make sure there wasn't one in a previous version (though I suppose if there were, my finetuning of the pretrained model you provided shouldn't have worked as well as it did).
Thanks for your responsiveness in all these issues; it's nice to see engagement with the code after release.
from freevc.
Also thanks to all of you for your interest ^_^
from freevc.
It could be. The speaker encoder does not have special design for the case where there is a lot of silence. The speaker embedding of an utterance can be "averaged" by silence. I mean, suppose a reference utterance is 001111000000001100 (0 denotes silence, 1 denotes speech), if only pass 111111 to speaker encoder, the resulted speaker embedding can properly reflect speaker property; if only pass 0000000 to speaker encoder, the resulted speaker embedding reflects the property of silence; if pass 001111000000001100, the resulted speaker embedding will be in between.
from freevc.
If your dataset is sufficiently diverse, I think that kind of pitch issue is inevitable with FreeVC—the pretrained version might seem better because VCTK is less diverse than your data. WavLM simply encodes too much speaker identity for the bottleneck here to remove, and some of it likely ends up leaking through.
from freevc.
I tried some large-pitch-gap conversions, including the mentioned 5142-to-p326 conversion, and some other conversions like a high-pitch kid's voice to another low-pitch man's voice. I put the results here.
So, seems that your issue is not caused by the large pitch gap 🤔.
from freevc.
I've followed your training instructions to the T, with one slight difference: I haven't created test.txt in prepare_flist, though this shouldn't have any effect on training results.
from freevc.
Also, is 64 batch size could be the issue. I understand that's to utilise as much memory of the rtx 3090 as possible.
from freevc.
I see, so it's not as simple as renaming data_utils_old.py to data_utils.py ?
from freevc.
Oh yes, "splitting voice" is something I'm hearing , too, in the generated audio.
from freevc.
Can this model be fine-tuned to a custom dataset by continuning from a pre-trained checkpoint?
from freevc.
Ok, I'll first try finetuning
from freevc.
I have continued from the pretrained model, but the logs are saying that current step is only 130400, not 900k. I've downloaded generator and discriminator from here https://onedrive.live.com/?authkey=%21AOOs5nZpsLC4ECE&id=537643E55991EE7B%219178&cid=537643E55991EE7B
But generator freevc.pth was last updated on 14/09/2022, whilst discriminator on 30/11/2022. Maybe there's an old version of the pre-trained generator uploaded to OneDrive?
INFO:freevc:====> Epoch: 1372
INFO:freevc:Train Epoch: 1373 [63%]
INFO:freevc:[2.479205369949341, 2.2820231914520264, 11.992838859558105, 19.632217407226562, 2.451157331466675, 130400, 0.00016841507612184626]
from freevc.
- that means you did not use batch size 64, 1 gpu, as I do. The displayed step can be influenced by batch size, number of gpu, etc.
- The displayed time is the time when I drag the file from server to my personal computer, both generator and discriminator are born at this time ↓.
from freevc.
I've started fine tuning with 2 gpus, but the same batch size . Good to know. I'll observe tuning results.
from freevc.
I noticed that data_utils_old
includes a speaker ID along with the audio data. I don't see any usage of speaker ID in the version of train
that got committed; was it used in a previous version? Seems like that could be beneficial for informing conversion.
from freevc.
I second what @space-pope said. Cheers @OlaWod !
from freevc.
I've fine-tuned using most recent commit and I can still hear original voice (in the Female to Male conversion), ESPECIALLY at the beginning of words after silence . I.e., if there's a silence before the the first phoneme, then mostly like the resulting phonemes pitch will be closer to the source rather than the target.
Could it be the custom dataset have long pauses in wavs? I'll try re-training on 5 second chunks split by silence
dBFS = sound.dBFS
chunks = split_on_silence(sound,
# split on silences longer than 1000ms (1 sec)
min_silence_len=100,
# anything under -16 dBFS is considered silence
silence_thresh=dBFS-16,
# keep 200 ms of leading/trailing silence
keep_silence=200
)
from freevc.
@space-pope how are your test results , especially with unseen F to seen M conversion?
from freevc.
The first thing I tried was finetuning the model for a couple thousand steps with some different data. I got reasonable results, but nothing groundbreaking. Challenging cases like the one you mention were still challenging/not great.
Past that, it's not really fair to compare my results, as I've been porting the code to another framework and using it as a starting point to attempt addressing some of the corner cases that seem tough for all current VC models—sources and targets with large pitch differences like you mention, and the fact that target conversions retain the accent of the source speaker, when ideally it'd work the other way around.
I regret to report that I have not yet surpassed the state of the art :).
from freevc.
accent is part of content.
I can tell that's the case here, but it strikes me as a strange definition of the word "content". Content should be what is spoken, not how. In other words, the same content spoken by different speakers will sound different due to the shape of their vocal tracts, emotion, idiolect, etc.; but lumping all those attributes into the definition of content feels over-broad.
Speech attributes often get entangled across a model, but, for example, a good multispeaker TTS system will have accent be more dependent on speaker ID than on the actual synthesis input. That's admittedly a bit of a stretch as a comparison since TTS systems receive a more explicit representation of content as input, but disentangling these things is something to shoot for.
from freevc.
@space-pope, all, I'm still having the output pitch issues when training a new model or trying to fine-tune the pre-trained one using the current code. Following some of your comments it is not totally clear to me if the code was already fixed or if an older version may work better. Would someone mind to clarify this?, I'm still getting a higher pitch range when converting Female to Male voices. Thanks in advance :-)
from freevc.
Thanks Josh. I'm using VCTK, I just cannot replicate the pertained model performance. I'm using VCTK data and the training configuration as denoted in the repository but still I get this pitch issue that I don't get it using the pretrained one. Did you try this?.
from freevc.
Hm, no; I never tried training with just VCTK since I was evaluating the repository for something else. I do remember there being some back-and-forth about the training scripts, though; I got the idea that the code used for the paper/pretrained models wasn't quite in sync with what got uploaded here. Either way, I haven't worked with this code in quite awhile, so I won't be much help now; sorry. I will say that in general, I've found that models can improve long after it looks like the training loss has started to level out (especially with audio models using just L1 or MSE loss), so it could be a training time issue, but I can't be sure due to the codebase confusion.
from freevc.
I see. Thank you for the feedback @space-pope !. I tried with different data but when I got the pitch issue and tried to replicated the pretrained model results without success. I checked at different error conditions but thing did not change much. It is a bit frustrating to not being able to replicate the results 😔. If anyone could actually do it successfully I will really appreciate sharing some information here. Thanks!
from freevc.
Related Issues (20)
- Asking help for understanding code.
- The audio suffix of VCTK data set is not '_ mic2.flac'? HOT 2
- Question for hps.data.n_mel_channels
- Inference or train with WavLM-Base or WavLM-Base+? HOT 1
- Condition decoder on desired output length to have control over speech rate in inference?
- 基于您现有的模型使用aishell3训练,大概要训练多久,作者有试过吗
- Unseen Male to Male results in Female output HOT 1
- 音色转换程度不一致
- Epoch duration
- 关于算法的类型 HOT 1
- 训练了500个epoch,按照freevc.json配置进行训练,无论wav_tgt使用何种音色,测试出来的音色都是同一个?
- Changing batch size to 16 or 32
- poor performance on seen-to-unseen task while finetuning on Hindi language HOT 2
- 2023.01.10 update: code below can deteriorate model performance HOT 3
- Vocoder version
- Fine tuning with custom (multilingual) data HOT 1
- How to start inference example? HOT 1
- 关于训练问题
- target pitch issue after training (not appearing if using the pretrained checkpoint) HOT 1
- Config file for the FreeVC-24 checkpoint HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from freevc.