For unseen F to seen M conversion, the resulting pitch is very close to the source spe

Hmmm, weired. <a class="issue-link js-issue-link" data-error-text="Failed to load titl

Some observations after 690k steps about freevc HOT 32 OPEN

olawod commented on July 23, 2024 1

Some observations after 690k steps

from freevc.

Comments (32)

steven850 commented on July 23, 2024 4

The VC replicates the Voice, so shape of the vocal tract to create the same formants as the target.
Accent is nothing to do with the physical properties of the voice or speaker, accent or dialect is simply pronunciation which makes it content.
so having separate content models for each accent would work, downside there is the many models needed as well as the system then only working with a single language.
the way it works currently lets the system work with any language, it can also handle "vocal grunts" and some singing, none of this would be possible with something that attempts to replicate a targets accent. You would also need a much longer sample file from the target to get something like that to work, a 5 second clip wouldn't cut it, you would need several minutes if not hours of the target speaker to get the correct pronunciations from that target.
you also start to loose control of the output the more data you try to pull from the target, so if the system is then reliant on the target for pronunciation and accent, a side effect of this is not being able to control the output emotion anymore, because it relies on the target for this and not the source. So you cant make the output sad or happy, unless you also have the target speaker with sad or happy recordings.
Any VC will still require you to do some acting, even if you implemented the system I mentioned, and were ok with all of those limits. You still need to mimic the targets speech patterns and cadence. Might as well use a system that is more flexible and mimic the accent as well.

This is why you see other VC systems(commercial ones) also implement a txt system, so it requires a script to go with the recordings, and those systems are fine tuned to a single target, they cant do many to many. They have a separate model you load for each target, and the system was heavily trained on hrs worth of data of that single target speaker.
what your wanting will never work with a many to many model, or with just a short sample from the target.

from freevc.

steven850 commented on July 23, 2024 3

fact that target conversions retain the accent of the source speaker, when ideally it'd work the other way around.

This is to be expected as accents are not part of voice, and no VC will ever be able to change that.
accent is part of content.
What you are talking about is content transfer in this case. So if you have an American source speaker and a British target You would need several content models, one for each accent, then when you say "tomato" in American English, have it recognize that, then pull the word "tomato" from the "British English" content model and use that content embed before performing the VC.

from freevc.

steven850 commented on July 23, 2024 2

For large pitch differences, its best to add F0 analyses to the VC. This also helps maintain pitch variance with natural speech VS "read text"

from freevc.

OlaWod commented on July 23, 2024 1

Hmmm, weired. #21 also reports worse results than mine. Tomorrow I'll check the results of my checkpoint trained with data_utils.py to see if this is the problem. I thought it would provide better performance than data_utils_old.py though, cause it loads more data in a batch. Other than that I can't think of any other reason now.

from freevc.

OlaWod commented on July 23, 2024 1

Tested checkpoint trained with data_utils.py, it does has this conversion failure problem. Sorry for the bug, I'll update the code later.
Though, I can not figure out why this happens, it just loads larger amounts of data. Maybe I also messed up something when I clean up my code? Please ping me if failure still exists, and I'll check up my old dirty code to see if I missed something.

from freevc.

OlaWod commented on July 23, 2024 1

Tested again, it does not have conversion failure. In my first test I just passed freevc-s.json to load freevc.
But, the resulted speech has some 'splitting up' voice. data_utils.py does has problem, the concatenation deteriorates model performance.
data_utils_old.py is from the code before my cleaning up, and has many old-version variable names. I'll update it later, after I make sure nothing is wrong. (Hope I do not mess up this time)

from freevc.

OlaWod commented on July 23, 2024 1

Yes. Btw code updated.

from freevc.

OlaWod commented on July 23, 2024 1

I noticed that data_utils_old includes a speaker ID along with the audio data. I don't see any usage of speaker ID in the version of train that got committed; was it used in a previous version? Seems like that could be beneficial for informing conversion.

nope, it's just an unused variable. As can be seen the model has nowhere to consume speaker id (it does not have an embedding table), it's pointless to pass a speaker id.
The whole thing goes like this: before I uploaded the code I need to clean them up, because there were many confusing variable names, bad-formatted filenames, useless codes, etc. When cleaning up data_utils.py I though it could be improved and wrote these. As it has a different logic with the old code I used, I also uploaded data_utils_old.py, just want to declare that, "if you get better results than my pretrained checkpoints, it's just because I improved the code". But, unfortunately, this 'improve' does the opposite. 😓

from freevc.

space-pope commented on July 23, 2024 1

As can be seen the model has nowhere to consume speaker id (it does not have an embedding table), it's pointless to pass a speaker id.

Yeah, that makes sense. I can see that there's no embedding table in committed versions of the model; just wanted to make sure there wasn't one in a previous version (though I suppose if there were, my finetuning of the pretrained model you provided shouldn't have worked as well as it did).

Thanks for your responsiveness in all these issues; it's nice to see engagement with the code after release.

from freevc.

OlaWod commented on July 23, 2024 1

Also thanks to all of you for your interest ^_^

from freevc.

OlaWod commented on July 23, 2024 1

It could be. The speaker encoder does not have special design for the case where there is a lot of silence. The speaker embedding of an utterance can be "averaged" by silence. I mean, suppose a reference utterance is 001111000000001100 (0 denotes silence, 1 denotes speech), if only pass 111111 to speaker encoder, the resulted speaker embedding can properly reflect speaker property; if only pass 0000000 to speaker encoder, the resulted speaker embedding reflects the property of silence; if pass 001111000000001100, the resulted speaker embedding will be in between.

from freevc.

space-pope commented on July 23, 2024 1

If your dataset is sufficiently diverse, I think that kind of pitch issue is inevitable with FreeVC—the pretrained version might seem better because VCTK is less diverse than your data. WavLM simply encodes too much speaker identity for the bottleneck here to remove, and some of it likely ends up leaking through.

from freevc.

OlaWod commented on July 23, 2024

I tried some large-pitch-gap conversions, including the mentioned 5142-to-p326 conversion, and some other conversions like a high-pitch kid's voice to another low-pitch man's voice. I put the results here.
So, seems that your issue is not caused by the large pitch gap 🤔.

from freevc.

skol101 commented on July 23, 2024

I've followed your training instructions to the T, with one slight difference: I haven't created test.txt in prepare_flist, though this shouldn't have any effect on training results.

from freevc.

skol101 commented on July 23, 2024

Also, is 64 batch size could be the issue. I understand that's to utilise as much memory of the rtx 3090 as possible.

from freevc.

skol101 commented on July 23, 2024

I see, so it's not as simple as renaming data_utils_old.py to data_utils.py ?

from freevc.

skol101 commented on July 23, 2024

Oh yes, "splitting voice" is something I'm hearing , too, in the generated audio.

from freevc.

skol101 commented on July 23, 2024

Can this model be fine-tuned to a custom dataset by continuning from a pre-trained checkpoint?

from freevc.

skol101 commented on July 23, 2024

Ok, I'll first try finetuning

from freevc.

skol101 commented on July 23, 2024

I have continued from the pretrained model, but the logs are saying that current step is only 130400, not 900k. I've downloaded generator and discriminator from here https://onedrive.live.com/?authkey=%21AOOs5nZpsLC4ECE&id=537643E55991EE7B%219178&cid=537643E55991EE7B

But generator freevc.pth was last updated on 14/09/2022, whilst discriminator on 30/11/2022. Maybe there's an old version of the pre-trained generator uploaded to OneDrive?

INFO:freevc:====> Epoch: 1372
INFO:freevc:Train Epoch: 1373 [63%]
INFO:freevc:[2.479205369949341, 2.2820231914520264, 11.992838859558105, 19.632217407226562, 2.451157331466675, 130400, 0.00016841507612184626]

from freevc.

OlaWod commented on July 23, 2024

that means you did not use batch size 64, 1 gpu, as I do. The displayed step can be influenced by batch size, number of gpu, etc.
The displayed time is the time when I drag the file from server to my personal computer, both generator and discriminator are born at this time ↓.

from freevc.

skol101 commented on July 23, 2024

I've started fine tuning with 2 gpus, but the same batch size . Good to know. I'll observe tuning results.

from freevc.

space-pope commented on July 23, 2024

I noticed that data_utils_old includes a speaker ID along with the audio data. I don't see any usage of speaker ID in the version of train that got committed; was it used in a previous version? Seems like that could be beneficial for informing conversion.

from freevc.

skol101 commented on July 23, 2024

I second what @space-pope said. Cheers @OlaWod !

from freevc.

skol101 commented on July 23, 2024

I've fine-tuned using most recent commit and I can still hear original voice (in the Female to Male conversion), ESPECIALLY at the beginning of words after silence . I.e., if there's a silence before the the first phoneme, then mostly like the resulting phonemes pitch will be closer to the source rather than the target.

Could it be the custom dataset have long pauses in wavs? I'll try re-training on 5 second chunks split by silence

dBFS = sound.dBFS
    chunks = split_on_silence(sound,
       # split on silences longer than 1000ms (1 sec)
        min_silence_len=100,

        # anything under -16 dBFS is considered silence
        silence_thresh=dBFS-16, 

        # keep 200 ms of leading/trailing silence
        keep_silence=200
    )

from freevc.

skol101 commented on July 23, 2024

@space-pope how are your test results , especially with unseen F to seen M conversion?

from freevc.

space-pope commented on July 23, 2024

The first thing I tried was finetuning the model for a couple thousand steps with some different data. I got reasonable results, but nothing groundbreaking. Challenging cases like the one you mention were still challenging/not great.

Past that, it's not really fair to compare my results, as I've been porting the code to another framework and using it as a starting point to attempt addressing some of the corner cases that seem tough for all current VC models—sources and targets with large pitch differences like you mention, and the fact that target conversions retain the accent of the source speaker, when ideally it'd work the other way around.

I regret to report that I have not yet surpassed the state of the art :).

from freevc.

space-pope commented on July 23, 2024

accent is part of content.

I can tell that's the case here, but it strikes me as a strange definition of the word "content". Content should be what is spoken, not how. In other words, the same content spoken by different speakers will sound different due to the shape of their vocal tracts, emotion, idiolect, etc.; but lumping all those attributes into the definition of content feels over-broad.

Speech attributes often get entangled across a model, but, for example, a good multispeaker TTS system will have accent be more dependent on speaker ID than on the actual synthesis input. That's admittedly a bit of a stretch as a comparison since TTS systems receive a more explicit representation of content as input, but disentangling these things is something to shoot for.

from freevc.

fervillamar commented on July 23, 2024

@space-pope, all, I'm still having the output pitch issues when training a new model or trying to fine-tune the pre-trained one using the current code. Following some of your comments it is not totally clear to me if the code was already fixed or if an older version may work better. Would someone mind to clarify this?, I'm still getting a higher pitch range when converting Female to Male voices. Thanks in advance :-)

from freevc.

fervillamar commented on July 23, 2024

Thanks Josh. I'm using VCTK, I just cannot replicate the pertained model performance. I'm using VCTK data and the training configuration as denoted in the repository but still I get this pitch issue that I don't get it using the pretrained one. Did you try this?.

from freevc.

space-pope commented on July 23, 2024

Hm, no; I never tried training with just VCTK since I was evaluating the repository for something else. I do remember there being some back-and-forth about the training scripts, though; I got the idea that the code used for the paper/pretrained models wasn't quite in sync with what got uploaded here. Either way, I haven't worked with this code in quite awhile, so I won't be much help now; sorry. I will say that in general, I've found that models can improve long after it looks like the training loss has started to level out (especially with audio models using just L1 or MSE loss), so it could be a training time issue, but I can't be sure due to the codebase confusion.

from freevc.

fervillamar commented on July 23, 2024

I see. Thank you for the feedback @space-pope !. I tried with different data but when I got the pitch issue and tried to replicated the pretrained model results without success. I checked at different error conditions but thing did not change much. It is a bit frustrating to not being able to replicate the results 😔. If anyone could actually do it successfully I will really appreciate sharing some information here. Thanks!

from freevc.

Some observations after 690k steps about freevc HOT 32 OPEN

Comments (32)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent