Hi, I've been having this issue for quite some time and have tried a ton of different

Different timbres from the same singer. seperated into unique speakers, all sound identical in a multi-speaker model about diffsinger HOT 8 CLOSED

spicytigermeat commented on September 6, 2024

Different timbres from the same singer. seperated into unique speakers, all sound identical in a multi-speaker model

from diffsinger.

Comments (8)

yqzhishen commented on September 6, 2024

The vocoder has nothing to do with the timbre. Do different timbres from the same singer sound the same also on TensorBoard? How different do the timbres sound from each other?

Maybe irrelevant, but your configuration have many improper values. Please copy the template configuration and edit it, do not edit any pre-existing files, and do not derive from base.yaml directly, as introduced in the documentation. Do not use fine-tuning except for extremely special cases. Enable augmentation. Enable AMP. Use larger batch size.

from diffsinger.

spicytigermeat commented on September 6, 2024

The vocoder has nothing to do with the timbre. Do different timbres from the same singer sound the same also on TensorBoard? How different do the timbres sound from each other?

I only mentioned the vocoder to cover all of the differences. No, the different timbres sound very similar to the ground truth sample in tensorboard (so long as training is far enough along)

Maybe irrelevant, but your configuration have many improper values. Please copy the template configuration and edit it, do not edit any pre-existing files, and do not derive from base.yaml directly, as introduced in the documentation. Do not use fine-tuning except for extremely special cases. Enable augmentation. Enable AMP. Use larger batch size.

Okay, I'll readjust the configuration using your recommendations and see if I get better results! In what cases would you recommend using fine-tuning for?

from diffsinger.

yqzhishen commented on September 6, 2024

No, the different timbres sound very similar to the ground truth sample in tensorboard (so long as training is far enough along)

So you mean the timbres are distinct from each other on TensorBoard but are very similar in OpenUTAU? The only possibility I can imagine is that if you forgot to split the timbres when training the variance model (energy & breathiness) like how you trained your acoustic model, the timbres can be mixed up. But in your configuration I saw you did not enable these two parameters.

In what cases would you recommend using fine-tuning for?

Currently the only recommended usecase is about training the aux decoder and diffusion decoder separately, when enabling shallow diffusion. Fine-tuning is not that helpful in regular cases. If you fine-tune a model, it will not save much training steps if you want to totally wash out the timbres in the pre-trained model. If you train enough steps, it will cause catastrophic forgetting. If you discard some layers or embeddings before fine-tuning, it may perform even worse than starting from scratch. Meanwhile, fine-tuning requires careful adjustment in the training-related hyperparameters to get the best results. In a word, do not use fine-tuning unless guided by the documentation or unless you are expert and are clearly aware of what you are doing. And especially, for people who own enough high-quality and well labeled datasets, please train from scratch.

from diffsinger.

spicytigermeat commented on September 6, 2024

So you mean the timbres are distinct from each other on TensorBoard but are very similar in OpenUTAU? The only possibility I can imagine is that if you forgot to split the timbres when training the variance model (energy & breathiness) like how you trained your acoustic model, the timbres can be mixed up. But in your configuration I saw you did not enable these two parameters.

Yes, they sound distinct in TensorBoard but almost identical in OpenUTAU. There are slight differences with waveforms but generally all of the unique timbre gets removed and they all sound like they've been trained together as opposed to separate speakers. I generally wasn't happy with the results I got with energy and breathiness prior, so I decided to not train using those parameters. Do you think that might have something to do with this issue?

Thank you for better explaining the use of finetuning! I'll be sure to stick to tuning from scratch going forward.

The only other thing I can think that might be causing this issue is the amount of data I'm using and the amount of speakers? I never had this issue when I was training smaller amounts of data (~2hrs, 2 different vocalists, 6 different "voice modes"/speakers in diffsinger) and now my dataset is ~6hrs, 6 different vocalists and up to 23 different speakers in diffsinger. That's about all my GPU can handle (I train locally).

I ran another test last night training only about 2hrs of data across 3 vocalists and 10 speakers in the config, and still got the same issue after 200epochs/12k steps of acoustic training. I can confirm the data is high quality and tagged well (all done by me by hand). Thanks so much for all of your help!

from diffsinger.

yqzhishen commented on September 6, 2024

I mean, if the timbres are distinct from each other on TensorBoard, they are expected to be distinct from each other in OpenUTAU as well - because the conditions are the same. If you do believe the TensorBoard samples really sound as you expected, then there must be something wrong; otherwise, they should have been in trouble when they were on TensorBoard.

A possible way for debugging is to export the DS files from OpenUTAU, and use python scripts/infer.py acoustic your_project.ds --spk your_spk to verify if the model is really correctly trained.

I personally train with 5 or 6 vocalists and ~9 timbres in total for my every experiment. I never encountered any issue with the differences between timbres. Some people in the community trains larger datasets than me with multi-timbre singers in them, and they have no problem either.

I generally wasn't happy with the results I got with energy and breathiness prior, so I decided to not train using those parameters.

According to experiences from I and other people in our community in China, the variance parameters did not cause deterioration of the quality, but improves stability and controllability. But if you do not train them well, it can cause some problem, and their is some interesting findings in our recent research about the mutual influence between variance modules. These has been updated into the documentation, and a minor release will also be published to notify users about it.

from diffsinger.

spicytigermeat commented on September 6, 2024

Thanks for the tip on debugging by inferring directly from the checkpoint, turns out it's either an OpenUTAU issue or a deployment issue, because direct inference from the checkpoint via command line actually gave me the proper output with separate timbre. I'll have to keep messing around with the OpenUTAU library file structure to figure out why it's getting embeds confused, which is my guess. Do they have to be in OPU configs in a certain order that you're aware of?

from diffsinger.

yqzhishen commented on September 6, 2024

When exporting to ONNX you should use --export_spk spk1 --export _spk spk2 ... to export all your desired embeds, or if this option is unset, the exporter exports all the embeds. Then you should write them down in the OpenUTAU config as its wiki says, and yes they should be in an ordered list, but in any order you would like.

So you might mixed up your embeds somehow, or it's probably just a personal mistake in the usage of OpenUTAU. You should first check your embeds to see if they are really different, and your configs, and then OpenUTAU itself (for example, use a clean install or reset all the preferences in case there are some misconfiguration in the expression settings).

from diffsinger.

spicytigermeat commented on September 6, 2024

First of all, thank you so much for all of your dedication helping me solve this issue, I've learned a ton!

Second of all, I discovered that the issue IS OpenUTAU. Apparently, if the embed files are not in the same directory as the character.yaml file, you have to specify where they are. I thought it pulled from the list of speakers, so it was totally my misunderstanding.

from diffsinger.

Different timbres from the same singer. seperated into unique speakers, all sound identical in a multi-speaker model about diffsinger HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent