Comments (8)
The vocoder has nothing to do with the timbre. Do different timbres from the same singer sound the same also on TensorBoard? How different do the timbres sound from each other?
Maybe irrelevant, but your configuration have many improper values. Please copy the template configuration and edit it, do not edit any pre-existing files, and do not derive from base.yaml directly, as introduced in the documentation. Do not use fine-tuning except for extremely special cases. Enable augmentation. Enable AMP. Use larger batch size.
from diffsinger.
The vocoder has nothing to do with the timbre. Do different timbres from the same singer sound the same also on TensorBoard? How different do the timbres sound from each other?
I only mentioned the vocoder to cover all of the differences. No, the different timbres sound very similar to the ground truth sample in tensorboard (so long as training is far enough along)
Maybe irrelevant, but your configuration have many improper values. Please copy the template configuration and edit it, do not edit any pre-existing files, and do not derive from base.yaml directly, as introduced in the documentation. Do not use fine-tuning except for extremely special cases. Enable augmentation. Enable AMP. Use larger batch size.
Okay, I'll readjust the configuration using your recommendations and see if I get better results! In what cases would you recommend using fine-tuning for?
from diffsinger.
No, the different timbres sound very similar to the ground truth sample in tensorboard (so long as training is far enough along)
So you mean the timbres are distinct from each other on TensorBoard but are very similar in OpenUTAU? The only possibility I can imagine is that if you forgot to split the timbres when training the variance model (energy & breathiness) like how you trained your acoustic model, the timbres can be mixed up. But in your configuration I saw you did not enable these two parameters.
In what cases would you recommend using fine-tuning for?
Currently the only recommended usecase is about training the aux decoder and diffusion decoder separately, when enabling shallow diffusion. Fine-tuning is not that helpful in regular cases. If you fine-tune a model, it will not save much training steps if you want to totally wash out the timbres in the pre-trained model. If you train enough steps, it will cause catastrophic forgetting. If you discard some layers or embeddings before fine-tuning, it may perform even worse than starting from scratch. Meanwhile, fine-tuning requires careful adjustment in the training-related hyperparameters to get the best results. In a word, do not use fine-tuning unless guided by the documentation or unless you are expert and are clearly aware of what you are doing. And especially, for people who own enough high-quality and well labeled datasets, please train from scratch.
from diffsinger.
So you mean the timbres are distinct from each other on TensorBoard but are very similar in OpenUTAU? The only possibility I can imagine is that if you forgot to split the timbres when training the variance model (energy & breathiness) like how you trained your acoustic model, the timbres can be mixed up. But in your configuration I saw you did not enable these two parameters.
Yes, they sound distinct in TensorBoard but almost identical in OpenUTAU. There are slight differences with waveforms but generally all of the unique timbre gets removed and they all sound like they've been trained together as opposed to separate speakers. I generally wasn't happy with the results I got with energy and breathiness prior, so I decided to not train using those parameters. Do you think that might have something to do with this issue?
Thank you for better explaining the use of finetuning! I'll be sure to stick to tuning from scratch going forward.
The only other thing I can think that might be causing this issue is the amount of data I'm using and the amount of speakers? I never had this issue when I was training smaller amounts of data (~2hrs, 2 different vocalists, 6 different "voice modes"/speakers in diffsinger) and now my dataset is ~6hrs, 6 different vocalists and up to 23 different speakers in diffsinger. That's about all my GPU can handle (I train locally).
I ran another test last night training only about 2hrs of data across 3 vocalists and 10 speakers in the config, and still got the same issue after 200epochs/12k steps of acoustic training. I can confirm the data is high quality and tagged well (all done by me by hand). Thanks so much for all of your help!
from diffsinger.
I mean, if the timbres are distinct from each other on TensorBoard, they are expected to be distinct from each other in OpenUTAU as well - because the conditions are the same. If you do believe the TensorBoard samples really sound as you expected, then there must be something wrong; otherwise, they should have been in trouble when they were on TensorBoard.
A possible way for debugging is to export the DS files from OpenUTAU, and use python scripts/infer.py acoustic your_project.ds --spk your_spk
to verify if the model is really correctly trained.
I personally train with 5 or 6 vocalists and ~9 timbres in total for my every experiment. I never encountered any issue with the differences between timbres. Some people in the community trains larger datasets than me with multi-timbre singers in them, and they have no problem either.
I generally wasn't happy with the results I got with energy and breathiness prior, so I decided to not train using those parameters.
According to experiences from I and other people in our community in China, the variance parameters did not cause deterioration of the quality, but improves stability and controllability. But if you do not train them well, it can cause some problem, and their is some interesting findings in our recent research about the mutual influence between variance modules. These has been updated into the documentation, and a minor release will also be published to notify users about it.
from diffsinger.
Thanks for the tip on debugging by inferring directly from the checkpoint, turns out it's either an OpenUTAU issue or a deployment issue, because direct inference from the checkpoint via command line actually gave me the proper output with separate timbre. I'll have to keep messing around with the OpenUTAU library file structure to figure out why it's getting embeds confused, which is my guess. Do they have to be in OPU configs in a certain order that you're aware of?
from diffsinger.
When exporting to ONNX you should use --export_spk spk1 --export _spk spk2 ...
to export all your desired embeds, or if this option is unset, the exporter exports all the embeds. Then you should write them down in the OpenUTAU config as its wiki says, and yes they should be in an ordered list, but in any order you would like.
So you might mixed up your embeds somehow, or it's probably just a personal mistake in the usage of OpenUTAU. You should first check your embeds to see if they are really different, and your configs, and then OpenUTAU itself (for example, use a clean install or reset all the preferences in case there are some misconfiguration in the expression settings).
from diffsinger.
First of all, thank you so much for all of your dedication helping me solve this issue, I've learned a ton!
Second of all, I discovered that the issue IS OpenUTAU. Apparently, if the embed files are not in the same directory as the character.yaml file, you have to specify where they are. I thought it pulled from the list of speakers, so it was totally my misunderstanding.
from diffsinger.
Related Issues (20)
- Support tension and voicing
- TypeError running variance inference (previously working) HOT 1
- ONNX inference 'depth' parameter HOT 6
- onnx exports to incorrect folder HOT 1
- Strange humming sound during `SP` & `AP` HOT 3
- Inference from OpenUTAU USTx -> DiffSinger DS not Carrying Over Parameters HOT 1
- AttributeError on ReFlow HOT 1
- Tracking: development around Rectified Flow HOT 3
- Export Acoustic Model Error:"size mismatch for fs2.txt_embed.weight" HOT 1
- Custom Trained DiffSinger Render Failed HOT 4
- 是否可以更改模型架构或者其他方式提升合成音质? HOT 6
- Is removing background noise from audio beneficial to the quality of DiffSinger? HOT 2
- Question regarding pitch models (Reflow vs DDPM) HOT 3
- 关于唱法模型数据集 HOT 1
- Effects of transitioning mel_base from '10' to 'e' HOT 2
- In automatic optimization, `training_step` must return a Tensor, a dict, or None (where the step will be skipped). HOT 7
- ONNX Inference Scripts Documentation HOT 5
- Error training variance model HOT 3
- DiffSinger 制作合唱 HOT 2
- Inference DiffSinger HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from diffsinger.