Comments (18)
Hi, so I trained a new model on the 8k dataset. Impression is that the prosody is more consistent leading to a quicker build up of attention for the Tacotron model (which i need for extracting durations for the fast pitch model). Here are samples for comparison (vocoded with universal hifi-gan):
fastpitch old dataset
fastpitch new dataset
coqui
Here is the model in case you want to play around with it: link
I also linked it in the repo: https://github.com/as-ideas/ForwardTacotron#pretrained-models (follow the README if you want to use the model)
from thorsten-voice.
Hi glad you like it. I trained on the stable dataset. I just read the 8k recordings are of higher quality? Ill try those out. I will play around a bit with the settings to see if i can increase the quality a bit and share the best model.
from thorsten-voice.
Absolutely, maybe the tts models need to get better :-)
One more thing. I found also that the correct phonemization is playing a huge role in tts prosody. I might retrain the model with a different phonemizer (https://github.com/as-ideas/DeepPhonemizer). The espeak phoemizer sucks in German.
Again thanks for you recordings, its a nice dataset to do some tts research, I may get into emotional TTS soon on it.
from thorsten-voice.
Hi @cschaefer26
I'm really impressed as i think it's sounding great.
Do you have a website, post or articles where i can link to? Adding a "Thorsten" phrase/sample here or here would really be great.
Did you train on the "stable" dataset or on the "recording-in-progress" dataset with 8k recordings?
Making your model/config public would be really nice?
Thanks for your effort and sharing.
from thorsten-voice.
Thanks again @cschaefer26 for playing around with my recording-in-progress dataset and adding to README.
As i am a little bit "betriebstaub" what is your opinion on naturalness of speechflow in these two versions?
I'd say the new dataset has a more natural flow but synthesized audio has same length as version with old dataset.
My first recordings have been recorded too fast (23 chars/sec) and i try to speak an average around 16 characters per second.
See my video here: https://youtu.be/mlsYnDw71vc?t=523
from thorsten-voice.
Hi, I didn't manually check the whole dataset, so it is more of a guess. Here is a ground truth example of a speed fluctuation (the word 'Solidarität' is quicker than the rest:
https://drive.google.com/file/d/1eEAyT91LLQLBc-5OtSJ4IHA8soBNQVrz/view?usp=sharing
In my experience, its best for the ML training if the speed is pretty constant - its probably some kind of double edged sword to be consistent but not robotic...
from thorsten-voice.
I know what you mean. Speaking in a very consistent speech pace can sound bored. Speaking too enthusiastic might be unuseable for TTS training. Finding the right value is not too easy.
from thorsten-voice.
Hi @cschaefer26
I've trained a "forward_tacotron" model till the end of 300k steps with your repo and using the universal hifigan vocoder quality is really good. But the speech is way too fast or the breaks between words and separate sentences is too short. Do you have any ideas on how to improve this?
Just in case i'll remove the fastest recordings from dataset and start a new training.
from thorsten-voice.
Hi, yeah I have encountered that problem before with ForwardTacotron for some datasets. Seems to me that the duration predictor is somehow overfitting. You could either take an earlier model (e.g. 50k steps, which should be fine) or train a model with model_type: 'fast_pitch' (in the config) - the FastPitch models didn't show this behaviour yet.
from thorsten-voice.
An earlier checkpoint could be worth a try. A FastPitch training is running already.
from thorsten-voice.
Here's a sample from a FastPitch model (300k) and HifiGAN. It's too slow and unnatural.
A speed between FastPitch and ForwardTacotron would be nice :-).
from thorsten-voice.
Hi, yeah thats too slow and the other one is way too fast. Have you tried earlier models? E.g. after 50k steps the quality is usually on par with 300k steps. Also, did you produce the audio with a single input? The quality is better when providing shorter sentences (and later concatenating the wavs). You can actuallly adjust the speed with the alpha param (larger alpha is faster):
python gen_forward.py --alpha 1.2 --input_text 'this is whatever you want it to be' hifigan
from thorsten-voice.
I tried some model variations and find following model quite useful:
- Forward Tacotron configuration
- Model checkpoint 300k
- alpha 0.8
Here's a sample:
https://sndup.net/6py7
from thorsten-voice.
I've added the audio samples based on your repo on my comparison page (including link to your repo obviously):
https://twitter.com/ThorstenVoice/status/1454537933558620174?s=20
from thorsten-voice.
Cool, sounds quite good with alpha=0.8. Seems to me though that you synthed the full thing at once? I really recommend splitting longer texts into single sentences as the model has been trained on sentences (especially the fastpitch models have trouble with longer sentences as they have to distribute their attention more).
from thorsten-voice.
You're right. I'll split longer texts, but in general i like the quality of synthesized voice.
from thorsten-voice.
@cschaefer26 i've uploaded my model files and will share the link via Twitter. Do you have a Twitter account i should link to or just your repo?
from thorsten-voice.
Hi, thanks for sharing the models and glad you like the synth. I don't have a Twitter account, if you like you could link the repo :-)
PS: you could use the repo to gain some insight about your dataset. You can load the attention score dictionary that provides for each file id a score which measures how sharp the tacotron attention was. Low scores correspond to a mismatch between text and audio file. Copy paste this in the main ForwardTacotron directory and execute:
from utils.files import unpickle_binary
if __name__ == '__main__':
att_dict = unpickle_binary('data/att_score_dict.pkl')
id_score = [(k, v[1]) for k, v in att_dict.items()]
id_score.sort(key=lambda x: -x[1])
for id, score in id_score:
print(id, score)
from thorsten-voice.
Related Issues (20)
- Finetuning Tacotron2 on your pretrained model HOT 6
- Recommendation for Training/ HOT 2
- Question with Phonemes HOT 4
- Documenting the process of building an open voice model out of audio files HOT 2
- ValueError: Phonemizer is not defined in the TTS config. HOT 2
- Porting the German voice into RHVoice HOT 7
- Eigene TTS erstellen HOT 3
- 44khz 16 bit available? HOT 2
- Source of Text Prompts HOT 1
- Help for vocoder training for Coqui HOT 6
- Multispeaker-Finetuning on Single-Speaker-VITS-Model HOT 2
- NumPy (Torch) issues HOT 2
- training duration / female voice? HOT 2
- Request for an oobabooga extension HOT 2
- TTS-Models: Download-Links broken? HOT 2
- Made with Thorsten-Voice 😊 HOT 2
- Windows: tts_to_file ignoring German Umlauts HOT 5
- Request - "synthesize_csv.py" from YouTube "Coqui TTS Audio samples of all models (Version 0.7.1)" HOT 4
- Voz Português Brazil HOT 1
- Emphasis on syllables – How to choose? HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from thorsten-voice.