Hi Thorsten, fist of all thanks for the nice dataset. Out of curiosi

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks again <a class="user-mention notranslate" data-hovercard-type="user" data-hover

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Forward Tacotron Model about thorsten-voice HOT 18 CLOSED

thorstenmueller commented on May 28, 2024 2

Forward Tacotron Model

from thorsten-voice.

Comments (18)

cschaefer26 commented on May 28, 2024 2

Hi, so I trained a new model on the 8k dataset. Impression is that the prosody is more consistent leading to a quicker build up of attention for the Tacotron model (which i need for extracting durations for the fast pitch model). Here are samples for comparison (vocoded with universal hifi-gan):

fastpitch old dataset
fastpitch new dataset
coqui

Here is the model in case you want to play around with it: link

I also linked it in the repo: https://github.com/as-ideas/ForwardTacotron#pretrained-models (follow the README if you want to use the model)

from thorsten-voice.

cschaefer26 commented on May 28, 2024 1

Hi glad you like it. I trained on the stable dataset. I just read the 8k recordings are of higher quality? Ill try those out. I will play around a bit with the settings to see if i can increase the quality a bit and share the best model.

from thorsten-voice.

cschaefer26 commented on May 28, 2024 1

Absolutely, maybe the tts models need to get better :-)

One more thing. I found also that the correct phonemization is playing a huge role in tts prosody. I might retrain the model with a different phonemizer (https://github.com/as-ideas/DeepPhonemizer). The espeak phoemizer sucks in German.

Again thanks for you recordings, its a nice dataset to do some tts research, I may get into emotional TTS soon on it.

from thorsten-voice.

thorstenMueller commented on May 28, 2024

Hi @cschaefer26

I'm really impressed as i think it's sounding great. 👏 👍

Do you have a website, post or articles where i can link to? Adding a "Thorsten" phrase/sample here or here would really be great.

Did you train on the "stable" dataset or on the "recording-in-progress" dataset with 8k recordings?

Making your model/config public would be really nice?

Thanks for your effort and sharing. 😄

from thorsten-voice.

thorstenMueller commented on May 28, 2024

Thanks again @cschaefer26 for playing around with my recording-in-progress dataset and adding to README.
As i am a little bit "betriebstaub" what is your opinion on naturalness of speechflow in these two versions?

I'd say the new dataset has a more natural flow but synthesized audio has same length as version with old dataset.

My first recordings have been recorded too fast (23 chars/sec) and i try to speak an average around 16 characters per second.
See my video here: https://youtu.be/mlsYnDw71vc?t=523

from thorsten-voice.

cschaefer26 commented on May 28, 2024

Hi, I didn't manually check the whole dataset, so it is more of a guess. Here is a ground truth example of a speed fluctuation (the word 'Solidarität' is quicker than the rest:

https://drive.google.com/file/d/1eEAyT91LLQLBc-5OtSJ4IHA8soBNQVrz/view?usp=sharing

In my experience, its best for the ML training if the speed is pretty constant - its probably some kind of double edged sword to be consistent but not robotic...

from thorsten-voice.

thorstenMueller commented on May 28, 2024

I know what you mean. Speaking in a very consistent speech pace can sound bored. Speaking too enthusiastic might be unuseable for TTS training. Finding the right value is not too easy.

from thorsten-voice.

thorstenMueller commented on May 28, 2024

Hi @cschaefer26 👋
I've trained a "forward_tacotron" model till the end of 300k steps with your repo and using the universal hifigan vocoder quality is really good. But the speech is way too fast or the breaks between words and separate sentences is too short. Do you have any ideas on how to improve this?

Just in case i'll remove the fastest recordings from dataset and start a new training.

https://sndup.net/39xm

from thorsten-voice.

cschaefer26 commented on May 28, 2024

Hi, yeah I have encountered that problem before with ForwardTacotron for some datasets. Seems to me that the duration predictor is somehow overfitting. You could either take an earlier model (e.g. 50k steps, which should be fine) or train a model with model_type: 'fast_pitch' (in the config) - the FastPitch models didn't show this behaviour yet.

from thorsten-voice.

thorstenMueller commented on May 28, 2024

An earlier checkpoint could be worth a try. A FastPitch training is running already.

from thorsten-voice.

thorstenMueller commented on May 28, 2024

Here's a sample from a FastPitch model (300k) and HifiGAN. It's too slow and unnatural.
A speed between FastPitch and ForwardTacotron would be nice :-).

https://sndup.net/2q9j

from thorsten-voice.

cschaefer26 commented on May 28, 2024

Hi, yeah thats too slow and the other one is way too fast. Have you tried earlier models? E.g. after 50k steps the quality is usually on par with 300k steps. Also, did you produce the audio with a single input? The quality is better when providing shorter sentences (and later concatenating the wavs). You can actuallly adjust the speed with the alpha param (larger alpha is faster):

python gen_forward.py --alpha 1.2 --input_text 'this is whatever you want it to be' hifigan

from thorsten-voice.

thorstenMueller commented on May 28, 2024

I tried some model variations and find following model quite useful:

Forward Tacotron configuration
Model checkpoint 300k
alpha 0.8

Here's a sample:
https://sndup.net/6py7

from thorsten-voice.

thorstenMueller commented on May 28, 2024

I've added the audio samples based on your repo on my comparison page (including link to your repo obviously):
https://twitter.com/ThorstenVoice/status/1454537933558620174?s=20

from thorsten-voice.

cschaefer26 commented on May 28, 2024

Cool, sounds quite good with alpha=0.8. Seems to me though that you synthed the full thing at once? I really recommend splitting longer texts into single sentences as the model has been trained on sentences (especially the fastpitch models have trouble with longer sentences as they have to distribute their attention more).

from thorsten-voice.

thorstenMueller commented on May 28, 2024

You're right. I'll split longer texts, but in general i like the quality of synthesized voice.

from thorsten-voice.

thorstenMueller commented on May 28, 2024

@cschaefer26 i've uploaded my model files and will share the link via Twitter. Do you have a Twitter account i should link to or just your repo?

from thorsten-voice.

cschaefer26 commented on May 28, 2024

Hi, thanks for sharing the models and glad you like the synth. I don't have a Twitter account, if you like you could link the repo :-)

PS: you could use the repo to gain some insight about your dataset. You can load the attention score dictionary that provides for each file id a score which measures how sharp the tacotron attention was. Low scores correspond to a mismatch between text and audio file. Copy paste this in the main ForwardTacotron directory and execute:

from utils.files import unpickle_binary

if __name__ == '__main__':
    att_dict = unpickle_binary('data/att_score_dict.pkl')
    id_score = [(k, v[1]) for k, v in att_dict.items()]
    id_score.sort(key=lambda x: -x[1])
    for id, score in id_score:
        print(id, score)

from thorsten-voice.

Forward Tacotron Model about thorsten-voice HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent