Giter Club home page Giter Club logo

Comments (18)

cschaefer26 avatar cschaefer26 commented on May 28, 2024 2

Hi, so I trained a new model on the 8k dataset. Impression is that the prosody is more consistent leading to a quicker build up of attention for the Tacotron model (which i need for extracting durations for the fast pitch model). Here are samples for comparison (vocoded with universal hifi-gan):

fastpitch old dataset
fastpitch new dataset
coqui

Here is the model in case you want to play around with it: link

I also linked it in the repo: https://github.com/as-ideas/ForwardTacotron#pretrained-models (follow the README if you want to use the model)

from thorsten-voice.

cschaefer26 avatar cschaefer26 commented on May 28, 2024 1

Hi glad you like it. I trained on the stable dataset. I just read the 8k recordings are of higher quality? Ill try those out. I will play around a bit with the settings to see if i can increase the quality a bit and share the best model.

from thorsten-voice.

cschaefer26 avatar cschaefer26 commented on May 28, 2024 1

Absolutely, maybe the tts models need to get better :-)

One more thing. I found also that the correct phonemization is playing a huge role in tts prosody. I might retrain the model with a different phonemizer (https://github.com/as-ideas/DeepPhonemizer). The espeak phoemizer sucks in German.

Again thanks for you recordings, its a nice dataset to do some tts research, I may get into emotional TTS soon on it.

from thorsten-voice.

thorstenMueller avatar thorstenMueller commented on May 28, 2024

Hi @cschaefer26

I'm really impressed as i think it's sounding great. 👏 👍

Do you have a website, post or articles where i can link to? Adding a "Thorsten" phrase/sample here or here would really be great.

Did you train on the "stable" dataset or on the "recording-in-progress" dataset with 8k recordings?

Making your model/config public would be really nice?

Thanks for your effort and sharing. 😄

from thorsten-voice.

thorstenMueller avatar thorstenMueller commented on May 28, 2024

Thanks again @cschaefer26 for playing around with my recording-in-progress dataset and adding to README.
As i am a little bit "betriebstaub" what is your opinion on naturalness of speechflow in these two versions?

I'd say the new dataset has a more natural flow but synthesized audio has same length as version with old dataset.
Bildschirmfoto 2021-10-01 um 20 53 29

My first recordings have been recorded too fast (23 chars/sec) and i try to speak an average around 16 characters per second.
See my video here: https://youtu.be/mlsYnDw71vc?t=523

from thorsten-voice.

cschaefer26 avatar cschaefer26 commented on May 28, 2024

Hi, I didn't manually check the whole dataset, so it is more of a guess. Here is a ground truth example of a speed fluctuation (the word 'Solidarität' is quicker than the rest:

https://drive.google.com/file/d/1eEAyT91LLQLBc-5OtSJ4IHA8soBNQVrz/view?usp=sharing

In my experience, its best for the ML training if the speed is pretty constant - its probably some kind of double edged sword to be consistent but not robotic...

from thorsten-voice.

thorstenMueller avatar thorstenMueller commented on May 28, 2024

I know what you mean. Speaking in a very consistent speech pace can sound bored. Speaking too enthusiastic might be unuseable for TTS training. Finding the right value is not too easy.

from thorsten-voice.

thorstenMueller avatar thorstenMueller commented on May 28, 2024

Hi @cschaefer26 👋
I've trained a "forward_tacotron" model till the end of 300k steps with your repo and using the universal hifigan vocoder quality is really good. But the speech is way too fast or the breaks between words and separate sentences is too short. Do you have any ideas on how to improve this?

Just in case i'll remove the fastest recordings from dataset and start a new training.

https://sndup.net/39xm

from thorsten-voice.

cschaefer26 avatar cschaefer26 commented on May 28, 2024

Hi, yeah I have encountered that problem before with ForwardTacotron for some datasets. Seems to me that the duration predictor is somehow overfitting. You could either take an earlier model (e.g. 50k steps, which should be fine) or train a model with model_type: 'fast_pitch' (in the config) - the FastPitch models didn't show this behaviour yet.

from thorsten-voice.

thorstenMueller avatar thorstenMueller commented on May 28, 2024

An earlier checkpoint could be worth a try. A FastPitch training is running already.

from thorsten-voice.

thorstenMueller avatar thorstenMueller commented on May 28, 2024

Here's a sample from a FastPitch model (300k) and HifiGAN. It's too slow and unnatural.
A speed between FastPitch and ForwardTacotron would be nice :-).

https://sndup.net/2q9j

from thorsten-voice.

cschaefer26 avatar cschaefer26 commented on May 28, 2024

Hi, yeah thats too slow and the other one is way too fast. Have you tried earlier models? E.g. after 50k steps the quality is usually on par with 300k steps. Also, did you produce the audio with a single input? The quality is better when providing shorter sentences (and later concatenating the wavs). You can actuallly adjust the speed with the alpha param (larger alpha is faster):

python gen_forward.py --alpha 1.2 --input_text 'this is whatever you want it to be' hifigan

from thorsten-voice.

thorstenMueller avatar thorstenMueller commented on May 28, 2024

I tried some model variations and find following model quite useful:

  • Forward Tacotron configuration
  • Model checkpoint 300k
  • alpha 0.8

Here's a sample:
https://sndup.net/6py7

from thorsten-voice.

thorstenMueller avatar thorstenMueller commented on May 28, 2024

I've added the audio samples based on your repo on my comparison page (including link to your repo obviously):
https://twitter.com/ThorstenVoice/status/1454537933558620174?s=20

from thorsten-voice.

cschaefer26 avatar cschaefer26 commented on May 28, 2024

Cool, sounds quite good with alpha=0.8. Seems to me though that you synthed the full thing at once? I really recommend splitting longer texts into single sentences as the model has been trained on sentences (especially the fastpitch models have trouble with longer sentences as they have to distribute their attention more).

from thorsten-voice.

thorstenMueller avatar thorstenMueller commented on May 28, 2024

You're right. I'll split longer texts, but in general i like the quality of synthesized voice.

from thorsten-voice.

thorstenMueller avatar thorstenMueller commented on May 28, 2024

@cschaefer26 i've uploaded my model files and will share the link via Twitter. Do you have a Twitter account i should link to or just your repo?

from thorsten-voice.

cschaefer26 avatar cschaefer26 commented on May 28, 2024

Hi, thanks for sharing the models and glad you like the synth. I don't have a Twitter account, if you like you could link the repo :-)

PS: you could use the repo to gain some insight about your dataset. You can load the attention score dictionary that provides for each file id a score which measures how sharp the tacotron attention was. Low scores correspond to a mismatch between text and audio file. Copy paste this in the main ForwardTacotron directory and execute:

from utils.files import unpickle_binary

if __name__ == '__main__':
    att_dict = unpickle_binary('data/att_score_dict.pkl')
    id_score = [(k, v[1]) for k, v in att_dict.items()]
    id_score.sort(key=lambda x: -x[1])
    for id, score in id_score:
        print(id, score)

from thorsten-voice.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.