Giter Club home page Giter Club logo

Comments (18)

Rayhane-mamah avatar Rayhane-mamah commented on August 25, 2024 2

from tacotron-2.

Rayhane-mamah avatar Rayhane-mamah commented on August 25, 2024 1

Hello @toannhu glad to see you got this working with such a small dataset! that's actually impressive!

So to answer your questions:

  1. Generation speed is mostly caused by not adding zoneout noise during inference as reported by @Ondal90 here, he was also kind to provide a fix. :)
    Considering you trained your model on a previous repo version, I would recommend you adopt @Ondal90 's fix. Do not update your code with the actual one in order to keep using your currently trained model! (changes were made to the graph construction so new version of repo will not work with your pretrained model).

  2. For mel spectrograms, you've got axis x for time, axis y for frequency and color is for "normalized" dB amplitudes. Typically, your model is doing well if the predicted and real spectrograms are the same. For speech it is only important to recreate information of low frequencies to achieve audible speech, the model also tends to learn how to recreate some of the high frequency values as well. I believe this presentation is pretty straightforward and explains well how spectros are computed, read and even gives few in-depth analysis.

  3. This repo's Wavenet can be trained, but results are not yet confirmed to be good. Especially, considering you are still using the older version of the repo, I would recommend you follow r9y9's live demo steps and combine your current model with @r9y9 's Wavenet for 100% sure working model.

  4. (Yes those are padding frames) Those alignments are not weird considering paddings of both inputs and outputs are not masked. Actually, these highlighted parts of yours are where the model is typically associating output paddings with the last input padding. Since at evaluation and synthesis times, there is no paddings in neither inputs nor outputs, these "strange" parts will not be present, so I believe they don't matter too much.

Glad our work was able to help you, feel free to share samples if you like! :)

from tacotron-2.

toannhu avatar toannhu commented on August 25, 2024 1

@Rayhane-mamah Here are some few synthesis samples
https://soundcloud.com/dinh-toan-nhu/sets/tacotron-2
https://soundcloud.com/dinh-toan-nhu/speech-wav-00006-mel
Once again, thanks for your help :)
P/s: about mel-spectrogram, i don't understand why the y-axis come from 50->0 mels like this image below (sorry for bad English, i don't know how to explain the problem). And could explain about the oversmooth in predict spectrogram?

40340119-a08a71ba-5da7-11e8-861b-917072017a7b

from tacotron-2.

Rayhane-mamah avatar Rayhane-mamah commented on August 25, 2024 1

@toannhu thanks for the samples!

For mel spectrograms, as mentioned in the paper we model our mel spectrograms on 80 dimensions. i.e: for each mel frame, we have a vector of dimension [1(i.e: one time step), 80] so the y axis values are 0 -> 80.

As for the smoothness of the spectrogram, I assure you are referring to the blur? It is mainly caused by the fact that the model is minimizing the MSE as mentioned in the T2 paper:
" This is due to the tendency of the predicted spectrograms to be oversmoothed and less detailed than the ground truth – a consequence of the squared error loss optimized by the feature prediction network".

I am not 100% sure about what makes the MSE the cause of these oversmoothed features, but it should be no problem for the wavenet to be conditioned on mels generated using r=2 since we already know for a fact that r=3 also works properly.

from tacotron-2.

gloriouskilka avatar gloriouskilka commented on August 25, 2024

I noticed same results with

outputs_per_step = 1
tacotron_batch_size = 64

model training on CPU
step-120-align
step-120-pred-mel-spectrogram
step-120-real-mel-spectrogram

Other model trains on GPU on the same data, looks OK. GPU hparams are the same except for:

outputs_per_step = 5
tacotron_batch_size = 32

from tacotron-2.

Rayhane-mamah avatar Rayhane-mamah commented on August 25, 2024

from tacotron-2.

gloriouskilka avatar gloriouskilka commented on August 25, 2024

Also could you approximately tell me how much Memory the r=1, batch_size=64
consumed? Thanks

@Rayhane-mamah It consumes ~10Gb of RAM. OOMed few hours ago. I have 24Gb RAM total in the machine, I don't know how much it tried to allocate before OOM. I'll make batch size 32 now

PS I'm not a topic starter, just noticed the same picture with alignment and noticed the difference in my hparams, maybe it'll help with the initial problem.

from tacotron-2.

Rayhane-mamah avatar Rayhane-mamah commented on August 25, 2024

from tacotron-2.

toannhu avatar toannhu commented on August 25, 2024

@Rayhane-mamah

  1. About my dataset, I format exactly same like LJSpeech (wavs folder and metadata.csv) but I use basic_cleaners and change _character into 186 symbols (Ex: aáàạảã... etc...)
  2. The audio length varies from 1s -> 9s but mostly from 5s-7s. I also trimmed all leading and trailing silence and all silence < 50db inside the audio. I'm sorry that I couldn't understand what you mean "there is too much padding, could you try optimizing that?" (I'm just a newbie in TTS o.0). Could you enlighten me how to do this thing?
    Btw I use default hparams.py except
    tacotron_batch_size = 8
    When I started training I got OOMed problem in some first steps just like @gloriouskilka said but then it's gone and loss decrease very fast.

from tacotron-2.

toannhu avatar toannhu commented on August 25, 2024

Btw, if I change all text_input into lowercase and remove all !?,.'", I think it will decrease vector input size. Will this thing help the alignment model?

from tacotron-2.

toannhu avatar toannhu commented on August 25, 2024

Tks. I'll try and report to you later 😄

from tacotron-2.

al3chen avatar al3chen commented on August 25, 2024

on my GTX1080(8G) ,for LJSpeech ,
outputs_per_step=2 and batch_size=32, good after 30k steps .
outputs_per_step=1 and batch_size=32, OOM.
outputs_per_step=1 and batch_size=16, fail after 30k steps.

from tacotron-2.

begeekmyfriend avatar begeekmyfriend commented on August 25, 2024

r==2 is acceptable for me. I do not think there is difference in synthesized audio quality between r==1 and r==2.

from tacotron-2.

toannhu avatar toannhu commented on August 25, 2024

First of all, thank you everyone for helping me out. I succeed in training my own Vietnamese corpus (GTX 1060 6GB with r = 2 and batch_size=16, OOM few steps then alignments show up around step 16000)
step-16000-align
step-16500-align
I have some questions want to ask you guys

  1. The synthesis sound is so good and naturalness but it seems a little bit faster than the original data. Is that because the attention module? And what the difference between content-based attention and location-based attention? What is the attention use in this Tacotron 2 implementation?
  2. What is the information that I can read from the mel-spectrogram here? How to compare 2 spectrograms?
    step-258000-real-mel-spectrogram
    step-258000-pred-mel-spectrogram
  3. Is the wavenet vocoder in this repo ready for training? I'm earger to hear the result synthesis from Tacotron 2 + Wavenet Vocoder.

from tacotron-2.

toannhu avatar toannhu commented on August 25, 2024

BTW, I have some strange alignment in some steps here. Is this padding frames?
1
2

from tacotron-2.

butterl avatar butterl commented on August 25, 2024

outputs_per_step = 3 ,while batch=32 gives fast alignment for me
and bigger outputs_per_step led to faster alignment but slow loss reduction

from tacotron-2.

toannhu avatar toannhu commented on August 25, 2024

@Rayhane-mamah I tried to intergrate Tacotron 2 with r9y9's WaveNet Vocoder using his Google Colab's code but it failed. The pitch has been lost. Btw, I use batch_size = 16 and r=2 in Tacotron 2 and batch_size = 1 in WaveNet, everything else is default. Wavenet Repo is training with original sound not with GTA. Here are the results.
sound.zip

from tacotron-2.

Rayhane-mamah avatar Rayhane-mamah commented on August 25, 2024

@toannhu we fixed our wavenet implementation in case you want to give it a go. You can also Wavenet with GTA for best audio quality. Problems of quality that come from wavenet mostly rely on that. and you may want to train the wavenet for 1 million steps if using "raw" mode.

Feel free to reopen the issue if needed.

from tacotron-2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.