First of all, thank you for great implementation of Tacotron 2. I'm trying to training

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I noticed same results with <div class="snippet-clipboard-content notranslate posi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Btw, if I change all text_input into lowercase and remove all <code class="notranslate

Strange alignment when training model about tacotron-2 HOT 18 CLOSED

rayhane-mamah commented on August 25, 2024

Strange alignment when training model

from tacotron-2.

Comments (18)

Rayhane-mamah commented on August 25, 2024 2

No actually I think I know the reason.. I had experienced failure cases where the model couldn't learn attention with batch_size smaller than 32. If your gpu can't work with higher batch size than your current then you should probably try to find the lowest outputs_per_step possible that supports batch_size=32. Also in order to reduce memory usage effectively, do you see that long dark part in the mel spectrograms you reported? They are paddings (const values we add to sequences to make them all of the same length to be able to make matrix computations effevtively). These paddings are non existant in original data and thus should be minimized because it's mostly useless. In our data feeder, we try grouping sentences with similar lengths which allows to have minor differences between sequences in the same batch, so paddings are rarely larger than 20 steps. (In the second example you gave it has 500 frames of padding). Optimizing that can make you win some memory and model quality. I am currently working on the wavenet and hopefully using a reduction factor larger than 1 won't affect quality much.

…

On Wed, 25 Apr 2018, 17:36 Nhu Dinh Toan, ***@***.***> wrote: Btw, if I change all text_input into lowercase and remove all !?,.'", I think it will decrease vector input size. Will this thing help the alignment model? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#32 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AhFSwCXLXTmaN26gD56G2Z-m3DxJT8ILks5tsKYjgaJpZM4TiyRb> .

from tacotron-2.

Rayhane-mamah commented on August 25, 2024 1

Hello @toannhu glad to see you got this working with such a small dataset! that's actually impressive!

So to answer your questions:

Generation speed is mostly caused by not adding zoneout noise during inference as reported by @Ondal90 here, he was also kind to provide a fix. :)
Considering you trained your model on a previous repo version, I would recommend you adopt @Ondal90 's fix. Do not update your code with the actual one in order to keep using your currently trained model! (changes were made to the graph construction so new version of repo will not work with your pretrained model).
For mel spectrograms, you've got axis x for time, axis y for frequency and color is for "normalized" dB amplitudes. Typically, your model is doing well if the predicted and real spectrograms are the same. For speech it is only important to recreate information of low frequencies to achieve audible speech, the model also tends to learn how to recreate some of the high frequency values as well. I believe this presentation is pretty straightforward and explains well how spectros are computed, read and even gives few in-depth analysis.
This repo's Wavenet can be trained, but results are not yet confirmed to be good. Especially, considering you are still using the older version of the repo, I would recommend you follow r9y9's live demo steps and combine your current model with @r9y9 's Wavenet for 100% sure working model.
(Yes those are padding frames) Those alignments are not weird considering paddings of both inputs and outputs are not masked. Actually, these highlighted parts of yours are where the model is typically associating output paddings with the last input padding. Since at evaluation and synthesis times, there is no paddings in neither inputs nor outputs, these "strange" parts will not be present, so I believe they don't matter too much.

Glad our work was able to help you, feel free to share samples if you like! :)

from tacotron-2.

toannhu commented on August 25, 2024 1

@Rayhane-mamah Here are some few synthesis samples
https://soundcloud.com/dinh-toan-nhu/sets/tacotron-2
https://soundcloud.com/dinh-toan-nhu/speech-wav-00006-mel
Once again, thanks for your help :)
P/s: about mel-spectrogram, i don't understand why the y-axis come from 50->0 mels like this image below (sorry for bad English, i don't know how to explain the problem). And could explain about the oversmooth in predict spectrogram?

from tacotron-2.

Rayhane-mamah commented on August 25, 2024 1

@toannhu thanks for the samples!

For mel spectrograms, as mentioned in the paper we model our mel spectrograms on 80 dimensions. i.e: for each mel frame, we have a vector of dimension [1(i.e: one time step), 80] so the y axis values are 0 -> 80.

As for the smoothness of the spectrogram, I assure you are referring to the blur? It is mainly caused by the fact that the model is minimizing the MSE as mentioned in the T2 paper:
" This is due to the tendency of the predicted spectrograms to be oversmoothed and less detailed than the ground truth – a consequence of the squared error loss optimized by the feature prediction network".

I am not 100% sure about what makes the MSE the cause of these oversmoothed features, but it should be no problem for the wavenet to be conditioned on mels generated using r=2 since we already know for a fact that r=3 also works properly.

from tacotron-2.

gloriouskilka commented on August 25, 2024

I noticed same results with

outputs_per_step = 1
tacotron_batch_size = 64

model training on CPU

Other model trains on GPU on the same data, looks OK. GPU hparams are the same except for:

outputs_per_step = 5
tacotron_batch_size = 32

from tacotron-2.

Rayhane-mamah commented on August 25, 2024

Hmmm there's a lot I could think of that can cause this.. First, let's talk about architecture, are you using the exact same architecture I am using for english? Second, I see that there's a data point where there are around 1200 decoding steps, are most of your data that large? It's usually harder for the model to to make proper alignments on long sentences. I also noticed that there is too much padding, could you try optimizing that? Are you bot using something similar to our data feeder? Finally, 6 hours of data seems like a small dataset.. I wondet if that can impact that. There are however few cases where my model started giving similar attention before learning the proper one. Feel free to show me your hparams.py along with any changes you made to the project. Also could you approximately tell me how much Memory the r=1, batch_size=64 consumed? Thanks

…

On Wed, 25 Apr 2018, 15:35 gloriouskilka, ***@***.***> wrote: I noticed same results with outputs_per_step = 1 tacotron_batch_size = 64 model training on CPU [image: step-120-align] <https://user-images.githubusercontent.com/12680403/39252608-3a0026ce-48d0-11e8-8917-9cdafb81829d.png> [image: step-120-pred-mel-spectrogram] <https://user-images.githubusercontent.com/12680403/39252609-3a28f23e-48d0-11e8-8296-06641dc083ea.png> [image: step-120-real-mel-spectrogram] <https://user-images.githubusercontent.com/12680403/39252610-3a4d5dd6-48d0-11e8-9fff-9377cdc2d5e2.png> Other model trains on GPU on the same data, looks OK. GPU hparams are the same except for: outputs_per_step = 5 tacotron_batch_size = 32 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#32 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AhFSwEIVz6RBsBr1Wb7-BJsCK_ZoVrh_ks5tsImWgaJpZM4TiyRb> .

from tacotron-2.

gloriouskilka commented on August 25, 2024

Also could you approximately tell me how much Memory the r=1, batch_size=64
consumed? Thanks

@Rayhane-mamah It consumes ~10Gb of RAM. OOMed few hours ago. I have 24Gb RAM total in the machine, I don't know how much it tried to allocate before OOM. I'll make batch size 32 now

PS I'm not a topic starter, just noticed the same picture with alignment and noticed the difference in my hparams, maybe it'll help with the initial problem.

from tacotron-2.

Rayhane-mamah commented on August 25, 2024

Oh my bad I didn't pay attention.. really sorry about that.. Thanks for your feedback however :)

…

On Wed, 25 Apr 2018, 15:59 gloriouskilka, ***@***.***> wrote: Also could you approximately tell me how much Memory the r=1, batch_size=64 consumed? Thanks @Rayhane-mamah <https://github.com/Rayhane-mamah> It consumes ~10Gb of RAM. OOMed few hours ago. I have 24Gb RAM total in the machine, I don't know how much it tried to allocate before OOM. I'll make batch size 32 now PS I'm not a topic starter, just noticed the same picture with alignment and noticed the difference in my hparams, maybe it'll help with the initial problem. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#32 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AhFSwAtQTd6ycI1dOy9ZPKtmo9fM_sp1ks5tsI80gaJpZM4TiyRb> .

from tacotron-2.

toannhu commented on August 25, 2024

@Rayhane-mamah

About my dataset, I format exactly same like LJSpeech (wavs folder and metadata.csv) but I use basic_cleaners and change _character into 186 symbols (Ex: aáàạảã... etc...)
The audio length varies from 1s -> 9s but mostly from 5s-7s. I also trimmed all leading and trailing silence and all silence < 50db inside the audio. I'm sorry that I couldn't understand what you mean "there is too much padding, could you try optimizing that?" (I'm just a newbie in TTS o.0). Could you enlighten me how to do this thing?
Btw I use default hparams.py except
tacotron_batch_size = 8
When I started training I got OOMed problem in some first steps just like @gloriouskilka said but then it's gone and loss decrease very fast.

from tacotron-2.

toannhu commented on August 25, 2024

Btw, if I change all text_input into lowercase and remove all !?,.'", I think it will decrease vector input size. Will this thing help the alignment model?

from tacotron-2.

toannhu commented on August 25, 2024

Tks. I'll try and report to you later 😄

from tacotron-2.

al3chen commented on August 25, 2024

on my GTX1080(8G) ,for LJSpeech ,
outputs_per_step=2 and batch_size=32, good after 30k steps .
outputs_per_step=1 and batch_size=32, OOM.
outputs_per_step=1 and batch_size=16, fail after 30k steps.

from tacotron-2.

begeekmyfriend commented on August 25, 2024

r==2 is acceptable for me. I do not think there is difference in synthesized audio quality between r==1 and r==2.

from tacotron-2.

toannhu commented on August 25, 2024

First of all, thank you everyone for helping me out. I succeed in training my own Vietnamese corpus (GTX 1060 6GB with r = 2 and batch_size=16, OOM few steps then alignments show up around step 16000)

I have some questions want to ask you guys

The synthesis sound is so good and naturalness but it seems a little bit faster than the original data. Is that because the attention module? And what the difference between content-based attention and location-based attention? What is the attention use in this Tacotron 2 implementation?
What is the information that I can read from the mel-spectrogram here? How to compare 2 spectrograms?
Is the wavenet vocoder in this repo ready for training? I'm earger to hear the result synthesis from Tacotron 2 + Wavenet Vocoder.

from tacotron-2.

toannhu commented on August 25, 2024

BTW, I have some strange alignment in some steps here. Is this padding frames?

from tacotron-2.

butterl commented on August 25, 2024

outputs_per_step = 3 ，while batch=32 gives fast alignment for me
and bigger outputs_per_step led to faster alignment but slow loss reduction

from tacotron-2.

toannhu commented on August 25, 2024

@Rayhane-mamah I tried to intergrate Tacotron 2 with r9y9's WaveNet Vocoder using his Google Colab's code but it failed. The pitch has been lost. Btw, I use batch_size = 16 and r=2 in Tacotron 2 and batch_size = 1 in WaveNet, everything else is default. Wavenet Repo is training with original sound not with GTA. Here are the results.
sound.zip

from tacotron-2.

Rayhane-mamah commented on August 25, 2024

@toannhu we fixed our wavenet implementation in case you want to give it a go. You can also Wavenet with GTA for best audio quality. Problems of quality that come from wavenet mostly rely on that. and you may want to train the wavenet for 1 million steps if using "raw" mode.

Feel free to reopen the issue if needed.

from tacotron-2.

Strange alignment when training model about tacotron-2 HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent