It seems the decreasing of loss during WaveNet training unsteady. Is it all right for

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello everyone, sorry for this long delay :) <a class="user-mention notranslate" d

Indeed, the answer has been given <a href="https://github.com/Rayhane-mamah/Tacotron-2

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Effects on WaveNet predicted wavs,about rayhane-mamah/tacotron-2

Comments (59)

a3626a commented on May 20, 2024 3

@QLQL
Why don't you lower your learning rate? I am using 1e-4 instead of default value 1e-3. Loss is around 0.2 now. (It was 3~4 with 1e-3)

from tacotron-2.

azraelkuan commented on May 20, 2024 3

@Rayhane-mamah yeah，i have finish my exam, so i will fix the bugs in my code these days. :)

from tacotron-2.

Rayhane-mamah commented on May 20, 2024 2

Hello everyone, sorry for this long delay :)
@azraelkuan and @a3626a are on the right track, I actually didn't pay much attention to such embarrassing mistake.. Tensorflow doesn't really work with python semantic especially not in its while loop. all the code we make is typically for a graph construction so it is only executed once (except for the tf.while_loop). The mistake I was making is that I was expecting tensorflow to update a class attribute when executing loops but that was not the case obviously, so I suspect a dictionary update can have the same issue @a3626a. Anyway, I will naturally make an update of the repo this weed end to fix this issue and many others, but just because this one is critical and can be corrected right away (without the need to retrain the model), Here are the wavenet.py and modules.py files you need for a correct fast wavenet synthesis:

models.tar.gz

it should be possible to just correct the incremental steps without retraining the model, it's mostly a synthesis time bug. It is also not mandatory to provide convolution queues to all network convolutions, only the dilated convolutions of the residual blocks need them (kernel_size=3). Queues are initialized to zeros before the while loop and are simply shifted at each iteration. Naturally they are provided as inputs and picked as outputs for these convolutions.

If using already trained models, do NOT replace your existing files with these ones, I may have changed some scopes so they will raise errors, instead just pick the "argmax" rectification, while_loop and the new conv1d incremental_step updates (init too has changed).

One last detail, during training, Ryuichi's Wavenet paddings with our conv1d call function raises problems for the reshape to dilated convolutions, almost everytime except when hop_size is power of 2 (that was the problem if I'm not wrong). Fast Wavenet's padding on the other hand is using paddings that only works for the special case of kernel_size=2, In the above files I updated fast wavenet's padding technique to support any kernel_size (2 and higher) while allowing to use any arbitrary hop size with Wavenet (for my case I use n_fft=2048, hop_size=300, win_size=1200 and sample_rate=24kHz without problems). Thanks for all your efforts!

from tacotron-2.

azraelkuan commented on May 20, 2024 2

i test the incremental step using test_inputs, it works well, my be i need to train longer for without test inputs.
this is the incremental result using test inputs.

from tacotron-2.

azraelkuan commented on May 20, 2024 1

Indeed, the answer has been given https://github.com/Rayhane-mamah/Tacotron-2/files/2145382/models.tar.gz, because training a mol model need much time about two weeks, so i don't test this, but in my test, the mu law works well.
below is the plot of mu law (eval step)

from tacotron-2.

azraelkuan commented on May 20, 2024 1

@Yeongtae about 40k, i can generate good wav, above is about 300k.
attached is 44k.
44000.zip

from tacotron-2.

Rayhane-mamah commented on May 20, 2024

I think it's normal that the Wavenet loss is shaky, log wavs sound better than eval wavs simply because the model is making predictions conditioned on ground truth during training, while eval wavs are synthesized sequentially which means the model is conditioned on its previous outputs. Since wavenet is still at early stages of training, conditioning on previously sampled outputs will cause loss cumulation and thus trash wavs..

from tacotron-2.

begeekmyfriend commented on May 20, 2024

Here are the log wav plot and the eval one. I will report again when it reaches 200K steps.

from tacotron-2.

begeekmyfriend commented on May 20, 2024

Should the hyper parameter train_with_GTA be set as True on the training of wavenet? The wave plot still seems not good until 190K steps while in that issue r9y9/wavenet_vocoder#79 the plot until 250K steps looks better than this one.

from tacotron-2.

butterl commented on May 20, 2024

@begeekmyfriend that plot in r9y9/wavenet_vocoder#79 is trained with real wav not GTA ，and GTA training wavplot for me is all mass when 50K step

from tacotron-2.

begeekmyfriend commented on May 20, 2024

It seems failure to convergence within 36K steps on 10h dataset on WaveNet model. @Rayhane-mamah You have found ways to reduce the quantity of training data on Tacotron. I still need to find similar ways to get to convergence on small dataset on WaveNet...

from tacotron-2.

begeekmyfriend commented on May 20, 2024

I have used 34h THCHS-30 as dataset long enough to train but still failed to convergence within 110K steps while in r9y9/wavenet_vocoder#79 it seems to convergence on r9y9's version. I doubt there is something wrong with the porting...

from tacotron-2.

begeekmyfriend commented on May 20, 2024

This is from r9y9's original repo, on 10h dataset. We can see the wave begins to convergence.

from tacotron-2.

butterl commented on May 20, 2024

@begeekmyfriend what‘s the step and lost in your latest post？

from tacotron-2.

azraelkuan commented on May 20, 2024

@begeekmyfriend what mode do u use? raw or mu-law-quantize?
i am now testing the code of wavenet, i have found that may be there are errors in incremental step.
here is training step, i am using mu-law-quantize:

but when i using incremental step:

from tacotron-2.

begeekmyfriend commented on May 20, 2024

from tacotron-2.

begeekmyfriend commented on May 20, 2024

I am just using the raw mode and I have not looked into the code closely.

from tacotron-2.

QLQL commented on May 20, 2024

The same happened to my WaveNet training (tested until step 250000 ). Note that, I used the groundtruth Mels to train the vocoder instead of these force-aligned Mels synthesised via Tacotron.

Under logs-WaveNet/plots, the Target waveform is almost identical with Prediction waveform, this is also confirmed by listening to audio clips under logs-WaveNet/wavs. Yet, the results under logs-WaveNet/eval-dir/plot are not converged, though the envelope of the predicted signal does look similar to that of the target. However, the predicted audio clips under logs-WaveNet/eval-dir/wavs sound totally a mess---total noise.

An example under logs-WaveNet/plots

An example under logs-WaveNet/eval-dir/plot

from tacotron-2.

azraelkuan commented on May 20, 2024

@QLQL Do u use the raw mode?

from tacotron-2.

begeekmyfriend commented on May 20, 2024

@QLQL Please use r9y9's original repo.

from tacotron-2.

a3626a commented on May 20, 2024

In my case, I am training WaveNet on mel spectrograms of SpectrogramNet with gta is on always.

I found that mel spectrograms from SpectrogramNet is really bad when gta is turned off for synthesis. This affects WaveNet's quality, too.
What mel spectrogram are you using during synthesis? STFT generated? GTA=on? GTA=off?

from tacotron-2.

begeekmyfriend commented on May 20, 2024

@a3626a STFT generated, ground truth, both for this repo and r9y9's, on 10h long dataset, and the evaluation results are quite different as you can see in #57 (comment).

from tacotron-2.

QLQL commented on May 20, 2024

@azraelkuan Yes, I was using raw mode. But I will also test with r9r9's repo as suggested by @begeekmyfriend . BTW, @begeekmyfriend did you manage to train the WaveNet Vocoder part with Rayhane-mamah repo? If so, how many steps did you use, and what is your batch size? I can cope with batch size of 2 instead of the default 4 due to OOM problem.

from tacotron-2.

begeekmyfriend commented on May 20, 2024

@QLQL It failed to convergence with Rayhane's wavenet vocoder when it got to 360K steps. I trained it with batch size as 3 with 11GB GTX 1080ti.

from tacotron-2.

azraelkuan commented on May 20, 2024

indeed, there are servious problem in incremental step,

Tacotron-2/wavenet_vocoder/models/modules.py

Lines 162 to 166 in e2f9780

 if self.convolution_queue is None: 

 self.convolution_queue = tf.zeros((batch_size, (kw - 1) + (kw - 1) * (dilation - 1), tf.shape(inputs)[2])) 

 else: 

 #shift queue 

 self.convolution_queue = self.convolution_queue[:, 1:, :]

in the while_loop, when we call incremental_step, the queue will be defined as None, so we don't get the correct convolution queue, any way to slove the problem?

like the queue in the ibab's implement, we can create some different queues to save the state. not good...
we can use the tfe-tensorflow eager model to run the wavenet model, it will save the attribute in the running process. i have write a small test code and it works well.
3.use tf.get_variable to save the convolution queue

hope somebody can raise some better solutions!

from tacotron-2.

QLQL commented on May 20, 2024

@begeekmyfriend I waited three days with another ~200 k steps, and there is no improvement in the loss, still between 6--7, as also shown in the following example under eval-dir/plots. I assume, maybe that has something to do with the issue mentioned by @azraelkuan ?

from tacotron-2.

azraelkuan commented on May 20, 2024

@QLQL i think u can predict a wav just use the train mode, u will find that the predict wav is very good

from tacotron-2.

QLQL commented on May 20, 2024

@azraelkuan , Yes. As quoted in @Rayhane-mamah reply earlier:

the train mode is

conditioned on ground truth during training, while eval wavs are synthesized sequentially which means the model is conditioned on its previous outputs.

In real synthesis applications (eval mode), we don't have groundtruth samples. We have to rely on previous predicted samples.

from tacotron-2.

a3626a commented on May 20, 2024

@azraelkuan

in the while_loop, when we call incremental_step, the queue will be defined as None, so we don't get the correct convolution queue, any way to slove the problem?

I have tested value of self.convolution_queue during synthesis using tf.Print. It is always 0. You are right.
I think using variables are better choice. ibab's implementation requires multiple sess.run call, so synthesizer must be modified a lot, and inefficient(multiple sess.run call slow the operation), too.

def __init__(self, ... ) :
(...)
            if kernel_size > 1 :
                self.convolution_queue = tf.get_variable("conv_queue_{}".format(name),
                                                         (1, kernel_size + (kernel_size - 1) * (dilation - 1), in_channels),
                                                         tf.float32,
                                                         initializer=tf.zeros_initializer(),
                                                         trainable=False)
(...)
def incremental_step(self, inputs):
(...)
        #append next input
        op_assign = tf.assign(self.convolution_queue, tf.concat([self.convolution_queue[:, 1:, :], tf.expand_dims(inputs[:, -1, :], axis=1)], axis=1))
            
        with tf.control_dependencies([op_assign]):
            inputs = self.convolution_queue
            if dilation > 1:
                inputs = inputs[:, 0::dilation, :]
(...)
    def clear_queue(self):
        pass

from tacotron-2.

QLQL commented on May 20, 2024

@a3626a , Thank you very much for the nice suggestion! Didn't think about that earlier!

from tacotron-2.

azraelkuan commented on May 20, 2024

@a3626a i found a much better way for this problem, we can set a input_buffer list to the while_loop, and return the input_buffer back at the incremental step, i test this, it works well.
create variable will need to assign the variable to zero but the assign function will return a op, so it is not good for coding i think.

from tacotron-2.

JK1532 commented on May 20, 2024

@azraelkuan Have you get acceptable result after using input_buffer？And ,does the input_buffer list contains the convolution_queue for every dilation layers? Thanks

from tacotron-2.

a3626a commented on May 20, 2024

@JK1532
I also fix incremental() by providing more variables like input_buffer to tf.while_loop. However, synthesized audio is not very good, yet.

    def incremental(self, initial_input, c=None, g=None,
        time_length=100, test_inputs=None,
        softmax=True, quantize=True, log_scale_min=-7.0):

        (...)

        initial_queue = self.clear_queue()
        # this returns python dictionary of empty queues for each layers.
        # {"conv0":tf.zeros(...), "conv1":tf.zeros(...)}

        (....)

        def condition(time, unused_outputs_ta, unused_current_input, unused_loss_outputs_ta, queue):
            return tf.less(time, time_length)

        def body(time, outputs_ta, current_input, loss_outputs_ta, queue):
            #conditioning features for single time step
            ct = None if self.c is None else tf.expand_dims(self.c[:, time, :], axis=1)
            gt = None if self.g_btc is None else tf.expand_dims(self.g_btc[:, time, :], axis=1)

            x = self.first_conv.incremental_step(current_input, queue)
            skips = None
            for conv in self.conv_layers:
                x, h = conv.incremental_step(x, ct, gt, queue)
                skips = h if skips is None else (skips + h)
            x = skips
            for conv in self.last_conv_layers:
                try:
                    x = conv.incremental_step(x, queue)
                except AttributeError: #When calling Relu activation
                    x = conv(x)

            #Save x for eval loss computation
            loss_outputs_ta = loss_outputs_ta.write(time, tf.squeeze(x, [1])) #squeeze time_length dimension (=1)

            #Generate next input by sampling
            if self.scalar_input:
                x = sample_from_discretized_mix_logistic(
                    tf.reshape(x, [batch_size, -1, 1]), log_scale_min=log_scale_min)
            else:
                x = tf.nn.softmax(tf.reshape(x, [batch_size, -1]), axis=1) if softmax \
                    else tf.reshape(x, [batch_size, -1])
                if quantize:
                    x = tf.reshape(x, [batch_size, -1])
                    sample = tf.multinomial(tf.reshape(x, [batch_size, -1]), 1)[0] #Pick a sample using x as probability
                    x = tf.one_hot(sample, depth=self.quantize_channels)

            outputs_ta = outputs_ta.write(time, x)
            time = time + 1
            #output = x (maybe next input)
            if test_inputs is not None:
                next_input = tf.expand_dims(test_inputs[:, time, :], axis=1)
            else:
                if is_mulaw_quantize(self.input_type):
                    next_input = tf.expand_dims(x, axis=1) #Expand on the time dimension
                else:
                    next_input = tf.expand_dims(x, axis=-1) #Expand on the channels dimension

            return (time, outputs_ta, next_input, loss_outputs_ta, queue)

        res = tf.while_loop(
            condition,
            body,
            loop_vars=[
                initial_time,
                initial_outputs_ta,
                initial_input,
                initial_loss_outputs_ta,
                initial_queue
            ],
            parallel_iterations=32,
            swap_memory=self.wavenet_swap_with_cpu)

        outputs_ta = res[1]
        #[time_length, batch_size, channels]
        outputs = outputs_ta.stack()

        #Save eval prediction for eval loss computation
        eval_outputs = res[3].stack()

        if is_mulaw_quantize(self.input_type):
            self.y_hat_eval = tf.transpose(eval_outputs, [1, 0, 2])
        else:
            self.y_hat_eval = tf.transpose(eval_outputs, [1, 2, 0])

        return tf.transpose(outputs, [1, 2, 0])

from tacotron-2.

azraelkuan commented on May 20, 2024

@JK1532 sorry, recently, i am busying with my exam.
@a3626a i think there are some problem with your code, did u check input_queue use tf.Print() in the running process, i think u should return your queue of each conv

from tacotron-2.

a3626a commented on May 20, 2024

No, I don't have to.

            x = self.first_conv.incremental_step(current_input, queue)
            skips = None
            for conv in self.conv_layers:
                x, h = conv.incremental_step(x, ct, gt, queue)
                skips = h if skips is None else (skips + h)
            x = skips
            for conv in self.last_conv_layers:
                try:
                    x = conv.incremental_step(x, queue)
                except AttributeError: #When calling Relu activation
                    x = conv(x)

queue is an dictionary and is modified inside incremental_step like queue["conv1"] = tf.concat( ... queue["conv1"] ... )

from tacotron-2.

a3626a commented on May 20, 2024

@Rayhane-mamah
I already synthesized intelligible audio using dictionary based implementation.

from tacotron-2.

Rayhane-mamah commented on May 20, 2024

@a3626a, ah perfect then ignore me :) I only tested the list approach so I only can confirm that list approach works, dictionaries one is cleaner I think, I'll have a look into it, thanks @a3626a ;)

from tacotron-2.

azraelkuan commented on May 20, 2024

@Rayhane-mamah why not just use the tf.layers.Conv1d 's dilation_rate?

from tacotron-2.

azraelkuan commented on May 20, 2024

this is my code

from tacotron-2.

Rayhane-mamah commented on May 20, 2024

@azraelkuan,
It should give exact same results during training actually, and I would have preferred to use that. unfortunately, I think I found some issues for picking the kernels and biases of such layer for inference time (to use the fast wavenet synthesis approach). If I remember well, I was facing some issue where at inference time the kernel and bias variables where only being initialized at the build function call of the convolution layer and it wasn't working with how I make my Wavenet code structure or something like that. In other words I just found it easier to create and use my own kernel and bias variables.
Plus I was kind of worried that using tf.layers.Conv1D would some way or another not work as I expect them to be.. So it's also a safety measure I took to make sure the network is as I'm thinking of it. I went to a lower level and used tf.nn.conv1d with our own kernel and bias variables to "rewrite" tf.layers.Conv1D if both give the same results.

When using tf.layers.Conv1D, are you able to get its kernel and bias without problem during synthesis mode?

from tacotron-2.

azraelkuan commented on May 20, 2024

yes, i also have the same issue, but i can get the kernel and bias by the collections. because of tensorflow will save all the variables in tf.GraphKeys.GLOBAL_VARIABLES, so we can get it by scope.
this is a better way to implement this.

from tacotron-2.

Rayhane-mamah commented on May 20, 2024

Ah yes this is very clever @azraelkuan! Really well thought! I will try it out for train+eval and synthesis time, if everything goes well and results are also correct I will most probably switch to this approach. If you get samples in the meantime using your code, feel free to share and suggest a PR ;)

from tacotron-2.

HyperGD1994 commented on May 20, 2024

@Rayhane-mamah hi, i'm training the WaveNet with your revised code, feeding raw audio input and the eval result seems not good at 20k steps. i wonder if it will be good to train more steps?

Meanwhile, i have trained the ibab's wavenet with local condition, mu-quantize audio and mel-spec input, and it can generate correct audio after a days training. i'm not satisfied with the acoustic quality so i try to use your code, raw input and bigger net.

i'm confused about the eval result, is the revised still not right or i just need to train it more steps? thanks ~

from tacotron-2.

azraelkuan commented on May 20, 2024

@HyperGD1994 yes, there is problem in the eval code, i try to fix the bugs, but like @a3626a, still can not get good result.

from tacotron-2.

HyperGD1994 commented on May 20, 2024

@azraelkuan have you tried ibab's generate method, will that be more brief?

from tacotron-2.

butterl commented on May 20, 2024

@azraelkuan

I found it gives good result wav in training sample

but would stuck when run synthesize with pretrained model ，not sure it's related with the problem you found

in synthesizer.py in wavnet_vocoder

	def synthesize(self, mel_spectrogram, speaker_id, index, out_dir, log_dir):
		hparams = self._hparams
		local_cond, global_cond = self._check_conditions()

		c = mel_spectrogram
		g = speaker_id
		feed_dict = {}
		print("Hi I'm here")
		if local_cond:
			feed_dict[self.local_conditions] = [np.array(c, dtype=np.float32)]
		else:
			feed_dict[self.synthesis_length] = 100
		print("Hi I'm here 2")
		if global_cond:
			feed_dict[self.global_conditions] = [np.array(g, dtype=np.int32)]

		generated_wav = self.session.run(self.model.y_hat, feed_dict=feed_dict)   <<< stucked here
		print("Hi I'm here 3")

from tacotron-2.

begeekmyfriend commented on May 20, 2024

@azraelkuan @butterl Can you open your fork for every one interested in it?

from tacotron-2.

HyperGD1994 commented on May 20, 2024

@azraelkuan wow that's wonderful, may i ask how many steps have you trained to get this? do you input only mel spec or with audio file too?

from tacotron-2.

azraelkuan commented on May 20, 2024

@HyperGD1994 this result use test inputs,only 1500 steps, i am testing the real evaluating step

from tacotron-2.

butterl commented on May 20, 2024

@begeekmyfriend @HyperGD1994 I used the Head of repo and just suit for THCHS30 dataset， and the waveplot is from the eval in the training of wavenet, the training step of tacotron is 100k and wavenet 160K

@azraelkuan any suggestion on real evaluating step modification ? I can only see it stuck in tqdm 0%

from tacotron-2.

v-yunbin commented on May 20, 2024

@butterl i have same problem with you, do you have solve it?

from tacotron-2.

WendongGan commented on May 20, 2024

@azraelkuan @begeekmyfriend@HyperGD1994 I also encountered the same problem. Do you have the final solution? Looking forward to your help。

from tacotron-2.

Yeongtae commented on May 20, 2024

Has anyone found a solution?
Even if you do not release your code, please share results and ideas, how to fix it.

from tacotron-2.

HyperGD1994 commented on May 20, 2024

@UESTCgan what's your problem? tqdm 0%? I do not use the whole code, i just use part of it, so i did not encounter this problem, but i think this bug will not be difficult to fix, you can try to debug it.

as for the true evaluate problem, have @azraelkuan fix it? the raw input is so hard to train, and the mu-law input seems to have a lot of bugs. However, i modified ibab code with local condition and a bigger net, and i finally got wonderful result with mu-law input. I suggest that you guys can try that.

from tacotron-2.

Yeongtae commented on May 20, 2024

@azraelkuan Thank you for your answer.

from tacotron-2.

WendongGan commented on May 20, 2024

@HyperGD1994 Thank for your help!Based on my current 20h Chinese data set, I have trained 240K times. The predicted values produced during training sound good, but only noise is obtained when synthesizing sound. I will try the latest code next.

from tacotron-2.

Yeongtae commented on May 20, 2024

@azraelkuan Did you test in LJspeech dataset?
If you test it, How many iterations have you learned to get the above result?

In my case using 'mulaw', it makes bad wave files with noise.

from tacotron-2.

Yeongtae commented on May 20, 2024

I had a mistake.
I didn't change a value of quantize_channels in hparams.py

I'm testing with the parameters in above images.
I can see that the model is reducing loss.

@azraelkuan Thank you for your advice.

from tacotron-2.

Rayhane-mamah commented on May 20, 2024

Thank you all for your valuable contributions, this issue has been fixed with latest commit. If any further problems appear, feel free to open new issues :)

from tacotron-2.

Effects on WaveNet predicted wavs about tacotron-2 HOT 59 CLOSED

Comments (59)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if self.convolution_queue is None:
	self.convolution_queue = tf.zeros((batch_size, (kw - 1) + (kw - 1) * (dilation - 1), tf.shape(inputs)[2]))
	else:
	#shift queue
	self.convolution_queue = self.convolution_queue[:, 1:, :]