<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Use of function "calculateInputSizes(sizes)" in DeepSpeechModel.lua? about deepspeech.torch HOT 17 CLOSED

seannaren commented on June 18, 2024

Use of function "calculateInputSizes(sizes)" in DeepSpeechModel.lua?

from deepspeech.torch.

Comments (17)

SeanNaren commented on June 18, 2024 1

Open a new issue (maybe something like better convergence with custom dataset), I think people will find this useful!

from deepspeech.torch.

SeanNaren commented on June 18, 2024

calculateInputSizes calculates the real size of each sample in the audio tensor so we can ignore the padding in the gradient cost calculation (found in the CTCCriterion).

A way around this would be to do something like:

sizes:resize(outputs:size(1)):fill(outputs:size(2))

If each sample of your output has the same length. Hopefully this helps!

from deepspeech.torch.

NightFury13 commented on June 18, 2024

I have variable length image-samples (same height, varying widths) so the alternate trick won't work. What should be passed in the sizes parameter to the CTCCriterion for loss calculation? (here) From what you suggest, it is the sequence length of the input samples. Can you please confirm?

So, in my case of images, since I pass a column-strip of the image at each time-step, sizes would be the width of each image in the batch after having been passed through the SpatialConv layer?

from deepspeech.torch.

SeanNaren commented on June 18, 2024

Sorry for the late response!

From what I can tell you will not need to touch the calculateInputSizes. They calculate the sizes respect to the convolution layers, not the input. So as long the input is given correctly in the similar format as the audio data is currently given it should automatically calculate the sizes to pass to the gradient calculation.

And just to confirm, it is the true length of the input samples AFTER going through the convolutional layers (which reduces the number of timesteps, that's why this is necessary).

from deepspeech.torch.

NightFury13 commented on June 18, 2024

Thank @SeanNaren :)

calculateInputSize is really a pretty neat hack! Turns out my problem were the noisy samples in my dataset which had an image-width less than the width of the convolution kernels I was using. Simply removing these corrupted samples from the dataset did the job for me. Thanks again!

from deepspeech.torch.

SeanNaren commented on June 18, 2024

Ah that is a good point! I think it be nice to add this somewhere into the documentation where appropriate, I ran into the same issue a lot when training these models!

from deepspeech.torch.

NightFury13 commented on June 18, 2024

@SeanNaren I can send you a PR once I myself get the codes working fine. The model trains in a weird fashion currently for me. The training loss keeps fluctuating between really small values and inf :/ (Take a look at the train-logs below). Any tips on what might be going wrong? I am checking if this is indeed exploding gradients (not hopeful of exploding-grads as the loss shouldn't have come back to 'non-inf' values once it exploded right?)

Training Epoch: 3 Average Loss: 3.046032 Average Validation WER: inf Average Validation CER: inf
[======= 5892/5892 ==================================>] Tot: 1h13m | Step: 729ms
Training Epoch: 4 Average Loss: 2.324838 Average Validation WER: inf Average Validation CER: inf
[======= 5892/5892 ==================================>] Tot: 1h16m | Step: 698ms
Training Epoch: 5 Average Loss: 1.797586 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h9m | Step: 693ms
Training Epoch: 6 Average Loss: -inf Average Validation WER: inf Average Validation CER: inf
[======= 5892/5892 ==================================>] Tot: 1h9m | Step: 760ms
Training Epoch: 7 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h9m | Step: 719ms
Training Epoch: 8 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h12m | Step: 749ms
Training Epoch: 9 Average Loss: 0.579901 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h11m | Step: 705ms
Training Epoch: 10 Average Loss: 0.420499 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h12m | Step: 766ms
Training Epoch: 11 Average Loss: 0.287849 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h12m | Step: 706ms
Training Epoch: 12 Average Loss: 0.192960 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h13m | Step: 834ms
Training Epoch: 13 Average Loss: 0.122787 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h12m | Step: 710ms
Training Epoch: 14 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 1h7m | Step: 506ms
Training Epoch: 15 Average Loss: 0.042043 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 43m38s | Step: 481ms
Training Epoch: 16 Average Loss: 0.023819 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 43m41s | Step: 464ms
Training Epoch: 17 Average Loss: 0.010227 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 43m51s | Step: 418ms
Training Epoch: 18 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 44m14s | Step: 484ms
Training Epoch: 19 Average Loss: 0.005311 Average Validation WER: nan Average Validation CER: nan
[======= 5892/5892 ==================================>] Tot: 46m42s | Step: 493ms
Training Epoch: 20 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan

..
..
..

Training Epoch: 33 Average Loss: 0.000093 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 53m33s | Step: 530ms
Training Epoch: 34 Average Loss: 0.000966 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 53m15s | Step: 570ms
Training Epoch: 35 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[=============== 5892/5892 ==================================>] Tot: 53m49s | Step: 521ms
Training Epoch: 36 Average Loss: 0.000915 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 54m17s | Step: 530ms
Training Epoch: 37 Average Loss: -0.000312 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 54m24s | Step: 552ms
Training Epoch: 38 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 49m14s | Step: 509ms
Training Epoch: 39 Average Loss: -0.000470 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 52m59s | Step: 599ms
Training Epoch: 40 Average Loss: 0.000786 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 57m7s | Step: 504ms
Training Epoch: 41 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 52m26s | Step: 457ms
Training Epoch: 42 Average Loss: -0.000240 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 50m47s | Step: 539ms
Training Epoch: 43 Average Loss: 0.000231 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 51m42s | Step: 558ms
Training Epoch: 44 Average Loss: 0.000756 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 51m53s | Step: 599ms
Training Epoch: 45 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 1h39s | Step: 852ms
Training Epoch: 46 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 1h22m | Step: 1s105ms
Training Epoch: 47 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 52m24s | Step: 533ms
Training Epoch: 48 Average Loss: -0.000156 Average Validation WER: nan Average Validation CER: nan
===============5892/5892 ==================================>] Tot: 1h2m | Step: 486ms
Training Epoch: 49 Average Loss: 0.000695 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 51m34s | Step: 469ms
Training Epoch: 50 Average Loss: 0.000689 Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 52m41s | Step: 493ms
Training Epoch: 51 Average Loss: -inf Average Validation WER: nan Average Validation CER: nan
[===============5892/5892 ==================================>] Tot: 52m34s | Step: 613ms
Training Epoch: 52 Average Loss: 0.000671 Average Validation WER: nan Average Validation CER: nan
[=============== 5892/5892 ==================================>] Tot: 53m42s | Step: 512ms
Training Epoch: 53 Average Loss: 0.000359 Average Validation WER: nan Average Validation CER: nan

from deepspeech.torch.

SeanNaren commented on June 18, 2024

Those are some fun losses, have you tried changing cutoff to a lower value like 100?

from deepspeech.torch.

NightFury13 commented on June 18, 2024

@SeanNaren I haven't tried that yet. On it. Btw, by cutoff you mean the MaxNorm right? For normalizing gradients?

from deepspeech.torch.

SeanNaren commented on June 18, 2024

Sorry exactly! That is what I meant :)

From tests I've done if you keep trying to lower the maxNorm it helps prevent gradients from exploding!

from deepspeech.torch.

NightFury13 commented on June 18, 2024

@SeanNaren I've tested running the codes by linearly bringing down the MaxNorm to a value as low as 10 but I still face the nan losses and inf WER/CER issue. From your experience, should I keep going down further or this is probably not the bug/parameter-tuning I am after? Please help.

from deepspeech.torch.

NightFury13 commented on June 18, 2024

Also, I have tried to reduce the number of RNN-hidden layers to something like 3 instead of the 7 originally. Still no positive signs though.

from deepspeech.torch.

SeanNaren commented on June 18, 2024

This goes against the grain of DS2, but could you try using cudnn.LSTMs instead of RNNs? Try keep the number of parameters around 80 million. LSTMs might help out since they have a lot of improvements to the standard recurrent net!

from deepspeech.torch.

NightFury13 commented on June 18, 2024

@SeanNaren will simply changing this line do the trick here? Replacing that line to, self.rnn = cudnn.LSTM(outputDim, outputDim, 1).

I see that there are BLSTM implementations also available, so just confirming.

from deepspeech.torch.

SeanNaren commented on June 18, 2024

Ah my apologies that would be a bit strange, I'd suggest doing this in the DeepSpeechModel.lua class:

Change:

local function RNNModule(inputDim, hiddenDim, opt)
    if opt.nGPU > 0 then
        require 'BatchBRNNReLU'
        return cudnn.BatchBRNNReLU(inputDim, hiddenDim)
    else
        require 'rnn'
        return nn.SeqBRNN(inputDim, hiddenDim)
    end
end

to something like:

local function RNNModule(inputDim, hiddenDim, opt)
        require 'cudnn'
        local rnn = nn.Sequential()
        rnn:add(cudnn.BLSTM(inputDim, hiddenDim, 1)
        rnn:add(nn.View(-1, 2, outputDim):setNumInputDims(2)) -- have to sum activations
        rnn:add(nn.Sum(3))
        return rnn
end

I would suggest changing the hidden size dimension to around 700 as the default for an LSTM would be pretty large!

from deepspeech.torch.

NightFury13 commented on June 18, 2024

Thanks a lot for the clarification! Will update with results 😄

from deepspeech.torch.

NightFury13 commented on June 18, 2024

@SeanNaren Can you tell me what role does the outputDim play in rnn:add(nn.View(-1, 2, outputDim):setNumInputDims(2)). What value does it signify?

from deepspeech.torch.

Use of function "calculateInputSizes(sizes)" in DeepSpeechModel.lua? about deepspeech.torch HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent