Hi, It seems that your implementation of RNN Transducer loss function is right. Bu

Hi, Recently, I write a greedy decoding for transducer: <div class="highlight

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RNN Transducer training problem about speech HOT 5 CLOSED

awni commented on August 18, 2024

RNN Transducer training problem

from speech.

Comments (5)

HawkAaron commented on August 18, 2024 4

Hi,
Recently, I write a greedy decoding for transducer:

    def greedy_decode(self, batch):
        x, y, x_lens, y_lens = self.collate(*batch)
        x = self.encode(x)[0]
        vy = autograd.Variable(torch.LongTensor([0]), volatile=True).view(1,1) # vector preserve for embedding
        y, h = self.dec_rnn(autograd.Variable(torch.zeros((1, 1, 256)), volatile=True)) # decode first zero 
        y_seq = []; logp = 0
        for i in x:
            out = self.fc1(i) + self.fc1(y[0][0])
            out = nn.functional.relu(out)
            out = self.fc2(out)
            out = F.log_softmax(out, dim=0)
            p, pred = torch.max(out, dim=0)
            pred = int(pred); logp += float(p)
            if pred != self.blank:
                y_seq.append(pred)
                vy.data[0][0] = pred # change pm state
                y = self.embedding(vy)
                y, h = self.dec_rnn(y, h)
        return [y_seq]

This code can be placed in your transducer_model.py file. Here just support one sequence per batch.

After decoding, PER is 24.4% in your training model ( maybe not converged).

It seems that the training PER calculated in the CTC way doesn't make sense.

Beam search would be much better.

from speech.

smolendawid commented on August 18, 2024

It seems that the training PER calculated in the CTC way doesn't make sense.
@HawkAaron what do you mean?

from speech.

HawkAaron commented on August 18, 2024

@smolendawid what you said is exactly what I mean.
Since the CTC transition topology is almost the same with HMM, the greedy search can be treated as viterbi decode. But the transition of RNN Transducer is two dimensional, decode any one of them doesn't make any sense.

from speech.