Giter Club home page Giter Club logo

pytorch-simple-transformer's People

Contributors

chenrongl avatar hasanm08 avatar ipsumdominum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pytorch-simple-transformer's Issues

Fix loss function

The calculation of loss function is wicked and allows model to fit very quickly and produce very good translation. But unfortunately this is not real and the model is not able to spell out the German sentence itself. It could only completes an i-length German sentence if you give it the first (i-1) token, which means it could not generate the whole sentence from a start-of-sentence tag .

It seems reasonable enough to use this loss to train the network, but unreasonable to assess its translation ability, though I have yet to train this network to its full functionality.

###################
## original leave-last-token out decoder, 
## Not sure what's the exact error in this calculation, 
## but maybe because the model see the mask token directly?
##
#Output German, One Token At A Time
all_outs = torch.tensor([],requires_grad=True).to(device)
for i in range(item["german"].shape[1]-1):
    out = model(item["german"][:,:i+1])
    all_outs = torch.cat((all_outs,out),dim=1)


# ###################
# My variation of leave-last-token-out decoder, Used at training
# output_vocab_size = german_vocab_len

g = item["german"].shape
x = torch.zeros( [g[0],g[1],],dtype=torch.long ).to(device)
all_outs = torch.tensor([],requires_grad=True).to(device)
for i in range(item["german"].shape[1]-1):
    xx = torch.zeros( [g[0],g[1], ],dtype=torch.long ).to(device)
    out = model(x)
    xx[:,i:i+1] = item["german"][:,i:i+1]
    x = x+xx
    all_outs = torch.cat((all_outs,out),dim=1)

# ###################
# My variation of beam search decoder
model.encode(item["english"][:,1:-1])
g = item["german"].shape
x = torch.zeros( [g[0],g[1],],dtype=torch.long ).to(device)
all_outs = torch.tensor([],requires_grad=True).to(device)
for i in range(item["german"].shape[1]-1):
    out = model(x)
    x[:,i:i+1] = out.argmax(axis=-1)
    all_outs = torch.cat((all_outs,out),dim=1)

I found this glitch when fiddling with the attention layer at its core, and found zeroing the attention value created no harm to the performance of a last-token-only model in

sub_layers.py

attention_weights = F.softmax(attention_weights,dim=2)
attention_weights = attention_weights *0.   ## Try this!

Thanks for sharing, but questions remain

Dear devs,

I find this repo simple and smooth to run. However I am confused why you used the same embedding for input and output?

Specifically here in def fowward and in def encode, both used the same reference to self.embedding. This looks weird isnt it? Since source language should use a different encoding when compared to the destination language?

class TransformerTranslator(nn.Module):
    def __init__(self,embed_dim,num_blocks,num_heads,vocab_size,CUDA=False):
        super(TransformerTranslator,self).__init__()
        self.embedding = Embeddings(vocab_size,embed_dim,CUDA=CUDA)
        self.encoder = Encoder(embed_dim,num_heads,num_blocks,CUDA=CUDA)
        self.decoder = Decoder(embed_dim,num_heads,num_blocks,vocab_size,CUDA=CUDA)
        self.encoded = False
        self.device = torch.device('cuda:0' if CUDA else 'cpu')
    def encode(self,input_sequence):
        embedding = self.embedding(input_sequence).to(self.device)
        self.encode_out = self.encoder(embedding)
        self.encoded = True
    def forward(self,output_sequence):
        if(self.encoded==False):
            print("ERROR::TransformerTranslator:: MUST ENCODE FIRST.")
            return output_sequence
        else:
            embedding = self.embedding(output_sequence)
            return self.decoder(self.encode_out,embedding)

add beam search

Without a decoding method one cannot actually uses the trained network to translate... The greedy decoding requires a very good network.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.