Implementation of Hierarchical Attention Networks in PyTorch

Home Page: https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf

Jupyter Notebook 100.00%

deep-learning document-classification glove gru hierarchical-attention-networks nlp pytorch word2vec

hierarchical-attention-network's People

Contributors

Stargazers

Watchers

hierarchical-attention-network's Issues

THIS IMPLEMENTATION's GOT at least 2 apparent MISTAKEs!

So far as I've read until, the implementation of attention on both word and sentence level are WRONG:

## The word RNN model for generating a sentence vector
class WordRNN(nn.Module):
    def __init__(self, vocab_size,embedsize, batch_size, hid_size):
        super(WordRNN, self).__init__()
        self.batch_size = batch_size
        self.embedsize = embedsize
        self.hid_size = hid_size
        ## Word Encoder
        self.embed = nn.Embedding(vocab_size, embedsize)
        self.wordRNN = nn.GRU(embedsize, hid_size, bidirectional=True)
        ## Word Attention
        self.wordattn = nn.Linear(2*hid_size, 2*hid_size)
        self.attn_combine = nn.Linear(2*hid_size, 2*hid_size,bias=False)
    def forward(self,inp, hid_state):
        emb_out  = self.embed(inp)

        out_state, hid_state = self.wordRNN(emb_out, hid_state)

        word_annotation = self.wordattn(out_state)
        attn = F.softmax(self.attn_combine(word_annotation),dim=1)

        sent = attention_mul(out_state,attn)
        return sent, hid_state

at Line 4 from the bottom: attn = F.softmax(self.attn_combine(word_annotation),dim=1).

As the nature of pytorch, if you don't use batch_first=True for GRU, the output dimention of out_state should be: (n_steps, batch_size, out_dims)

As the paper states, the softmax function should be applied on different time steps (for which the sum of all timesteps of softmax(value) should add up to 1), wheras THE IMPLEMENTATION of F.softmax MADE THE SOFTMAX ON DIFFERENT BATCHES (dim=1), which is incorrect!!! (should be changed to dim=0)

So does the sentence level attention.

Maybe this could be a reason for the non-convergent fluctuating test accuracy.
I am reading through the code and trying to make a corrected version for this implementation, will get back later.

How to obtain the attention scores for different words and sentences in a certain document?

This is definitely a good implementation of the paper. As shown in the paper, it provides the visualization of the importance (attention) of each word in a document. I'm wondering, can we get the attention score for each word from the codes? Thank you so much!

Note: If you are using NLLLoss from pytorch make sure to use the log_softmax function from the functional class and not softmax

Hi Pandey,
Can you tell me the reasoning behind the above statement in your README?

Thanks.

Changing word embedding size from 200 to 300 causes NoneType is not Iterable

So I am running the jupyter notebook, with pretrained embedding of 300 size, I changed embedsize to 300, but I am getting NoneType is iterable error, but it goes away if load pretrained word2vec of size 200, everything else is the same, is the size hardcoded in the model ?

Also changing batch size from 64 to 80 caused the same error

pandeykartikey / hierarchical-attention-network Goto Github PK

hierarchical-attention-network's People

Contributors

Stargazers

Watchers

Forkers

hierarchical-attention-network's Issues

THIS IMPLEMENTATION's GOT at least 2 apparent MISTAKEs!

How to obtain the attention scores for different words and sentences in a certain document?

Note: If you are using NLLLoss from pytorch make sure to use the log_softmax function from the functional class and not softmax

Changing word embedding size from 200 to 300 causes NoneType is not Iterable

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent