ematvey / hierarchical-attention-networks Goto Github PK

Document classification with Hierarchical Attention Networks in TensorFlow. WARNING: project is currently unmaintained, issues will probably not be addressed.

License: MIT License

Python 100.00%

deep-learning document-classification hierarchical-attention-networks machine-learning nlp tensorflow

hierarchical-attention-networks's People

Contributors

Stargazers

Watchers

Forkers

jimmyyfeng dconger binbinbian allensmile wonyonyon benjamesbabala vyraun ml-lab amoliu ifff jankim scylla kelvict zuofengli pandagod imutlab zpilgrim jknair scpei mwin007 yongyehuang mtdersvan qss2012 vamshiteja jianzhengming izenmania libin19861023 fydlzr qfzxhy generalzh fence topsun8888 patrickket richardsun-voyager chunyuany lvyli lovive cosecant-csc zyzzhaoyuzhe junman m12redx jiaqinglin satels will-rice tengke-xiong gkaramanolakis aptperson teslaa22 wuyou61 yangqiu pikaliov yaduvanshiankitofficial magellen codeaspoetry akbari59 sxliuxiu yyl8781697 yuancz lgdkobe24 r-wheeler queenie88 nthain dthboyd wbb123 monk1337 bingbao cooravi elena-nuntio gdh756462786 zlxwl chivalrouss stabgan gonewithgt connectdotz sanyamg123 lichao88 songkaisong hearyshen yzx1992 naveenselvan nkzhlee bowbowbow zeyu-h cz9779 xueguohua peter-xbs koyappe siddrtm blacktoothgrin96 puneet6060 xuwenkang peterboyyy leechenchen shubhampachori12110095 xuehui0725 machine4life wangtao1321 sameepyadav cshaowang generalsemantics

hierarchical-attention-networks's Issues

Performance on the paper's dataset

The performance reported in the Readme has not been computed on the same dataset used in the original paper (Hierarchical Attention Networks for Document Classification, Tang et al. 2016).

It would be more convenient for understanding the real performance of the implementation to report the accuracy on that dataset, where training, dev and test sets are predefined.
The dataset can be downloaded from the Duyu Tang's homepage.
Download link: http://ir.hit.edu.cn/~dytang/paper/emnlp2015/emnlp-2015-data.7z

Error While Running yelp_prepare.py

Hellow，While Running yelp_prepare.py, I got error log as follow.
The code has been ran with Yelp dataset round10 and tensorflow 1.1.0 and Python 3.5.2 in Linux.

0it [00:00, ?it/s]
Traceback (most recent call last):
File "yelp_prepare.py", line 98, in
make_data()
File "yelp_prepare.py", line 78, in make_data
for sent in en(review['text']).sents:
File "/home/wangtao/py35env/lib/python3.5/site-packages/spacy/language.py", line 330, in call
for name, proc in self.pipeline:
TypeError: 'Tagger' object is not iterable

Performance on Yelp 15

I used the same dataset (Download link: http://ir.hit.edu.cn/~dytang/paper/emnlp2015/emnlp-2015-data.7z), but can only get 68.5% on yelp 2015 (The paper said they can get 71%), is there any wrong with my parameters? Here are my parameters:
vocab_size: 49000 (Byte-Pair-Encoding with 50000 byte pairs; all tokens that appears no less than 5 times)
learning_rate: 0.001
max tokens in a sentence: 48 (over 95% sentences are shorter than 48 tokens)
max sentences in a document: 32 (over 95% docs are shorter than 32 sentences)
word_embedding_size: 300 (pre-trained with word2vec)
word_output_size: 128
sentence_output_size: 128
LSTM hidden_dim: 64
LSTM layer_num: 5
dropout_keep_prob: 0.8 (using tf.nn.dropout, add dropout after word_output and sentence_output)

Getting same sentence level outputs for very different documents. Can someone please help.

some error in yelp_prepare.py

When i run the yelp_prepare.py, it told me that ValueError: sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation:(http://spacy.io/docs/usage)
Could you please give me some advice!
Thanks a lot!

Error in HAN_model.py

Same cell for word and sentence level

In worker.py looking at lines 70-80, it seems you are using the same cell for word and sentence level, but it should be a different lstm cell

What's the Accuracy ?

Could you tell me the acc in the Yelp-2013/2014/2015 by running your code? I run the code, but I could not reach the acc written in the paper.

Thanks!

GRU VS LSTM

Hi @ematvey ,
First of all thanks a lot for your implementation !
This is actually more of a question than an Issue: If I'm not mistaken, your code seems to tell that you initially used GRU, then left it out for LSTM cells. Can I know the reasons ?

Error While Training with Yelp Dataset

While I trained model with Yelp Dataset prepared by yelp_prepare.py, I got error log as follow.
The code has been ran with tensorflow 1.0.1 and Python 2.7.12 in Linux.

....
step 251, loss=1.55983, accuracy=0.2, t=19.89, inputs=(30, 30, 30)
Traceback (most recent call last):
  File "worker.py", line 220, in <module>
    main()
  File "worker.py", line 215, in main
    train()
  File "worker.py", line 195, in train
    ], fd)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[28,0,6] = 50000 is not in [0, 50000)
	 [[Node: tcm/tcm/embedding/embedding_lookup = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@tcm/embedding/embedding_matrix"], validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](tcm/embedding/embedding_matrix/read, _recv_tcm/inputs_0)]]

Caused by op u'tcm/tcm/embedding/embedding_lookup', defined at:
  File "worker.py", line 220, in <module>
    main()
  File "worker.py", line 215, in main
    train()
  File "worker.py", line 165, in train
    model, saver = model_fn(s)
  File "worker.py", line 97, in HAN_model_1
    is_training=is_training,
  File "/data/zhiheng/project/deep-text-classifier/HAN_model.py", line 63, in __init__
    self._init_embedding(scope)
  File "/data/zhiheng/project/deep-text-classifier/HAN_model.py", line 102, in _init_embedding
    self.embedding_matrix, self.inputs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/embedding_ops.py", line 111, in embedding_lookup
    validate_indices=validate_indices)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1359, in gather
    validate_indices=validate_indices, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): indices[28,0,6] = 50000 is not in [0, 50000)
	 [[Node: tcm/tcm/embedding/embedding_lookup = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@tcm/embedding/embedding_matrix"], validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](tcm/embedding/embedding_matrix/read, _recv_tcm/inputs_0)]]

Why use orthogonal_initializer ?

Are uw and us global weights? just to conform.

Thank you ematvey for this paper.

I wonder the uw and us are two vectors as global weights, or there are different uw(s) for each sentence, and different us(s) for each document?

From the code I think these are global vectors, am I right? Please help me confirm this.

As in the model_components.py it is said

Performs task-specific attention reduction, using learned
attention context vector (constant within task of interest).

The uw or us are defined in the function task_specific_attention(), although they are both referred to the attention_context_vector, but in the computational graph, are they different vectors? It would be helpful if you could explain a little about this part.

attention_context_vector = tf.get_variable(name='attention_context_vector', shape=[output_size], initializer=initializer, dtype=tf.float32)

Thank you.

The Attention code

Hi,ematvey. I have read the paper and I don't understand your code of the attention mechanism. According to you code ,how can I get the weight of each word in a sentence ?
thx.

Embeddings for special tokens/padding?

I was wondering where in the code you are initializing the embeddings for the special tokens in the vocabulary (like unknown and padding words) - shouldn't these be set to zero-embeddings and excluded from training? Or how are your dealing with these?

ValueError in running worker.py

Sorry to bother you again.
I used the tensorflow=1.2.1, python=3.6, run worker.py as your instructions, but it encountered an error.
ValueError: Trying to share variable tcm/word/fw/multi_rnn_cell/cell_0/bn_lstm/w_xh, but specified shape(100,320) and found shpae(200,320).

en-core-web-sm needs to be installed beforehand

When trying to install using the requirements.txt, I got errors about "Could not find a version that satisfies the requirement en-core-web-sm"

This can be avoided by first installing spacy, then installing the english model as in step 2, and then installing with the requirements.txt.

Is the embedding initialized with a pre-trained one?

From the code it seems the embedding is not initialized with a pre-trained embedding (i.e. word2vec), although in the paper it says so. Am I right or I missed something? Many thanks!

relevant code in _init_embedding

def _init_embedding(self, scope): #seems did not using word embedding
with tf.variable_scope(scope):
with tf.variable_scope("embedding") as scope:
self.embedding_matrix = tf.get_variable(
name="embedding_matrix",
shape=[self.vocab_size, self.embedding_size],
initializer=layers.xavier_initializer(),
dtype=tf.float32)
self.inputs_embedded = tf.nn.embedding_lookup(
self.embedding_matrix, self.inputs)

Mask for attention weight

Hi ematvey,

Thanks for sharing the code!

I notice the attention weights for sentence & word are not mask according to their actual length, which means the model will "pay attention" to the useless input. Is there a reason you didn't use a mask for the project?

Please correct me if I am wrong. Thanks!
Xianlonb

I want to know what's the effect of classes_weights in HAN_model....

Visualize word and sentence attention weight as color coded in the paper

Hi,

The code runs fine, thanks, and gives the accuracy results but I am trying to visualize the attention weights as shown in the paper (color-wise). Any suggestion on its implementation? Thanks a lot!

Won't the code leads to different input shape for different batch?

In the file data_util.py, the code is as follows:
`def batch(inputs):
batch_size = len(inputs)

document_sizes = np.array([len(doc) for doc in inputs], dtype=np.int32) # Different batch will
# have different document_sizes.
document_size = document_sizes.max() # Document with maximum sentence number.

sentence_sizes_ = [[len(sent) for sent in doc] for doc in inputs] # every sentence len in each document.
sentence_size = max(map(max, sentence_sizes_)) # The maximum sentence length.

b = np.zeros(shape=[batch_size, document_size, sentence_size], dtype=np.int32) # == PAD

sentence_sizes = np.zeros(shape=[batch_size, document_size], dtype=np.int32)
for i, document in enumerate(inputs):
for j, sentence in enumerate(document):
sentence_sizes[i, j] = sentence_sizes_[i][j]
for k, word in enumerate(sentence):
b[i, j, k] = word

return b, document_sizes, sentence_sizes`
The output batch depends on the inputs. Won't this leads to different shapes of b since the input is not padded before. Each document may have different number of sentences and each sentence may have different number words.

How to make `TensorBoard Projector` work.

I'd like to comment out the Embedding Projector part to visualize the result, but don't know how. For instance, the embedding.metadata_path = vocab_tsv in the original code does not work since vocab_tsv does not exists. What variable should I assign in this statement?

Thank you very much!