raleighz / statnlp_fundamental_reading Goto Github PK

View Code? Open in Web Editor NEW

5.0 3.0 0.0 33.98 MB

Group for Fundamental NLP Reading and Learning

statnlp_fundamental_reading's Introduction

statnlp fundamental reading

Intro

This is a reading group for both fundamental and cutting edge NLP research.

Course Materials

'Neural Network for NLP' by Neubig

Reading Materials

code for 'Neural Network for NLP'
Jocab Eisenstein, Natural Language Processing (Draft), 2018, link: https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf
Dan Jurafsky, Speech and Language Processing (Draft), 2018, link: https://web.stanford.edu/~jurafsky/slp3/
*Please upload new book to the corresponding Dropbox folder.

statnlp_fundamental_reading's People

Contributors

Stargazers

Watchers

statnlp_fundamental_reading's Issues

What is the loss of BERT and is it a muti-task training task?

Structure RNN

Similar to structure CNNs, we do have structure RNNs to encode structural inductive bias especially in NLP (trees, DAG, etc.).

Here are some models:

Recursive Neural Networks: Parsing Natural Scenes and Natural Language
with Recursive Neural Networks
Tree LSTM: Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks

Dilated Convolution related papers

I think Dilated Convolution is a very important variant of vanilla convolution. We should investigate more about this. Here are some papers:

Fast and Accurate Entity Recognition with Iterated Dilated Convolutions
Neural Machine Translation in Linear Time (ByteNet)
WAVENET: A GENERATIVE MODEL FOR RAW AUDIO.

Also, the difference between Deconvolution and Dilated Convolution is worth to be discussed.

1×1 convolutional layer

So Boyuan propose an unanswered question: if 1×1 convolutional layer can be considered as a fully-connected layer?

The answer is yes based on the description here: https://d2l.ai/chapter_convolutional-neural-networks/channels.html

You can refer to section 6.4.3.

To sum up:

The 1×1 convolutional layer is equivalent to the fully-connected layer when applied on a per pixel basis.
The 1×1 convolutional layer is typically used to adjust the number of channels between network layers and to control model complexity.

some question for cnn architecture

The definition of and underlying meaning of spatial resolution
Stride is not well-explained in note
What do you mean by “In some cases, we want to reduce the resolution drastically, if say we find our originall input resolution to be unwieldy.”
“odd height and width” ？
The "channel" definition, in filter and input image

RNNs Variants

There are couples of variants of RNNs that should be studied:

Speed: how to decouple the recurrent relationships, by introducing different strategies (some are borrowed from CNNs)
SRU: Simple Recurrent Units for Highly Parallelizable Recurrence (EMNLP18). They claimed that SRU achieves 5–9x speed-up over cuDNN-optimized LSTM.
Quasi-RNN: Quasi-Recurrent Neural Networks (ICLR17). They claimed that QRNN is 16 times faster than LSTMs at train and test time.
SRNN: Sliced Recurrent Neural Networks (COLING2018). They claimed that SRNNs are 136 times as fast as standard RNNs and could be even faster when we train longer sequences.

Arch:
NASCell: NEURAL ARCHITECTURE SEARCH WITH REINFORCEMENT LEARNING (ICLR17). Using NAS to learn the architecture.
RCRN: Recurrently Controlled Recurrent Networks (NIPS18). Learn the recurrent gating functions using recurrent networks.

Residue Connection

Is there any modification about adding residue connection instead of Highway in LSTM?
If not why?

Question after reading this note

Details of pooling in CNN for NLP

Dynamic pooling
K-max pooling, retain the relative position information?

Sin cannot serve as non-linear funtion in NN?

Discussion: https://zhuanlan.zhihu.com/p/42160421

Key Concepts in GNN

In section Variants of GNN and Propogation Steps, some concepts just pop up from nowhere. What is Relational-GCN, could you give some explanations?

This problem is also applicable to the image attached here. It is a nice picture, but some keep concepts needed to be clarify. For example, what is aggregator? what is updator? what is spectral method? what is spacial? For the models mentioned in the picture, should you at least mention where they come from (conference+year)? what components inside this model and which tasks they are trying to dealing with (roughly)?

what is the relation between homogeneous Markov and n-gram LM

What is the meaning of GLU layer?

Structured Convolution

Another topic I would like to investigate (as always) is the convolution over structural data since language has structure.

Related works are Tree-structured Convolution and GCNs. Since we will compare CNNs with RNNs, Tree-LSTM, Recursive NN and GGNNs (GRNs) can be discussed in the same time.

Supplementary for gradient vanishing

Variants of GNN

In this section, the note mentioned GNN include GCN, GCNN, GAT, Gated GNN, Graph LSTM. None of these models have been explained or compare. What is the point to just indicate the names of them?

Also, I highly doubt that Gated GNN and Graph LSTM belongs to the GCN family. I think these models should be classified into the upper GNN family.

LRP(long-range dependency) evaluation between CNN, RNN and TF

There are 2 EMNLP18 papers empirically compare there networks:

Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures.
They use subject-verb agreement to test the LRD and word sense disambiguation to test the semantic features.
Results show that: 1) self-attentional networks and CNNs do not outperform RNNs in modeling subject-verb agreement over long distances; 2) self-attentional networks perform distinctly better than RNNs and CNNs on word sense disambiguation.
The Importance of Being Recurrent for Modeling Hierarchical Structure.
They choose two subject-verb agreement to test the ability to capture syntactic dependencies, and logical inference to compare tree-based NNs against sequence-based NNs with respect to their ability to exploit hierarchical structures. Both models have to exploit hierarchical structural features.
Conclusion: recurrency is indeed important to model hierarchical structure.

What is the definition of entropy rate and what is the relation to perplexity?

word-embedding: Reading suggestion

Suggest reading “Neural Word Embedding as Implicit Matrix Factorization” to understand the relation between count-based decomposition method and neural-based method from a mathematical perspective.

Details of negative sampling in skip-gram

Need more details of negative sampling in skip-gram

why cross entropy equals to perplexity?

Language

we should use English rather than Chinese

Other Resources of CNNs

I also recommend this book from Amazon (Alex Smola and Mu Li):

http://zh.gluon.ai/chapter_convolutional-neural-networks/index.html

They introduced different classic CNN architectures from LeNet, AlexNet, VGG, GoogleNet, NiN, ResNet to DenseNet and their corresponding implementations. From my perspective, I think to understand these important work of CNNs is beneficial to us if we want to design our own architecture.