Text Classification

Data Preprocessing and EDA

It was observed that the target distribution was not uniform.
It was observed that num_sentence and reflection_period did not correlate with the targets and are unnecessary for training.
The text in hm_train.csv and hm_test.csv was tokenized and joined with spaces in between. This was done to make sure that all tokens (including punctuations) are space-separated. The cleaned data was saved in hm_train_cleaned.csv and hm_test_cleaned.csv respectively.
The training data was split into train and validation sets and saved in train.csv and test.csv.

The code for preprocessing and EDA is pressent in preprocessing.ipynb.

Approaches Explored

TextCNN
Bi-attentive Classification Network

TextCNN

The TextCNN consisted of a 3-layer 1D ConvNet followed by two Dense layers. 300 dimensional GloVe word embeddings were used for word vectorization. Dropout was used to avoid overfitting. Despite adding dropout, val accuracy was saturating, while training accuracy was increasing. Weighted cross-entropy loss was used to tackle the class imbalance. The code for the experiment is present in the notebook TextCNN/TextCNN.ipynb.

Score on submission: 0.8645

Bi-attentive Classification Network

The approach and the model architecture was introduced in the Contextualized Word Vectors paper. A brief overview goes as follows:

The input text sequence S = [w₁, w₂, w₃, ..... w_n] is vectorized using an embedding of choice. Let the vectors be denoted by

E = [e₁, e₂, e₃, ..... e_n]

where n = No. of words in sequence
The vectorized sequence is duplicated to give two copies of the same - E₁ and E₂.
A feedforward network with ReLU activation is used on each word vector. This is the pre-encoding step used for dimensionality reduction of the embeddings.

E₁' = FC_Layer(E₁)

E₂' = FC_Layer(E₂)
The output is then passed through a Bi-directional LSTM to obtain context aware representations of the words in the input sequence.

X = Bi-LSTM(E₁')

Y = Bi-LSTM(E₂')
Compute A_nxn = XY^T. The matrix will contain the dot products of each x_i with every y_i. Intutively, it will contain information on how each word is related to every other in the input sequence.
Compute:

A_x = softmax(A)

A_y = softmax(A^T)

where softmax() represents column wise softmax. Here, A_x and A_y will be nxn matrices where each column sums to 1. This gives us normalized weights which can be used for attention.
In this step we apply the attention weights obtained, to the corresponding matrices X and Y.

C_x = A_x^T . X

C_y = A_y^T . Y
The conditioned matrices C_x and C_y along with X and Y are passed through another Bi-LSTM to integrate the obtained information.

X_|Y = Bi-LSTM(X, C_y)

Y_|X = Bi-LSTM(Y, C_x)

These outputs are then min-pooled along the column.
The pooled representations X_pool and Y_pool are then concatenated and passed through a final Dense layer with softmax at the end to obtain the probablity distribution over the classes.

Implementation Details

TextCNN

Libraries used:

Keras: Used for training
Scikit-Learn: Used for preprocessing and metrics

For more details, refer to TextCNN/TextCNN.ipynb. The notebook contains code for loading data, building and training the model and predicting on the test set.

Score: 0.8645

Bi-attentive Classification Network

AllenNLP, a high level NLP research library was used for this purpose. The library had to be extended to add support for reading from and predicting on differently formatted data. The library extension can be found in the BCN/mylibrary folder. A customizable implementation of the Bi-attentive Classification Network is available in the library. Many experiments were performed with different choices of word embeddings, model size, learning rates and learning rate schedulers:

300 Dimensional GloVe embeddings + Medium Sized model + lr=0.001 + 20 epochs

(See BCN/bcn.jsonnet)

Score: 0.8766
300 Dimensional GloVe embeddings + Medium Sized model + lr=0.0005 + LR scheduler (lr/2 on plateau for 3 epochs) + 15 epochs

(See BCN/bcn_lrscheduler.jsonnet)

The model seemed to overfit in the later epochs. Best val accuracy was in the 12^th epoch. Perhaps increasing dropout can help.

Score: 0.8827
300 Dimensional GloVe + 512 Dimensional ELMo embeddings + Large Sized Model + lr=0.0005

(See BCN/bcn_glove_elmo.jsonnet)

This configuration caused memory error even on a reduced batch size of 32 (from 100).
300 Dimensional GloVe + 128 Dimensional ELMo embeddings + Small Sized Model + lr=0.001

(See BCN/bcn_elmo_small.jsonnet)

This configuration also caused memory error even on a small batch size.
50 Dimensional GloVe + 128 Dimensional ELMo embeddings + Small Sized Model + lr=0.001

(See BCN/bcn_glove_small_elmo_small.jsonnet)

Was taking a long time to train (1.5 hrs per epoch). The validation metrics for the 1^st and 2^nd epoch were not impressive. Training was cancelled and no submission was made.
768 Dimensional BERT Embeddings + Medium Sized Model + lr=0.0005 + 10 epochs

(See BCN/bcn_bert.jsonnet)

The accuracy increased rapidly in the early epochs but was saturating after the 8^th epoch.

Perhaps learning rate scheduling can help.

Score: 0.8738

Running the code

The BCN/*.jsonnet files are configuration files that AllenNLP uses to build the model architecture and train models. The paths to training and validation data are present in these config files.

The training and prediction scripts are provided in the BCN/folder.

For training:

python3 train.py /path/to/config.jsonnet /path/to/model/folder

For predictions:

python3 predict.py /path/to/model.tar.gz /path/to/test.csv /path/to/submission.csv

Directory Structure

.
├── BCN
│   ├── mylibrary
│   │   ├── data
│   │   │   ├── dataset_readers
│   │   │   │   ├── __init__.py
│   │   │   │   └── smiledb.py
│   │   │   └── __init__.py
│   │   ├── predictors
│   │   │   ├── __init__.py
│   │   │   └── smiledb.py
│   │   └── __init__.py
│   ├── bcn_bert.jsonnet
│   ├── bcn_elmo_small.jsonnet
│   ├── bcn_glove_elmo.jsonnet
│   ├── bcn_glove_small_elmo_small.jsonnet
│   ├── bcn.jsonnet
│   ├── bcn_lrscheduler.jsonnet
│   ├── predict.py
│   └── train.py
├── TextCNN
│   └──TextCNN.ipynb
├── hm_test_cleaned.csv
├── hm_test.csv
├── hm_train_cleaned.csv
├── hm_train.csv
├── install_requirements.sh
├── preprocessing.ipynb
├── README.md
├── requirements.txt
├── test.csv
└── train.csv

kushalchauhan98 / bcn-cnn-text-classification Goto Github PK

bcn-cnn-text-classification's Introduction

Text Classification

Data Preprocessing and EDA

Approaches Explored

TextCNN

Bi-attentive Classification Network

Implementation Details

TextCNN

Bi-attentive Classification Network

Running the code

Directory Structure

bcn-cnn-text-classification's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent