Giter Club home page Giter Club logo

movie_review's Introduction

Sentiment analysis on movie reviews

About

This project is an exploration of methods involved in natural language processing, and in this case, sentiment analysis with text classification.

The data is obtained from a Kaggle tutorial competition, Bag of Words Meets Bags of Popcorn, and consists of movie reviews from IMDB. The objective is to give a binary classification, indicating a positive or negative sentiment.

Methodology

Data preprocessing

Typical data cleaning steps for text include removing stop words and normalizing text. In this example, we also implement removal of HTML tags.

Feature extraction

Bag-of-words word counts

In a bag-of-words model, the sequence of words within a sentence does not matter. Taking an example from Bag of Words Meets Bags of Popcorn, we have two sentences:

Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat"

Given that the vocabulary from the two sentences is {the, cat, sat, on, hat, dog, ate, and}, we simply construct a vector based on the number of occurrences of each word in a sentence.

Sentence the cat sat on hat dog ate and
The cat sat on the hat 2 1 1 1 1 0 0 0
The dog ate the cat and the hat 3 1 0 0 1 1 1 1

Thus, the vectors produced for the sentences are:

Sentence 1: [2, 1, 1, 1, 1, 0, 0, 0]
Sentence 2: [3, 1, 0, 0, 1, 1, 1, 1]

Each sentence can thus be transformed into a vector, with the length of the vector being the number of words in the vocabulary. During classification, the model will then possibly learn that higher occurrences of certain words are more likely to lead to a particular prediction.

word2vec

word2vec produces word embeddings for one-hot encoded vectors, such that each word can be represented as an n-dimensional vector. These word vectors are able to capture word relations in vector space.

Continuous bag-of-words (CBOW)

In the CBOW model, we attempt to predict a word given its context. This means that we attempt to predict the center word from the sum of surrounding word vectors.

Skip-gram

For the skip-gram model, we predict the context given a word. Instead of predicting the center word from surounding words, given the center word, we instead predict each surrounding single word.

The skip-gram model is found to perform better against words that appear less frequently. However, it is much slower to train, as compared to the CBOW model.

Thus, negative sampling is introduced when training the skip-gram model. On top of the surrounding words, we also take k negative samples (words that are not within the context window). During training, we maximimize the probability that the words in the context window appears, while also minimizing the probability that other words appear.

Classification

Machine learning

We use a random forest classifier with SciKit-Learn as a ML classifier for extracted features.

RNN

The RNN model architecture consists of the following components:

  1. Word embedding as a randomly initialized trainable variable
  2. RNN for feature extraction
  3. Fully-connected layer for softmax classification

CNN

Model architecture for CNN is obtained from the paper Convolutional Neural Networks for Sentence Classification.

Similar to the RNN model architecture, the CNN involves training new word embeddings from scratch. After obtaining word vectors, the vector is independently passed through convolutional layers with kernels of sizes 3, 4, and 5. The max of each filter is then taken through max-pooling, resulting in a tensor of shape [batch, 1, 1, n_filters] per kernel size.

The results of each convolutional layer is then concatenated and flattened, before being fed into a fully-connected layer for softmax classification.

Resources

This tutorial guides the user through constructing a bag-of-words classification model. It begins with data cleaning methods, before feature extraction by word counts and word2vec. Classification is done by random forest.

Guide to using RNNs for text classification.

Additional readings

word2vec

movie_review's People

Contributors

gabrielwong159 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.