Giter Club home page Giter Club logo

sentiment_analysis's Introduction

Sentiment analysis using Naive Bayes and Logistic regression

  • Idea

    • Analyze films review data and classify positives and negatives of them
  • Environment colab img

    • Google colaboratory server
    • Python 3
  • Dataset

    • Film reviews dataset
      • rt-polarity.pos contains 5331 positive snippets
      • rt-polarity.neg contains 5331 negative snippets
  • Models

    • Naive Bayes
    • Logistic regression
  • Results

    • Naive Bayes:
      • Mean test accuracy: 77.60 %
      • f1-score for positive class: 0.79
      • f1-score for negative class: 0.79
      • execution time with GridSearch: 0.277s
      • execution time without GridSearch: 0.013s
      • execution time calculating from scratch: 0.008s
    • Logistic regression
      • Mean test accuracy: 74.53 %
      • f1-score for positive class: 0.78
      • f1-score for negative class: 0.78
      • execution time with GridSearch: 2.51s
      • execution time without GridSearch: 0.073s
  • Questions

1. Describe text processing pipeline you have selected.

  1. Because models do not work with plain text I needed to convert it to numerical representation. For that I needed to have units - tokens later to be encoded. In a first attempt i decided to use keras tokenizer but then came to conclusion that with keras tokenizer I am loosing information about every word frequency in a sentence (one hot encoding only unique words) and do not have information about how common or unique each word is in whole dataset.
  2. Then I switched to Sklearn CountVectorizer which solves frequency problem but does not covers uniqueness. It works like one_hot_encoding but integer value of each token after encoding means frequence in the sentence.
  3. After that I came to sklearn TfidfVectorizer which covers all these problems - basically saving more information with this method. With help oh TfidfVectorizer I encoded my tokenized data matrix into sparse matrix (increase efficiency and protects from storage problems while most of the entries in each example are zeros) of tfidf scores where each score means frequency and uniqueness of each token (word). In short - lower tfidf-score means more common/frequent word in single example and more unique word in total dataset.
    Using TfidfVectorizer improves f1-scores for both methods compared with previous 2 attempts.

2. Why you have selected these two classification methods?

They are easy to implement, fast, doing reasonably good performance.

3. Compare selected classification methods. Which one is better? Why?

As we see both of them showing similar results (the same f1-score), but for Naive Bayes method I found that we need to tune less hyperparameters. Overall it seems that both algorithms performing within more or less the same time but if looking closer an execution time is a bit less for Naive Bayes and even more fast if doing calculations from scratch (no hyperparameters used)

4. How would you compare selected classification methods if the dataset was imbalanced?

By checking into confusion matrix, classification report. Basically f-score is good enought metrix to use on unbalanced data for checking models performance because it is not affected of negative and positive examples rates.

sentiment_analysis's People

Contributors

antanskas avatar

Watchers

 avatar

Forkers

modestasbar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.