Giter Club home page Giter Club logo

word2vec-inversion's Introduction

Document classification by inversion of word vectors

I recently read Matt Taddy's paper "Document classification by inversion of distributed language representations" (available here). The paper presents an neat method of making use of distributional information for document classification. However, my intuition is that the method would not work well for small labelled, so I decided to check this experimentally.

Requirements

Implementation of Taddy method is based on example provided by gensim. You will need a bleeding edge copy. Also, standard Python 3 scientific stack, i.e. numpy and scikit-learn. Legacy Python (<3.0) not supported.

Current features

  • Taddy, Naive Bayes and SVM classifiers with grid search for parameter settings.
  • Yelp reviews and 20 Newsgroups data sets

Usage

  • get a pre-trained word2vec model, either from Google or train it yourself
  • get Yelp reviews data from here (you will need to log into Kaggle)
  • run the following commands
python write_conf_files.py # generate one configuration file per experiment
python prepare_models.py # train word2vec models for each labelled corpus
python train_classifiers.py # train and evaluate classifiers
  • inspect results in ./results/exp*

Todo

  • More classifiers and labelled data sets
  • Grid search for word2vec parameters (currently using half of cwiki with default settings)
  • Clean up and publish script I used to pre-train word2vec
  • Compare to Word Mover Distance (another method that reports state of the art results in document classification)

Disclaimer: This is a weekend hack, very much work in progress. Haven't had a chance to run an extensive evaluation. Preliminary results suggest Naive Bayes < Taddy < SVM (with grid search) for Yelp data.

word2vec-inversion's People

Contributors

mbatchkarov avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.