Giter Club home page Giter Club logo

wang2vec's Introduction

wang2vec

Extension of the original word2vec (https://code.google.com/p/word2vec/) using different architectures

To build the code, simply run:

make

The command to build word embeddings is exactly the same as in the original version, except that we removed the argument -cbow and replaced it with the argument -type:

./word2vec -train input_file -output embedding_file -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 5 -cap 0

The -type argument is a integer that defines the architecture to use. These are the possible parameters:
0 - cbow
1 - skipngram
2 - cwindow (see below)
3 - structured skipngram(see below)
4 - collobert's senna context window model (still experimental)

If you use functionalities we added to the original code for research, please support us by citing our paper (thanks!):

@InProceedings{Ling:2015:naacl,
author = {Ling, Wang and Dyer, Chris and Black, Alan and Trancoso, Isabel},
title="Two/Too Simple Adaptations of word2vec for Syntax Problems",
booktitle="Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
year="2015",
publisher="Association for Computational Linguistics",
location="Denver, Colorado",
}

The main changes we made to the code are:

****** Structured Skipngram and CWINDOW ******

The two NN architectures cwindow and structured skipngram (aimed for solving syntax problems).

These are described in our paper:

-Two/Too Simple Adaptations of word2vec for Syntax Problems

****** Noise Contrastive Estimation objective ******

Noise contrastive estimation is another approximation for the word softmax objective function, in additon to Hierarchical softmax and negative sampling, which are implemented in the default word2vec toolkit. This can be turned on by setting the -nce argument. Simply set -nce 10, to use 10 negative samples. Also remember to set -negative and -hs to 0.

****** Parameter Capping ******

By default parameters are updated freely, and are not checked for algebric overflows to maximize efficiency. However, we had some datasets where the CWINDOW architecture overflows, which leads to segfaults, If this happens, even in other architectures, try setting the paramter -cap 1 in order to avoid this problem at the cost of a small degradation in computational speed.

****** Class-based Negative Sampling ******

A new argument -negative-classes can be added to specify groups of classes. It receives a file in the format:

N dog
N cat
N worm
V doing
V finding
V dodging
A charming
A satirical

where each line defines a class and a word belonging to that class. For words belonging to the class, negative sampling is only performed on words on that class. For instance, if the desired output is dog, we would only sample from cat and worm. For words not in the list, sampling is performed over all word types.

warning: the file must be order so that all words in the same class are grouped, so the following would not work correctly.

N dog
A charming
N cat
N worm
V doing
V finding
V dodging
A satirical

****** Minor Changes ******

The distance_txt and kmeans_txt are adaptations of the original distance and kmeans code to take textual (-binary 0) embeddings as input

wang2vec's People

Contributors

wlin12 avatar

Watchers

Francis Tyers avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.