Giter Club home page Giter Club logo

char2vec's Introduction

char2vec

This code implements the skip-gram algorithm to find vector representations for the letters of the alphabet, as opposed to words as is done in word2vec. It does this by taking a body of text (stored in /data) and training a shallow neural network to predict characters c_(n-1) and c_(n+1) given c_n. In this implementation, c_n is represented as a one-hot encoding, mapped to a hidden layer, and then mapped to two output layers (one each for c_(n-1) and c_(n+1)), with categorical cross-entropy losses.

The result this algorithm has is that characters which appear in similar contexts will have similar encodings. For example, vowels often appear in similar contexts, so we would expect them to have similar encodings. Unlike the word2vec case where it is easy to conceive of what king-man+woman = queen means, I find it harder to interpret m-z+t = w.

example_embeddings

Requirements

This code is written in Python and requires Keras.

Usage

$ python main.py

When the code is run, it will convert the entire text file to training data (watch out for RAM usage) and then train the model. Since the number of classes is quite small, the network should converge quite quickly. Next, the encodings for the characters will be generated and plotted.

Additional Notes

The hidden layer/encoding is currently 2-D. This makes it easier to visualize without having to use techniques such as PCA or t-SNE.

The code currently uses window sizes of width 3 (c_(n-1:n+1)). There are several lines commented out which allow this width to be increased.

I have found that the text source can result in slightly different embeddings. Though for the same body of text, the embeddings learned between trials are very similar, up to rotation and flipping.

Something fun to try: instead of using a tanh activation in the hidden layer, use softmax with an encoding dimension << #chars -- this should allow you to come up with approximate classifications of the letters of the alphabet. This could also be achieved with clustering and the tanh activation... but this alternative approach seems more fun.

char2vec's People

Contributors

tannerbohn avatar

Watchers

James Cloos avatar Pavel Smirnov avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.