Giter Club home page Giter Club logo

brown-clustering's Introduction

Brown Clustering

Brown clustering in Python

The code will be released after Nov 25 for pedagogical reasons.

Design

This project follows the both the original paper and the fixed window size optimization from another thesis.

During reading the paper from MIT, I noticed that it contains some mistakes. To be specific, the equation 4.5 is wrong. At the same time, the original paper dose not use fixed window size optimization (of course). So I derived the correct equations of merging two cluster or appending a cluster, and use the terms from original paper as much as possible.

Complexity

Assuming the corpus contains n words and v unique words. and the m is the fixed window size. This implementation has O(v*m^2+n) complexity.

Data Processing

This implementation only require you to provide corpus, as a list of sentences. Laplace smoothing and 2-gram will be applied as a part of the algorithm. If you don't need any smoothing you can pass 0 as smoothing parameter alpha.

Evaluation

This project contains multiple implementation, and one does not contain any optimization. This project contains some unit tests which can make sure that different implementations will give exact same result.

Dependencies

  • numpy is used for ndarray
  • nltk is used for ngram

I was not hesitated to use them considering that both of them are extremely popular. And if you don't like them, it would be a very easy task to replace ndarray and ngram with some simple python code.

The Same Page

What is n

n is not number of words in corpus. Actually, n is the number of 2-grams. so n increased after Laplace smoothing.

pl(i), pr(j) and p(i), p(j) and p(i, j)

In original paper pl(i)=SUM(p(i, j) for each j). If you have start and end symbols in clusters, it's not difficult to find out that pl(i) = pr(i) = c(i)/n. So instead of sum them, we can directly keep 1-gram information and use that to calculate p(i) = pl(i) = pr(i).

Laplace Smoothing

In order to make above true even after smoothing, we should add double the lambda for 1-gram. Wrong smoothing will end with wrong clustering result.

Likelihood, Quality and Average Mutual Information

Likelihood(C) = Quality(C) = AverageMutualInformation(C) + some constant, both papers contains the idea of calculating AMI instead of Likelihood or Quality. In this project, benefit is the delta value you gain when you merge two clusters.

brown-clustering's People

Contributors

yangyuan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.