Giter Club home page Giter Club logo

generalised-brown's Introduction

generalised-brown

version 1.0 © 2015 Sean Chester and Leon Derczynski


Table of Contents


Introduction

The generalised-brown software suite clusters word types by distributional similarity in two phases. It first generates a list of merges based on the well-known Brown clustering algorithm and then recalls historical states to vary the granularity of the clusters. For example, given the following corpus:

Alice likes dogs and Bob likes cats while Alice hates snakes and Bob hates spiders

Greedily clustering word types based on average mutual information (i.e., running the C++ merge generator) produces the following merge list (assuming a = |V| = 10):

snakes spiders 8

dogs cats 7

Alice Bob 6

and while 5

likes hates 4

dogs snakes 3

dogs and 2

dogs Alice 1

dogs likes 0

One can then recall any historical state of the computation in order to produce a set of clusters (i.e., run the python cluster generator). For example, with c = 5, we recall the state c - 1 = 4 to produce the following clusters:

{snakes, spiders} {dogs, cats} {Alice, Bob} {likes, hates} {and, while}

This approach (setting separate values of a and c) we refer to as Roll-up feature generation. By contrast, traditional Brown clustering would produce the following five clusters (equivalent to running the C++ merge generator with a = 5 and the python cluster generator with c = 5):

{likes, hates} {snakes, spiders, cats, dogs} {and, while} {Alice} {Bob}

For details about the concepts implemented in this software, please read our recent AAAI paper:

L. Derczynski and S. Chester. 2016. "Generalised Brown Clustering and Roll-up Feature Generation." In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16). 7 pages. To appear.

For details about traditional Brown clustering, consult the article in which it was introduced:

PF Brown et al. 1992. "Class-based n-gram models of natural language." Computational Linguistics 18(4): 467--479.

or the implementation that our C++ merge generator forked:

wcluster.


Requirements

generalised-brown relies on the following applications:

  • For compiling the C++ merge generator: A C++ compiler that is compatible with C++ 11 and OpenMP (e.g., the newest GNU compiler) and the make program

  • For running the python cluster generator: A python interpreter


Installation

The python cluster generator does not need to be compiled. To compile the C++ merge generator, navigate to the merge_generator/ subdirectory of the project and type:

make


Usage

To produce a set of features for a corpus, you will first want to use Generalised Brown (i.e., the C++ merge generator) to create a merge list. Then, you can create c clusters by running the python cluster generator on the merge list. This second step can be done for as many values of c as you like, but we recommend that each value of c is not larger than the value of a used to generate the merge list.

To run the C++ merge generator, type:

./merge_generator/wcluster --text [input_file] --a [active_set_size]

The resultant merges will be recorded in:

./[input_file]-c[active_set_size]-p1.out/merges

To run the python cluster generator, type:

python ./cluster_generator/cluster.py -in ./[input_file]-c[active_set_size]-p1.out/merges -c 3

Each word type will be printed to stdout with its cluster id.

The C++ merge generator runs in O(|V| a^2) time, where |V| is the number of distinct word types in the corpus (i.e., the size of the vocabulary) and a is a bound on the algorithm's search space. The python cluster generator runs in O(|V|) time.


License

This software consists of two sub-modules, each released under a different license:

  • The python cluster generator is subject to the terms of The MIT License

  • The C++ merge generator follows the original licensing terms
    of wcluster.

See the relevant sub-directories of this repository for the specific details of each license.


Datasets

You can find aclusters induced from the RCV1 data, including a full merge file, with a=2560, here: http://derczynski.com/sheffield/resources/rcv.a2560.tar.bz2


Contact

This software suite will undergo a major revision; so, you are encouraged to ensure that this is still the latest version. Please do not hesitate to contact the authors if you have comments, questions, or bugs to report.

generalised-brown on GitHub


generalised-brown's People

Contributors

leondz avatar sean-chester avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.