Giter Club home page Giter Club logo

brown-cluster's Introduction

Implementation of the Brown hierarchical word clustering algorithm.
Percy Liang
Release 1.3
2012.07.24

Input: a sequence of words separated by whitespace (see input.txt for an example).
Output: for each word type, its cluster (see output.txt for an example).
        In particular, each line is:
  <cluster represented as a bit string> <word> <number of times word occurs in input>

Runs in $O(N C^2)$, where $N$ is the number of word types and $C$
is the number of clusters.

References:

  Brown, et al.: Class-Based n-gram Models of Natural Language
    http://acl.ldc.upenn.edu/J/J92/J92-4003.pdf

  Liang: Semi-supervised learning for natural language processing
    http://cs.stanford.edu/~pliang/papers/meng-thesis.pdf

Compile:

  make

Run:

  # Clusters input.txt into 50 clusters:
  ./wcluster --text input.txt --c 50
  # Output in input-c50-p1.out/paths

============================================================
Change Log

1.3: compatibility updates for newer versions of g++ (courtesy of Chris Dyer).
1.2: make compatible with MacOS (replaced timespec with timeval and changed order of linking).
1.1: Removed deprecated operators so it works with GCC 4.3.

============================================================
(C) Copyright 2007-2012, Percy Liang

http://cs.stanford.edu/~pliang

Permission is granted for anyone to copy, use, or modify these programs and
accompanying documents for purposes of research or education, provided this
copyright notice is retained, and note is made of any changes that have been
made.

These programs and documents are distributed without any warranty, express or
implied.  As the programs were written for research purposes only, they have
not been tested to the degree that would be advisable in any important
application.  All use of these programs is entirely at the user's own risk.

brown-cluster's People

Contributors

ajaech avatar amamidzu avatar andrewyates avatar aryamccarthy avatar egrefen avatar karlstratos avatar percyliang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

brown-cluster's Issues

what are these results?

I'm not sure whether this can be called an issue or the matter of understanding, I ran the clustering on Persian text and after couple of hours I got these results in map output:
بینبریج 00111111-L 5.54361 00111111-R 2.82232 00111111-freq 1
گروهان 00111111-L 5.20714 00111111-R 2.7586 00111111-freq 1
می‌دهده 00111111-L 4.15732 00111111-R 6.05444 00111111-freq 1
...
and I'm not sure what each column means and which one exactly is the cluster group?!

Clustering perplexity measure

Does the package return (or write in the log file) the perplexity or any other goodness of fit measure? If yes, would it be a good idea to run a BayesOpt optimizer to find the best clustering this way? Or is it ill-posed?

Thanks

Problem compiling on Windows 7

I'm trying to compile on Windows 7 using g++ 4.7.2 and GNU Make 3.8.1. When I do I get the following errors:

C:\Users\ameasure\brown-cluster-master>make
g++ -Wall -g -o wcluster.o -c wcluster.cc
wcluster.cc: In function 'void repcheck()':
wcluster.cc:431:3: error: '__STRING' was not declared in this scope
wcluster.cc:432:3: error: '__STRING' was not declared in this scope
wcluster.cc: In function 'int main(int, char*)':
wcluster.cc:1072:3: error: '__STRING' was not declared in this scope
make: *
* [wcluster.o] Error 1

Any idea what's going on?

Speed up with compiler optimization

In case anyone is clustering large datasets:

in my experiments (40M corpus and NofClusters=1000), turning on compiler optimization with "-O3" yields speed-ups of around 3.

I changed the following lines in my Makefile:

wcluster: $(files)
    g++ -Wall -g -O3 -o wcluster $(files)

%.o: %.cc
    g++ -Wall -g -O3 -o $@ -c $<

Is there any limit for the vocab size (#types)?

The code fails (with core dump: segmentation fault message) when I run it on a huge txt file (about 20M types and 14GB file size). I already used wcluster for different files with much less types and it worked pretty well.

Is there any limit for the vocabulary size (#types)?

Running The code

Hi,

Can you please guide me how can I pass multiple text files to generate output files on them?

Is it possible to cluster new documents without relearning everything?

I'm looking for some way to run the clustering algorithm while using previously learned collocs, map, and paths. I tried pointing to the paths file with the --paths flag, but this just overwrote it with a newly learned one.

I don't have time to relearn everything from scratch: it takes days!

basic/prob-utils.cc:8:37: error: ‘M_PI’ was not declared in this scope

I am using Cygwin on windows and trying to run this code. On the first step when running "make" command, getting following error.

basic/prob-utils.cc: In function ‘double rand_gaussian(double, double)’:
basic/prob-utils.cc:8:37: error: ‘M_PI’ was not declared in this scope
double z = sqrt(-2log(x1))cos(2M_PIx2);
^~~~
make: *** [Makefile:13: basic/prob-utils.o] Error 1

Can you guide in this regard?

A library for brown clustering?

I was wondering if it's possible to make a library out of this code in order to be able to include it into other projects?

Question

Hello, I would like to use your algorithm to categorize job titles. Do you still make updates and maintain the library ?

Bets Regards,
Evangelia

how Paths2map is used

Hello! I was browsing the code and I saw the
opt_define_bool(paths2map, "paths2map", false, "Take the paths file and generate a map file.");
Is it possible to be used? What is the output? Something like the tree presented in brown algorithm paper?
Thank you very much

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.