Giter Club home page Giter Club logo

scactm's Introduction

Scalable inference for Correlated Topic Models (ScaCTM)

Introduction

Correlated Topic Model is a topic model for text or other discrete data that models correlation between the occurrence of different topics in a document (Figure 1).

ScaCTM is software that learn Correlated Topic Model of thousands of topics from millions of documents, with tens of machines.

Figure 1. Correlated Topic Model. (Blue monospace: output file names, Black monospace: input parameter names)

The generative model is described as follows.

Below is the speed of ScaCTM on some corpus. For more details, please refer to our paper.

corpus number of documents number of topics time
NIPS 1.2K 100 8.9 min
20NG 11K 200 9 min
NYTimes 285K 400 0.5 hrs
Wiki 6M 1000 17 hrs

Install

Assume you already have BLAS and LAPACK installed, to build ScaCTM, simply do

git clone --recursive http://github.com/cjf00000/ScaCTM.git
    cd third_party
    make                    # Install third-party dependencies, takes 6 minutes on our 12-core server
    cd ..
    cd src
    make -j                 # Build ScaCTM in parallel
    cd ..

See BLAS if you don't have BLAS and LAPACK.

Chinese users: if you have trouble installing, see this.

For offline installation, download everything in https://github.com/cjf00000/ScaCTM/releases, put them in third_party/third_party/src, then do the normal installation sequence above.

Usage

First you need to set some environment variables

    source scripts/setenv_bash      # for bash
    source scripts/setenv_csh       # for csh

Then, copy ScaCTM to every nodes of cluster. We assume you have two nodes, juncluster1 and juncluster2, and ScaCTM is stored at juncluster1:ScaCTM

    scp -r ScaCTM juncluster2:

Note: you need to execute source scripts/setenv_<your shell> in every shell session.

Input data

Download the example data, from Yahoo-LDA

    mkdir work && cd work   # We'll do everything in work/
    wget https://raw.github.com/sudar/Yahoo_LDA/master/test/ydir_1k.txt

Now, take a look at the data. Every document is a single line in the file, every line is

    docid site word_1 word_2 ...

You need to make sure that docid is unique. site is not used by ScaCTM currently, just for compatibility with Yahoo-LDA. Followed by these two fields are all the words in the corpus. You may wish to do some pre-processing, e.g. remove stop words.

Train

First, open a file machinefile, put the following lines

    juncluster1
    juncluster2

Which are the hostnames for your cluster.

Look at help of ctm-train

    ctm-train --help

Now we want to train a model with default parameters

    ctm-train -f machinefile ydir_1k.txt

If you only have one machine, use

    ctm-train ydir_1k.txt

The training is now started, you will see its progress on your console

    Training on juncluster1, juncluster2 (2 nodes).
    Distributing executables...
    Done (7.418916 seconds).
    Starting server...
    Formatting corpus ydir_1k.txt...
    Done (1.675312 seconds).
    Training
    100.000000%
    Done (68.470395 seconds).
    Gathering results...
    Done (13.113103 seconds). Model stored in /home/jianfei/tmp/ScaCTM/work/ydir_1k.txt.model.

When the training is completed, the result in stored in ydir_1k.txt.model/info

  • model.topToWor.txt is learned topics
  • model.dict.json is a list of dictionary / vocabulary
  • model.phi.json is a V * K matrix phi, where V is vocabulary size and K is number of topics. phi[v, k] = p(w=v | z=k, phi) is the probabilty of word v of topic k. Word v is the v-th line in model.dict.json.
  • model.mu.json, model.cov.json are prior mean and covariance. model.corr.json is the correlation matrix.
  • model.top.json is the topic distribution eta for each document. It uses docid provided in data as key and eta as value.

Train and test

We need a training set and testing set

    make_test_data.sh ydir_1k.txt ydir_1k.txt.train ydir_1k.txt.test 0.8

You'll get a training set and a testing set with 8:2 documents.

Train

    ctm-train -f machinefile ydir_1k.txt.train

Test

    ctm-test -m ydir_1k.txt.train.model/ ydir_1k.txt.test 

Test result is stored in ydir_1k.txt.test.result/info.

Perplexity evaluation

Sometimes you want to measure the perplexity on a test set. You need to further divide every document in the test set into an observed part and a heldout part. We'll infer eta on the observed part and measure perplexity on the heldout part. See the paper for detail.

    make_test_data.sh ydir_1k.txt ydir_1k.txt.train ydir_1k.txt.test_observed 0.8 ydir_1k.txt.test_heldout

You'll get a training set and a testing set with 8:2 documents, the test set is further divided into an observed part and a heldout part.

Train

    ctm-train -f machinefile ydir_1k.txt.train

Test

    ctm-test -m ydir_1k.txt.train.model/ --heldout ydir_1k.txt.test_heldout ydir_1k.txt.test_observed

Perplexity is shown at the bottom.

Testing on juncluster1, juncluster2 (2 nodes).
Distributing executables...
Done (3.956380 seconds).
Broadcasting model...
Done (2.808829 seconds).
Starting server...
Formatting corpus ydir_1k.txt.test_observed...
Done (0.642709 seconds).
Testing
100.000000%
Done (4.367594 seconds).
Gathering results...
Done (4.367329 seconds). Output stored in ydir_1k.txt.test_observed.result.
Computing held-out perplexity
Perplexity = 1112.114460

BLAS

If you don't have BLAS and LAPACK, you can simply install them by package manager, for example

    sudo apt-get install libblas-dev
    sudo apt-get install liblapack-dev

will install Netlib reference implementation on Ubuntu systems.

If you are working with a lot of short documents and many topics (>=500), you may wish to use a faster BLAS. There are many possible options, for example, to install ATLAS

    sudo apt-get install libatlas3-dev

and rebuild the code

    cd src
    make clean && make -j

make sure ATLAS is used by ScaCTM

    # See which BLAS is ScaCTM using
    $ ldd /bin/learntopics
    <skipped>
    libblas.so.3gf => /usr/lib/libblas.so.3gf (0x00007f7c770e0000)
    <skipped>

    # Find the path of /usr/lib/libblas.so.3gf
    $ ls -l /usr/lib/libblas.so.3gf
    /usr/lib/libblas.so.3gf -> /etc/alternatives/libblas.so.3gf 

    $ ls -l /etc/alternatives/libblas.so.3gf
    /etc/alternatives/libblas.so.3gf -> /usr/lib/atlas-base/atlas/libblas.so.3gf

if you are still using the old BLAS, you can manually link with ATLAS by modifing LDFLAGS in src/Makefile. You can also use other BLAS such as MKL or ACML by modifing INCFLAGS and LDFLAGS in src/Makefile.

NOTE

Be cautious with OpenBLAS, we find ScaCTM works extremely slow with OpenBLAS.

Bugs / New Features

Please create a issue for us if you found any bugs or want any new features. We look forward for your suggestions!

Dependencies

ScaCTM is built upon the Yahoo-LDA, which is a distributed framework for topic modeling.

License

Apache License 2.0

Reference

Please cite our paper if you find ScaCTM is useful!

Jianfei Chen, Jun Zhu, Zi Wang, Xun Zheng, and Bo Zhang. Scalable Inference for Logistic-Normal Topic Models, Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, USA, 2013. (NIPS 2013)

Some demonstration can be found at http://ml.cs.tsinghua.edu.cn/~scalable-ctm/.

scactm's People

Contributors

cjf00000 avatar

Stargazers

 avatar Luis Montero avatar Gökçen Eraslan avatar  avatar  avatar He Zhao avatar weijing avatar Jamie Morton avatar  avatar Diego Antognini avatar Tim avatar Nikolai Zakharov avatar Pablo Melo avatar Yevgeniy avatar Yizhong Ru avatar  avatar  avatar Minh-Duc NGUYEN avatar Vivek Kulkarni avatar axiom avatar Elvis avatar VinsonHuang-D avatar Erfan Noury avatar Nissan Pow avatar Zygmunt Zając avatar Jason Tan avatar Aish avatar  avatar  avatar Alex wang avatar

Watchers

Gökçen Eraslan avatar Xun Zheng avatar  avatar Alex wang avatar  avatar Kaiyu Yang avatar  avatar telehan avatar  avatar

scactm's Issues

make时出错

请问这是什么原因呢?
make -j的时候
/home/zhuyi/code/ScaCTM/third_party/third_party/include/gflags/gflags_declare.h:55:9: error: ‘int64_t’ does not name a type
typedef int64_t int64;
/home/zhuyi/code/ScaCTM/third_party/third_party/include/gflags/gflags_declare.h:56:9: error: reference to ‘uint64_t’ is ambiguous
typedef uint64_t uint64;
/home/zhuyi/code/ScaCTM/third_party/third_party/include/gflags/gflags_declare.h:56:9: error: ‘uint64_t’ does not name a type
typedef uint64_t uint64;
/home/zhuyi/code/ScaCTM/third_party/third_party/include/gflags/gflags.h:128:41: error: ‘int64’ does not name a type
extern bool RegisterFlagValidator(const int64* flag,
/home/zhuyi/code/ScaCTM/third_party/third_party/include/gflags/gflags.h:130:41: error: ‘uint64’ does not name a type
extern bool RegisterFlagValidator(const uint64* flag,

Running ScaCTM on one machine

Hi,

I am able to compile the code and my problem is I didn't have distributed clusters and only want to run it on one machine, I followed the manual and ran the following code:
ctm-train xxx.txt
It turned out that a bunch of "no such directory or folder" errors occurred. I looked at the ctm-train script and guess that the program tried to write files to a remote directory but I do not have such environment. Would you be able to provide some suggestions?

Thanks very much

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.