cmdevries / lmw-tree Goto Github PK

Learning M-Way Tree - Web Scale Clustering - EM-tree, K-tree, k-means, TSVQ, repeated k-means, clustering, random projections, random indexing, hashing, bit signatures

Home Page: http://lmwtree.devries.ninja

License: BSD 3-Clause "New" or "Revised" License

C++ 99.73% CMake 0.27%

lmw-tree's People

Contributors

Stargazers

Watchers

Forkers

scottzeng82 crosslink ldevine big-data-lab stefansavev songyf istanbulboy ml-lab minghao2016 solertis hzy5000 pombredanne shubhampachori12110095 afcarl ilibx qfxlcyc phymucs klapidus 00mjk

lmw-tree's Issues

Compile with GCC 4.9

GCC 4.9 brings useful new features like template lambdas and declaring templates using auto. Refactor the code where it makes the code cleaner.

Make streaming EM-tree working for non-bitvector types

There is a TODO in the code for this.

Use folly and jemalloc

Switch to follow and jemalloc to replace std::string and std::vector. Might be faster.

Introduce string and vector types in the lmwtree namespace.

Use likely and unlikely where useful.

Sparsify bitmap lookups

Sparsifying storage of bitmap lookups can make them more efficient.

See https://github.com/cmdevries/LMW-tree/blob/master/src/lmw/BitMapList8.h#L53 and https://github.com/cmdevries/LMW-tree/blob/master/src/lmw/BitMapList16.h#L55.

Also can make it a single class while we are at it.

Create unit tests

Some lower level unit tests for vector types and other concepts would also be useful.

Weighting functions for document vectors

TF-IDF
BM25 - probably quite useful with reflexive random indexing because it preserves the inner product space where BM25 works well
Log Likelihood from TopSig paper

Add random projections and signature generation

Use Lance's idea for cyclic generation of random index vectors. Very cache friendly.

Implement mini-batch stochastic gradient descent parallel streaming EM-tree

With a large enough mini-batch size, parallelism can be exploited within the mini-batch of say 1 million signatures.

Hopefully this will converge in 1 iteration or less for excessively large datasets like ClueWeb.

This also works in a distributed setting by simply batching and broadcasting updates to the tree as the parallel mini-batches proceed.

Fast randomization of the input signature file is also useful.

Switch accumulator vectors to atomic<int> vectors and remove locking

Also compare performance before and after.

This can just be done by switching the ACCUMULATOR vector type to an atomic version. Note that whole vector operations do NOT need to be atomic, just updating a single dimension in the accumulator vector.

Investigate using Armadillo for everything but bit vectors.

http://arma.sourceforge.net/

Reduce memory overhead of vector IDs

There is some memory overhead on vectors.

std::string is around 32 bytes and we use this for a vector ID, but vectors are usually 512 bytes at most. So make the ID a template parameter and switch to char* for strings.

Improve memory allocation and locality

Use memory pools, custom allocators, and, allocation of nearby vectors in contiguous memory to reduce allocation overhead and improve locality of reference.

Remove manual memory management

Using smart pointers for cases where the pointer overhead is not critical such as tree nodes.

Switch to move semantics for vectors.

Fix the K-tree implementation

The K-tree implementation was broken somewhere along the way. It needs to be fixed. Maybe it never worked completely in the first place.

Setup regression testing

Create a set of standard test datasets and their expected quality in terms of internal and external measures. This can be used to test for any regressions modifications may introduce.

It also serves as documentation.

Get some non-document data sets for UCI.

Make CMake build cross platform

Currently the CMake build is specific for GCC.

To make it cross platform it needs to use the find_package() function instead of manually setting GCC compile and link flags.

Unroll loops for Hamming distance

Unrolling the loops for Hamming distance may give further performance improvements by keeping all integer execution units busy inside newer processors. This basically unrolls the uint64_t chunk1, chunk2; result = chunk1 ^ chunk2; POPCNT result; operations inside the Hamming distance calculation.

Implement reflexive random indexing

https://code.google.com/p/semanticvectors/wiki/ReflectiveRandomIndexing

However, no binary vector version exists yet.

Static analysis with Coverity

Load standard data formats

Such as SVM-light and Matlab formats.

Check tests memory and threads with valgrind and helgrind as part of build

Consider switching non-performance critical sections to interfaces

change non-performance critical concepts to interfaces instead of concepts; i.e. virtual functions instead of templates

Upgrade to Boost 1.57.0

Implement distributed version of streaming parallel EM-tree

Compress transmission of integer accumulators between machines vectors using https://github.com/lemire/FastPFOR.

Hadoop + HDFS (just get hadoop to hand over the bytes, or use HDFS directly).

ZeroMQ + GlusterFS.

Apache Spark might work well with python bindings for library, https://github.com/apache/spark.

HDFS + Erlang scheduler (gascheduler) + C++ code as a simple TCP server.

Support compressed storage of document vectors

After indexing compress document vectors so they can be written out in a compressed binary format.

Delta encode and variable byte sparse vectors.

Use https://github.com/lemire/FastPFOR.

Evaluate full SIMD Hamming distance performance

Use compiler intrinsics to try different SIMD implementations of Hamming distance for both exlusive or and population count

Fix const correctness

Move all external libraries to external/ and create Makefile

Finish implementing indexer

It would be ideal for the indexer to output integer valued document vectors with term frequencies. These can be optionally written to disk in a compressed format using https://github.com/lemire/FastPFOR to allow for easy experimentation with different representation approaches. It would also output term collection statistics.

This would allow quick processing to convert the vectors to weighted vectors with TF-IDF, BM25, etc, or conversion to signatures.

It would be great to do the same with bi-grams and invent or reuse a weighting scheme that uses pointwise mutual information (http://nlpwp.org/book/chap-ngrams.xhtml#chap-ngrams-bigrams) in the weighting calculation.