Learning M-Way Tree - Web Scale Clustering - EM-tree, K-tree, k-means, TSVQ, repeated k-means, clustering, random projections, random indexing, hashing, bit signatures
TF-IDF
BM25 - probably quite useful with reflexive random indexing because it preserves the inner product space where BM25 works well
Log Likelihood from TopSig paper
This can just be done by switching the ACCUMULATOR vector type to an atomic version. Note that whole vector operations do NOT need to be atomic, just updating a single dimension in the accumulator vector.
std::string is around 32 bytes and we use this for a vector ID, but vectors are usually 512 bytes at most. So make the ID a template parameter and switch to char* for strings.
Use memory pools, custom allocators, and, allocation of nearby vectors in contiguous memory to reduce allocation overhead and improve locality of reference.
Create a set of standard test datasets and their expected quality in terms of internal and external measures. This can be used to test for any regressions modifications may introduce.
Unrolling the loops for Hamming distance may give further performance improvements by keeping all integer execution units busy inside newer processors. This basically unrolls the uint64_t chunk1, chunk2; result = chunk1 ^ chunk2; POPCNT result; operations inside the Hamming distance calculation.
It would be ideal for the indexer to output integer valued document vectors with term frequencies. These can be optionally written to disk in a compressed format using https://github.com/lemire/FastPFOR to allow for easy experimentation with different representation approaches. It would also output term collection statistics.
This would allow quick processing to convert the vectors to weighted vectors with TF-IDF, BM25, etc, or conversion to signatures.