oertl / probminhash Goto Github PK
View Code? Open in Web Editor NEWProbMinHash – A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity
ProbMinHash – A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity
Hi Otmar,
I ran the probminhash test for order minhash performance and equivalence, output is like a histogram (sketch) but no actual distance computed for 2 sets to approximate edit distance. I checked the omh2019 repo (https://github.com/Kingsford-Group/omhismb2019), in omh.cpp:
double compare_sketch_pair(const char* p1, const char* p2, unsigned m, unsigned k, unsigned l, bool circular) {
const unsigned block = std::max(l, (unsigned)1) * k;
unsigned count = 0;
if(!circular || l < 2) {
for(unsigned i = 0; i < m; ++i, p1 += block, p2 += block)
count += memcmp(p1, p2, block) == 0;
} else {
for(unsigned i = 0; i < m; ++i, p1 += block, p2 += block) {
for(unsigned j = 0; j < l; ++j) {
if(memcmp(p1, p2 + j * k, block - j * k) == 0 && memcmp(p1 + block - j * k, p2, j * k) == 0) {
++count;
break;
}
}
}
}
return (double)count / m;
}
It seems just count how many bits matched divided by block size will then be the collision probability, which is the order Jaccard for the approixmate edit distance?
Thank you,
Jianshu
Hello Otmar,
In the main paper:
“OPH has by far the best performance for input sizes n which are in the order of m, but is quite slow for small n” (Figure 5. The last sub figure, when w_d is 1, probminhash will then be simple minhash and can be compare with OPH with optimal densification), this is exactly what the most recent bidirectional densification idea was solving (https://dl.acm.org/doi/abs/10.1145/3448016.3452833), e.g., when n is small and there will be many empty bins after hashing n element to the sketch with size m, optimal densification has an O(m^2) complexity for densification, which means it is very slow. bidensification improved it to O(m), so very fast then, at all size of n relative to m. In practice, n ls always much larger than m, so even for the original densification, no problem in the real-world application, but the bidirectional densification idea solves the problem with small n in the end. With the bidirectional densification, OPH with bidirectional densification is now strictly O(n+m) no matter the size of m and n. Do you think this is something interesting to implement or mention.
Thanks,
Jianshu
Hello Otmar,
I have a question on J_N and J_P, since both are scale-invariant, J_N is the normalized weighted jaccard similarity, is J_p (assume P-minhash is used) closer to J (unweighted, no-normalization) or J_N for a give set, where there are weight for each element in the set. I imagine J and J_N is not the same for the set. If my goal was to use J_N to compare 2 sets, will J_p give similar answer.
Thanks,
Jianshu
Dear oertl:
I am Yi Wang the author of wyhash :-D I polished the hyperloglog algorithm by discovering the underlying mathematical formules and constant. I improved reduced 1% log scale error compared to old versions. It is available at https://github.com/wangyi-fudan/wyhll
Currently, I am interested in probabilistic skech methods and will learn advanced methods from your papers.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.