Giter Club home page Giter Club logo

probminhash's People

Contributors

oertl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

probminhash's Issues

Question on compute the order MinHash based distance

Hi Otmar,

I ran the probminhash test for order minhash performance and equivalence, output is like a histogram (sketch) but no actual distance computed for 2 sets to approximate edit distance. I checked the omh2019 repo (https://github.com/Kingsford-Group/omhismb2019), in omh.cpp:

double compare_sketch_pair(const char* p1, const char* p2, unsigned m, unsigned k, unsigned l, bool circular) {
const unsigned block = std::max(l, (unsigned)1) * k;
unsigned count = 0;
if(!circular || l < 2) {
for(unsigned i = 0; i < m; ++i, p1 += block, p2 += block)
count += memcmp(p1, p2, block) == 0;
} else {
for(unsigned i = 0; i < m; ++i, p1 += block, p2 += block) {
for(unsigned j = 0; j < l; ++j) {
if(memcmp(p1, p2 + j * k, block - j * k) == 0 && memcmp(p1 + block - j * k, p2, j * k) == 0) {
++count;
break;
}
}
}
}
return (double)count / m;
}

It seems just count how many bits matched divided by block size will then be the collision probability, which is the order Jaccard for the approixmate edit distance?

Thank you,

Jianshu

Question on OPH with optimal densification versus ProbMinHash

Hello Otmar,

In the main paper:

“OPH has by far the best performance for input sizes n which are in the order of m, but is quite slow for small n” (Figure 5. The last sub figure, when w_d is 1, probminhash will then be simple minhash and can be compare with OPH with optimal densification), this is exactly what the most recent bidirectional densification idea was solving (https://dl.acm.org/doi/abs/10.1145/3448016.3452833), e.g., when n is small and there will be many empty bins after hashing n element to the sketch with size m, optimal densification has an O(m^2) complexity for densification, which means it is very slow. bidensification improved it to O(m), so very fast then, at all size of n relative to m. In practice, n ls always much larger than m, so even for the original densification, no problem in the real-world application, but the bidirectional densification idea solves the problem with small n in the end. With the bidirectional densification, OPH with bidirectional densification is now strictly O(n+m) no matter the size of m and n. Do you think this is something interesting to implement or mention.

Thanks,

Jianshu

question on J_N and J_p

Hello Otmar,

I have a question on J_N and J_P, since both are scale-invariant, J_N is the normalized weighted jaccard similarity, is J_p (assume P-minhash is used) closer to J (unweighted, no-normalization) or J_N for a give set, where there are weight for each element in the set. I imagine J and J_N is not the same for the set. If my goal was to use J_N to compare 2 sets, will J_p give similar answer.

Thanks,

Jianshu

The elegant wyHyperLogLog

Dear oertl:
I am Yi Wang the author of wyhash :-D I polished the hyperloglog algorithm by discovering the underlying mathematical formules and constant. I improved reduced 1% log scale error compared to old versions. It is available at https://github.com/wangyi-fudan/wyhll
Currently, I am interested in probabilistic skech methods and will learn advanced methods from your papers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.