codelibs / minhash Goto Github PK
View Code? Open in Web Editor NEWThis provides tools for b-bit MinHash algorism.
License: Apache License 2.0
This provides tools for b-bit MinHash algorism.
License: Apache License 2.0
On README.md you put this case:
// Compare a different text.
String text2 = "Solr is the popular, blazing fast open source enterprise search platform";
byte[] minhash2 = MinHash.calculate(analyzer, text2);
assertEquals(0.453125f, MinHash.compare(minhash, minhash2));
Please, help-me to understand this algorithm. Why the compare of two very different string the value is 0.453125f
. Why not a number more near to 0
?
Thanks
Hi, I've been learning to use this code recently. Can I think this code as the following three steps? First, we use lucene to generate text sets. Then we use a family of hash functions to obtain the minhash values. Finally we reduce the minhash value length to b-bit. That why we finally got a num* hashbit bits minhash. Am I right? I would appreciate it if you could reply this question.
The code
String text = "新冠疫苗效果不错"; byte[] minhash = calculateMinHash(text); String text1 = "每天吃饭呀哈哈哈"; byte[] minhash1 = calculateMinHash(text1); float score1 = MinHash.compare(minhash, minhash1);
the result is "0.546875"
you readme result below 0.5 is not simolarlity
But now ,The two text is not simolarlity.But result has been greater than 0.5.
If I use have problem ,Can you help me. Thanks
What goes in "..." for Tokenizer in the example you wrote in README.md?
I would like to replicate the example but I am confused there.
Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.