Comments (12)
from mingo.
I had doubts about this find, so I decided to write my own benchmark. It follows closely what you have in JSperf but can be run be locally.
All the algorithms compare well, however hashCode
gives the most preferred results for time and collision. It is also well behaved and has less variation between runs for different inputs.
The benchmark excludes the actual first run for each algorithm, which is not shown in the output. See the script here
I suspect the data you have is either for a different algorithm or for a specific kind of input (e.g. serial numbers, names, zipcodes etc).
first
data size: 1000000
====================
hashcode:
{"time":[4887,5275,5381,5973],"collisions":109,"avg":5379}
djb2:
{"time":[4662,5390,5107,5286],"collisions":129,"avg":5111.25}
murmurhash3:
{"time":[5663,6046,5917,5966],"collisions":129,"avg":5898}
sdbm:
{"time":[6254,6391,6799,6457],"collisions":122,"avg":6475.25}
second
data size: 1000000
====================
hashcode:
{"time":[5294,5582,5611,5666],"collisions":115,"avg":5538.25}
djb2:
{"time":[5615,5625,5414,5574],"collisions":129,"avg":5557}
murmurhash3:
{"time":[8046,6030,5985,5916],"collisions":104,"avg":6494.25}
sdbm:
{"time":[6368,6521,6480,8156],"collisions":118,"avg":6881.25}
third
data size: 1000000
====================
hashcode:
{"time":[5399,5329,5419,5408],"collisions":122,"avg":5388.75}
djb2:
{"time":[5266,5348,5367,7305],"collisions":122,"avg":5821.5}
murmurhash3:
{"time":[5833,5890,5712,7132],"collisions":106,"avg":6141.75}
sdbm:
{"time":[6343,6336,6293,6416],"collisions":108,"avg":6347}
from mingo.
After some tweaks and a few more runs, it turns out the XOR version of the djb2
algorithm provides better performance per collision.
[UPDATE]
hashCode
still performs better
from mingo.
I am not keen to take on a dependency. The current hash function can be optimized a bit more or substituted for a faster smaller alternative, but not as an external dependency.
Keeping the code size small is very important. I think if you want to use some features from mingo
with an very optimal implementation, you can roll out your own custom operator. If your dataset is very large and you aren't in the browser, then you can just use MongoDB
directly by way of the mongoimport command.
from mingo.
I am quite curious to know how you came to these numbers. hashCode
is generally considered a good enough (in both speed and collision resistance) for simple cases.
murmurhash3 (pure javascript) is about 140% as fast as hashCode.
cityhash (c++ node.js module) is about 280% as fast as hashCode
from mingo.
@kofrasa said:
I am not keen to take on a dependency. The current hash function can be optimized a bit more or substituted for a faster smaller alternative, but not as an external dependency.
Ok, I can understand that. We differ on persective a bit and that's fine. Is making the hash function extensible an interesting alternative? Everything would work as it works now, except you could optionally overrule the function like this:
mingo.setup({
key : '_id',
hashFn : function(val) {
return murmurHash3.x86.hash32(val, 22)
}
});
mingo
could be used as is, or with a js library for browser implementations, or with a c++ library for node.js implementations. Everyone wins.
I am quite curious to know how you came to these numbers.
Honestly, I remember from previous research when I wanted to optimize hashing for object comparisons within large arrays of objects. I think I can find something when I google my name.
-update-
Here is a jsperf comparison I did in 2015:
- The comparison was based on the work of others and I'm sure there are revisions after me.
- Personally I am only interested in Chrome results (because the same V8 js engine powers node.js) but speed in other browsers may vary.
string-hash
(removed from screenshot for brevity) is the clear winner but seemed very unreliable because of hash collisions.- 410 / 310 * 100 = 132%. I swear it was 140% when I tested in 2015. 😝
This was for javascript solutions, but later on I went to investigate native solutions.
I found this article with the following results:
Speed depends on input size, but for 2 to 32 characters, cityHash is 200 to 500% faster.
Note that cityHash
is from Google and MurmurHash3
got someone hired at Google (or the other way around).
-update-
This article on Reddit might be relevant to your interests.
https://www.reddit.com/r/programming/comments/ozodk/cityhash_new_hash_function_by_google_faster_than/
-update-
What is also interesting is that native functions like cityHash
might make it's way into javascript when WebAssembly makes it way into mainstream in a year or two.
from mingo.
Thanks for the benchmarks.
While we have numbers, it makes sense to take the fastest and relatively smaller algorithm.
string-hash
in particular was up for consideration when I choose the hash function, but I run no benchmark. From the data, the speed is uncontested however, we don't know about how collision resistant it is. This may have influenced my choice at the time but I do not remember.
Exposing an interface to override the hash function will merely create a contract that must be supported in the future. I would argue that is not necessary. It is an implementation detail which is not a hard requirement. I might as well compare the stringified results of the inputs directly which will work fine and guarantee better correctness. The only purpose of the hash was to reduce the footprint of the amount of data required to be in memory, during certain operations.
In the future, those details could change in a manner transparent to users.
from mingo.
Ok, gotcha.
mingo.setup({
key : '_id',
hashFn : val => val
});
😉
Am I correct to assume that your preferences for mingo
make it unlikely that you'll accept PRs in the future related to small* performance optimizations or extensibility of internals?
*) 'small' as in trivial for smaller datasets (e.g. todo-list), but relevant for larger datasets (e.g. data analysis)
from mingo.
PR's are welcome (fixes, new operators, internal improvements etc), but like any dev work, they are subject to project constraints.
This project environment is bias towards the browser, that means it should have a small footprint, be reasonably fast, provide the most useful query features.
There are numerous alternatives on the server side of things including using an actual MongoDB database with access to full indexing support if needed. You mentioned LokiJS
for example, which says on the site
LokiJS is an in-memory database which prioritises performance over everything
They also have MongoDB API compatibility on their road map so I presume it is just a matter of time.
Back to your benchmark, I am curious to see how the hashCode
implementation performs if written differently.
current
for (i = 0; i < testCount; i++) {
//from http://goo.gl/lY5ww
var s = 0;
for (var ii = 0, l = testStrings[i].length; ii < l; ii++) {
s = ((s << 5) - s) + testStrings[i][ii].charCodeAt(0);
s = s & s;
}
}
this is the slightly different version mingo
uses.
for (i = 0; i < testCount; i++) {
var h = 0, s = testStrings[i];
var x = s.length;
while (x) {
(h << 5) - h + s.charCodeAt(--x)
h |= 0
}
}
Would it be possible to add this to your benchmark suite?
from mingo.
Would it be possible to add this to your benchmark suite?
Yes, although jsperf has been 502
all day. I'll try when it's up again.
from mingo.
jsperf is finally up again. I've added the test. I also found that I made some mistakes in the string-hash test in 2015, it's so fast because it breaks early.
Here are the new tests: https://jsperf.com/murmurhash3-vs-xxhash
And you're right. Your implementation of hashCode is faster.
Meanwhile, I did read that hashCode
is likely to collide at least once when using more than 77.000 hashes.
That's more than I expected. It's not as much about the algorhythm as it's about the 32 bitness.
With this information, I think it's more serious now (perhaps even crucial) (have the option to) disable hashing of match fields in for example $look
, or at least use 64 bit hashes (which are slower across the board).
MurmurHash can do 64 bit and 128 bit hex.
from mingo.
Here a fork to avoid hash collision, but loosing performances : https://github.com/BenjD90/mingo
Here an exemple of processing error that it avoid :
import mingo from 'mingo'
const collection = [
{
"codes": ["KNE_OC42-midas"]
},
{
"codes": ["KNE_OCS3-midas"]
}
];
const criteria = {
"codes": { "$in": ["KNE_OCS3-midas"] }
};
console.log(`---- START ----`);
let query = new mingo.Query(criteria)
const cursor = query.find(collection)
// print all match, should match only the 2nd
while (cursor.hasNext()) {
console.log(cursor.next())
}
console.log(`---- END ----`)
/* Output :
---- START ----
{ codes: [ 'KNE_OC42-midas' ] }
{ codes: [ 'KNE_OCS3-midas' ] }
---- END ----
*/
I'll try do make de PR soon, to allow user to give it's own hash function, and tell it in the documentation.
from mingo.
Related Issues (20)
- Add support for $percentile (aggregation) HOT 2
- Add support for $bitAnd (aggregation)
- Add support for $bitNot (aggregation)
- Add support $bitOr (aggregation)
- Add support for $bitXor (aggregation)
- updateObject fails when the $exists operator is used with $pull HOT 2
- $push with updateObject does not create the array if the fields does not exist HOT 1
- $inc with updateObject does not create the field if it doesn't exist HOT 1
- aggregate() $group stage with mongodb object id as group key only creates one single group HOT 1
- aggregate() $project stage modifies original collection when using negative projection on nested field HOT 2
- TypeError: Cannot convert undefined or null to object HOT 9
- AWS SSO HOT 2
- query HOT 1
- $project step omits referenced properties when matching property exists in incoming data HOT 3
- Basic query operators not loaded by default for update operation. HOT 4
- 6.4.11 regression: ESM imports no longer work HOT 4
- mingo.io is expired
- Minify compiled output for publishing
- Support value equality check for custom types using toString
- 6.4.14 regression: ESM imports no longer work HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mingo.