Your hasing function looks like the hashCode algorhyt

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Ok, gotcha. <div class="snippet-clipboard-content notranslate position-relative ov

Would it be possible to add this to your benchmark suite? </blockquot

Thoughts on hashing optimizations about mingo HOT 12 CLOSED

kofrasa commented on July 18, 2024

Thoughts on hashing optimizations

from mingo.

Comments (12)

kofrasa commented on July 18, 2024 2

Hi Ben, I think your request sounds reasonable. Different use cases may need different hashing guarantees. I will take a look at the PR soon. Regards, Francis

…

On Mon, Nov 30, 2020 at 6:25 PM Benjamin DANIEL ***@***.***> wrote: Here a fork to avoid hash collision, but loosing performances : https://github.com/BenjD90/mingo Here an exemple of processing error that it avoid : import mingo from 'mingo' const collection = [ { "codes": ["KNE_OC42-midas"] }, { "codes": ["KNE_OCS3-midas"] }]; const criteria = { "codes": { "$in": ["KNE_OCS3-midas"] }}; console.log(`---- START ----`); let query = new mingo.Query(criteria)const cursor = query.find(collection) // print all match, should match only the 2ndwhile (cursor.hasNext()) { console.log(cursor.next())} console.log(`---- END ----`) /* Output :---- START ----{ codes: [ 'KNE_OC42-midas' ] }{ codes: [ 'KNE_OCS3-midas' ] }---- END ----*/ I'll try do make de PR soon, to allow user to give it's own hash function, and tell it in the documentation. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#62 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHF7EPQHIKYLBYGFUDC6CLSSPBPHANCNFSM4DZ2RXRA> .

from mingo.

kofrasa commented on July 18, 2024 1

I had doubts about this find, so I decided to write my own benchmark. It follows closely what you have in JSperf but can be run be locally.

All the algorithms compare well, however hashCode gives the most preferred results for time and collision. It is also well behaved and has less variation between runs for different inputs.

The benchmark excludes the actual first run for each algorithm, which is not shown in the output. See the script here

I suspect the data you have is either for a different algorithm or for a specific kind of input (e.g. serial numbers, names, zipcodes etc).

first

data size: 1000000
====================
hashcode:
{"time":[4887,5275,5381,5973],"collisions":109,"avg":5379}

djb2:
{"time":[4662,5390,5107,5286],"collisions":129,"avg":5111.25}

murmurhash3:
{"time":[5663,6046,5917,5966],"collisions":129,"avg":5898}

sdbm:
{"time":[6254,6391,6799,6457],"collisions":122,"avg":6475.25}

second

data size: 1000000
====================
hashcode:
{"time":[5294,5582,5611,5666],"collisions":115,"avg":5538.25}

djb2:
{"time":[5615,5625,5414,5574],"collisions":129,"avg":5557}

murmurhash3:
{"time":[8046,6030,5985,5916],"collisions":104,"avg":6494.25}

sdbm:
{"time":[6368,6521,6480,8156],"collisions":118,"avg":6881.25}

third

data size: 1000000
====================
hashcode:
{"time":[5399,5329,5419,5408],"collisions":122,"avg":5388.75}

djb2:
{"time":[5266,5348,5367,7305],"collisions":122,"avg":5821.5}

murmurhash3:
{"time":[5833,5890,5712,7132],"collisions":106,"avg":6141.75}

sdbm:
{"time":[6343,6336,6293,6416],"collisions":108,"avg":6347}

from mingo.

kofrasa commented on July 18, 2024 1

After some tweaks and a few more runs, it turns out the XOR version of the djb2 algorithm provides better performance per collision.
[UPDATE]
hashCode still performs better

from mingo.

kofrasa commented on July 18, 2024

I am not keen to take on a dependency. The current hash function can be optimized a bit more or substituted for a faster smaller alternative, but not as an external dependency.

Keeping the code size small is very important. I think if you want to use some features from mingo with an very optimal implementation, you can roll out your own custom operator. If your dataset is very large and you aren't in the browser, then you can just use MongoDB directly by way of the mongoimport command.

from mingo.

kofrasa commented on July 18, 2024

I am quite curious to know how you came to these numbers. hashCode is generally considered a good enough (in both speed and collision resistance) for simple cases.

murmurhash3 (pure javascript) is about 140% as fast as hashCode.
cityhash (c++ node.js module) is about 280% as fast as hashCode

from mingo.

Redsandro commented on July 18, 2024

@kofrasa said:

I am not keen to take on a dependency. The current hash function can be optimized a bit more or substituted for a faster smaller alternative, but not as an external dependency.

Ok, I can understand that. We differ on persective a bit and that's fine. Is making the hash function extensible an interesting alternative? Everything would work as it works now, except you could optionally overrule the function like this:

mingo.setup({
    key    : '_id',
    hashFn : function(val) {
        return murmurHash3.x86.hash32(val, 22)
    }
});

mingo could be used as is, or with a js library for browser implementations, or with a c++ library for node.js implementations. Everyone wins.

I am quite curious to know how you came to these numbers.

Honestly, I remember from previous research when I wanted to optimize hashing for object comparisons within large arrays of objects. I think I can find something when I google my name.

-update-

Here is a jsperf comparison I did in 2015:

The comparison was based on the work of others and I'm sure there are revisions after me.
Personally I am only interested in Chrome results (because the same V8 js engine powers node.js) but speed in other browsers may vary.
string-hash (removed from screenshot for brevity) is the clear winner but seemed very unreliable because of hash collisions.
410 / 310 * 100 = 132%. I swear it was 140% when I tested in 2015. 😝

This was for javascript solutions, but later on I went to investigate native solutions.

I found this article with the following results:

Speed depends on input size, but for 2 to 32 characters, cityHash is 200 to 500% faster.

Note that cityHash is from Google and MurmurHash3 got someone hired at Google (or the other way around).

-update-

This article on Reddit might be relevant to your interests.
https://www.reddit.com/r/programming/comments/ozodk/cityhash_new_hash_function_by_google_faster_than/

-update-

What is also interesting is that native functions like cityHash might make it's way into javascript when WebAssembly makes it way into mainstream in a year or two.

from mingo.

kofrasa commented on July 18, 2024

Thanks for the benchmarks.

While we have numbers, it makes sense to take the fastest and relatively smaller algorithm.
string-hash in particular was up for consideration when I choose the hash function, but I run no benchmark. From the data, the speed is uncontested however, we don't know about how collision resistant it is. This may have influenced my choice at the time but I do not remember.

Exposing an interface to override the hash function will merely create a contract that must be supported in the future. I would argue that is not necessary. It is an implementation detail which is not a hard requirement. I might as well compare the stringified results of the inputs directly which will work fine and guarantee better correctness. The only purpose of the hash was to reduce the footprint of the amount of data required to be in memory, during certain operations.

In the future, those details could change in a manner transparent to users.

from mingo.

Redsandro commented on July 18, 2024

Ok, gotcha.

mingo.setup({
    key    : '_id',
    hashFn : val => val
});

😉

Am I correct to assume that your preferences for mingo make it unlikely that you'll accept PRs in the future related to small* performance optimizations or extensibility of internals?

*) 'small' as in trivial for smaller datasets (e.g. todo-list), but relevant for larger datasets (e.g. data analysis)

from mingo.

kofrasa commented on July 18, 2024

PR's are welcome (fixes, new operators, internal improvements etc), but like any dev work, they are subject to project constraints.

This project environment is bias towards the browser, that means it should have a small footprint, be reasonably fast, provide the most useful query features.

There are numerous alternatives on the server side of things including using an actual MongoDB database with access to full indexing support if needed. You mentioned LokiJS for example, which says on the site

LokiJS is an in-memory database which prioritises performance over everything
They also have MongoDB API compatibility on their road map so I presume it is just a matter of time.

Back to your benchmark, I am curious to see how the hashCode implementation performs if written differently.

current

for (i = 0; i < testCount; i++) {
  //from http://goo.gl/lY5ww
  var s = 0;
  for (var ii = 0, l = testStrings[i].length; ii < l; ii++) {
    s = ((s << 5) - s) + testStrings[i][ii].charCodeAt(0);
    s = s & s;
  }
}

this is the slightly different version mingo uses.

for (i = 0; i < testCount; i++) {
  var h = 0, s = testStrings[i];
  var x = s.length;
  while (x) {
    (h << 5) - h + s.charCodeAt(--x)
    h |= 0
  }
}

Would it be possible to add this to your benchmark suite?

from mingo.

Redsandro commented on July 18, 2024

Would it be possible to add this to your benchmark suite?

Yes, although jsperf has been 502 all day. I'll try when it's up again.

from mingo.

Redsandro commented on July 18, 2024

@kofrasa

jsperf is finally up again. I've added the test. I also found that I made some mistakes in the string-hash test in 2015, it's so fast because it breaks early.

Here are the new tests: https://jsperf.com/murmurhash3-vs-xxhash

And you're right. Your implementation of hashCode is faster.

Meanwhile, I did read that hashCode is likely to collide at least once when using more than 77.000 hashes.

That's more than I expected. It's not as much about the algorhythm as it's about the 32 bitness.

With this information, I think it's more serious now (perhaps even crucial) (have the option to) disable hashing of match fields in for example $look, or at least use 64 bit hashes (which are slower across the board).

MurmurHash can do 64 bit and 128 bit hex.

from mingo.

BenjD90 commented on July 18, 2024

Here a fork to avoid hash collision, but loosing performances : https://github.com/BenjD90/mingo

Here an exemple of processing error that it avoid :

import mingo from 'mingo'

const collection = [
  {
    "codes": ["KNE_OC42-midas"]
  },
  {
    "codes": ["KNE_OCS3-midas"]
  }
];

const criteria = {
  "codes": { "$in": ["KNE_OCS3-midas"] }
};

console.log(`---- START ----`);

let query = new mingo.Query(criteria)
const cursor = query.find(collection)

// print all match, should match only the 2nd
while (cursor.hasNext()) {
  console.log(cursor.next())
}

console.log(`---- END ----`)

/* Output :
---- START ----
{ codes: [ 'KNE_OC42-midas' ] }
{ codes: [ 'KNE_OCS3-midas' ] }
---- END ----
*/

I'll try do make de PR soon, to allow user to give it's own hash function, and tell it in the documentation.

from mingo.

Thoughts on hashing optimizations about mingo HOT 12 CLOSED

Comments (12)

first

second

third

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent