Comments (9)
Thank you for the idea.
How do you calculate the SHIFT
from the bucket_count
? Wouldn't you end-up to just take the x most significant bits of the hash?
I know there is the "fast range" from Lemire that uses the multiplication+shift to map a value x into a range [0, n).
Instead of:
uint32_t reduce(uint32_t x, uint32_t N) {
return x % N;
}
He proposes to use:
uint32_t reduce(uint32_t x, uint32_t N) {
return ((uint64_t) x * (uint64_t) N) >> 32 ;
}
But the problem is that hashes that are close together will end-up in the same bucket. Not too much of a problem with a good hash function, but libstd++ uses an identity hash function for std::hash<int>
. So if there is a lot of small or close integers (which is often the case) you will end-up with a lot of collisions. At first glance, your method will suffer of the same problems.
from robin-map.
Yeah, there are many variations of this idea. Lemire's method economize one shift, but requires "wide multiplication" that may be tiny bit more expensive than "narrow multiplication". Although on 64-bit cpu, it may be implemented in 64-bit registers with the same MUL+SHIFT sequence as my code.
Now about distributions. Indeed, Lemire's method is equivalent to computing x / N
that suffers on id hash function. But my code
- allows to use arbitrary multiplication constant
- extracts bits
32-N .. 31
from the result rather than bits32 .. 31+N
.
It's easy to see that with full-width 32-bit multiplication constant every bit of output will depend on every bit of input. Well, almost. Actually, the problem is exactly opposite to Lemire's hash - with my formula, lower bits of output doesn't depend on higher bits of input. Which may be "good enough" for most, but not all, practical purposes.
If we need more fair hash, we can use wide multiplication and then add higher and lower words of the result:
res = u64(x) * u64(N);
return u32(res>>32) + u32(res)
For even more fair hash, we can repeat it again and/or add some more rol/shift/add operations. Just to let you know - for arbitrary constant N, cost of computing exact x % N
value is 3 wide multiplications plus a few simpler operations, so even my "fair hash" is still several times cheaper.
Alternatively, we can go with CRC hashing. It's very interesting math construction: CRC32(n) = (n*x^32) % p(x)
, where p(x) is a primitive binary polynomial of degree 32 and we compute remainder of polynomial division with GF(2) arithmetic. Yes, CRC is just another remainder, and it share some desirable properties with arithmetic remainder, namely each bit of output depends on each bit of input, and distribution of results is fair (moreover, afair, CRC32(u32) is a 1:1 relation, so it's ideally fair, guaranteed). Moreover, each CRC(n) is very different from CRC(n+1) that is also great for Robin Hood hashing.
Now, how to compute: SSE 4.2 cpus (i.e. 90% of current end-user computers) have u32 CRC32(u32, prev_crc==0)
instruction which is as cheap as a single narrow MUL. In 64-bit mode, there is also u32 CRC32_64(u64, prev_crc==0)
instruction that can process 64 input bits in a single step, and is equivalent to CRC32(u32(x), CRC32(x>>32))
.
For a more portable code, CRC can be computed using tables. I.E., since CRC(N) is just remainder of some binary polynomial corresponding to binary digits of N, we can split N into shorter polynomials, get remainder from each sub-polynomial from a table, and then sum them up. Usually, four 256-entry (4 * 8 bits) tables are used, although we can use three tables instead, of 1024 + 2048 + 2048 entries (10+11+11 bits):
b0 = x>>24
b1 = (x>>16) & 0xFF
b2 = (x>>8) & 0xFF
b3 = x & 0xFF
return table0[b0] ^ table1[b1] ^ table2[b2] ^ table3[b3]
This requires 4 load+MOVZX operations, 3 LOAD+XOR and 1 LOAD. Still looks like faster than exact x % N
computation while providing even a bit better bit mixing. But require 4 KB memory to store tables. Plus allows you to directly support 64-bit (and longer) input hashes with guaranteed fair distribution.
from robin-map.
The idea seems interesting.
I will see if I can add a new growth policy based on it and do some tests when I have more time (a bit busy with the end of the year). If you have time don't hesitate to make a pull request, the interface to implement a growth policy is quite simple (see https://github.com/Tessil/robin-map#growth-policy).
from robin-map.
What constant should be used for CONST (in 32 and 64 bits), do you have any good values in mind? You said in your first message "CONST should be even and preferably prime number"
but I suppose you just meant prime as 2 is the only even prime.
from robin-map.
It's typo, of course it should be odd. It shouldn't be even because in this case x*CONST
will always have zero in at least one lowest bit. Primality is not important per se, but it's a simple way to ensure that CONST doesn't have [small] multipliers that can interfere with the following calculations.
I usually use 1234567891 as 31-bit prime number. Even better, you can borrow some from xxHash32 or xxHash64 - i know that Cyan is super-careful about details, so his numbers should be better than mine (i.e. have ~50/50 split between 1 and 0 bits). Use prime.cpp to generate more primes - f.e. you can start with some random-generated number having fair split between 0 and 1 bits and then find closest prime.
from robin-map.
If you have time don't hesitate to make a pull request, the interface to implement a growth policy is quite simple
You are absolutely right that it's easy to implement other grow policies, it can be done in a few minutes. The real work is to benchmark the outcome, and to try various modifications of a basic idea. So, if you prefer, i can write several policies implementing variations of these ideas, but benchmarking them will be on you.
BTW, do you have test(s) measuring fairness of policy (rather than raw speed)? SMHasher tests should be (almost) directly applicable for this task.
from robin-map.
BTW, do you have test(s) measuring fairness of policy (rather than raw speed)?
No, the benchmarks I did were mainly geared toward comparing speed of collisions resolutions in hash tables (didn't check probes lengths and cache-misses formally). I didn't bother too much regarding the hash reductions and their distribution as the ones I use are fairly common
from robin-map.
Overall, I think that the library should provide the following policies:
- No hashing and growth factor=2^K. This policy just extracts N lower bits from the provided hash value - ideal for situations when client code already implements sophisticated hashing algorithm and the library just need to use the hash value as is.
- No hashing and arbitrary growth factor. In this case, the policy should just multiply the hash value by the current hash table size using Lemire's formula. Again, we rely on client code to supply us with already well-mixed hash value.
- Fair hashing and growth factor=2^K. Fair hashing may use CRC or multiplication with my formula. It should be the default policy, since it guarantees fair distribution even when user-provided hash values are unfair.
- Fair hashing and arbitrary growth factor. Combines computations of two previous hashes.
- Hardware-assisted CRC fair hashing. Very fast and 100% fair, but require SSE 4.2.
- Combination of policies 5+2
Since all policies come in pairs, it will be great to combine them, i.e. a template CrcPolicy<float Factor=2>
provides default Factor=2 for users that don't care about exact growth factor, and its code at compile-time chooses between employing lower bits of hash for Factor==2^K
and use of Lemire's formula otherwise.
I believe that the default hashing policy should be the most reliable (fair) one, rather than the fastest one. This follows the "least surprise" principle.
from robin-map.
I'll close the issue and keep the power of two policy as default policy for now. In the end I didn't really have the time to test the different alternatives extensively. If someone makes a comparison of different growth policies, feel free to post the link here, I'd be curious to read it.
from robin-map.
Related Issues (20)
- How to put this into shared memory? HOT 3
- release the latest version to conda-forge? HOT 2
- 1.0.0 fails to build with GCC 12 on Fedora HOT 10
- Do you know how to put robin_map into shared_memory? HOT 1
- Hash Map Insert Stuck in an infinite loop HOT 53
- Broken on Conan HOT 1
- Does robin_map is thread safe?
- CL.exe crash
- find/insert return iterators with read-only second element HOT 3
- aligned_storage is deprecated by C++23 HOT 3
- A C2028 error may be generated when using the c++ 20 module.
- Disable install rule if robin-map is a subproject HOT 3
- Q: Is there a C wrapper? HOT 1
- Robin_map:find() causing segmentation faults HOT 8
- robin_hash crash HOT 1
- CMake: not exporting if IS_SUBPROJECT breaks use through FetchContent HOT 1
- What is the proper way to use robin_map in lambda function for parallelizing?
- Excessive use of memory in some odd cases HOT 5
- Identity hash function by default on GCC HOT 4
- ``erase()`` function that takes an iterator but does not return one. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from robin-map.