rizkg / bbhash Goto Github PK
View Code? Open in Web Editor NEWBloom-filter based minimal perfect hash function library
License: MIT License
Bloom-filter based minimal perfect hash function library
License: MIT License
hello @rizkg I just wanted to let you know that my Python bindings for your code, https://github.com/dib-lab/pybbhash, can now be installed via standard Python pip
as well as via conda, conda install -c conda-forge bbhash
.
Thanks again for bbhash!
Looking at your popcount code in
Lines 170 to 189 in 6bb97c4
it's perhaps worthwhile noting that you could speed-up your popcounts by using the GCC __builtin_pocount()
and __builtin_popcountl()
intrinsics when available, instead. Which would be compiled into a single POPCNT instruction on some CPUs - e.g. on not too old Intel ones.
Using the super-fast wyhash hash function (https://github.com/wangyi-fudan/wyhash), I'm seeing a collision between two integers, one of which is not in the list.
Specifically:
I've attached a minimum working example to this post. My output is as follows, which I hope helps illustrate the problem further.
$ g++ main.cpp -o out -lpthread && ./out
area = 9853546745770798894
Ninja21_V = 10499620994759670644
HelloWorld = 990340755478284348
wyand hash of area = 1514463996213185106
wyand hash of Ninja21_V = 135295144726059243
Looking up area which is not in the map: 0
Looking up Ninja21_V which is in the map: 0
Looking up HelloWorld which is not in the map: 18446744073709551615
I tried to run the test benchmark, and ran into a segfault:
-> % ./Bootest 100000000 4 -check -bench
key file generated
Construct a BooPHF with 100000000 elements
for info, total work write each : 1.000 total work inram from level -1991486788 : -1991486787.000 total work raw : 25.000
[Building BooPHF] 100 % elapsed: 0 min 4 sec remaining: 0 min 0 sec[1] 2339 segmentation fault (core dumped) ./Bootest 100000000 4 -check -bench
I've traced the segfault here:
#0 0x00005555555656f5 in void boomphf::mphf<unsigned long, boomphf::SingleHashFunctor<unsigned long> >::pthread_processLevel<bfile_iterator>(std::vector<unsigned long, std::allocator<unsigned long> >&, std::shared_ptr<bfile_iterator>, std::shared_ptr<bfile_iterator>, int) ()
#1 0x0000555555565a9d in void* boomphf::thread_processLevel<unsigned long, boomphf::SingleHashFunctor<unsigned long>, file_binary, bfile_iterator>(void*) ()
#2 0x00007ffff7bbd6db in start_thread (arg=0x7ffff50bd700) at pthread_create.c:463
#3 0x00007ffff6fa788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
This is my gcc version:
-> % g++ --version
g++ (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
Probably it's unrelated, but I also see a lot of warnings of this type:
BooPHF.h:1026:10: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 2 has type ‘uint64_t {aka long unsigned int}’ [-Wformat=]
``
I'm not sure if boophf allows to hash regular strings. Because at one point in the code, a buffer object does a malloc of sizeof(template hashed type). If the template type is a string, then sizeof(string) won't be enough to store the object.
So, someday, let's test Boophf with strings and not uint64_t.
I'm curious what drawbacks there are when not removing duplicates from BBhash's input.
I could save quite a bit of compute effort in my case if I can provide BBhash with duplicate keys, but I'd prefer not to do this if there are serious issues that might result.
I would like to encourage false positives to occur for inputs that are near (in some comparative sense) members of the set that was used to build the PMHF.
Would the construction algorithm already encourage this to happen for some cases? I haven't tested but perhaps you have thought about this.
If not, what adjustment should I consider making to encourage locality sensitive false positives?
Hi @rizkg and others,
First of all, thank you for your awesome MPHF implementation. It's fast, and easy to use.
There is a need for generating some temporary files on disk during the construction. But the address used for temp files is relative to where one is running the code from. This happened to make our run fail a couple of times, while we were running the code from a disk that was full but giving the output address on another disk that had enough space (also, the binary was not in the same disk as the one we were running our program from). That took us a while to wonder what is causing this failure.
If not wrong, I think the part that one can change root address for temp files to where the final output MPHF is stored should be here: https://github.com/rizkg/BBHash/blob/master/BooPHF.h#L1377
We fixed this by allowing a prefix for the temporary directory to be passed to the constructor. But you as authors might have a better way to do it.
Either way, it seems it would be useful to allow, at least optionally, to override the current behavior.
Thanks,
Fatemeh
Looking at the example code you might think that bhpf->lookup()
returns an index that can be used to access the element in the original data. If that were so, code like
for (u_int64_t i = 0; i < nelem; i++){
uint64_t idx = bphf->lookup(data[i]);
assert(data[idx] == data[i]);
}
would never fail. But it does – on every element.
So, what exactly does the index returned by mphf::lookup()
mean?
Hi ,
In Branch " alltypes" running below gives core dump
./example_custom_hash_strings 1000000000 32
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
Any Hint on this
First thanks for this great stuff!
Lookup can be 3x faster on strings with the following changes:
uint64_t lookup(const elem_t &elem)
uint64_t getLevel(hash_pair_t & bbhash,const elem_t &val,int * res_level, int maxlevel = 100, int minlevel =0)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.