Giter Club home page Giter Club logo

bbhash's People

Contributors

ekg avatar malfoy avatar pierrepeterlongo avatar rchikhi avatar rizkg avatar rob-p avatar songqing avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bbhash's Issues

Faster popcount

Looking at your popcount code in

BBHash/BooPHF.h

Lines 170 to 189 in 6bb97c4

inline unsigned int popcount_32(unsigned int x)
{
unsigned int m1 = 0x55555555;
unsigned int m2 = 0x33333333;
unsigned int m4 = 0x0f0f0f0f;
unsigned int h01 = 0x01010101;
x -= (x >> 1) & m1; /* put count of each 2 bits into those 2 bits */
x = (x & m2) + ((x >> 2) & m2); /* put count of each 4 bits in */
x = (x + (x >> 4)) & m4; /* put count of each 8 bits in partie droite 4bit piece*/
return (x * h01) >> 24; /* returns left 8 bits of x + (x<<8) + ... */
}
inline unsigned int popcount_64(uint64_t x)
{
unsigned int low = x & 0xffffffff ;
unsigned int high = ( x >> 32LL) & 0xffffffff ;
return (popcount_32(low) + popcount_32(high));
}

it's perhaps worthwhile noting that you could speed-up your popcounts by using the GCC __builtin_pocount() and __builtin_popcountl() intrinsics when available, instead. Which would be compiled into a single POPCNT instruction on some CPUs - e.g. on not too old Intel ones.

Collision with item not in list

Using the super-fast wyhash hash function (https://github.com/wangyi-fudan/wyhash), I'm seeing a collision between two integers, one of which is not in the list.

Specifically:

  • I compile-time hash strings to uint64_h using xxh64.
  • I then use these hashes to build up bbhash which has been created to use the wyhash function
  • If I query a specific hash that isn't in the table, it doesn't properly return UINT max

I've attached a minimum working example to this post. My output is as follows, which I hope helps illustrate the problem further.

$ g++ main.cpp -o out -lpthread && ./out

area = 9853546745770798894
Ninja21_V = 10499620994759670644
HelloWorld = 990340755478284348

wyand hash of area = 1514463996213185106
wyand hash of Ninja21_V = 135295144726059243

Looking up area which is not in the map: 0
Looking up Ninja21_V which is in the map: 0
Looking up HelloWorld which is not in the map: 18446744073709551615

bbhash-collision.zip

segfault with gcc 7.3

I tried to run the test benchmark, and ran into a segfault:

-> % ./Bootest 100000000 4 -check -bench            
key file generated 
Construct a BooPHF with  100000000 elements  
for info, total work write each  : 1.000    total work inram from level -1991486788 : -1991486787.000  total work raw : 25.000 
[Building BooPHF]  100  %   elapsed:   0 min 4  sec   remaining:   0 min 0  sec[1]    2339 segmentation fault (core dumped)  ./Bootest 100000000 4 -check -bench

I've traced the segfault here:

#0  0x00005555555656f5 in void boomphf::mphf<unsigned long, boomphf::SingleHashFunctor<unsigned long> >::pthread_processLevel<bfile_iterator>(std::vector<unsigned long, std::allocator<unsigned long> >&, std::shared_ptr<bfile_iterator>, std::shared_ptr<bfile_iterator>, int) ()
#1  0x0000555555565a9d in void* boomphf::thread_processLevel<unsigned long, boomphf::SingleHashFunctor<unsigned long>, file_binary, bfile_iterator>(void*) ()
#2  0x00007ffff7bbd6db in start_thread (arg=0x7ffff50bd700) at pthread_create.c:463
#3  0x00007ffff6fa788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This is my gcc version:

-> % g++ --version
g++ (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0

Probably it's unrelated, but I also see a lot of warnings of this type:

BooPHF.h:1026:10: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 2 has type ‘uint64_t {aka long unsigned int}’ [-Wformat=]
``

possible issue when the hashed type is e.g. std::string

I'm not sure if boophf allows to hash regular strings. Because at one point in the code, a buffer object does a malloc of sizeof(template hashed type). If the template type is a string, then sizeof(string) won't be enough to store the object.

So, someday, let's test Boophf with strings and not uint64_t.

non-de-duplicated input

I'm curious what drawbacks there are when not removing duplicates from BBhash's input.

I could save quite a bit of compute effort in my case if I can provide BBhash with duplicate keys, but I'd prefer not to do this if there are serious issues that might result.

Locality sensitive false positives

I would like to encourage false positives to occur for inputs that are near (in some comparative sense) members of the set that was used to build the PMHF.

Would the construction algorithm already encourage this to happen for some cases? I haven't tested but perhaps you have thought about this.

If not, what adjustment should I consider making to encourage locality sensitive false positives?

Relative tmp address

Hi @rizkg and others,

First of all, thank you for your awesome MPHF implementation. It's fast, and easy to use.

There is a need for generating some temporary files on disk during the construction. But the address used for temp files is relative to where one is running the code from. This happened to make our run fail a couple of times, while we were running the code from a disk that was full but giving the output address on another disk that had enough space (also, the binary was not in the same disk as the one we were running our program from). That took us a while to wonder what is causing this failure.

If not wrong, I think the part that one can change root address for temp files to where the final output MPHF is stored should be here: https://github.com/rizkg/BBHash/blob/master/BooPHF.h#L1377
We fixed this by allowing a prefix for the temporary directory to be passed to the constructor. But you as authors might have a better way to do it.

Either way, it seems it would be useful to allow, at least optionally, to override the current behavior.

Thanks,
Fatemeh

What does the index returned by mphf::lookup() mean?

Looking at the example code you might think that bhpf->lookup() returns an index that can be used to access the element in the original data. If that were so, code like

for (u_int64_t i = 0; i < nelem; i++){
  uint64_t idx = bphf->lookup(data[i]);
  assert(data[idx] == data[i]);
}

would never fail. But it does – on every element.

So, what exactly does the index returned by mphf::lookup() mean?

Core Dumped

Hi ,
In Branch " alltypes" running below gives core dump
./example_custom_hash_strings 1000000000 32
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
Any Hint on this

Avoid copy constructor

First thanks for this great stuff!
Lookup can be 3x faster on strings with the following changes:
uint64_t lookup(const elem_t &elem)
uint64_t getLevel(hash_pair_t & bbhash,const elem_t &val,int * res_level, int maxlevel = 100, int minlevel =0)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.