Giter Club home page Giter Club logo

Comments (53)

ktprime avatar ktprime commented on August 24, 2024 2

I can reproduce it now in my test(gcc only) and test code https://github.com/ktprime/emhash/blob/master/bench/btest.cpp

//download my git emhash and run test with gcc 7.5
root@ubuntu:/emhash/bench g++ -O3 -march=native -mtune=native -I.. -I../thirdparty -std=c++17 btest.cpp -o btest
root@ubuntu:
/emhash/bench ./btest

tsl_robin_map:

Consecutive insert: 538 ms (s=0, size=2000000)
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

Program received signal SIGABRT, Aborted.
0x00007ffff7446438 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007ffff7446438 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007ffff744803a in __GI_abort () at abort.c:89
#2 0x00007ffff7a8cdde in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007ffff7a987a6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ffff7a98811 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ffff7a98a65 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007ffff7a8ca7e in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x0000000000409f42 in tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, unsigned int>, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::robin_hash(unsigned long, std::hash const&, std::equal_to const&, std::allocator<std::pair<unsigned long, unsigned int> > const&, float, float) ()
#8 0x000000000040e00c in tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, unsigned int>, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::rehash_impl(unsigned long) ()
#9 0x000000000040e40a in std::pair<tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, unsigned int>, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::robin_iterator, bool> tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, unsigned int>, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::insert_impl<unsigned long, std::pair<unsigned long, unsigned int> >(unsigned long const&, std::pair<unsigned long, unsigned int>&&) ()
#10 0x000000000040e6b5 in void test_insert<tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> > >(tsl::robin_map<unsigned long, unsigned int, std::hash, std::equal_to, std::allocator<std::pair<unsigned long, unsigned int> >, false, tsl::rh::power_of_two_growth_policy<2ul> >&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >&) ()
#11 0x000000000040e7f6 in void test<tsl_robin_map>(char const*) ()
#12 0x000000000040198d in main ()

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

Thank you for the report.

From a first glance I have troubles to understand how m_load_threshold ends up with such a large value. The only place where the variable is set (outside of copy/move constructor/operator and swap) is in max_load_factor and the value there is bounded by m_bucket_count and the clamped between [0.1;0.95] m_max_load_factor. As the m_bucket_count has a normal value of 8 and the m_max_load_factor is 0.5, m_load_threshold should be equal to 4. How does it end up with such value? The copy/move constructor/operator and swap seem correct. Could there be a memory bug somewhere in the implementation that somehow overrides the memory where m_load_threshold is?

I'll try to investigate more when I have some time.

In the meantime could you eventually try the latest version and define the TSL_DEBUG variable with assert enabled (so NDEBUG not defined)? It'll enable the assertions in the project and we might catch something with them.

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

I really can't find how m_load_threshold could overflow like that considering how the variable is assigned. Could some parts of the program that uses the library write to some wrong memory locations? Could you try to compile it with the AddressSanitizer or run it with valgrind?

I'll try to investigate more as it's a worrying bug. Please let me know if you can eventually reproduce the problem in a self-contained example. Thanks.

from robin-map.

markpapadakis avatar markpapadakis commented on August 24, 2024

I have experienced the issue multiple times in the past which effectively prevents us from using this map implementation. I thought I filed an issue about this but looks like I haven't. It would be great if this is resolved.

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

Thank you, I'll try to get a more in depth look then as it's really worrying if you also had the problem.

I have tried to insert the keys mentioned above in every way possible + tried with some extra ones and I can't reproduce the problem. If any of you eventually has a reproducible self-contained example it'd be immensely useful.

from robin-map.

markpapadakis avatar markpapadakis commented on August 24, 2024

I can reproduce it, but it is practically impossible to share the code (its for out embeddable columnar store thing, that's not OSS). Certain SQL queries would cause this -- tried every sanitizer, nothing came up really (so almost certainly not a corruption). Had this issue with other programs (our application server, some other utilities etc) -- but I unfortunately can't share that codebase either. It is almost certainly not a matter of some memory corruption/buffer overrun issue though.

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

Thank you very much for the useful infos. I'll look more into depth then as I currently can't understand how m_load_threshold could end up with a value of 18446744069414584320.

from robin-map.

szhang0119 avatar szhang0119 commented on August 24, 2024

In our tests, we also saw a different stuck stack.

In this case, m_bucket_count=0, and it stuck in a different while loop here.

Here is some gdb information:

(gdb) p *this
$2 = {<std::hash<unsigned long>> = {<std::__hash_base<unsigned long, unsigned long>> = {<No data fields>}, <No data fields>}, <std::equal_to<unsigned long>> = {<std::binary_function<unsigned long, unsigned long, bool>> = {<No data fields>}, <No data fields>}, <tsl::rh::power_of_two_growth_policy<2>> = {m_mask = 0}, static STORE_HASH = false, static USE_STORED_HASH_ON_LOOKUP = false, static DEFAULT_INIT_BUCKETS_SIZE = 0, 
  static DEFAULT_MAX_LOAD_FACTOR = 0.5, static MINIMUM_MAX_LOAD_FACTOR = 0.200000003, static MAXIMUM_MAX_LOAD_FACTOR = 0.949999988, 
  static DEFAULT_MIN_LOAD_FACTOR = 0, static MINIMUM_MIN_LOAD_FACTOR = 0, static MAXIMUM_MIN_LOAD_FACTOR = 0.150000006, 
  static REHASH_ON_HIGH_NB_PROBES__NPROBES = 128, static REHASH_ON_HIGH_NB_PROBES__MIN_LOAD_FACTOR = 0.150000006, 
  m_buckets_data = {<std::_Vector_base<tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, false>, std::allocator<tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, false> > >> = {
      _M_impl = {<std::allocator<tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, false> >> = {<__gnu_cxx::new_allocator<tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, false> >> = {<No data fields>}, <No data fields>}, _M_start = 0x0, _M_finish = 0x0, _M_end_of_storage = 0x0}}, <No data fields>}, 
  m_buckets = 0x400a4d6e2010 <tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, MapGraph::Blocks::DataBlock*>, tsl::robin_map<unsigned long, MapGraph::Blocks::DataBlock*, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, MapGraph::Blocks::DataBlock*> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::KeySelect, tsl::robin_map<unsigned long, MapGraph::Blocks::DataBlock*, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, MapGraph::Blocks::DataBlock*> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, MapGraph::Blocks::DataBlock*> >, false, tsl::rh::power_of_two_growth_policy<2ul> >::static_empty_bucket_ptr()::empty_bucket>, m_bucket_count = 0, m_nb_elements = 0, 
  m_load_threshold = 0, m_max_load_factor = 0.5, m_grow_on_next_insert = false, m_min_load_factor = 0, m_try_skrink_on_next_insert = false}

Since the m_bucket_count is 0, the below information does not quite make sense. But I just want to show m_dist_from_ideal_bucket is also MAX_SHORT.

(gdb) p/x *reinterpret_cast<uint64_t*>(m_buckets[0].m_value.__data)
$4 = 0x2a000000000002
(gdb) p m_buckets[0]
$5 = {<tsl::detail_robin_hash::bucket_entry_hash<false>> = {<No data fields>}, static EMPTY_MARKER_DIST_FROM_IDEAL_BUCKET = -1, 
  m_dist_from_ideal_bucket = 32767, m_last_bucket = true, m_value = {__data = "\002\000\000\000\000\000*\000\000\000\000\000\000\000\000", 
    __align = {<No data fields>}}}

We also tried to upgrade to latest version (v1.0.1), but can still see the above stuck stacks in our tests.

from robin-map.

gabrieltanase42 avatar gabrieltanase42 commented on August 24, 2024

Hi Tessil; I am from same team with Song. For us the bug occurs rarely and there is always an infinite loop. We suspected that maybe some memory corruption on our side may mess robin map memory and cause teh infinite loop and we spent already two weeks investigating this. What is specific to our case is that in a highly multithreaded scenario each thread creates, inserts and destroys teh map. We do guard with mutexes the operations on map as the datastructure itself is not thread safe.

So now I wonder : do you in your tests also have this scenario where we create map/insert elements/delete map in a loop ; and maybe from lots of threads (each thread creates, inserts and destroy its own robin map). Maybe there is an issue on the delete part which I am thinking may not be heavily tested???

At this point we started replacing the robin map with STL unordered to see if we observe any hangs anymore :(

from robin-map.

markpapadakis avatar markpapadakis commented on August 24, 2024

For what it's worth, in cases where it fails, we don't erase any KVs. It's all insertions. It occurs when inserting elements.

from robin-map.

markpapadakis avatar markpapadakis commented on August 24, 2024

Also, we experimented with various many different hash maps implementations. None of them failed so it's perhaps safe to suggest that the issue's
specific to this implementation.

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

Thanks all for the information.

I was not able to reproduce the problem but I looked carefully through the implementation. I couldn't find a problem in the insertion logic of the code but I noticed that the library can invoke some undefined behaviour with C++17.

P0137R1 introduced a change in the object model in C++17 which now requires to std::launder the reinterpreted pointer from std::aligned_storage. From my understanding and the discussion I have read, the change seems to be backward incompatible and could potentially make previously well-defined code become undefined. See the following discussion for some interesting details: https://stackoverflow.com/questions/47735657/does-reinterpret-casting-stdaligned-storage-to-t-without-stdlaunder-violat and https://groups.google.com/a/isocpp.org/g/std-discussion/c/xaAwFR6qUmY

I have fixed the problem with commit 4abcc97, could you try these latests changes? It may not be the reason of the bug you encounter but it would be tremendously useful for me to at least try as I can't reproduce the problem myself.

I also added some extra assertions so running the code with TSL_DEBUG defined and assertions enabled may also help to pinpoint the problem.

Sorry for the caused inconveniences.

from robin-map.

gabrieltanase42 avatar gabrieltanase42 commented on August 24, 2024

@markpapadakis Do you think you can try this patch and confirm that it fixes your problem. Now I am in the middle of replacing and benchmarking with an alternative. Also for some weird reason for us I need to let the server run with a constant load for a week or more before I notice a query stuck in infinite loop inside robin map.

from robin-map.

markpapadakis avatar markpapadakis commented on August 24, 2024

@gabrieltanase42: I just tried this commit and still fails.

from robin-map.

thompsonbry avatar thompsonbry commented on August 24, 2024

I am with the same team as Gabi. We were not using c++17 when we observed the problem. We have only been able to reproduce this in multi-day tests of the integrated software stack. We lack a clean reproducer based on a simple set of inputs and some sequence of their presentation. We do use a lock to guard access to the map from multiple threads in the code where we are observing the problem. We have only observed this for relatively small maps with numerically ordered keys which should be presented more or less in order (there might be some races involved, but there is a guarding lock).

We have also been using the robin map for several years in hash joins and have never observed an issue with that code. However, unless there is something very odd about the case where the map is behind a lock, we have to assume that we can encounter the same issue when building a hash index for a join.

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

So I reviewed all the code of the library to check any eventual problem and I couldn't find anything really suspicious.

One part that could be problematic in a pre-C++11 multi-threaded program is the static initialization in https://github.com/Tessil/robin-map/blob/master/include/tsl/robin_hash.h#L1590 But it's thread-safe in C++11 and forward and this static variable is never modified once initialized. There are no other non-const shared variables.

Another slight doubt I have is the const_cast in https://github.com/Tessil/robin-map/blob/master/include/tsl/robin_hash.h#L1108 but as pos.m_bucket should always point to the non-const m_buckets_data/m_buckets or the end of it, it should be perfectly safe to cast-away the constness of it in non-const methods. One alternative would be to use the slower return iterator(m_buckets + (pos.m_bucket - m_buckets));.

I have not been able to reproduce the bug in a non-threaded program. I'll try to run some tests in a multi-threaded one as it seems it's a common premise both @szhang0119 and @markpapadakis have and maybe the static_empty_bucket_ptr has something to do with it.

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

It'd be good to eventually test the map by replacing https://github.com/Tessil/robin-map/blob/master/include/tsl/robin_hash.h#L1591 with static thread_local bucket_entry empty_bucket(true);. Though if it works I would be a bit puzzled as the library only do concurrent reads from empty_bucket (and if there was a write due to a bug, it should also show weird behaviour in a non-threaded program).

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

Tried to reproduce the bug in a multi-threaded test program that do insertions, erasures and maps creations but no luck unfortunately

from robin-map.

hunjmes avatar hunjmes commented on August 24, 2024

I work on the same team as Gabriel-- it's interesting to me (although I don't know what it means) that in the second bug report, from 13 days ago, the static empty_bucket has its m_dist_from_ideal_bucket set to 32767-- no one should ever be modifying it! Also, in the initial bug report, from 27 days ago, all 8 buckets have their m_dist_from_ideal_bucket set to 32767.

Since the comparison involving m_dist_from_ideal_bucket is a <= involving signed 16-bit integers, the comparison will always succeed, which explains the infinite loop. But it's not much of an explanation, since:

  • In the 2nd case, the static empty bucket should never be modified, and its m_dist_from_ideal_bucket should always be -1. Somehow it went from -1 to 32767...
  • In the 1st case, while it is true that the maximum possible value for m_dist_from_ideal_bucket is 32767, it's also the case that trying to access a bucket that has that maximum value would seem to lead to an infinite loop. So, how did the 1st case end up with four(!) "infinite-loop" buckets, without hanging on, say, the first?

I don't have answers to any of these questions, unfortunately--

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

Yes, it's really strange. m_dist_from_ideal_bucket should also never reach a value of 32767 as once m_dist_from_ideal_bucket reaches DIST_FROM_IDEAL_BUCKET_LIMIT=4096 (which should only happen with a really, really bad hash), a rehash to grow the map is forced at next insertion.

How did it end up to reach a value of 32767 with only 8 buckets (and thus a max distance of 7) and how was static_empty_bucket_ptr ever modified to not be empty anymore (I double-checked the code and can't see how it would happen outside of some undefined behaviour)? I'll try to re-read the code again when I have some time as I must be missing something. I never had such behaviour when using the library.

To be sure, are all multi-threaded accesses and writes protected by a mutex? Even move, swap, copy, creation?

from robin-map.

markpapadakis avatar markpapadakis commented on August 24, 2024

IIRC it always failed in MT programs for us and it's always about maps protected with either an std::mutex or an std::shared_mutex. As I mentioned in a previous comment, no other hash map implementation I tried failed.

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

So I tried to reproduce the bug with the following program that has multiple threads inserting, erasing, reading random values in multiple maps in parallel to try to reproduce the problem but to no avail. Also tried to use sequential values instead of random ones but still working fine. Both with clang 13.0 and g++ 11.3 with -O3.

I you could eventually try to reproduce the problem with TSL_DEBUG defined as well as assertions enabled, it may help to pinpoint the problem if one of the asserts fail.

#include <tsl/robin_map.h>

#include <cstdlib>
#include <iostream>
#include <mutex>
#include <random>
#include <shared_mutex>
#include <thread>

const std::size_t nb_iterations = 100000000000;
const std::size_t nb_maps = 4;
std::vector<std::shared_mutex> mutexes(nb_maps);
std::vector<tsl::robin_map<std::uint64_t, int>> maps(nb_maps);
std::mutex stdout_mutex;
thread_local std::random_device rd;
thread_local std::mt19937 gen(rd());

void test() {
  std::uniform_int_distribution<std::size_t> map_choice(0, 3);
  std::uniform_int_distribution<std::uint64_t> value_gen(0, 10000000);
  std::uniform_int_distribution<int> should_insert(0, 3);
  std::uniform_int_distribution<int> should_erase(0, 7);
  std::uniform_int_distribution<int> should_reset(0, 10000000);

  std::uint64_t total = 0;
  for (std::size_t i = 0; i < nb_iterations; i++) {
    const std::uint64_t insert_val = value_gen(gen);
    const std::uint64_t erase_val = value_gen(gen);
    const std::uint64_t read_val = value_gen(gen);
    const bool insert = should_insert(gen) == 0;
    const bool erase = should_erase(gen) == 0;
    const bool reset = should_reset(gen) == 0;

    if (insert) {
      const std::size_t imap = map_choice(gen);
      std::unique_lock<std::shared_mutex> lock(mutexes[imap]);
      maps[imap].insert({insert_val, 1});
    }

    if (erase) {
      const std::size_t imap = map_choice(gen);
      std::unique_lock<std::shared_mutex> lock(mutexes[imap]);
      maps[imap].erase(erase_val);
    }

    {
      const std::size_t imap = map_choice(gen);
      std::shared_lock<std::shared_mutex> lock(mutexes[imap]);
      auto it = maps[imap].find(read_val);
      if (it != maps[imap].end()) {
        total += 1;
      }
    }

    if (reset) {
      const std::size_t imap = map_choice(gen);
      std::unique_lock<std::shared_mutex> lock(mutexes[imap]);
      tsl::robin_map<std::uint64_t, int> empty;
      maps[imap].swap(empty);
      total = 0;
    }

    if (i % 10000000ull == 0) {
      std::unique_lock<std::mutex> lock(stdout_mutex);
      std::cout << i << ": " << total << std::endl;
    }
  }
}

int main(int, char**) {
  std::size_t nthreads = 16;
  std::vector<std::thread> threads(nthreads);

  for (std::size_t i = 0; i < threads.size(); i++) {
    threads[i] = std::thread(test);
  }

  for (std::size_t i = 0; i < threads.size(); i++) {
    threads[i].join();
  }
}

from robin-map.

gabrieltanase42 avatar gabrieltanase42 commented on August 24, 2024

I also tried to reproduce with something simple but no luck. Here is my test in case it may help somebody:

╰─○ cat robin.cc
#include <cstdint>
#include <iostream>
#include <string>
#include <tsl/robin_map.h>
#include <tsl/robin_set.h>

#include <thread>
#include <mutex>
#include <unistd.h>

using namespace std;

class MyMap {
    int count;
    tsl::robin_map<int, char*> *data;
    std::mutex m;
    
public:
    MyMap(){
        data = new tsl::robin_map<int, char*>();
    }
    
    void rotate(){
        m.lock();

        auto new_data = new tsl::robin_map<int, char*>();
        
        auto mit = data->begin();
        while (mit != data->end()){
            delete mit.value();
            ++mit;
        }
        delete data;

        data = new_data;
        
        m.unlock();
        
    }

    void op() {
        m.lock();
        int sz = rand()%16384+8;
        char* bytes = (char*) malloc(sz);
        for (int i=0;i<sz;++i)
            bytes[i] = i%256;
        (*data)[count++] = bytes;
        m.unlock();

        m.lock();
        if (data->size()>10){
            auto mit = data->begin();
            delete mit.value();
            data->erase(mit.key());
        }
        m.unlock();
    }

};


void th_func(int id, MyMap* map, vector<size_t>* counts){
    cout<<"Thread :"<<id<<"\n";
    size_t opid = 0;
    do {
        map->op();
        counts->at(id)++;
    } while ( true);
}

int main() 
{
    const int P = 128;
    
    vector<std::thread> vt(P);
    vector<size_t> counts(P,0);
    vector<size_t> oldcounts(P,0);
        
    MyMap* mdata = new MyMap();
    
    for (int i=0;i<P;++i){
        vt[i] = std::thread(th_func, i, mdata, &counts);
    }

    int s = 0;
    do{
        sleep(1);
        cout<<s++<<":";
        for (int i=0;i<P;++i){
            //cout << counts[i]<<" ";
            if (counts[i] - oldcounts[i] > 100000 && s>2){
                cout<<"Error:: count difference too big on slot"<<i<<":::"<<"\n";       
            }
        }
        cout<<"\n";
        oldcounts = counts;

        if (s%10==0){
            mdata->rotate();
        }
    }while (true);
    
    for (int i=0;i<P;++i){
        vt[i].join();
    }
}

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

So I tried to reproduce it again with both scripts, ran it for 24 hours with random values and a few hours with sequential values using 32 threads and couldn't reproduce the problem.

I re-read the code again and can't find where it could come from outside of a potential race condition or memory corruption.

Does the bug happen after moving, swapping, copying a map, serialization/deserialization or anything like that?

If anyone encounters the problem, has more information or a way to reproduce, please let me know. One way to help would be to enable assertions and define TSL_DEBUG with the latests changes that add some assertions.

For now I marked the issue as cannot reproduce as there isn't much more I can do.

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

The error seems unrelated (not an infinite loop) and more like a lack of memory when rehashing (hence the std::bad_alloc exception). How much RAM do you have?

from robin-map.

ktprime avatar ktprime commented on August 24, 2024

32GB server. it's a bad hash function cause many rehash ?

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

It could be due to the indices3 that are powers-of-two + the std::hash identity function that causes too much collisions and end-up to make the distance variable becomes too large forcing a rehash. See comment in the README:

The first two are faster and use a power of two growth policy, the last two use a prime growth policy instead and are able to cope better with a poor hash function. Use the prime version if there is a chance of repeating patterns in the lower bits of your hash (e.g. you are storing pointers with an identity hash function). See GrowthPolicy for details.

You can thus try with another GrowthPolicy of hash function to avoid the problem of using power-of-two hashes with tsl::rh::power_of_two_growth_policy.

from robin-map.

markpapadakis avatar markpapadakis commented on August 24, 2024

@Tessil I just ran again this service/program that fails consistently (as described in other comments). I switched from a CitHash variant to xxHash, and even tried a differ GP, but it still fails (effectively, rehashes indefinitely, until it runs out of memory):

        template <typename KT, typename VT>
        using agg_fast_map = tsl::robin_map<KT, VT
                , std::hash<KT>
                , std::equal_to<KT>
                , std::allocator<std::pair<KT, VT>>
                , not std::is_arithmetic_v<KT>
                , tsl::rh::mod_growth_policy<>>;

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

Thanks.

So I checked and added a try/catch on the std::bad_alloc to check the distribution of the values in the buckets (save the bucket for each value in a file and analyze it after).

With the power-of-two growth there's a bucket that has to manage 456(!) collisions. With a prime growth, the maximum is 57 collisions for one bucket.

The problem is that the robin-hood collision resolution keeps track of the distance from the ideal bucket in a variable which has a limit in size and can't store unlimited distances. So when the distance reaches DIST_FROM_IDEAL_BUCKET_LIMIT (4096) (meaning that we have a collision chain of 4096 values!) the hash map is rehashed to a larger size. There isn't much that can be done and such behaviour is synonymous of a really bad hash function.

I'll try with CityHash and xxHash, not sure how they behave with integers.

Note though that from my understanding this problem has nothing to do with the infinite loop issue and is a kind of expected behaviour with a bad hash function (I checked the cause, and it's effectively because the dist_from_ideal_bucket reaches 4096 and not a value like 32767). Or @markpapadakis and @gabrieltanase42 do you also have such long collisions in your programs that cause rehashes?

from robin-map.

yhding456 avatar yhding456 commented on August 24, 2024

Hi @Tessil , I can recreate the OOM problem using the following test case.

#include
#include <unordered_map>
#include "robin_map.h"

struct KeyEQ {
bool operator()(const std::pair<uint32_t, uint32_t>&r1, const std::pair<uint32_t, uint32_t> &r2) const {
if (r1.first == r2.first && r1.second == r2.second) return true;
else return false;
}
};

struct KeyHasher {
std::size_t operator()(const std::pair<uint32_t, uint32_t> &key) const {
using std::hash;
using std::size_t;
size_t res = 17;

res = 31*res + std::hash<int32_t>()(key.first);
res = 31*res + std::hash<int32_t>()(key.second);

return res;

}
};

struct KeyHasher2 {
std::size_t operator()(const std::pair<uint32_t, uint32_t> &key) const {
using std::hash;
using std::size_t;
size_t res = 17;

res = res + std::hash<int32_t>()(key.first);
res = res + std::hash<int32_t>()(key.second);
res = res * 2654435761;

return res;

}
};

//tsl::robin_map<std::pair<uint32_t, uint32_t>, uint32_t, KeyHasher, KeyEQ> robin_map9; // OOM occurs!!!
tsl::robin_map<std::pair<uint32_t, uint32_t>, uint32_t, KeyHasher2, KeyEQ> robin_map9;
//std::unordered_map<std::pair<uint32_t, uint32_t>, uint32_t, KeyHasher, KeyEQ> hash_map9;
std::unordered_map<std::pair<uint32_t, uint32_t>, uint32_t, KeyHasher2, KeyEQ> hash_map9;

static void testCollision()
{
size_t hash_total0 = 0;
size_t hash_max_collisions0 = 0;
size_t robin_total0 = 0;
size_t robin_max_collisions0 = 0;
hash_map9.clear();
robin_map9.clear();
for (unsigned int i = 2450816; i < 2450816 + 1825; i++) {
for (unsigned int j = 1; j < 76; j++) {
std::pair<uint32_t, uint32_t> p = {i, j};
if (hash_map9.find(p) != hash_map9.end()) {
hash_map9[p]++;
hash_total0++;
if (hash_map9[p] > hash_max_collisions0) hash_max_collisions0 = hash_map9[p];
} else {
hash_map9.insert({p, 1});
}
if (robin_map9.find(p) != robin_map9.end()) {
robin_map9[p]++;
robin_total0++;
if (robin_map9[p] > robin_max_collisions0) robin_max_collisions0 = robin_map9[p];
} else {
robin_map9.insert({p, 1});
}
}
}
std::cout << "hash_map9 without applying any MASK!!! Total collisions: " << hash_total0 << ", Max bucket collisions: " << hash_max_collisions0 << "\n";
std::cout << "robin_map9 without applying any MASK!!! Total collisions: " << robin_total0 << ", Max bucket collisions: " << robin_max_collisions0 << "\n";
std::cout << "std::unordered_map table size: " << hash_map9.size() << ", max_size: " << hash_map9.max_size() << ", bucket count: " << hash_map9.bucket_count() << ", max_bucket_count: " << hash_map9.max_bucket_count() << ", load_factor: " << hash_map9.load_factor() << ", max_load_factor: " << hash_map9.max_load_factor() << "\n";
std::cout << "robin_map9 table size: " << robin_map9.size() << ", max_size: " << robin_map9.max_size() << ", bucket count: " << robin_map9.bucket_count() << ", max_bucket_count: " << robin_map9.max_bucket_count() << ", load_factor: " << robin_map9.load_factor() << ", max_load_factor: " << robin_map9.max_load_factor() << "\n";

return;

}

int main()
{
testCollision();
}

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

I'll close the issue as I wasn't able to reproduce the original error after extensive testing.

For the std::bad_alloc errors, the issue is different as it's an (unfortunately) excepted behaviour when a bad hash function is used and the number of collisions for one bucket is more than DIST_FROM_IDEAL_BUCKET_LIMIT (raised it to 8192). If it occurs, please change your hash function as this would cause a massive slow-down.

from robin-map.

thompsonbry avatar thompsonbry commented on August 24, 2024

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

I have tested this extensively as the original bug worried me but I have not be able to reproduce the original bug at all. I will be happy to reopen the issue if there is any new information.

Changing the hash function is mainly for the the second problem mentioned in the issue which isn't related to the original one. Each bucket stores its distance from the original bucket in an int16_t integer which means that there is a hard limit from which a bucket can be from its original bucket (less than std::numeric_limits<distance_type>::max() though in practice I limit it to DIST_FROM_IDEAL_BUCKET_LIMIT=8192). If we ever reach this limit, the hash map is grown to the next size. This can lead to std::bad_alloc exceptions if the values keeps being assigned to the same buckets after the resize. In this case, a better hash function is needed (a good hash function would not generate a 8192 collisions chain).

from robin-map.

gabrieltanase42 avatar gabrieltanase42 commented on August 24, 2024

Tessil, today we were able to create a test that can reproduce infinite loop in a deterministic way. Like every time we run a benchmark the program hangs (spinning in the loop below). We flipped again from robin map to std map and things work perfectly. In this situation however the robin map has a relatively large number of elements. I am looking now at the previous comments and I see that you mention some potential issues and that the user should try different hash functions etc. Probably we could try these but in my mind we could still hit this unfortunate situation where code spins forever in this method:

`template
const_iterator find_impl(const K& key, std::size_t hash) const {
std::size_t ibucket = bucket_for_hash(hash);
distance_type dist_from_ideal_bucket = 0;

while (dist_from_ideal_bucket <=
       m_buckets[ibucket].dist_from_ideal_bucket()) {
  if (TSL_RH_LIKELY(
          (!USE_STORED_HASH_ON_LOOKUP ||
           m_buckets[ibucket].bucket_hash_equal(hash)) &&
          compare_keys(KeySelect()(m_buckets[ibucket].value()), key))) {
    return const_iterator(m_buckets + ibucket);
  }

  ibucket = next_bucket(ibucket);
  dist_from_ideal_bucket++;
}

return cend();

}`

I asked permission with my team and if you have time one of these days we could have an interactive session where we can look at the code inside debugger.

In this case the map is from uint64_t to bool (tsl::robin_map<gui_t, bool> ); The Hash function is the default one;
When attaching the debugger I could see dist_from_ideal_bucket being negative;
Today I'll spend more time with it but I wanted to post this first to maybe reopen the ticket and see if we could meet virtually to look at the code briefly.

from robin-map.

gabrieltanase42 avatar gabrieltanase42 commented on August 24, 2024

So after I enable asserts , the program runs for like 30 mins inside that loop and it asserts the following:

TestSuites: /home/igtanase/brazil-pkg-cache/packages/TessilRobinMap/TessilRobinMap-1.0.x.344.0/AL2_x86_64/DEV.STD.PTHREAD/build/include/tsl/robin_hash.h:251: tsl::detail_robin_hash::bucket_entry<ValueType, StoreHash>::value_type& tsl::detail_robin_hash::bucket_entry<ValueTyp
e, StoreHash>::value() [with ValueType = std::pair<long unsigned int, bool>; bool StoreHash = false; tsl::detail_robin_hash::bucket_entry<ValueType, StoreHash>::value_type = std::pair<long unsigned int, bool>]: Assertion `!empty()' failed.
[1/1] P8DataFlowTest.BFSDFGBenchmarkLiveJournal returned/aborted with exit code -6 (132077 ms)

That corresponds to the following function:

value_type& value() noexcept { tsl_rh_assert(!empty()); return *reinterpret_cast<value_type*>(std::addressof(m_value)); }

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

The bad hash issue is different and is unrelated to the problem you mention. It is when the collision chain keeps being higher than DIST_FROM_IDEAL_BUCKET_LIMIT=8192 due to a poor hash function as I mentioned in my previous comment and would not cause an infinite loop but a growth of the map which would lead to a std::bad_alloc during insertion if the collision chain keeps happening. Changing the hash is effectively not a solution in your case and the problem is somewhere else.

It seems dist_from_ideal_bucket variable is overflowing if it becomes negative which is quite strange as the maximum value of m_buckets[ibucket].dist_from_ideal_bucket() should be DIST_FROM_IDEAL_BUCKET_LIMIT=8192 (or 4096 before) which is lower than std::numeric_limits<int16_t>::max(). The loop should thus terminate once dist_from_ideal_bucket goes over DIST_FROM_IDEAL_BUCKET_LIMIT in any case. Could you check if any bucket has a value larger than DIST_FROM_IDEAL_BUCKET_LIMIT? Or lower than -1?

I asked permission with my team and if you have time one of these days we could have an interactive session where we can look at the code inside debugger.

Sorry, I would prefer to avoid looking into closed-source code for legal reasons.

Could you try to compile the code with clang address and undefined sanitizers (and eventually thread and memory ones)? Or just run it through Valgrind for a first pass?

What kind of operations are done on the map in the benchmark? Is the map moved or copied? Is it eventually serialized/deserialized? As I see a mention of PTHREAD I suppose multi-threading is used, as the map doesn't support multi-threading, are the locks all in place?

I reopened the ticket.

from robin-map.

gabrieltanase42 avatar gabrieltanase42 commented on August 24, 2024

I'll keep digging more. Here are few more details. We do have the code split in multiple .so libs. The robin_map is allocated in one .so and it is passed as argument to a function that is implemented in a a different .so.

The reason I mentioned this is because at one point I tried to replace robin with abseil hash and that effort failed because of this exact reason ( abseil/abseil-cpp#1193 abseil/abseil-cpp#834)

With regards to multithreading. In this scenario is much simpler. The map is accessed by one thread only. I am doing two operations. Find and insert (key, value).

I did run with address sanitizer flag enabled. did'n print out anything.

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

Thanks for the info.

Do you know the stacktrace of the failing assert in value()? What operation was done at that moment?

from robin-map.

gabrieltanase42 avatar gabrieltanase42 commented on August 24, 2024

Do you think you can prepare a print_info function for me that I can invoke inside that tsl_rh_assert() ?
Or point me to an existing one if there ?
Maybe if we print the state of all critical variables we may get a clue....

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

The problem is that some of the critical infos may be in the m_buckets array which from what I understand is quite large.

As the bug occurs with mono-thread insertions and finds only (no deletions, map copy/move/swap, serialization or other) and is deterministic, would it eventually be possible to have a compressed text file with all the inserted gui_t in order of insertion or is there anything private in them? I could then try to reproduce the bug on my side.

from robin-map.

gabrieltanase42 avatar gabrieltanase42 commented on August 24, 2024

I just realized we are also using clear() to erase the map; Essentially is a BFS algorithm. And we are using two maps; One is a set of input vertices. During the kernel in question we traverse this input map and we compute an output map with new vertices. We need the map to make sure we won;t visit a vertex twice.

After the kernel is over, the output map becomes the new input map and the old input map is erased; then we invoke the kernel again;

stack:: the version of the code is 1.0.1 but it is prior to you adding those std::launder

* frame #0: 0x00007f757c5e7d71 libmapgraph-p8-operators.so`tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, bool>, false>::value(this=0x00007f73e04c8000) at robin_hash.h:258
    frame #1: 0x00007f757c5ec0fe libmapgraph-p8-operators.so`tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::robin_iterator<true> tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::find_impl<unsigned long>(this=0x00007f7561e3b280, key=0x00007f7561e26f88, hash=11529781552120594432) const at robin_hash.h:1171
    frame #2: 0x00007f757c5eb475 libmapgraph-p8-operators.so`tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::robin_iterator<true> tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::find<unsigned long>(this=0x00007f7561e3b280, key=0x00007f7561e26f88, hash=11529781552120594432) const at robin_hash.h:1014
    frame #3: 0x00007f757c5ea83d libmapgraph-p8-operators.so`tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::robin_iterator<false> tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::find_impl<unsigned long>(this=0x00007f7561e3b280, key=0x00007f7561e26f88, hash=11529781552120594432) at robin_hash.h:1160
    frame #4: 0x00007f757c5e997f libmapgraph-p8-operators.so`tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::robin_iterator<false> tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::ValueSelect, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::find<unsigned long>(this=0x00007f7561e3b280, key=0x00007f7561e26f88) at robin_hash.h:999
    frame #5: 0x00007f757c5e88f1 libmapgraph-p8-operators.so`tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>::find(this=0x00007f7561e3b280, key=0x00007f7561e26f88) at robin_map.h:488
    frame #6: 0x00007f757c5e763d libmapgraph-p8-operators.so`unsigned long MapGraph::p8::spmspvAdjacencyOnlyKernelMapInput<(MapGraph::p8::GraphDataType)1, bool, (MapGraph::Execution::SemiRingAddition)1, (MapGraph::Execution::SemiRingMultiplication)6, true, false, false>(REL_E=0x00007f7561e27810, inputWeightMap=0x00007f7561e3b140, validVertexMap=0x00007f7575148e40, cc=0x00007f7561e27708, edgep=11529781176126406656, outputMap=0x00007f7561e3b280)1>&, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>&, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false, tsl::rh::power_of_two_growth_policy<2ul>>&, MapGraph::p8::Catalog::CatalogConfig&, unsigned long, tsl::robin_map<unsigned long, bool, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long, bool>>, false

I also print the state of the hash map:

(lldb) p *this
(tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::ValueSelect, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >) $8 = {
  tsl::rh::power_of_two_growth_policy<8> = (m_mask = 65535)
  m_buckets_data = size=0 {}
  m_buckets = 0x00007f73e0408000
  m_bucket_count = 65536
  m_nb_elements = 32768
  m_load_threshold = 32768
  m_min_load_factor = 0
  m_max_load_factor = 0.5
  m_grow_on_next_insert = false
  m_try_shrink_on_next_insert = false

So most likely the insert before this find reached the 32768 size (power of two) and the resizing may mess up the internals;

Here are the variables in the last find_impl:

(lldb) p dist_from_ideal_bucket
(tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::ValueSelect, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::distance_type) $9 = -32768
(lldb) p ibucket
(std::size_t) $10 = 32768
(lldb) p m_buckets[ibucket].dist_from_ideal_bucket()
(tsl::detail_robin_hash::bucket_entry<std::pair<unsigned long, bool>, true>::distance_type) $11 = -1

(lldb) p m_buckets[ibucket]
(tsl::detail_robin_hash::robin_hash<std::pair<unsigned long, bool>, tsl::robin_map<unsigned long, bool, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::KeySelect, tsl::robin_map<unsigned long, bool, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::ValueSelect, std::hash<>, std::equal_to<unsigned long>, std::allocator<>, true, tsl::rh::power_of_two_growth_policy<8> >::bucket_entry) $12 = {
  m_dist_from_ideal_bucket = -1
  m_last_bucket = false
  m_value = (__data = "", __align = std::aligned_storage<8, 8>::type::(unnamed struct) @ 0x0000000006f396c8)
}

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

Thank you for all the info.

The m_buckets_data of size 0 with a m_bucket_count of 65536 is really worrying. The m_buckets_data should be of size m_bucket_count and m_buckets should point to m_buckets_data.data() if m_bucket_count > 0 (otherwise static_empty_bucket_ptr()).

I'll take a deeper look tomorrow.

from robin-map.

gabrieltanase42 avatar gabrieltanase42 commented on August 24, 2024

I think I got the trace of operations to get it to expose the issue. you can replay the trace with something like this:


#include <boost/tokenizer.hpp>
#include <boost/lexical_cast.hpp>

// test code

            using namespace std;
            using namespace boost;

            std::string inputPath = "log";
            ifstream in(inputPath.c_str());
            ASSERT_TRUE(in.is_open());

            tsl::robin_map<gui_t, bool> rmap;
            typedef tsl::robin_map<gui_t, bool>::value_type pair_type;

            typedef boost::tokenizer<boost::char_separator<char> > Tokenizer;
            vector< string > vec;
            string line;

            while (getline(in,line))
            {

                if (rmap.size() == 32767){
                   // this here is so you can attach a debugger ; if not need comment out this if
                   // also replace LOG(ERROR) with cout
                    LOG(ERROR)<<"Getting close::"<<rmap.size()<<"\n";
                    int i=0;
                    while (i == 0){
                        usleep(2);
                    }
                }
                boost::char_separator<char> sep{"#"};
                Tokenizer tok(line,sep);
                vec.assign(tok.begin(),tok.end());
                // vector now contains strings from one row, output to cout here
                for (auto& s:vec){
                    LOG(ERROR)<<s<<"#";
                }
                LOG(ERROR) << "\n----------------------" << endl;
                if (vec[1]=="clear") {
                    rmap.clear();
                }
                if (vec[1]=="find") {
                    uint64_t num = lexical_cast<uint64_t>(vec[2]);
                    rmap.find(num);
                }
                if (vec[1]=="insert") {
                    uint64_t num = lexical_cast<uint64_t>(vec[2]);
                    rmap.insert(pair_type(num,true));
                }
                if (vec[1]=="reassign") {
                    uint64_t num = lexical_cast<uint64_t>(vec[2]);
                    rmap[num] = true;
                }

            }

you need to use # as separator. Now the values in each line are like this:

E1228 22:53:38.103422 37107 spmspvKernels.h:276] %%%0x7fa0c523b280#insert#11529778670445002752#1#1

the name of the operation (clear, find, insert, reassign), followed by Key (uint64_t) followed by value (always 1 in this case) followed by the size of the map before the operation is performed. There is some info before the operation name but ignore that;

You can unzip : gzip -d log.gz
log.gz

Tomorrow I'll try to reproduce this outside our framework, with the latest code release;

from robin-map.

gabrieltanase42 avatar gabrieltanase42 commented on August 24, 2024

Standalone test also hangs so we have a reproducer :)
I just did a git pull in my local copy of robin map:

commit 6775231bbf11c4a78ac2d25b8ae5791da8b35422 (HEAD -> master, origin/master, origin/HEAD)
Author: Thibaut Goetghebuer-Planchon <[email protected]>
Date:   Sun Dec 18 18:18:01 2022 +0000

    Disable CMake install rule if robin_map is used as subproject (#60)
~ cat bug.cc
// compile: g++ -o exe -g -O3 -std=c++17 -I/home/igtanase/brazil-pkg-cache/packages/Boost/Boost-3.0.394338.0/AL2_x86_64/DEV.STD.PTHREAD/build/include/ -Irobin-map/include bug.cc

#include <cstdint>
#include <iostream>
#include <fstream>
#include <string>
#include <tsl/robin_map.h>
#include <tsl/robin_set.h>

#include <thread>
#include <mutex>
#include <unistd.h>


#include <boost/tokenizer.hpp>
#include <boost/lexical_cast.hpp>

using namespace std;

typedef uint64_t gui_t;

int main() 
{
    using namespace boost;

    std::string inputPath = "log";
    ifstream in(inputPath.c_str());
    assert(in.is_open());
    
    tsl::robin_map<gui_t, bool> rmap;
    typedef tsl::robin_map<gui_t, bool>::value_type pair_type;
    
    typedef boost::tokenizer<boost::char_separator<char> > Tokenizer;
    vector< string > vec;
    string line;
    
    while (getline(in,line)) {
            
        if (rmap.size() == 32767){
            // this here is so you can attach a debugger ; if not need comment out this if
            cout<<"Getting close::"<<rmap.size()<<"\n";
            //int i=0;
            //while (i == 0){
            //    usleep(2);
            // }
        }
        boost::char_separator<char> sep{"#"};
        Tokenizer tok(line,sep);
        vec.assign(tok.begin(),tok.end());
        // vector now contains strings from one row, output to cout here
        for (auto& s:vec){
            cout<<s<<"#";
        }
        cout << "\n----------------------" << endl;
        if (vec[1]=="clear") {
            rmap.clear();
        }
        if (vec[1]=="find") {
            uint64_t num = lexical_cast<uint64_t>(vec[2]);
            rmap.find(num);
        }
        if (vec[1]=="insert") {
            uint64_t num = lexical_cast<uint64_t>(vec[2]);
            rmap.insert(pair_type(num,true));
        }
        if (vec[1]=="reassign") {
            uint64_t num = lexical_cast<uint64_t>(vec[2]);
            rmap[num] = true;
        }
        
    }
}

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

Thank you very much for the example and your help. I was able to successfully reproduce the bug and there was effectively a problem in the code, I'm really sorry for that.

I create a PR for the fix if you could try it out. I still have to add some tests before merging it.

The problem was in the end due to a high collisions chain due to a poor hash function which the code didn't manage properly. I did check for potential overflows of dist_from_ideal_bucket during the insertion but only during the robin swap while it should also be done beforehand to make sure that we don't store a m_dist_from_ideal_bucket that overflows in the bucket_entry.

Could you check if the PR fixes your issue?

I would recommend though to change the hash function as unfortunately GCC and clang use an identity function for integer types for std::hash and this can lead to poor distribution with a power of two map size. With the current hash you may still end up with std::bad_alloc due to excessive rehash growth.

I will try to create a tsl::hash project when I have more time to replace it as default hash function of the tsl libs and have a better default hash for integer types with a fallback to std::hash for other types. Eventually a stateful one that would change between rehashes.

Thanks again for your help and patience.

from robin-map.

gabrieltanase42 avatar gabrieltanase42 commented on August 24, 2024

Thx for your help with this. I'll definitely check your fix in a minute. I guess our luck (christmass present) is that we found a trace to reproduce it. Now I am hoping that this fix also fixes other people's issues. @markpapadakis can you also give it a try with the new branch that @Tessil mentions (fix_issue_52).

@Tessil we do use the hash map in multiple scenarios and we do have our own hash functions for different situations. Our first thoutgh was also to change the hash function when we bumped into this latest bug but that would have only masked the problem till the next crash.

Here is a link to a hashing function that will work better than std::hash

https://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm

public int hash32shiftmult(int key)
{
  int c2=0x27d4eb2d; // a prime or an odd constant
  key = (key ^ 61) ^ (key >>> 16);
  key = key + (key << 3);
  key = key ^ (key >>> 4);
  key = key * c2;
  key = key ^ (key >>> 15);
  return key;
}

public long hash64shift(long key)
{
  key = (~key) + (key << 21); // key = (key << 21) - key - 1;
  key = key ^ (key >>> 24);
  key = (key + (key << 3)) + (key << 8); // key * 265
  key = key ^ (key >>> 14);
  key = (key + (key << 2)) + (key << 4); // key * 21
  key = key ^ (key >>> 28);
  key = key + (key << 31);
  return key;
}

from robin-map.

gabrieltanase42 avatar gabrieltanase42 commented on August 24, 2024

Some initial testing shows that the code is not crashing anymore. Next week I'll run more in depth tests

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

I guess our luck (christmass present) is that we found a trace to reproduce it.

Yes, that was really useful. I did consider the dist_from_ideal_bucket reaching its size limit in case of very high collisions during the design but I didn't consider this case. Changing the hash function would effectively have masked the bug and I'm glad we found it as it was a worrying bug.

For the robin-map v2 I'm eventually considering to just saturate the distance to std::numeric_limits<distance_type>::max() and take that into account during the find where we would have to continue looking into the collision chain if the limit is reached. Something like:

distance_type next_distance(distance_type distance) const noexcept {
    // To optimize
    if (distance == std::numeric_limits<distance_type>::max()) {
        return distance;
    }

    return distance + 1;
}

template <class K>
const_iterator find_impl(const K& key, std::size_t hash) const {
    ...
    while (dist_from_ideal_bucket <= m_buckets[ibucket].dist_from_ideal_bucket()) {
        if (compare_keys(KeySelect()(m_buckets[ibucket].value()), key))) {
            return const_iterator(m_buckets + ibucket);
        }

        ibucket = next_bucket(ibucket);
        dist_from_ideal_bucket = next_distance(dist_from_ideal_bucket);
    }
    ...
}

During the insertion and erasure robin swap, we would have to pay attention to use the non-saturated real distance (which would mean potentially rehashing some values) instead of using the stored one.

It would avoid the excessive resizes in case of catastrophical collision chains but performance will probably be reduced a bit. Need to check the performance impact and if it's worth it for such a corner case (if this case occurs, it means the hash function is quite poor).

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

I merged the PR. Don't hesitate to let me know if you still encounter some problems even after this fix.

I will create a new release in the next few days and close the PR if everything is working right.

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

I created the new release with the bugfix. I strongly recommend to update to this new version.

I will close the issue for now, let me know if the bug still occurs, we can then reopen it.

Note that you may still end up with std::bad_alloc due to aggressive rehashes if the hash function is very poor and create long collision chains (> 8192). Solving this would require some design change (see my #52 (comment)) which may decrease some of the performance which I'm not keen to affect for such extreme bad hash function case. I will still do some tests and eventually provide a 2.0 release if the performances are adequate.

from robin-map.

yhding456 avatar yhding456 commented on August 24, 2024

from robin-map.

Tessil avatar Tessil commented on August 24, 2024

See #52 (comment) comment. To avoid overflowing the dist_from_ideal_bucket, the map is rehashed when it reaches the DIST_FROM_IDEAL_BUCKET_LIMIT limit (8192).

A multiplicative hash is not recommended for open-addressing hash as it may lead to a very high number of collisions depending on the input distribution. The KeyHasher you use will only output odd hash for any even input.

Changing the hash function is the best course of action here, as even if I implement #52 (comment) you'll have terrible performances with the current hash function. If you have a dist_from_ideal_bucket of 8192 for example (or more) for one bucket, it means that the map has to go through 8192 buckets before finding the value in this bucket.

from robin-map.

yhding456 avatar yhding456 commented on August 24, 2024

from robin-map.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.