Giter Club home page Giter Club logo

Comments (6)

zhaoweijie12 avatar zhaoweijie12 commented on June 30, 2024

Please check the following bithash.h. You may need to use hash2kv in the data reader.

#pragma once

#include<random>
#include<vector>

typedef float bithash_t;

class BitHash{
public:
        std::vector<bithash_t> hash_matrix;
        int p,k;

        BitHash(){}

        BitHash(int p,int k,int seed = 123) : p(p),k(k){
                std::default_random_engine generator (seed);
                std::normal_distribution<bithash_t> distribution(0.0,1.0);
                hash_matrix.resize(p * k);
                for(int i = 0;i < hash_matrix.size();++i)
                        hash_matrix[i] = distribution(generator);
        }

        std::vector<bool> hash2vecbool(const std::vector<std::pair<int,value_t>>& point){
                std::vector<bool> ret(k);
                for(int i = 0;i < k;++i){
                        bithash_t sum = 0;
                        for(const auto& pp : point){
                                sum += hash_matrix[i * p + pp.first] * pp.second;
                        }
                        ret[i] = (sum >= 0);
                }
                return std::move(ret);
        }

        uint8_t hash2uint8(const std::vector<std::pair<int,value_t>>& point){
                uint8_t ret = 0;
                for(int i = 0;i < k;++i){
                        bithash_t sum = 0;
                        for(const auto& pp : point){
                                sum += hash_matrix[i * p + pp.first] * pp.second;
                        }
                        ret |= ((sum >= 0) << i);
                }
                return std::move(ret);
        }

        std::vector<std::pair<int,value_t>> hash2kv(const std::vector<std::pair<int,value_t>>& point){
                std::vector<std::pair<int,value_t>> ret(k/sizeof(value_t)/8);
        #ifdef __ENABLE_HASH
                for(int i = 0;i < k / sizeof(value_t) / 8;++i){
                        ret[i].first = i;
                        ret[i].second = 0;
                }
                for(int i = 0;i < k;++i){
                        bithash_t sum = 0;
                        for(const auto& pp : point){
                                sum += hash_matrix[i * p + pp.first] * pp.second;
                        }
                        ret[i / sizeof(value_t) / 8].second |= (sum >= 0) ? (1 << i % (sizeof(value_t) * 8)) : 0;
                }
        #endif
                return std::move(ret);
        }

};

from song.

saimkhan1509 avatar saimkhan1509 commented on June 30, 2024

Hi Weiji,
Thanks for the reply. Really helpful!
After using the bithash.h file shared by you, I was able to build the hashed dataset using SONG CPU binary (main.cc) but not with the GPU binary. That is, even after making the proposed changes, SONG GPU binary (main.cu) produced the non-hashed dataset file.

For performing the search operation, I tried running SONG GPU binary on the hashed dataset (and graph) that was produced by the CPU binary. However, for all the queries, the resultant k neighbor IDs are printed as all zeroes. (Note that on the same generated hashed dataset, CPU binary was able to return actual neighbor IDs for the search operation.)

On diving deeper into the code, I found that __ENABLE_HASH flag (in config.h) is not affecting the GPU code at all. To fix the issue, I made some changes in the code to link the main.cu to hashkenelgraph.h the same way how main.cc is linked to graph.h. Even with this change, the search output is all zeroes with the GPU binary. I wish to run queries using GPU binary on the hashed dataset. Can you point me in the right direction? Thanks a lot!

from song.

zhaoweijie12 avatar zhaoweijie12 commented on June 30, 2024

Sorry for the late reply.
You should define __ENABLE_HASH as a compilation macro and use hash_warp_no_heap_astar_accelerator.h and hashkernelgraph.h in the main.cu. Also, make sure the DIM macro in these files represents the correct dimension of your data.

Weijie

from song.

saimkhan1509 avatar saimkhan1509 commented on June 30, 2024

Thanks for the reply, Weijie.

We have incorporated all the suggestions given by you about __ENABLE_HASH, hash_warp_no_heap_astar_accelerator.h, hashkernelgraph.h and DIM. We are now able to generate hashed dataset with both main.cu and main.cc. We have also run the GPU search function on the hashed dataset. Unlike last time, this time the output is meaningful (k neighbour ids for each query are within the dataset range).

However, surprisingly, the recall values on this output are not matching with the ones reported in the SONG paper for the MNIST8M dataset. For example, for hash-32 we are getting a max recall@1 of 3.8%. Similarly, for hash-512, we are getting max recall@1 of ~32%. These recall values are too low compared to the ones reported in the SONG paper (30%-99%). For sanity check, we have also calculated recall@10 and recall@5 for hash-512, and the results are 70.56% and 59.24% respectively.

Could you kindly suggest what we can be missing?

Here is the code we are using (for your reference): https://github.com/saimkhan1509/SONG_Clone
(mnist8m folder can be found at: https://drive.google.com/drive/folders/1hxcwgJz3weZUU5CcV7jNBh4BTHrgFVfs?usp=sharing)

from song.

zhaoweijie12 avatar zhaoweijie12 commented on June 30, 2024

I saw you were using l2 as measure.
This hashing is specifically for cosine similarity.

from song.

saimkhan1509 avatar saimkhan1509 commented on June 30, 2024

Thanks a lot for the reply Weiji.
Uncommenting __ENABLE_HASH sets the distance measure to hamming distance, irrespective of what distance measure we mention in the command. Upon your suggestion, we tweaked the code and forced it to use cosine_similarity as the distance measure. But still, we are getting the same recall@1 of 0.3314 for the MNIST8M dataset.
I have modified the code (at https://github.com/saimkhan1509/SONG_Clone) accordingly. I will be obliged if you could take a look again.

from song.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.