Comments (6)
Please check the following bithash.h. You may need to use hash2kv in the data reader.
#pragma once
#include<random>
#include<vector>
typedef float bithash_t;
class BitHash{
public:
std::vector<bithash_t> hash_matrix;
int p,k;
BitHash(){}
BitHash(int p,int k,int seed = 123) : p(p),k(k){
std::default_random_engine generator (seed);
std::normal_distribution<bithash_t> distribution(0.0,1.0);
hash_matrix.resize(p * k);
for(int i = 0;i < hash_matrix.size();++i)
hash_matrix[i] = distribution(generator);
}
std::vector<bool> hash2vecbool(const std::vector<std::pair<int,value_t>>& point){
std::vector<bool> ret(k);
for(int i = 0;i < k;++i){
bithash_t sum = 0;
for(const auto& pp : point){
sum += hash_matrix[i * p + pp.first] * pp.second;
}
ret[i] = (sum >= 0);
}
return std::move(ret);
}
uint8_t hash2uint8(const std::vector<std::pair<int,value_t>>& point){
uint8_t ret = 0;
for(int i = 0;i < k;++i){
bithash_t sum = 0;
for(const auto& pp : point){
sum += hash_matrix[i * p + pp.first] * pp.second;
}
ret |= ((sum >= 0) << i);
}
return std::move(ret);
}
std::vector<std::pair<int,value_t>> hash2kv(const std::vector<std::pair<int,value_t>>& point){
std::vector<std::pair<int,value_t>> ret(k/sizeof(value_t)/8);
#ifdef __ENABLE_HASH
for(int i = 0;i < k / sizeof(value_t) / 8;++i){
ret[i].first = i;
ret[i].second = 0;
}
for(int i = 0;i < k;++i){
bithash_t sum = 0;
for(const auto& pp : point){
sum += hash_matrix[i * p + pp.first] * pp.second;
}
ret[i / sizeof(value_t) / 8].second |= (sum >= 0) ? (1 << i % (sizeof(value_t) * 8)) : 0;
}
#endif
return std::move(ret);
}
};
from song.
Hi Weiji,
Thanks for the reply. Really helpful!
After using the bithash.h file shared by you, I was able to build the hashed dataset using SONG CPU binary (main.cc) but not with the GPU binary. That is, even after making the proposed changes, SONG GPU binary (main.cu) produced the non-hashed dataset file.
For performing the search operation, I tried running SONG GPU binary on the hashed dataset (and graph) that was produced by the CPU binary. However, for all the queries, the resultant k neighbor IDs are printed as all zeroes. (Note that on the same generated hashed dataset, CPU binary was able to return actual neighbor IDs for the search operation.)
On diving deeper into the code, I found that __ENABLE_HASH flag (in config.h) is not affecting the GPU code at all. To fix the issue, I made some changes in the code to link the main.cu to hashkenelgraph.h the same way how main.cc is linked to graph.h. Even with this change, the search output is all zeroes with the GPU binary. I wish to run queries using GPU binary on the hashed dataset. Can you point me in the right direction? Thanks a lot!
from song.
Sorry for the late reply.
You should define __ENABLE_HASH as a compilation macro and use hash_warp_no_heap_astar_accelerator.h and hashkernelgraph.h in the main.cu. Also, make sure the DIM macro in these files represents the correct dimension of your data.
Weijie
from song.
Thanks for the reply, Weijie.
We have incorporated all the suggestions given by you about __ENABLE_HASH, hash_warp_no_heap_astar_accelerator.h, hashkernelgraph.h and DIM. We are now able to generate hashed dataset with both main.cu and main.cc. We have also run the GPU search function on the hashed dataset. Unlike last time, this time the output is meaningful (k neighbour ids for each query are within the dataset range).
However, surprisingly, the recall values on this output are not matching with the ones reported in the SONG paper for the MNIST8M dataset. For example, for hash-32 we are getting a max recall@1 of 3.8%. Similarly, for hash-512, we are getting max recall@1 of ~32%. These recall values are too low compared to the ones reported in the SONG paper (30%-99%). For sanity check, we have also calculated recall@10 and recall@5 for hash-512, and the results are 70.56% and 59.24% respectively.
Could you kindly suggest what we can be missing?
Here is the code we are using (for your reference): https://github.com/saimkhan1509/SONG_Clone
(mnist8m folder can be found at: https://drive.google.com/drive/folders/1hxcwgJz3weZUU5CcV7jNBh4BTHrgFVfs?usp=sharing)
from song.
I saw you were using l2 as measure.
This hashing is specifically for cosine similarity.
from song.
Thanks a lot for the reply Weiji.
Uncommenting __ENABLE_HASH sets the distance measure to hamming distance, irrespective of what distance measure we mention in the command. Upon your suggestion, we tweaked the code and forced it to use cosine_similarity as the distance measure. But still, we are getting the same recall@1 of 0.3314 for the MNIST8M dataset.
I have modified the code (at https://github.com/saimkhan1509/SONG_Clone) accordingly. I will be obliged if you could take a look again.
from song.
Related Issues (9)
- cuda10.2+vs2019+gtx1050Ti,can not run true result.Result are all 14829735431805717965 HOT 1
- Run with sift1M dataset HOT 2
- High-dimensional dataset raises GPU memory error HOT 2
- questions about the CUDA kernel HOT 1
- Can you share the parameters to build graphs in SONG paper? HOT 2
- pq_size parameter setting HOT 2
- Recall function HOT 3
- [Bug] `num_query` not used in search kernel functions HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from song.