Giter Club home page Giter Club logo

snn's Introduction

SNN: Fast and exact fixed-radius neighbor search

C/C++ CI !pypi License: MIT DOI

SNN is a fast and exact fixed-radius nearest neighbor search algorithm [1]. It uses the first principal component of the data to prune the search space and speeds up Euclidean distance computations using high-level BLAS routines. SNN is implemented in native Python. On many problems, SNN is faster than KDtree and Balltree in the scikit-learn package. There is also a C++ implementation of SNN.

Reproducibility

To reproduce the experiments from the paper [1], see the instructions and code in the exp subfolder.

Python installation

The native Python implementation of SNN can be installed by:

pip install snnpy

Python API

Here is an example illustrating the use of SNN:

import numpy as np
from snnpy import *
from time import time

n_samples = 100000
n_dim =  100
radius = 3.5
rng = np.random.RandomState(0)
X = rng.random_sample((n_samples, n_dim))  

# build SNN model
st = time()
snn_model = build_snn_model(X)  
print("SNN index time:", time()-st)
# will be faster if return_dist is False, then no distance information come out

# query neighbors of X[0]
st = time()
ind,dist = snn_model.query_radius(X[0], radius, return_distance=True)
# If remove the returning of the associated distance, use: ind, dist = snn_model.query_radius(X[0], radius, return_distance=False)
sort_ind = np.argsort(dist)
print("SNN query time:", time()-st)

# print total number and top five indices
print("number of neighbors:", len(ind))
print("indices of closest five:", ", ".join([str(i) for i in ind[sort_ind][:5]]))

# EXAMPLE OUTPUT
# SNN index time: 0.2224433422088623
# SNN query time: 0.009207725524902344
# number of neighbors: 550
# indices of closest five: 0, 27279, 69983, 65906, 97095

Compare this to sklearn's KDTree:

from sklearn.neighbors import KDTree
st = time()
tree = KDTree(X)    
print("KDTree index time:", time()-st)
st = time()
ind2 = tree.query_radius(X[0].reshape(1, -1), radius)
print("KDTree query time:", time()-st)
print("number of neighbors:", len(ind2[0]))

# KDTree index time: 7.597502946853638
# KDTree query time: 0.08962678909301758
# number of neighbors: 550

We also support multi-point queries which exploit single-threaded BLAS-3 (multi-threading is under development):

ind = snn_model.radius_batch_query(X[:10], radius) 

snn_model.radius_batch_query uses Numba for further speedup. For the first run, Numba has to go through the code and optimize it, adding extra overhead. Each subsequent run of batch queries will be much faster.

C++ installation

The C++ version of SNN has dependencies on CBLAS, LAPACK and openMP. Reference LAPACK is available from GitHub. LAPACK releases are available on netlib (using MKL can yield further speedup).

Please modify the CMakeList.txt file to reflect your LAPACK location. If liblpacke-dev as well as libopenblas-dev were installed using your package manager (for example apt) there is no need to further modify the CMakeLists.txt.

git clone https://github.com/nla-group/snn.git
cd snn
cmake -S . -B ./build/
cmake --build ./build/ 
cp *.a /usr/lib
cp include/*.h /usr/include

After installation, you can include "snn.h" in your code, and compile it by linking the libsnn.a, CBLAS and LAPACK libraries. For example, you can use g++ with GSL BLAS and LAPACK by typing g++ your_code.cpp libsnn.a -o output -llapacke -lgslcblas -lm -W in Ubuntu.

C++ API

SNN has an easy-to-use API, and it works with int, float and double data types stored in column major order.

We first prepare the data:

int rows = 10;
int cols = 3;
double df[rows*cols] = {
  0.5488135 , 0.54488318, 0.43758721, 0.38344152, 0.56804456,
  0.0871293 , 0.77815675, 0.79915856, 0.11827443, 0.94466892,
  0.71518937, 0.4236548 , 0.891773, 0.79172504, 0.92559664,
  0.0202184 , 0.87001215, 0.46147936, 0.63992102, 0.52184832,
  0.60276338, 0.64589411, 0.96366276, 0.52889492, 0.07103606,
  0.83261985, 0.97861834, 0.78052918, 0.14335329, 0.41466194
}; 

/* data -- column major order

0.548813 0.715189 0.602763 
0.544883 0.423655 0.645894 
0.437587 0.891773 0.963663 
0.383442 0.791725 0.528895 
0.568045 0.925597 0.0710361 
0.0871293 0.0202184 0.83262 
0.778157 0.870012 0.978618 
0.799159 0.461479 0.780529 
0.118274 0.639921 0.143353 
0.944669 0.521848 0.414662 
*

Now we create the SNN index, specifying the number of objects and feature dimensions in the data:

// index SNN model
SNN_MODEL<double, double> snn_model_test(df, rows, cols);

Querying neighbors of a single data point:

// query data
double query[cols] = {0.5488135 , 0.71518937, 0.60276338}; 

// create the variable storing neighbors ID 
vector<int> knnID; 

// create the variable storing neighbors' distance to the query
vector<double> knnDist; 

// employ single query, the number 0.4 refers to the search radius (range) 
snn_model_test.radius_single_query(query, 0.4, &knnID, &knnDist);

We can also query neighbors of multiple data points at once:

 // employ two queries, parallel compute by openMP
double query_batch[2*cols] = {0.5488135, 0.944669, 0.71518937, 0.521848, 0.60276338, 0.414662};

vector<vector<int> > batch_knnID;
vector<vector<double> > batch_knnDist;

snn_model_test.radius_batch_query(query_batch, 0.4, &batch_knnID, &batch_knnDist, 2);

for (int j=0; j<2; j++){
    cout << "knnID" << endl;
    for (auto i: batch_knnID[j]){
        cout << i << " ";
    }

    cout << "\nknnDist" << endl;
    for (auto i: batch_knnDist[j]){
        cout << i << " ";
    }

    cout << endl;
}

/* output
  knnID
  3 0 1 7 
  knnDist
  0.196627 0 0.294734 0.398299 

  knnID
  9 7 
  knnDist
  3.35193e-07 0.398342
*/

License

All the content in this repository is licensed under the MIT License.

Reference

@article{CG24b,
  title   = {Fast and exact fixed-radius neighbor search based on sorting},
  author  = {Chen, Xinye and G\"{u}ttel, Stefan},
  year    = {2024},
  volume  = {},
  number  = {},
  pages   = {},
  journal = {To appear in PeerJ Computer Science},
  url     = {https://arxiv.org/abs/2212.07679},
  webpdf  = {https://arxiv.org/abs/2212.07679}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.