src-d / kmcuda Goto Github PK

Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA

License: Other

CMake 0.94% Cuda 12.39% C++ 17.40% Python 5.71% Jupyter Notebook 62.85% R 0.71%

cuda kmeans yinyang knn-search machine-learning python afk-mc2 hacktoberfest

kmcuda's Introduction

"Yinyang" K-means and K-nn using NVIDIA CUDA

K-means implementation is based on "Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup". While it introduces some overhead and many conditional clauses which are bad for CUDA, it still shows 1.6-2x speedup against the Lloyd algorithm. K-nearest neighbors employ the same triangle inequality idea and require precalculated centroids and cluster assignments, similar to the flattened ball tree.

Benchmarks	sklearn KMeans	KMeansRex	KMeansRex OpenMP	Serban	kmcuda	kmcuda 2 GPUs
speed	1x	4.5x	8.2x	15.5x	17.8x	29.8x
memory	1x	2x	2x	0.6x	0.6x	0.6x

Technically, this project is a shared library which exports two functions defined in kmcuda.h: kmeans_cuda and knn_cuda. It has built-in Python3 and R native extension support, so you can from libKMCUDA import kmeans_cuda or dyn.load("libKMCUDA.so").

How was this created?

K-means
K-nn
Notes
Building
- macOS
Testing
Benchmarks
- 100,000x256@1024
  - Configuration
  - Contestants
  - Data
  - Notes
- 8,000,000x256@1024
  - Data
  - Notes
Python examples
Python API
R examples
- K-means
- K-nn
R API
C examples
C API
License

K-means

The major difference between this project and others is that kmcuda is optimized for low memory consumption and the large number of clusters. E.g., kmcuda can sort 4M samples in 480 dimensions into 40000 clusters (if you have several days and 12 GB of GPU memory); 300K samples are grouped into 5000 clusters in 4½ minutes on NVIDIA Titan X (15 iterations); 3M samples and 1000 clusters take 20 minutes (33 iterations). Yinyang can be turned off to save GPU memory but the slower Lloyd will be used then. Four centroid initialization schemes are supported: random, k-means++, AFKMC2 and import. Two distance metrics are supported: L2 (the usual one) and angular (arccos of the scalar product). L1 is in development. 16-bit float support delivers 2x memory compression. If you've got several GPUs, they can be utilized together and it gives the corresponding linear speedup either for Lloyd or Yinyang.

The code has been thoroughly tested to yield bit-to-bit identical results from Yinyang and Lloyd. "Fast and Provably Good Seedings for k-Means" was adapted from the reference code.

Read the articles: 1, 2.

K-nn

Centroid distance matrix C_ij is calculated together with clusters' radiuses R_i (the maximum distance from the centroid to the corresponding cluster's members). Given sample S in cluster A, we avoid calculating the distances from S to another cluster B's members if C_AB - SA - R_B is greater than the current maximum K-nn distance. This resembles the ball tree algorithm.

The implemented algorithm is tolerant to NANs. There are two variants depending on whether k is small enough to fit the sample's neighbors into CUDA shared memory. Internally, the neighbors list is a binary heap - that reduces the complexity multiplier from O(k) to O(log k).

The implementation yields identical results to sklearn.neighbors.NearestNeighbors except cases in which adjacent distances are equal and the order is undefined. That is, the returned indices are sorted in the increasing order of the corresponding distances.

Notes

Lloyd is tolerant to samples with NaN features while Yinyang is not. It may happen that some of the resulting clusters contain zero elements. In such cases, their features are set to NaN.

Angular (cosine) distance metric effectively results in Spherical K-Means behavior. The samples must be normalized to L2 norm equal to 1 before clustering, it is not done automatically. The actual formula is:

$D(A, B)=\arccos\left(\frac{A\cdot B}{|A||B|}\right)$

If you get OOM with the default parameters, set yinyang_t to 0 which forces Lloyd. verbosity 2 will print the memory allocation statistics (all GPU allocation happens at startup).

Data type is either 32- or 16-bit float. Number of samples is limited by 2^32, clusters by 2^32 and features by 2^16 (2^17 for fp16). Besides, the product of clusters number and features number may not exceed 2^32.

In the case of 16-bit floats, the reduced precision often leads to a slightly increased number of iterations, Yinyang is especially sensitive to that. In some cases, there may be overflows and the clustering may fail completely.

Building

git clone https://github.com/src-d/kmcuda
cd src
cmake -DCMAKE_BUILD_TYPE=Release . && make

It requires cudart 8.0 / Pascal and OpenMP 4.0 capable compiler. The build has been tested primarily on Linux but it works on macOS too with some blows and whistles (see "macOS" subsection). If you do not want to build the Python native module, add -D DISABLE_PYTHON=y. If you do not want to build the R native module, add -D DISABLE_R=y. If CUDA is not automatically found, add -D CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-8.0 (change the path to the actual one). By default, CUDA kernels are compiled for the architecture 60 (Pascal). It is possible to override it via -D CUDA_ARCH=52, but fp16 support will be disabled then.

Python users:

CUDA_ARCH=61 pip install libKMCUDA
# replace 61 with your device version

Or install it from source:

CUDA_ARCH=61 pip install git+https://github.com/src-d/kmcuda.git#subdirectory=src
# replace 61 with your device version

Binary Python packages are quite hard to provide because they depend on CUDA and device architecture versions. PRs welcome!

macOS

macOS build is tricky, but possible. The instructions below correspond to the state from 1 year ago and may be different now. Please help with updates!

Install Homebrew and the Command Line Developer Tools which are compatible with your CUDA installation. E.g., CUDA 8.0 does not support the latest 8.x and works with 7.3.1 and below. Install clang with OpenMP support and Python with numpy:

brew install llvm --with-clang
brew install python3
pip3 install numpy

Execute this magic command which builds kmcuda afterwards:

CC=/usr/local/opt/llvm/bin/clang CXX=/usr/local/opt/llvm/bin/clang++ LDFLAGS=-L/usr/local/opt/llvm/lib/ cmake -DCMAKE_BUILD_TYPE=Release .

And make the last important step - rename *.dylib to *.so so that Python is able to import the native extension:

mv libKMCUDA.{dylib,so}

Testing

test.py contains the unit tests based on unittest. They require either cuda4py or pycuda and scikit-learn. test.R contains R integration tests and shall be run with Rscript.

Benchmarks

100000x256@1024

	sklearn KMeans	KMeansRex	KMeansRex OpenMP	Serban	kmcuda	kmcuda 2 GPUs
time, s	164	36	20	10.6	9.2	5.5
memory, GB	1	2	2	0.6	0.6	0.6

Configuration

16-core (32 threads) Intel Xeon E5-2620 v4 @ 2.10GHz
256 GB RAM Samsung M393A2K40BB1
Nvidia Titan X 2016

Contestants

sklearn.cluster.KMeans@0.18.1; KMeans(n_clusters=1024, init="random", max_iter=15, random_state=0, n_jobs=1, n_init=1).
KMeansRex@288c40a with -march-native and Eigen 3.3; KMeansRex.RunKMeans(data, 1024, Niter=15, initname=b"random").
KMeansRex with additional -fopenmp.
Serban KMeans@83e76bf built for arch 6.1; ./cuda_main -b -i serban.bin -n 1024 -t 0.0028 -o
kmcuda v6.1 built for arch 6.1; libKMCUDA.kmeans_cuda(dataset, 1024, tolerance=0.002, seed=777, init="random", verbosity=2, yinyang_t=0, device=0)
kmcuda running on 2 GPUs.

Data

100000 random samples uniformly distributed between 0 and 1 in 256 dimensions.

Notes

100000 is the maximum size Serban KMeans can handle.

8000000x256@1024

	sklearn KMeans	KMeansRex	KMeansRex OpenMP	Serban	kmcuda 2 GPU	kmcuda Yinyang 2 GPUs
time	please no	-	6h 34m	fail	44m	36m
memory, GB	-	-	205	fail	8.7	10.4

kmeans++ initialization, 93 iterations (1% reassignments equivalent).

Data

8,000,000 secret production samples.

Notes

KmeansRex did eat 205 GB of RAM on peak; it uses dynamic memory so it constantly bounced from 100 GB to 200 GB.

Contributions

...are welcome! See CONTRIBUTING and code of conduct.

License

Apache 2.0

Python examples

K-means, L2 (Euclidean) distance

import numpy
from matplotlib import pyplot
from libKMCUDA import kmeans_cuda

numpy.random.seed(0)
arr = numpy.empty((10000, 2), dtype=numpy.float32)
arr[:2500] = numpy.random.rand(2500, 2) + [0, 2]
arr[2500:5000] = numpy.random.rand(2500, 2) - [0, 2]
arr[5000:7500] = numpy.random.rand(2500, 2) + [2, 0]
arr[7500:] = numpy.random.rand(2500, 2) - [2, 0]
centroids, assignments = kmeans_cuda(arr, 4, verbosity=1, seed=3)
print(centroids)
pyplot.scatter(arr[:, 0], arr[:, 1], c=assignments)
pyplot.scatter(centroids[:, 0], centroids[:, 1], c="white", s=150)

You should see something like this:

K-means, angular (cosine) distance + average

import numpy
from matplotlib import pyplot
from libKMCUDA import kmeans_cuda

numpy.random.seed(0)
arr = numpy.empty((10000, 2), dtype=numpy.float32)
angs = numpy.random.rand(10000) * 2 * numpy.pi
for i in range(10000):
    arr[i] = numpy.sin(angs[i]), numpy.cos(angs[i])
centroids, assignments, avg_distance = kmeans_cuda(
    arr, 4, metric="cos", verbosity=1, seed=3, average_distance=True)
print("Average distance between centroids and members:", avg_distance)
print(centroids)
pyplot.scatter(arr[:, 0], arr[:, 1], c=assignments)
pyplot.scatter(centroids[:, 0], centroids[:, 1], c="white", s=150)

You should see something like this:

K-nn

import numpy
from libKMCUDA import kmeans_cuda, knn_cuda

numpy.random.seed(0)
arr = numpy.empty((10000, 2), dtype=numpy.float32)
angs = numpy.random.rand(10000) * 2 * numpy.pi
for i in range(10000):
    arr[i] = numpy.sin(angs[i]), numpy.cos(angs[i])
ca = kmeans_cuda(arr, 4, metric="cos", verbosity=1, seed=3)
neighbors = knn_cuda(10, arr, *ca, metric="cos", verbosity=1, device=1)
print(neighbors[0])

You should see

reassignments threshold: 100
performing kmeans++...
done
too few clusters for this yinyang_t => Lloyd
iteration 1: 10000 reassignments
iteration 2: 926 reassignments
iteration 3: 416 reassignments
iteration 4: 187 reassignments
iteration 5: 87 reassignments
initializing the inverse assignments...
calculating the cluster radiuses...
calculating the centroid distance matrix...
searching for the nearest neighbors...
calculated 0.276552 of all the distances
[1279 1206 9846 9886 9412 9823 7019 7075 6453 8933]

Python API

def kmeans_cuda(samples, clusters, tolerance=0.01, init="k-means++",
                yinyang_t=0.1, metric="L2", average_distance=False,
                seed=time(), device=0, verbosity=0)

samples numpy array of shape [number of samples, number of features] or tuple(raw device pointer (int), device index (int), shape (tuple(number of samples, number of features[, fp16x2 marker]))). In the latter case, negative device index means host pointer. Optionally, the tuple can be 2 items longer with preallocated device pointers for centroids and assignments. dtype must be either float16 or convertible to float32.

clusters integer, the number of clusters.

tolerance float, if the relative number of reassignments drops below this value, algorithm stops.

init string or numpy array, sets the method for centroids initialization, may be "k-means++", "afk-mc2", "random" or numpy array of shape [clusters, number of features]. dtype must be float32.

yinyang_t float, the relative number of cluster groups, usually 0.1. 0 disables Yinyang refinement.

metric str, the name of the distance metric to use. The default is Euclidean (L2), it can be changed to "cos" to change the algorithm to Spherical K-means with the angular distance. Please note that samples must be normalized in the latter case.

average_distance boolean, the value indicating whether to calculate the average distance between cluster elements and the corresponding centroids. Useful for finding the best K. Returned as the third tuple element.

seed integer, random generator seed for reproducible results.

device integer, bitwise OR-ed CUDA device indices, e.g. 1 means first device, 2 means second device, 3 means using first and second device. Special value 0 enables all available devices. The default is 0.

verbosity integer, 0 means complete silence, 1 means mere progress logging, 2 means lots of output.

return tuple(centroids, assignments, [average_distance]). If samples was a numpy array or a host pointer tuple, the types are numpy arrays, otherwise, raw pointers (integers) allocated on the same device. If samples are float16, the returned centroids are float16 too.

def knn_cuda(k, samples, centroids, assignments, metric="L2", device=0, verbosity=0)

k integer, the number of neighbors to search for each sample. Must be ≤ 1¹⁶.

samples numpy array of shape [number of samples, number of features] or tuple(raw device pointer (int), device index (int), shape (tuple(number of samples, number of features[, fp16x2 marker]))). In the latter case, negative device index means host pointer. Optionally, the tuple can be 1 item longer with the preallocated device pointer for neighbors. dtype must be either float16 or convertible to float32.

centroids numpy array with precalculated clusters' centroids (e.g., using K-means/kmcuda/kmeans_cuda()). dtype must match samples. If samples is a tuple then centroids must be a length-2 tuple, the first element is the pointer and the second is the number of clusters. The shape is (number of clusters, number of features).

assignments numpy array with sample-cluster associations. dtype is expected to be compatible with uint32. If samples is a tuple then assignments is a pointer. The shape is (number of samples,).

verbosity integer, 0 means complete silence, 1 means mere progress logging, 2 means lots of output.

return neighbor indices. If samples was a numpy array or a host pointer tuple, the return type is numpy array, otherwise, a raw pointer (integer) allocated on the same device. The shape is (number of samples, k).

R examples

K-means

dyn.load("libKMCUDA.so")
samples = replicate(4, runif(16000))
result = .External("kmeans_cuda", samples, 50, tolerance=0.01,
                   seed=777, verbosity=1, average_distance=TRUE)
print(result$average_distance)
print(result$centroids[1:10,])
print(result$assignments[1:10])

K-nn

dyn.load("libKMCUDA.so")
samples = replicate(4, runif(16000))
cls = .External("kmeans_cuda", samples, 50, tolerance=0.01,
                seed=777, verbosity=1)
result = .External("knn_cuda", 20, samples, cls$centroids, cls$assignments,
                   verbosity=1)
print(result[1:10,])

R API

function kmeans_cuda(
    samples, clusters, tolerance=0.01, init="k-means++", yinyang_t=0.1,
    metric="L2", average_distance=FALSE, seed=Sys.time(), device=0, verbosity=0)

samples real matrix of shape [number of samples, number of features] or list of real matrices which are rbind()-ed internally. No more than INT32_MAX samples and UINT16_MAX features are supported.

clusters integer, the number of clusters.

tolerance real, if the relative number of reassignments drops below this value, algorithm stops.

init character vector or real matrix, sets the method for centroids initialization, may be "k-means++", "afk-mc2", "random" or real matrix, of shape [clusters, number of features].

yinyang_t real, the relative number of cluster groups, usually 0.1. 0 disables Yinyang refinement.

metric character vector, the name of the distance metric to use. The default is Euclidean (L2), it can be changed to "cos" to change the algorithm to Spherical K-means with the angular distance. Please note that samples must be normalized in the latter case.

average_distance logical, the value indicating whether to calculate the average distance between cluster elements and the corresponding centroids. Useful for finding the best K. Returned as the third list element.

seed integer, random generator seed for reproducible results.

verbosity integer, 0 means complete silence, 1 means mere progress logging, 2 means lots of output.

return list(centroids, assignments[, average_distance]). Indices in assignments start from 1.

function knn_cuda(k, samples, centroids, assignments, metric="L2", device=0, verbosity=0)

k integer, the number of neighbors to search for each sample. Must be ≤ 1¹⁶.

samples real matrix of shape [number of samples, number of features] or list of real matrices which are rbind()-ed internally. In the latter case, is is possible to pass in more than INT32_MAX samples.

centroids real matrix with precalculated clusters' centroids (e.g., using kmeans() or kmeans_cuda()).

assignments integer vector with sample-cluster associations. Indices start from 1.

metric str, the name of the distance metric to use. The default is Euclidean (L2), can be changed to "cos" to behave as Spherical K-means with the angular distance. Please note that samples must be normalized in that case.

verbosity integer, 0 means complete silence, 1 means mere progress logging, 2 means lots of output.

return integer matrix with neighbor indices. The shape is (number of samples, k). Indices start from 1.

C examples

example.c:

#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <kmcuda.h>

// ./example /path/to/data <number of clusters>
int main(int argc, const char **argv) {
  assert(argc == 3);
  // we open the binary file with the data
  // [samples_size][features_size][samples_size x features_size]
  FILE *fin = fopen(argv[1], "rb");
  assert(fin);
  uint32_t samples_size, features_size;
  assert(fread(&samples_size, sizeof(samples_size), 1, fin) == 1);
  assert(fread(&features_size, sizeof(features_size), 1, fin) == 1);
  uint64_t total_size = ((uint64_t)samples_size) * features_size;
  float *samples = malloc(total_size * sizeof(float));
  assert(samples);
  assert(fread(samples, sizeof(float), total_size, fin) == total_size);
  fclose(fin);
  int clusters_size = atoi(argv[2]);
  // we will store cluster centers here
  float *centroids = malloc(clusters_size * features_size * sizeof(float));
  assert(centroids);
  // we will store assignments of every sample here
  uint32_t *assignments = malloc(((uint64_t)samples_size) * sizeof(uint32_t));
  assert(assignments);
  float average_distance;
  KMCUDAResult result = kmeans_cuda(
      kmcudaInitMethodPlusPlus, NULL,  // kmeans++ centroids initialization
      0.01,                            // less than 1% of the samples are reassigned in the end
      0.1,                             // activate Yinyang refinement with 0.1 threshold
      kmcudaDistanceMetricL2,          // Euclidean distance
      samples_size, features_size, clusters_size,
      0xDEADBEEF,                      // random generator seed
      0,                               // use all available CUDA devices
      -1,                              // samples are supplied from host
      0,                               // not in float16x2 mode
      1,                               // moderate verbosity
      samples, centroids, assignments, &average_distance);
  free(samples);
  free(centroids);
  free(assignments);
  assert(result == kmcudaSuccess);
  printf("Average distance between a centroid and the corresponding "
         "cluster members: %f\n", average_distance);
  return 0;
}

Build:

gcc -std=c99 -O2 example.c -I/path/to/kmcuda.h/dir -L/path/to/libKMCUDA.so/dir -l KMCUDA -Wl,-rpath,. -o example

Run:

./example serban.bin 1024

The file format is the same as in serban/kmeans.

C API

KMCUDAResult kmeans_cuda(
    KMCUDAInitMethod init, float tolerance, float yinyang_t,
    KMCUDADistanceMetric metric, uint32_t samples_size, uint16_t features_size,
    uint32_t clusters_size, uint32_t seed, uint32_t device, int32_t device_ptrs,
    int32_t fp16x2, int32_t verbosity, const float *samples, float *centroids,
    uint32_t *assignments, float *average_distance)

init specifies the centroids initialization method: k-means++, random or import (in the latter case, centroids is read).

tolerance if the number of reassignments drop below this ratio, stop.

yinyang_t the relative number of cluster groups, usually 0.1.

metric The distance metric to use. The default is Euclidean (L2), can be changed to cosine to behave as Spherical K-means with the angular distance. Please note that samples must be normalized in that case.

samples_size number of samples.

features_size number of features. if fp16x2 is set, one half of the number of features.

clusters_size number of clusters.

seed random generator seed passed to srand().

device CUDA device OR-ed indices - usually 1. For example, 1 means using first device, 2 means second device, 3 means first and second device (2x speedup). Special value 0 enables all available devices.

device_ptrs configures the location of input and output. If it is negative, samples and returned arrays are on host, otherwise, they belong to the corresponding device. E.g., if device_ptrs is 0, samples is expected to be a pointer to device #0's memory and the resulting centroids and assignments are expected to be preallocated on device #0 as well. Usually this value is -1.

fp16x2 activates fp16 mode, two half-floats are packed into a single 32-bit float, features_size becomes effectively 2 times bigger, the returned centroids are fp16x2 too.

verbosity 0 - no output; 1 - progress output; >=2 - debug output.

samples input array of size samples_size x features_size in row major format.

centroids output array of centroids of size clusters_size x features_size in row major format.

assignments output array of cluster indices for each sample of size samples_size x 1.

average_distance output mean distance between cluster elements and the corresponding centroids. If nullptr, not calculated.

Returns KMCUDAResult (see kmcuda.h);

KMCUDAResult knn_cuda(
    uint16_t k, KMCUDADistanceMetric metric, uint32_t samples_size,
    uint16_t features_size, uint32_t clusters_size, uint32_t device,
    int32_t device_ptrs, int32_t fp16x2, int32_t verbosity,
    const float *samples, const float *centroids, const uint32_t *assignments,
    uint32_t *neighbors);

k integer, the number of neighbors to search for each sample.

samples_size number of samples.

features_size number of features. if fp16x2 is set, one half of the number of features.

clusters_size number of clusters.

device_ptrs configures the location of input and output. If it is negative, samples, centroids, assignments and the returned array are on host, otherwise, they belong to the corresponding device. E.g., if device_ptrs is 0, samples, centroids and assignments are expected to be pointers to device #0's memory and the resulting neighbors is expected to be preallocated on device #0 as well. Usually this value is -1.

fp16x2 activates fp16 mode, two half-floats are packed into a single 32-bit float, features_size becomes effectively 2 times bigger, affects samples and centroids.

verbosity 0 - no output; 1 - progress output; >=2 - debug output.

samples input array of size samples_size x features_size in row major format.

centroids input array of centroids of size clusters_size x features_size in row major format.

assignments input array of cluster indices for each sample of size samples_size x 1.

neighbors output array with the nearest neighbors of size samples_size x k in row major format.

Returns KMCUDAResult (see kmcuda.h);

README {#ignore_this_doxygen_anchor}

kmcuda's People

Contributors

Stargazers

Watchers

Forkers

xuanhan863 zhiyu-chen zhang365947064 wanjinchang westamine zyx1986 h312h rehan-asif caomw ibugs961 ezhangle qlycool benjamesbabala xjwxjw cdeterman qiaoyu0747 zgsxwsdxg zhu-dq tumb1er solertis pombredanne olalonde limin2021 montecarlo1 ralic h2oai dpordomingo tjipenk anandharidass fangmath guopd lanking520 dingmyu yiskylee hdubey hkcaesar heypinch tanfluent fossabot yaoq gsanchis jacnel zy20091082 xuerenlv romakoks blackjack201312 vmarkovtsev palcode whn09 afcarl avikundal lijiawenl auserj rexnxiaobai zurk jaykimbravekjh mursids zhouyonglong mahmud83 dahaiyu stevenlol pavlexander danimtk biaozhunhuge zhongminjin dionny alguevara7 gaoyz0625 visenzeadam iesahin thipeesan gy-earth hayeonlee mysablehats ssinghnorthstarhub blankxz scarleaf chenshaxiong wrongwhp fesianxu colionx tilmto yinjiangjin riowong riviera2015 dylanwusee xmurobi qianlinjun redheli bwdeng20 liuchanfeng165 jeroenherczeg recogni yukewang96 allenfernzz entn-at cxczzy gtrunsec pandinosaurus zhangpiu

kmcuda's Issues

lots of warning

Hi
I have built it successfully, but i got lots of warning such as "warning : p2p 4<->0 is impossible". Please tell me how to fix it.
Thanks!

Issues running in fp16

Even though the fp16 unit tests run successfully, I cannot get clustering to work on my sample in fp16 mode.

When is use my data .astype(np.float32) the clustering output looks sth. like:

data loaded
reassignments threshold: 50000
transposing the samples...
performing kmeans++...
done
too few clusters for this yinyang_t => Lloyd
iteration 1: 1000000 reassignments
iteration 2: 205760 reassignments
iteration 3: 103270 reassignments
iteration 4: 70388 reassignments
iteration 5: 53987 reassignments
iteration 6: 43521 reassignments
clustering done

Wuth the exact same data but .astype(np.float16) however i get:

iteration 1: 1000000 reassignments
iteration 2: 727174 reassignments
iteration 3: 691319 reassignments
iteration 4: 676380 reassignments
iteration 5: 663719 reassignments
iteration 6: 657045 reassignments
iteration 7: 648330 reassignments
iteration 8: 644405 reassignments
iteration 9: 645016 reassignments
iteration 10: 639175 reassignments
iteration 11: 636870 reassignments
iteration 12: 637613 reassignments
iteration 13: 623391 reassignments
iteration 14: 626752 reassignments
iteration 15: 635901 reassignments
iteration 16: 639952 reassignments
iteration 17: 640327 reassignments
iteration 18: 639718 reassignments
iteration 19: 652200 reassignments
iteration 20: 662045 reassignments
iteration 21: 676625 reassignments
iteration 22: 701241 reassignments
iteration 23: 693161 reassignments
iteration 24: 713261 reassignments
iteration 25: 709713 reassignments
iteration 26: 717198 reassignments
iteration 27: 735580 reassignments
iteration 28: 743289 reassignments
iteration 29: 745265 reassignments
iteration 30: 761803 reassignments
iteration 31: 762372 reassignments
iteration 32: 779398 reassignments
iteration 33: 778028 reassignments
iteration 34: 781619 reassignments
iteration 35: 786291 reassignments
iteration 36: 792249 reassignments
iteration 37: 799609 reassignments
iteration 38: 804822 reassignments
iteration 39: 799691 reassignments
iteration 40: 804918 reassignments
iteration 41: 823586 reassignments
iteration 42: 810885 reassignments
iteration 43: 827755 reassignments
iteration 44: 833164 reassignments
[...]

It does not converge....

I also tested to cast the data to fp16 and then back to fp32 to loose precision on the dataset, but that still converged fine.

Any ideas?

(I tried on Tesla P100 as well as Titan X and 1080 pascal cards)

cuda error: the launch timed out and was terminated

The dataset dim is: (581977, 2000)

Here are the logs. Same happens with a single GPU.

Issue goes away when using smaller dataset.

arguments: 1 0x7ffef9c84304 0.010 0.00 1 581977 2000 10000 1550770014 0 0 2 0x7f3704cb7010 0x7f3964fda010 0x310f210 (nil)
reassignments threshold: 5819
yinyang groups: 0
GPU #0 memory: used 4836229120 bytes (40.3%), free 7160725504 bytes, total 11996954624 bytes
GPU #1 memory: used 4820631552 bytes (40.2%), free 7176323072 bytes, total 11996954624 bytes
GPU #0 has 49152 bytes of shared memory per block
GPU #1 has 49152 bytes of shared memory per block
transposing the samples...
transpose <<<(18187, 63), (8, 32)>>> 581977, 2000, xyswap
performing kmeans++...
done
too few clusters for this yinyang_t => Lloyd
plans: [(0, 290992), (290992, 290985)]
planc: [(0, 5000), (5000, 5000)]
cudaMemcpyFromSymbol(&my_changed, d_changed_number, sizeof(my_changed))
/tmp/pip-req-build-bqlfsbw6/src/kmeans.cu:705 -> the launch timed out and was terminated
kmeans_cuda_yy failed: the launch timed out and was terminated

Could NOT find OpenMP (missing: OpenMP_C_FLAGS OpenMP_CXX_FLAGS)

I install openmpi. Machine: OS x 10.12

grok-machine:kmcuda dendisuhubdy$ cmake -DCMAKE_BUILD_TYPE=Release . && make
-- Try OpenMP C flag = [ ]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-fopenmp=libomp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [/openmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-Qopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-openmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-xopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [+Oopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-qsmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-mp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [ ]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-fopenmp=libomp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [/openmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-Qopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-openmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-xopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [+Oopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-qsmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-mp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
CMake Error at /usr/local/Cellar/cmake/3.7.2/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:138 (message):
  Could NOT find OpenMP (missing: OpenMP_C_FLAGS OpenMP_CXX_FLAGS)
Call Stack (most recent call first):
  /usr/local/Cellar/cmake/3.7.2/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
  /usr/local/Cellar/cmake/3.7.2/share/cmake/Modules/FindOpenMP.cmake:316 (find_package_handle_standard_args)
  CMakeLists.txt:3 (find_package)

predict method

Thanks for the lib! May I know if there is an method like kmeans_gpu.predict that i can use to predict which cluster does a new point belong to?

Install issue: Recipe for target 'all' failed

Hi,

I have been trying to install kmcuda on my system which has Nvidia's GP102 Pascal TitanX and cuda 9.2. But, for some weird reason, I have been facing the following issue when I run the command as:

cmake -DCMAKE_BUILD_TYPE=Release -DCUDA_ARCH=61 - DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-9.2 . && make

And, this is what I get:

[ 33%] Building NVCC (Device) object CMakeFiles/KMCUDA.dir/KMCUDA_generated_kmeans.cu.o
/usr/local/cuda/include/cuda_fp16.hpp(161): error: invalid redeclaration of member function "__half::operator=(const __half_raw &)"
(160): here

/usr/local/cuda/include/cuda_fp16.hpp(162): error: invalid redeclaration of member function "__half::operator=(const __half_raw &)"
(160): here

/usr/local/cuda/include/cuda_fp16.hpp(164): error: invalid redeclaration of member function "__half::operator __half_raw() const"
(163): here

3 errors detected in the compilation of "/tmp/tmpxft_00008f6c_00000000-6_kmeans.cpp1.ii".
CMake Error at KMCUDA_generated_kmeans.cu.o.Release.cmake:279 (message):
  Error generating file
  /mlodata1/sidak/projects/kmcuda_cuda=61/kmcuda/src/CMakeFiles/KMCUDA.dir//./KMCUDA_generated_kmeans.cu.o


CMakeFiles/KMCUDA.dir/build.make:63: recipe for target 'CMakeFiles/KMCUDA.dir/KMCUDA_generated_kmeans.cu.o' failed
make[2]: *** [CMakeFiles/KMCUDA.dir/KMCUDA_generated_kmeans.cu.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/KMCUDA.dir/all' failed
make[1]: *** [CMakeFiles/KMCUDA.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

It would be great if someone can help me on this! Thanks a lot! 😄

PS: I really like kmcuda's support for fast clustering as well as the angular distance. Now, I have moved to a different machine, and have been stuck in this.

[SOLVED] Build from source on RHEL 7.5 (GCC 4.8.5) and CUDA 9.0

Hi,

in case somebody is in a similar situation. Compiling from source I got the error:

Scanning dependencies of target KMCUDA
[ 66%] Building CXX object CMakeFiles/KMCUDA.dir/kmcuda.cc.o
/opt/kmcuda/kmcuda_2018-09-17/src/kmcuda.cc:206:0: error: ignoring #pragma omp simd [-Werror=unknown-pragmas]
       #pragma omp simd
 ^
/opt/kmcuda/kmcuda_2018-09-17/src/kmcuda.cc:310:0: error: ignoring #pragma omp simd [-Werror=unknown-pragmas]
           #pragma omp simd reduction(+:dist_sum2)
 ^
cc1plus: all warnings being treated as errors
make[2]: *** [CMakeFiles/KMCUDA.dir/kmcuda.cc.o] Error 1
make[1]: *** [CMakeFiles/KMCUDA.dir/all] Error 2
make: *** [all] Error 2

Changing to my manual compiled GCC 5.5.0 it finishes 100%. Here is the cmake3 statement. Adapt your paths to python and GCC.

$ sudo git clone https://github.com/src-d/kmcuda
$ cd kmcuda; sudo mkdir build; cd build
$ sudo cmake3 -DCMAKE_BUILD_TYPE=Release \
-DPYTHON_EXECUTABLE:FILEPATH=/opt/intel/intelpython35/bin/python3.5 \
-DPYTHON_INCLUDE_DIR:PATH=/opt/intel/intelpython35/include/python3.5m \
-DCUDA_NVCC_FLAGS="-ccbin /opt/gcc/GCC-5.5.0/bin/gcc" \
-DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-9.0 \
-DCMAKE_CXX_COMPILER:FILEPATH=/opt/gcc/GCC-5.5.0/bin/c++ \
-DCMAKE_CXX_COMPILER_AR:FILEPATH=/opt/gcc/GCC-5.5.0/bin/gcc-ar \
-DCMAKE_CXX_COMPILER_RANLIB:FILEPATH=/opt/gcc/GCC-5.5.0/bin/gcc-ranlib \
-DCMAKE_C_COMPILER:FILEPATH=/opt/gcc/GCC-5.5.0/bin/gcc \
-DCMAKE_C_COMPILER_AR:FILEPATH=/opt/gcc/GCC-5.5.0/bin/gcc-ar \
-DCMAKE_C_COMPILER_RANLIB:FILEPATH=/opt/gcc/GCC-5.5.0/bin/gcc-ranlib \
-DDISABLE_R=y \
../src/

Import in python worked. 3 examples from the main page worked as well. Haven't run unittests or other examples yet.

internal bug inside kmeans_init_centroids: dist_sum is NaN

Hello! Thanks for your code, which is really helpful.

I have encountered a problem. I can't pass line 296 in kmcuda.cc 'assert(dist_sum == dist_sum); ' because 'dist_sum is NaN'. I don't know why.

My sample_size is 1.5M and my feature_size is 8, when I set the cluster_size to 1000, it works fine. but when I set the cluster_size to 1500 or 2000, it would fail.

Is it because of this? Lloyd is tolerant to samples with NaN features while Yinyang is not. It may happen that some of the resulting clusters contain zero elements. In such cases, their features are set to NaN.

My GPU is TESLA P100 and the memory is enough.

Thanks for your attention!

Buffered input for very large dataset.

How about add buffered input for large datasets?
~500M samples in 64 dimension into ~150M clusters for example.

Using more than 2 GPUs

Hello!
I have more than two GPUs and I want to run kmcuda on 3 and 4 devices. How can I do it?

cudaMemcpy failed

Hi,

I'm trying to get the KMCuda library running in Python, but I am getting the following error:

root:INFO:07:04:43 Clustering users into 50 clusters...
reassignments threshold: 470
transposing the samples...
performing kmeans++...
done            
too few clusters for this yinyang_t => Lloyd
iteration 1: 47094 reassignments
iteration 2: 10551 reassignments
iteration 3: 6259 reassignments
iteration 4: 4474 reassignments
iteration 5: 3621 reassignments
iteration 6: 2952 reassignments
iteration 7: 2536 reassignments
iteration 8: 2156 reassignments
iteration 9: 1895 reassignments
iteration 10: 1757 reassignments
iteration 11: 1527 reassignments
iteration 12: 1370 reassignments
iteration 13: 1213 reassignments
iteration 14: 1111 reassignments
iteration 15: 1038 reassignments
iteration 16: 903 reassignments
iteration 17: 919 reassignments
iteration 18: 870 reassignments
iteration 19: 835 reassignments
iteration 20: 766 reassignments
iteration 21: 725 reassignments
iteration 22: 673 reassignments
iteration 23: 637 reassignments
iteration 24: 580 reassignments
iteration 25: 580 reassignments
iteration 26: 555 reassignments
iteration 27: 557 reassignments
iteration 28: 503 reassignments
iteration 29: 495 reassignments
iteration 30: 509 reassignments
iteration 31: 486 reassignments
iteration 32: 454 reassignments
/home/sourced/Projects/kmcuda/src/kmcuda.cc:515 -> invalid argument
  File "train.py", line 900, in do_cluster
    a, assignments = kmeans_cuda(data, clusters=n_clusters, device=2, verbosity=1, yinyang_t=0)
RuntimeError: cudaMemcpy failed

I'm running this on a machine with 4 GPUs, and I have been experimenting and I'm getting that error whenever I set device to something other than {1,3,7,15}, i.e. there seems to be some relation between this error and not using the first GPU. Any hints?

Thanks in advance!!

CUDA device arch 37 does not support fp16

Getting the following error (AWS p2.xlarge instance), with a list of descriptors return by OpenCV SIFT extract (des in kp, des = sift.detectAndCompute(image, None)).

arguments: 1 0x7fff855fff14 0.010 0.10 0 72718 64 1000 3 0 1 3 0x7f379c9da010 0x3c923c0 0x3cd0bd0 (nil)
CUDA device arch 37 does not support fp16

This example is working though:

import numpy
from libKMCUDA import kmeans_cuda

numpy.random.seed(0)
arr = numpy.empty((10000, 2), dtype=numpy.float32)
arr[:2500] = numpy.random.rand(2500, 2) + [0, 2]
arr[2500:5000] = numpy.random.rand(2500, 2) - [0, 2]
arr[5000:7500] = numpy.random.rand(2500, 2) + [2, 0]
arr[7500:] = numpy.random.rand(2500, 2) - [2, 0]
centroids, assignments = kmeans_cuda(arr, 4, verbosity=1, seed=3)
print(centroids)

So I'm guessing that I need to transform the numpy array returned by opencv? Sorry, pretty new to Python.

Does this project have C++ API?

hi, i want to use custom initial cluster center to

initCluCenter=np.array([[0,0],[4,8],[6,3]]).astype(np.float32)
centroids, assignments, avg_distance = kmeans_cuda(samplesMatrix_,len(initCluCenter), init=initCluCenter, verbosity=0, seed=3)

Trying to install on Ubuntu 16.04

Hello, thanks for making a very cool package.

I am trying to install it on Ubuntu 16.04. I installed python3-dev and that resolved some of the python dependency issues but it is now failing with the following errors:

charles@srv-hanson-ml1:~/kmcuda-master$ cmake -DCMAKE_BUILD_TYPE=Release . && make
-- Configuring done
-- Generating done
-- Build files have been written to: /home/charles/kmcuda-master
[ 25%] Building CXX object CMakeFiles/KMCUDA.dir/python.cpp.o
/home/charles/kmcuda-master/python.cpp: In function ‘PyObject* py_kmeans_cuda(PyObject*, PyObject*, PyObject*)’:
/home/charles/kmcuda-master/python.cpp:109:53: error: invalid conversion from ‘int’ to ‘const float*’ [-fpermissive]
       verbosity, -1, samples, centroids, assignments);
                                                     ^
/home/charles/kmcuda-master/python.cpp:109:53: error: cannot convert ‘float*’ to ‘uint32_t* {aka unsigned int*}’ for argument ‘12’ to ‘int kmeans_cuda(bool, float, float, uint32_t, uint16_t, uint32_t, uint32_t, uint32_t, int32_t, const float*, float*, uint32_t*)’
CMakeFiles/KMCUDA.dir/build.make:378: recipe for target 'CMakeFiles/KMCUDA.dir/python.cpp.o' failed
make[2]: *** [CMakeFiles/KMCUDA.dir/python.cpp.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/KMCUDA.dir/all' failed
make[1]: *** [CMakeFiles/KMCUDA.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

Any help appreciated!

Installation Issue : subprocess.CalledProcessError: Command '('make', '-j8')' returned non-zero exit status 2

Hello,

I am trying to set up this project to use K-man GPU implementation in Python.
In set up steps, I am getting Error while giving the

command: pip install git+https://github.com/src-d/kmcuda.git#subdirectory=src

Error:
Collecting git+https://github.com/src-d/kmcuda.git#subdirectory=src
Cloning https://github.com/src-d/kmcuda.git to /tmp/pip-NGyzQG-build
Requirement already satisfied: numpy in /usr/local/lib64/python2.7/site-packages (from libKMCUDA==6.2.1)
Installing collected packages: libKMCUDA
Running setup.py install for libKMCUDA ... error
Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-NGyzQG-build/src/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-PJz8TS-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
-- The C compiler identification is GNU 4.8.3
-- The CXX compiler identification is GNU 4.8.3
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Try OpenMP C flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Try OpenMP CXX flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Found OpenMP: -fopenmp
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found PythonInterp: /usr/bin/python3 (found suitable version "3.4.3", minimum required is "3")
-- Found PythonLibs: /usr/lib64/libpython3.4m.so (found suitable version "3.4.3", minimum required is "3")
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/pip-NGyzQG-build/src
[ 16%] Building NVCC (Device) object CMakeFiles/KMCUDA.dir/KMCUDA_generated_transpose.cu.o
[ 33%] Building NVCC (Device) object CMakeFiles/KMCUDA.dir/KMCUDA_generated_kmeans.cu.o
[ 50%] Building NVCC (Device) object CMakeFiles/KMCUDA.dir/KMCUDA_generated_knn.cu.o
nvcc fatal : Value 'sm_61' is not defined for option 'gpu-architecture'
nvcc fatal : Value 'sm_61' is not defined for option 'gpu-architecture'
nvcc fatal : Value 'sm_61' is not defined for option 'gpu-architecture'
CMake Error at KMCUDA_generated_transpose.cu.o.Release.cmake:207 (message):
Error generating
/tmp/pip-NGyzQG-build/src/CMakeFiles/KMCUDA.dir//./KMCUDA_generated_transpose.cu.o

CMake Error at KMCUDA_generated_kmeans.cu.o.Release.cmake:207 (message):
Error generating
/tmp/pip-NGyzQG-build/src/CMakeFiles/KMCUDA.dir//./KMCUDA_generated_kmeans.cu.o

CMake Error at KMCUDA_generated_knn.cu.o.Release.cmake:207 (message):
Error generating
/tmp/pip-NGyzQG-build/src/CMakeFiles/KMCUDA.dir//./KMCUDA_generated_knn.cu.o

make[2]: *** [CMakeFiles/KMCUDA.dir/KMCUDA_generated_transpose.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [CMakeFiles/KMCUDA.dir/KMCUDA_generated_kmeans.cu.o] Error 1
make[2]: *** [CMakeFiles/KMCUDA.dir/KMCUDA_generated_knn.cu.o] Error 1
make[1]: *** [CMakeFiles/KMCUDA.dir/all] Error 2
make: *** [all] Error 2
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-NGyzQG-build/src/setup.py", line 80, in
"Programming Language :: Python :: 3.6",
File "/usr/lib64/python2.7/distutils/core.py", line 151, in setup
dist.run_commands()
File "/usr/lib64/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/dist-packages/setuptools/command/install.py", line 61, in run
return orig.install.run(self)
File "/usr/lib64/python2.7/distutils/command/install.py", line 604, in run
self.run_command('build')
File "/usr/lib64/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib64/python2.7/distutils/command/build.py", line 127, in run
self.run_command(cmd_name)
File "/usr/lib64/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/tmp/pip-NGyzQG-build/src/setup.py", line 19, in run
self._build()
File "/tmp/pip-NGyzQG-build/src/setup.py", line 44, in _build
check_call(("make", "-j%d" % cpu_count()))
File "/usr/lib64/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('make', '-j8')' returned non-zero exit status 2

Command "/usr/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-NGyzQG-build/src/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-PJz8TS-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-NGyzQG-build/src

Can anybody please help me to resolve this issue.
I am using cuda-7.5

Thanks,

Is it possible to find the nearest neighbor of a new data point

Great library, thanks a lot!

I would like to use it for solving the following problem: I have two sets of points e.g. in 3D-space. With the first set I perform a kmeans-clustering and then I would like to identify for each point in the second set its nearest neighbor in the first. As far as I understand, the array "samples" in knn_cuda is the same as in kmeans_cuda and the function returns the nearest neighbors within "samples". I could not see a way to provide a second array.

I hope that I did not miss something obvious. Any help would be highly appreciated.

Illegal instruction (core dumped)

Hi, @vmarkovtsev
I meet the same problem: "Segmentation fault (core dumped)" When running test.py
I use ubuntu 16.04, python 3.5.2 and CUDA 8.0.
I tried to use pip install libKMCUDA,
the log it shows

The directory '/home/deeptexas-1/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/deeptexas-1/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting libKMCUDA
  Downloading https://files.pythonhosted.org/packages/b9/ae/db37581df07f7e225fac11e62cb8231437949780f41e8542979bed7aaba3/libKMCUDA-6.2.1-cp35-cp35m-manylinux1_x86_64.whl (643kB)
    100% |████████████████████████████████| 645kB 61kB/s 
Requirement already satisfied: numpy in ./.local/lib/python3.5/site-packages (from libKMCUDA) (1.15.0)
Installing collected packages: libKMCUDA
Successfully installed libKMCUDA-6.2.1

but when I try to run this sample

import numpy
from matplotlib import pyplot
from libKMCUDA import kmeans_cuda

numpy.random.seed(0)
arr = numpy.empty((10000, 2), dtype=numpy.float32)
arr[:2500] = numpy.random.rand(2500, 2) + [0, 2]
arr[2500:5000] = numpy.random.rand(2500, 2) - [0, 2]
arr[5000:7500] = numpy.random.rand(2500, 2) + [2, 0]
arr[7500:] = numpy.random.rand(2500, 2) - [2, 0]
centroids, assignments = kmeans_cuda(arr, 4, verbosity=1, seed=3)
print(centroids)
pyplot.scatter(arr[:, 0], arr[:, 1], c=assignments)
pyplot.scatter(centroids[:, 0], centroids[:, 1], c="white", s=150)

it shows error

arguments: 1 0x7ffe7faab5e4 0.010 0.10 0 10000 2 4 3 0 0 2 0x32b4ff0 0x32f3ef0 0x33585e0 (nil)
Illegal instruction (core dumped)

so I try to build from source , the command is

cmake -DCMAKE_BUILD_TYPE=Release -D CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-8.0 .
make

and the build logs is
kmcuda_build_log.txt

my CPU is E5-1650, motherboard is intel C602 chipset, 32G ddr3 ram.

Thanks!

compute capability mismatch for device 0: wanted 6.1, have 6.0

I am getting this error on rhel 7.5 with Python 3.6.5, CUDA 8.0 (V8.0.61), and gcc 4.9.2.
It looks as though I may need CUDA V8.0.60 instead of V8.0.61?

[root@e35559eae255 kmcuda]# python pythontest.py
reassignments threshold: 100
compute capability mismatch for device 0: wanted 6.1, have 6.0

you may want to build kmcuda with -DCUDA_ARCH=60 (refer to "Building" in README.md)
compute capability mismatch for device 1: wanted 6.1, have 6.0
you may want to build kmcuda with -DCUDA_ARCH=60 (refer to "Building" in README.md)
compute capability mismatch for device 2: wanted 6.1, have 6.0
you may want to build kmcuda with -DCUDA_ARCH=60 (refer to "Building" in README.md)
Traceback (most recent call last):
File "pythontest.py", line 11, in
centroids, assignments = kmeans_cuda(arr, 4, verbosity=1, seed=3)
ValueError: No such CUDA device exists

[root@e35559eae255 kmcuda]# cat pythontest.py
import numpy
from matplotlib import pyplot
from libKMCUDA import kmeans_cuda

numpy.random.seed(0)
arr = numpy.empty((10000, 2), dtype=numpy.float32)
arr[:2500] = numpy.random.rand(2500, 2) + [0, 2]
arr[2500:5000] = numpy.random.rand(2500, 2) - [0, 2]
arr[5000:7500] = numpy.random.rand(2500, 2) + [2, 0]
arr[7500:] = numpy.random.rand(2500, 2) - [2, 0]
centroids, assignments = kmeans_cuda(arr, 4, verbosity=1, seed=3)
print(centroids)
#pyplot.scatter(arr[:, 0], arr[:, 1], c=assignments)
#pyplot.scatter(centroids[:, 0], centroids[:, 1], c="white", s=150)

I have the following packages installed:

[root@e35559eae255 kmcuda]# pip list installed
Package Version

absl-py 0.7.0
appdirs 1.4.3
astor 0.7.1
atomicwrites 1.3.0
attrs 18.2.0
cycler 0.10.0
decorator 4.3.2
gast 0.2.2
grpcio 1.18.0
h5py 2.9.0
javapackages 4.3.2
Keras-Applications 1.0.7
Keras-Preprocessing 1.0.9
kiwisolver 1.0.1
libKMCUDA 6.2.2
Mako 1.0.7
Markdown 3.0.1
MarkupSafe 1.1.0
matplotlib 3.0.2
more-itertools 6.0.0
numpy 1.16.1
pip 19.0.3
pluggy 0.8.1
protobuf 3.6.1
py 1.8.0
pycuda 2018.1.1
pyparsing 2.3.1
pytest 4.3.0
python-dateutil 2.8.0
pytools 2019.1
PyXB 1.2.4
scikit-learn 0.20.2
scipy 1.2.1
setuptools 40.8.0
six 1.12.0
tensorboard 1.12.2
tensorflow 1.12.0
termcolor 1.1.0
Werkzeug 0.14.1
wheel 0.33.1

I have tried the following Make command to build but have not resolved the problem. Suggestions?
cmake -DCMAKE_BUILD_TYPE=Release -DCUDA_ARCH=60 . && make
and installed with
pip install git+https://github.com/src-d/kmcuda.git#subdirectory=src

test.py fails

** Running python3 test.py gives that traceback:**

E
E
Earguments: 1 0.010 0.10 1 1000 256 10 3 0 0 3 0x7f50873bd010 0x3305110 0x3307920
reassignments threshold: 10
yinyang groups: 1
[0] device_samples: 0x4214380000 - 0x421447a000 (1024000)
[1] device_samples: 0x4214480000 - 0x421457a000 (1024000)
[2] device_samples: 0x4214580000 - 0x421467a000 (1024000)
[3] device_samples: 0x4214680000 - 0x421477a000 (1024000)
[0] device_centroids: 0x4214780000 - 0x4214782800 (10240)
[1] device_centroids: 0x4214880000 - 0x4214882800 (10240)
[2] device_centroids: 0x4214980000 - 0x4214982800 (10240)
[3] device_centroids: 0x4214a80000 - 0x4214a82800 (10240)
[0] device_assignments: 0x4214b80000 - 0x4214b80fa0 (4000)
[1] device_assignments: 0x4214c80000 - 0x4214c80fa0 (4000)
[2] device_assignments: 0x4214d80000 - 0x4214d80fa0 (4000)
[3] device_assignments: 0x4214e80000 - 0x4214e80fa0 (4000)
[0] device_assignments_prev: 0x4214b81000 - 0x4214b81fa0 (4000)
[1] device_assignments_prev: 0x4214c81000 - 0x4214c81fa0 (4000)
[2] device_assignments_prev: 0x4214d81000 - 0x4214d81fa0 (4000)
[3] device_assignments_prev: 0x4214e81000 - 0x4214e81fa0 (4000)
[0] device_ccounts: 0x4214f80000 - 0x4214f80028 (40)
[1] device_ccounts: 0x4215080000 - 0x4215080028 (40)
[2] device_ccounts: 0x4215180000 - 0x4215180028 (40)
[3] device_ccounts: 0x4215280000 - 0x4215280028 (40)
[0] device_assignments_yy: 0x4214f80200 - 0x4214f80228 (40)
[1] device_assignments_yy: 0x4215080200 - 0x4215080228 (40)
[2] device_assignments_yy: 0x4215180200 - 0x4215180228 (40)
[3] device_assignments_yy: 0x4215280200 - 0x4215280228 (40)
[0] device_bounds_yy: 0x4214b82000 - 0x4214b827d0 (2000)
[1] device_bounds_yy: 0x4214c82000 - 0x4214c827d0 (2000)
[2] device_bounds_yy: 0x4214d82000 - 0x4214d827d0 (2000)
[3] device_bounds_yy: 0x4214e82000 - 0x4214e827d0 (2000)
[0] device_drifts_yy: 0x4214782800 - 0x4214785028 (10280)
[1] device_drifts_yy: 0x4214882800 - 0x4214885028 (10280)
[2] device_drifts_yy: 0x4214982800 - 0x4214985028 (10280)
[3] device_drifts_yy: 0x4214a82800 - 0x4214a85028 (10280)
[0] device_passed_yy: 0x4214f80400 - 0x4214f807e8 (1000)
[1] device_passed_yy: 0x4215080400 - 0x42150807e8 (1000)
[2] device_passed_yy: 0x4215180400 - 0x42151807e8 (1000)
[3] device_passed_yy: 0x4215280400 - 0x42152807e8 (1000)
[0] device_centroids_yy: 0x4214f80800 - 0x4214f80c00 (1024)
[1] device_centroids_yy: 0x4215080800 - 0x4215080c00 (1024)
[2] device_centroids_yy: 0x4215180800 - 0x4215180c00 (1024)
[3] device_centroids_yy: 0x4215280800 - 0x4215280c00 (1024)
GPU #0 memory: used 281280512 bytes (2.3%), free 11714297856 bytes, total 11995578368 bytes
GPU #1 memory: used 70647808 bytes (0.6%), free 11924930560 bytes, total 11995578368 bytes
GPU #2 memory: used 70647808 bytes (0.6%), free 11924930560 bytes, total 11995578368 bytes
GPU #3 memory: used 72744960 bytes (0.6%), free 11922833408 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/mgr/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 1 0.010 0.10 1 10000 2 4 3 1 0 2 0x44ca270 0x44c84d0 0x44f1390
reassignments threshold: 100
yinyang groups: 0
GPU #0 memory: used 280231936 bytes (2.3%), free 11715346432 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 1 0.010 0.10 1 10000 1 4 3 1 1 2 0x44ca010 0x2f5b7b0 0x44e74f0
reassignments threshold: 100
yinyang groups: 0
GPU #0 memory: used 279183360 bytes (2.3%), free 11716395008 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 1 0.050 0.00 0 13000 1 50 3 1 1 2 0x44ca010 0x32e77c0 0x44d6b40
reassignments threshold: 650
yinyang groups: 0
GPU #0 memory: used 279183360 bytes (2.3%), free 11716395008 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_
EE
E
E
E
E
E
E
Esamples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 1 1.000 0.00 0 13000 2 50 3 1 0 2 0x32e78b0 0x3305df0 0x44ca010
reassignments threshold: 13000
yinyang groups: 0
GPU #0 memory: used 280231936 bytes (2.3%), free 11715346432 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 1 0.010 0.10 0 13000 1 50 3 1 1 2 0x44ca010 0x32e77c0 0x44d6b40
reassignments threshold: 130
yinyang groups: 5
reusing passed_yy for centroids_yy
GPU #0 memory: used 280231936 bytes (2.3%), free 11715346432 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 0 0.050 0.00 0 13000 1 50 3 1 1 2 0x44ca010 0x32e77c0 0x44d6b40
reassignments threshold: 650
yinyang groups: 0
GPU #0 memory: used 279183360 bytes (2.3%), free 11716395008 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 0 0.250 0.00 0 13000 2 50 3 1 0 2 0x32e78b0 0x33078d0 0x44ca010
reassignments threshold: 3250
yinyang groups: 0
GPU #0 memory: used 280231936 bytes (2.3%), free 11715346432 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 1 0.050 0.00 0 13000 2 50 3 1 0 2 0x32e78b0 0x44c0c80 0x44ca010
reassignments threshold: 650
yinyang groups: 0
GPU #0 memory: used 280231936 bytes (2.3%), free 11715346432 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 1 0.050 0.00 0 13000 2 50 3 3 0 2 0x32e78b0 0x44c0c80 0x44ca010
reassignments threshold: 650
yinyang groups: 0
p2p is already enabled on gpu #0
p2p is already enabled on gpu #1
GPU #0 memory: used 280231936 bytes (2.3%), free 11715346432 bytes, total 11995578368 bytes
GPU #1 memory: used 69599232 bytes (0.6%), free 11925979136 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 1 0.010 0.10 0 13000 2 50 3 1 0 2 0x32e78b0 0x44c1c20 0x44ca010
reassignments threshold: 130
yinyang groups: 5
reusing passed_yy for centroids_yy
GPU #0 memory: used 281280512 bytes (2.3%), free 11714297856 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 0 0.050 0.00 0 13000 2 50 3 1 0 2 0x32e78b0 0x44c1e00 0x44ca010
reassignments threshold: 650
yinyang groups: 0
GPU #0 memory: used 280231936 bytes (2.3%), free 11715346432 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 0 0.050 0.00 0 13000 2 50 3 3 0 2 0x32e78b0 0x44c14b0 0x44ca010
reassignments threshold: 650
yinyang groups: 0
p2p is already enabled on gpu #0
p2p is already enabled on gpu #1
GPU #0 memory: used 280231936 bytes (2.3%), free 11715346432 bytes, total 11995578368 bytes
GPU #1 memory: used 69599232 bytes (0.6%), free 11925979136 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
km
E.
E
E
E
Eeans_cuda_setup failed: invalid device symbol
arguments: 0 0.050 0.00 0 13000 2 50 3 65535 0 2 0x32e78b0 0x3306680 0x44ca010
arguments: 0 0.050 0.00 0 13000 2 50 3 0 0 2 0x32e78b0 0x44c97e0 0x44ca010
reassignments threshold: 650
yinyang groups: 0
p2p is already enabled on gpu #0
p2p is already enabled on gpu #0
p2p is already enabled on gpu #0
p2p is already enabled on gpu #1
p2p is already enabled on gpu #1
p2p is already enabled on gpu #1
p2p is already enabled on gpu #2
p2p is already enabled on gpu #2
p2p is already enabled on gpu #2
p2p is already enabled on gpu #3
p2p is already enabled on gpu #3
p2p is already enabled on gpu #3
GPU #0 memory: used 280231936 bytes (2.3%), free 11715346432 bytes, total 11995578368 bytes
GPU #1 memory: used 69599232 bytes (0.6%), free 11925979136 bytes, total 11995578368 bytes
GPU #2 memory: used 69599232 bytes (0.6%), free 11925979136 bytes, total 11995578368 bytes
GPU #3 memory: used 71696384 bytes (0.6%), free 11923881984 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 0 0.050 0.00 0 13000 2 50 3 2 0 2 0x4228700000 0x4228800000 0x4228900000
reassignments threshold: 650
yinyang groups: 0
p2p is already enabled on gpu #1
p2p is already enabled on gpu #0
GPU #1 memory: used 136052736 bytes (1.1%), free 11859525632 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 0 0.050 0.00 0 13000 2 50 3 0 0 2 0x32e78b0 0x569ac60 0x45168d0
reassignments threshold: 650
yinyang groups: 0
p2p is already enabled on gpu #0
p2p is already enabled on gpu #0
p2p is already enabled on gpu #0
p2p is already enabled on gpu #1
p2p is already enabled on gpu #1
p2p is already enabled on gpu #1
p2p is already enabled on gpu #2
p2p is already enabled on gpu #2
p2p is already enabled on gpu #2
p2p is already enabled on gpu #3
p2p is already enabled on gpu #3
p2p is already enabled on gpu #3
GPU #0 memory: used 280231936 bytes (2.3%), free 11715346432 bytes, total 11995578368 bytes
GPU #1 memory: used 71696384 bytes (0.6%), free 11923881984 bytes, total 11995578368 bytes
GPU #2 memory: used 71696384 bytes (0.6%), free 11923881984 bytes, total 11995578368 bytes
GPU #3 memory: used 71696384 bytes (0.6%), free 11923881984 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 0 0.050 0.00 0 13000 2 50 3 1 0 2 0x4228700000 0x4228800200 0x422890cc00
reassignments threshold: 650
yinyang groups: 0
GPU #0 memory: used 344588288 bytes (2.9%), free 11650990080 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol
arguments: 0 0.050 0.00 0 13000 2 50 3 0 0 2 0x4228700000 0x4228800400 0x4228919800
reassignments threshold: 650
yinyang groups: 0
p2p is already enabled on gpu #0
p2p is already enabled on gpu #0
p2p is already enabled on gpu #0
p2p is already enabled on gpu #1
p2p is already enabled on gpu #1
p2p is already enabled on gpu #1
p2p is already enabled on gpu #1
p2p is already enabled on gpu #2
p2p is already enabled on gpu #2
p2p is already enabled on gpu #2
p2p is already enabled on gpu #2
p2p is already enabled on gpu #3
p2p is already enabled on gpu #3
p2p is already enabled on gpu #3
p2p is already enabled on gpu #3
p2p is already enabled on gpu #0
p2p is already enabled on gpu #0
p2p is already enabled on gpu #0
GPU #0 memory: used 344588288 bytes (2.9%), free 11650990080 bytes, total 11995578368 bytes
GPU #1 memory: used 136052736 bytes (1.1%), free 11859525632 bytes, total 11995578368 bytes
GPU #2 memory: used 13605273
E
======================================================================
ERROR: test_256_features (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 369, in test_256_features
    yinyang_t=0.1, seed=3)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_cosine_metric (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 347, in test_cosine_metric
    seed=3)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_fp16_cosine_metric (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 438, in test_fp16_cosine_metric
    seed=3)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_fp16_kmeanspp_lloyd (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 397, in test_fp16_kmeanspp_lloyd
    verbosity=2, seed=3, tolerance=0.05, yinyang_t=0)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_fp16_kmeanspp_validate (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 407, in test_fp16_kmeanspp_validate
    verbosity=2, seed=3, tolerance=1.0, yinyang_t=0)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_fp16_kmeanspp_yinyang (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 422, in test_fp16_kmeanspp_yinyang
    verbosity=2, seed=3, tolerance=0.01, yinyang_t=0.1)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_fp16_random_lloyd (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 379, in test_fp16_random_lloyd
    verbosity=2, seed=3, tolerance=0.05, yinyang_t=0)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_import_lloyd (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 211, in test_import_lloyd
    verbosity=2, seed=3, tolerance=0.25, yinyang_t=0)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_kmeanspp_lloyd (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 195, in test_kmeanspp_lloyd
    verbosity=2, seed=3, tolerance=0.05, yinyang_t=0)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_kmeanspp_lloyd_2gpus (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 233, in test_kmeanspp_lloyd_2gpus
    verbosity=2, seed=3, tolerance=0.05, yinyang_t=0)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_kmeanspp_yinyang (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 203, in test_kmeanspp_yinyang
    verbosity=2, seed=3, tolerance=0.01, yinyang_t=0.1)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_random_lloyd (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 182, in test_random_lloyd
    verbosity=2, seed=3, tolerance=0.05, yinyang_t=0)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_random_lloyd_2gpus (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 223, in test_random_lloyd_2gpus
    verbosity=2, seed=3, tolerance=0.05, yinyang_t=0)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_random_lloyd_all_gpus (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 249, in test_random_lloyd_all_gpus
    verbosity=2, seed=3, tolerance=0.05, yinyang_t=0)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_random_lloyd_different_device_ptr (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 322, in test_random_lloyd_different_device_ptr
    verbosity=2, seed=3, tolerance=0.05, yinyang_t=0)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_random_lloyd_host_ptr (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 261, in test_random_lloyd_host_ptr
    verbosity=2, seed=3, tolerance=0.05, yinyang_t=0)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_random_lloyd_same_device_ptr (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 274, in test_random_lloyd_same_device_ptr
    verbosity=2, seed=3, tolerance=0.05, yinyang_t=0)
RuntimeError: cudaMemcpy failed

======================================================================
ERROR: test_random_lloyd_same_device_ptr_all_devs (__main__.KMCUDATests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 298, in test_random_lloyd_same_device_ptr_all_devs
    verbosity=2, seed=3, tolerance=0.05, yinyang_t=0)
RuntimeError: cudaMemcpy failed

----------------------------------------------------------------------
Ran 19 tests in 8.968s

FAILED (errors=18)
6 bytes (1.1%), free 11859525632 bytes, total 11995578368 bytes
GPU #3 memory: used 136052736 bytes (1.1%), free 11859525632 bytes, total 11995578368 bytes
cudaMemcpyToSymbol(d_samples_size, &h_samples_size, sizeof(h_samples_size))
/home/blabla/projects/kmcuda/kernel.cu:640 -> invalid device symbol
kmeans_cuda_setup failed: invalid device symbol

** Hardware: Tesla K80, CUDA 8.0, python 3, ubuntu 14.04**

"Segmentation fault (core dumped)" When running test.py

When I run test.py, I encounter this "Segmentation fault (core dumped)".
I use python 3.6.5 and Cuda 9.0.
Here is the build logs attachment:
kmcuda.log

does anyone have the same issue?

passing samples data to kmeans_cuda

Hello,

A question about samples data type. The k-means method signature is following:

KMCUDAResult kmeans_cuda(
    KMCUDAInitMethod init, const void *init_params, float tolerance, float yinyang_t,
    KMCUDADistanceMetric metric, uint32_t samples_size, uint16_t features_size,
    uint32_t clusters_size, uint32_t seed, uint32_t device, int32_t device_ptrs,
    int32_t fp16x2, int32_t verbosity, const float *samples, float *centroids,
    uint32_t *assignments, float *average_distance)

where samples data is represented by 3 parameters:

uint32_t samples_size
uint16_t features_size
const float *samples

as I understand:

uint32_t samples_size is the size of 1st dimension.
uint16_t features_size is the size of 2nd dimension

Then, what exactly is const float *samples ? I can see that it is of type pointer/array.
Does that mean then that this is a single dimensional array? Where did the second dimension go?

For example, if on host machine I have following array of samples (samples_size = 2, features_size = 3):

[
   [1, 2, 3],
   [9, 10, 11],
]

Does that then means, that if I want to pass this array to kmeans_cuda method - I need to flatten it into single dimension? i.e.:

[1, 2, 3, 9, 10, 11]

Or can you actually pass a 2D array, to parameter of type const float *?

Error: spurious trailing ‘%’ in format

I encountered this problem when compiling with gcc 5.3.0 and cuda 8.0.

/home/kmcuda/src/python.cc: In function ‘bool validate_features_size(uint32_t)’:
/home/kmcuda/src/python.cc:108:45: error: expected ‘)’ before ‘PRIu32’:
     sprintf(msg, "\"samples\": more than %" PRIu32 " features is not supported",
                                             ^~~~~~
/home/kmcuda/src/python.cc:109:26: error: spurious trailing ‘%’ in format [-Werror=format=]

This page explains why and by adding #define __STDC_FORMAT_MACROS at the begining of python.cc file I can compile successfully.

Also I tried gcc 6.3.0 and cuda 9.0 this problem still exists.

RuntimeError: cudaMemcpy failed

Hello,

I am randomly getting the error "RuntimeError: cudaMemcpy failed". It occurs intermittently on the same data set. I can get kmeans to finish on the data set but it only works maybe 1 out of 3 times. The data set has 450K samples with 16 features each.

I am running on a GTX 1080 with 8GB RAM. While viewing the process execute, it consumes at maximum about 7% of the GPUs memory. No other processes are using the GPU. Does not seem like the GPU is running out of memory.

running Lloyd until reassignments drop below 49180
iteration 1: 447094 reassignments
iteration 2: 445995 reassignments
iteration 3: 130666 reassignments
iteration 4: 69232 reassignments
iteration 5: 46072 reassignments
performing kmeans++...
step 20kmeans_init_centroids() failed for yinyang groups: invalid argument
Traceback (most recent call last):
  File "~/models/pattern_recognition/KMeans/batch.py", line 45, in <module>
    run_id = build.partition(clusters=1000)
  File "~/models/pattern_recognition/KMeans/KMeansConstructor.py", line 151, in partition
    centroids, labels = kmeans_cuda(data, clusters, kmpp=True, verbosity=1)
RuntimeError: cudaMemcpy failed

cuda build issues on Windows

Hi,

I am having troubles building CUDA files.

Following prerequisites were done:

platform toolset: Visual Studio 2015 (v140)
configuration type: Static library (.lib)
precompiled header: Not Using Precompiled Headers
target machine platform: 64-bit (--machine 64) (library was also set to x64 in build menu)
code generation: compute_61,sm_61;%(CodeGeneration)
additional library directories: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.2\lib\x64;%(AdditionalLibraryDirectories)
additional dependencies: cudart.lib;%(AdditionalDependencies)
Build customizations: CUDA 9.2
all files with .cu extension have CUDA C/C++ ItemType
master branch files are used
for the sake of simplicity knn related methods commented out. Only k-means is active
Python and R libraries removed from solution, hence are not causing any issues

Errors are following:

File: metric_abctraction.h
Code:

	FPATTR static typename HALF<F>::type distance(
		F sqr1 __attribute__((unused)), F sqr2 __attribute__((unused)), F prod) {
		float fp = _float(_fin(prod));
		if (fp >= 1.f) return _half<F>(0.f);
		if (fp <= -1.f) return _half<F>(M_PI);
		return _half<F>(acos(fp));
	}

Error:

c:\users\lucky\source\repos\core\core3\metric_abstraction.h(172): error : expected a ")"

Comment:
line with ((unused))

File: metric_abctraction.h
Code:

	FPATTR static void normalize(uint32_t count __attribute__((unused)), float *vec) {
		// Kahan summation with inverted c
		float norm = 0, corr = 0;
#pragma unroll 4
		for (int f = 0; f < d_features_size; f++) {
			float v = vec[f];
			float y = _fma(corr, v, v);
			float t = norm + y;
			corr = y - (t - norm);
			norm = t;
		}
		norm = _reciprocal(_sqrt(norm));

#pragma unroll 4
		for (int f = 0; f < d_features_size; f++) {
			vec[f] = vec[f] * norm;
		}
	}

Error:

c:\users\lucky\source\repos\core\core3\metric_abstraction.h(255): error : expected a ")"

Comment:
line with ((unused))

File: wrappers.h
Code:

template <typename T>
class unique_devptr : public unique_devptr_parent<T> {
public:
	explicit unique_devptr(T *ptr, bool fake = false) : unique_devptr_parent<T>(
		ptr, fake ? [](T*) {} : [](T *p) { cudaFree(p); }) {}
};

Error:

1> c:\users\lucky\source\repos\core\core3\wrappers.h(20): error : more than one operator "?" matches these operands:
1> built-in operator "expression ? pointer : pointer"
1> built-in operator "expression ? pointer : pointer"
1> operand types are: lambda [](float *)->void : lambda [](float *)->void
1> detected during:
1> instantiation of "unique_devptr::unique_devptr(T *, __nv_bool) [with T=float]"

Comment:
line with ptr, fake ? [](T*) {} : [](T *p) { cudaFree(p); }) {}

File: metric_abstraction.h
Code:

	FPATTR static float distance_t(const F *__restrict__ v1, const F *__restrict__ v2,
		uint64_t v1_size, uint64_t v1_index) {
		// Kahan summation with inverted c
		F prod = _const<F>(0), corr = _const<F>(0);
#pragma unroll 4
		for (uint64_t f = 0; f < d_features_size; f++) {
			F yprod = _fma(corr, v1[v1_size * f + v1_index], v2[f]);
			F tprod = _add(prod, yprod);
			corr = _sub(yprod, _sub(tprod, prod));
			prod = tprod;
		}
		return _float(distance(_const<F>(1), _const<F>(1), prod));
	}

Error:

1> c:\users\lucky\source\repos\core\core3\metric_abstraction.h(203): error : no instance of overloaded function "METRIC<(KMCUDADistanceMetric)1, F>::distance [with F=float]" matches the argument list
1> argument types are: (float, float, float)
1> detected during:
1> instantiation of "float METRIC<(KMCUDADistanceMetric)1, F>::distance_t(const F *, const F *, uint64_t, uint64_t) [with F=float]"

Comment:
line with return _float(distance(_const<F>(1), _const<F>(1), prod));

Similar errors in same file:

1> C:/Users/Lucky/source/repos/Core/Core3/kmeans.cu(269): error : no instance of overloaded function "METRIC<(KMCUDADistanceMetric)1, F>::distance [with F=float]" matches the argument list
1> argument types are: (float, float, float)
1> detected during instantiation of "void kmeans_assign_lloyd_smallc<M,F>(uint32_t, uint32_t, const F *, const F *, uint32_t *, uint32_t *) [with M=kmcudaDistanceMetricCosine, F=float]"
1> (954): here
1>
1> C:/Users/Lucky/source/repos/Core/Core3/kmeans.cu(342): error : no instance of overloaded function "METRIC<(KMCUDADistanceMetric)1, F>::distance [with F=float]" matches the argument list
1> argument types are: (float, float, float)
1> detected during instantiation of "void kmeans_assign_lloyd<M,F>(uint32_t, uint32_t, const F *, const F *, uint32_t *, uint32_t *) [with M=kmcudaDistanceMetricCosine, F=float]"
1> (954): here

File: same
Code:

FPATTR static void normalize(uint32_t count __attribute__((unused)), float *vec) {
		// Kahan summation with inverted c
		float norm = 0, corr = 0;
#pragma unroll 4
		for (int f = 0; f < d_features_size; f++) {
			float v = vec[f];
			float y = _fma(corr, v, v);
			float t = norm + y;
			corr = y - (t - norm);
			norm = t;
		}
		norm = _reciprocal(_sqrt(norm));

#pragma unroll 4
		for (int f = 0; f < d_features_size; f++) {
			vec[f] = vec[f] * norm;
		}
	}

Error:

1> c:\users\lucky\source\repos\core\core3\metric_abstraction.h(260): error : identifier "vec" is undefined
1> detected during:
1> instantiation of "void METRIC<(KMCUDADistanceMetric)1, F>::normalize(uint32_t) [with F=float]"

File: kmeans.cu
Code:

template <KMCUDADistanceMetric M, typename F>
__global__ void kmeans_adjust(
    const uint32_t coffset, const uint32_t length,
    const F *__restrict__ samples,
    const uint32_t *__restrict__ assignments_prev,
    const uint32_t *__restrict__ assignments,
    F *__restrict__ centroids, uint32_t *__restrict__ ccounts) {
  uint32_t c = blockIdx.x * blockDim.x + threadIdx.x;
  if (c >= length) {
    return;
  }
  c += coffset;
  uint32_t my_count = ccounts[c];
  {
    F fmy_count = _const<F>(my_count);
    centroids += c * d_features_size;
    for (int f = 0; f < d_features_size; f++) {
      centroids[f] = _mul(centroids[f], fmy_count);
    }
  }
  extern __shared__ uint32_t ass[];
  int step = d_shmem_size / 2;
  F corr = _const<F>(0);
  for (uint32_t sbase = 0; sbase < d_samples_size; sbase += step) {
    __syncthreads();
    if (threadIdx.x == 0) {
      int pos = sbase;
      for (int i = 0; i < step && sbase + i < d_samples_size; i++) {
        ass[2 * i] = assignments[pos + i];
        ass[2 * i + 1] = assignments_prev[pos + i];
      }
    }
    __syncthreads();
    for (int i = 0; i < step && sbase + i < d_samples_size; i++) {
      uint32_t this_ass = ass[2 * i];
      uint32_t  prev_ass = ass[2 * i + 1];
      int sign = 0;
      if (prev_ass == c && this_ass != c) {
        sign = -1;
        my_count--;
      } else if (prev_ass != c && this_ass == c) {
        sign = 1;
        my_count++;
      }
      if (sign != 0) {
        F fsign = _const<F>(sign);
        #pragma unroll 4
        for (uint64_t f = 0; f < d_features_size; f++) {
          F centroid = centroids[f];
          F y = _fma(corr,
                     samples[static_cast<uint64_t>(d_samples_size) * f + sbase + i],
                     fsign);
          F t = _add(centroid, y);
          corr = _sub(y, _sub(t, centroid));
          centroids[f] = t;
        }
      }
    }
  }
  // my_count can be 0 => we get NaN with L2 and never use this cluster again
  // this is a feature, not a bug
  METRIC<M, F>::normalize(my_count, centroids);
  ccounts[c] = my_count;
}

Error:

1> C:/Users/Lucky/source/repos/Core/Core3/kmeans.cu(427): error : too many arguments in function call
1> detected during instantiation of "void kmeans_adjust<M,F>(uint32_t, uint32_t, const F *, const uint32_t *, const uint32_t *, F *, uint32_t *) [with M=kmcudaDistanceMetricCosine, F=float]"
1> (1002): here

Comment: error on line METRIC<M, F>::normalize(my_count, centroids);

File: tricks.cuh
Code:

__device__ __forceinline__ uint32_t atomicAggInc(uint32_t *ctr) {
  int mask = ballot(1);
  int leader = __ffs(mask) - 1;
  uint32_t res;
  if ((threadIdx.x % warpSize) == leader) {
    res = atomicAdd(ctr, __popc(mask));
  }
  res = shfl(res, leader);
  return res + __popc(mask & ((1 << (threadIdx.x % warpSize)) - 1));
}

Error:

1> C:/Users/Lucky/source/repos/Core/Core3/tricks.cuh(31): error : identifier "shfl" is undefined

Fix for error 7: After I have added #include "private.h" into file - errors went away.. I wonder why wasn't this change committed? It seems like a very obvious fix and yet master branch does not include it.

Is CUDA project supposed to work at this moment? I would appreciate to hear out comments about errors that I have encountered.

Thanks in advance

building error on TITAN X

ubuntu14.04
CUDA8.0
cmake 3.7.2

sudo cmake -DCMAKE_BUILD_TYPE=Release -DDISABLE_PYTHON=y -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda . && make

Configuring done
-- Generating done
-- Build files have been written to: /home/ubuntu/cuda-workspace/kmcuda-master
Building NVCC (Device) object CMakeFiles/KMCUDA.dir/KMCUDA_generated_knn.cu.o
nvcc fatal : Could not open output file /home/ubuntu/cuda-workspace/kmcuda-master/CMakeFiles/KMCUDA.dir//KMCUDA_generated_knn.cu.o.NVCC-depend
CMake Error at KMCUDA_generated_knn.cu.o.Release.cmake:222 (message):
Error generating
/home/ubuntu/cuda-workspace/kmcuda-master/CMakeFiles/KMCUDA.dir//./KMCUDA_generated_knn.cu.o
make[2]: *** [CMakeFiles/KMCUDA.dir/KMCUDA_generated_knn.cu.o] ERROR 1
make[1]: *** [CMakeFiles/KMCUDA.dir/all] ERROR 2
make: *** [all] ERROR 2

Make error on Anaconda

Hi awesome work! Got make error as follows. My environment is Anaconda 4.4.0, python 3.6, cuda 8.0. Any idea?

[ 33%] Building CXX object CMakeFiles/KMCUDA.dir/python.cc.o
/home/westwell/repo/kmcuda/src/kmcuda.cc:206:0: error: ignoring #pragma omp simd [-Werror=unknown-pragmas]
       #pragma omp simd
 ^
/home/westwell/repo/kmcuda/src/kmcuda.cc:310:0: error: ignoring #pragma omp simd [-Werror=unknown-pragmas]
           #pragma omp simd reduction(+:dist_sum2)
 ^
cc1plus: all warnings being treated as errors
make[2]: *** [CMakeFiles/KMCUDA.dir/kmcuda.cc.o] Error 1
make[2]: *** Waiting for unfinished jobs....
/home/westwell/repo/kmcuda/src/python.cc: In function ‘PyObject* py_kmeans_cuda(PyObject*, PyObject*, PyObject*)’:
/home/westwell/repo/kmcuda/src/python.cc:358:42: error: ‘init’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
       adflag? &average_distance : nullptr);
                                          ^
cc1plus: all warnings being treated as errors
make[2]: *** [CMakeFiles/KMCUDA.dir/python.cc.o] Error 1
make[1]: *** [CMakeFiles/KMCUDA.dir/all] Error 2
make: *** [all] Error 2
(python3) westwell@szhou-westwell:~/repo/kmcuda/src$ make -j20
[ 16%] Building CXX object CMakeFiles/KMCUDA.dir/kmcuda.cc.o
[ 33%] Building CXX object CMakeFiles/KMCUDA.dir/python.cc.o
/home/westwell/repo/kmcuda/src/kmcuda.cc:206:0: error: ignoring #pragma omp simd [-Werror=unknown-pragmas]
       #pragma omp simd
 ^
/home/westwell/repo/kmcuda/src/kmcuda.cc:310:0: error: ignoring #pragma omp simd [-Werror=unknown-pragmas]
           #pragma omp simd reduction(+:dist_sum2)
 ^
/home/westwell/repo/kmcuda/src/python.cc: In function ‘PyObject* py_kmeans_cuda(PyObject*, PyObject*, PyObject*)’:
/home/westwell/repo/kmcuda/src/python.cc:358:42: error: ‘init’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
       adflag? &average_distance : nullptr);
                                          ^
cc1plus: all warnings being treated as errors
make[2]: *** [CMakeFiles/KMCUDA.dir/kmcuda.cc.o] Error 1
make[2]: *** Waiting for unfinished jobs....
cc1plus: all warnings being treated as errors
make[2]: *** [CMakeFiles/KMCUDA.dir/python.cc.o] Error 1
make[1]: *** [CMakeFiles/KMCUDA.dir/all] Error 2
make: *** [all] Error 2

Any idea why the following errors occur when make?

-- Found OpenMP_C: -fopenmp
-- Found OpenMP_CXX: -fopenmp
-- Could NOT find R (missing: R_EXECUTABLE R_INCLUDE_DIR R_LIBRARY)
-- Configuring done
You have changed variables that require your cache to be deleted.
Configure will be re-run and you may have to reset some variables.
The following variables have changed:
CMAKE_C_COMPILER= /usr/bin/gcc-4.8
CMAKE_CXX_COMPILER= /usr/bin/g++-4.8

-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/gcc-4.8
-- Check for working C compiler: /usr/bin/gcc-4.8 -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/g++-4.8
-- Check for working CXX compiler: /usr/bin/g++-4.8 -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "3.1")
-- Found OpenMP_CXX: -fopenmp (found version "3.1")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found PythonInterp: /usr/bin/python3 (found suitable version "3.6.2", minimum required is "3")
-- Found PythonLibs: /usr/lib/libpython3.6m.so (found suitable version "3.6.2", minimum required is "3")
-- Could NOT find R (missing: R_EXECUTABLE R_INCLUDE_DIR R_LIBRARY)
-- Configuring done
-- Generating done
-- Build files have been written to: /home/yxchng/git/kmcuda/src
[ 16%] Building NVCC (Device) object CMakeFiles/KMCUDA.dir/KMCUDA_generated_transpose.cu.o
/home/yxchng/git/kmcuda/src/private.h(280): error: identifier "printf" is undefined

/home/yxchng/git/kmcuda/src/transpose.cu(79): error: identifier "printf" is undefined

/home/yxchng/git/kmcuda/src/transpose.cu(86): error: identifier "printf" is undefined

/home/yxchng/git/kmcuda/src/transpose.cu(89): error: identifier "printf" is undefined

/home/yxchng/git/kmcuda/src/transpose.cu(92): error: identifier "printf" is undefined

/home/yxchng/git/kmcuda/src/transpose.cu(106): error: identifier "printf" is undefined

/home/yxchng/git/kmcuda/src/transpose.cu(115): error: identifier "printf" is undefined

11 errors detected in the compilation of "/tmp/tmpxft_00000581_00000000-5_transpose.cpp4.ii".
CMake Error at KMCUDA_generated_transpose.cu.o.cmake:282 (message):
Error generating file
/home/yxchng/git/kmcuda/src/CMakeFiles/KMCUDA.dir//./KMCUDA_generated_transpose.cu.o

Is there any planning for python2 support?

And what's the point to restrict usage only on python3. I would like to write a version both compatible for python2 and python3 if possible.

Compile error

This package looks great, but I can't build it. I'm getting this error when building on Ubuntu 14.04.5 with cuda 8. Any help would be greatly appreciated. Thanks.

kmcuda/src/kmcuda.cc:206:0: error: ignoring #pragma omp simd [-Werror=unknown-pragmas]
#pragma omp simd
^
kmcuda/src/kmcuda.cc:310:0: error: ignoring #pragma omp simd [-Werror=unknown-pragmas]
#pragma omp simd reduction(+:dist_sum2)
^
cc1plus: all warnings being treated as errors

Space requirements

Could you elaborate on the space requirements on GPU. I'm trying to run a 2.9Mx300 with K=32000 instance on a 16GB RAM P100 GPU and I'm getting out of memory errors. This instance should be smaller than what you have described in the readme with 8Mx256 and K=1024. Here the output:

arguments: 1 (nil) 0.000 0.10 0 2953377 300 32000 1234 1 0 2 0x7f72854f7010 0x7f727f453010 0xa16d2a0 0x7fff3ebcc0dc
reassignments threshold: 0
yinyang groups: 3200
cudaMalloc(&__ptr, __size)
/home/563/mp9691/kmcuda/src/kmcuda.cc:456 -> out of memory
failed to allocate 9453759777 bytes for device_bounds_yy
Status: MemoryAllocationFailure

using Python 2.7

I was wondering if there is a workaround to use libKMCUDA in Python 2.7?

hi, How to get the distance from each sample to other sample in the cluster

Problem install with pip libKMCUDA

Hello everybody.
I am using CUDA 8.0, GCC 7.3 and CMAKE 3.12

I am trying install libKMCUDA with pip, but I get this error:

pip install libKMCUDA

Collecting libKMCUDA
  Using cached https://files.pythonhosted.org/packages/c0/07/66fce78d11f1ba57fdc14a29f3fcda04685cddee7d754fb47ce470ff1c1e/libKMCUDA-6.2.1.tar.gz
Requirement already satisfied: numpy in /home/gabrielgomes/anaconda3/lib/python3.6/site-packages (from libKMCUDA) (1.14.3)
Building wheels for collected packages: libKMCUDA
  Running setup.py bdist_wheel for libKMCUDA ... error
  Complete output from command /home/gabrielgomes/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-r8jx0k34/libKMCUDA/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-_44ad4nk --python-tag cp36:
  running bdist_wheel
  running build
  running build_py
  CMake Error: The source directory "/tmp/pip-install-r8jx0k34/libKMCUDA" does not appear to contain CMakeLists.txt.
  Specify --help for usage, or press the help button on the CMake GUI.
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-r8jx0k34/libKMCUDA/setup.py", line 80, in <module>
      "Programming Language :: Python :: 3.6",
    File "/home/gabrielgomes/anaconda3/lib/python3.6/site-packages/setuptools/__init__.py", line 129, in setup
      return distutils.core.setup(**attrs)
    File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/dist.py", line 955, in run_commands
      self.run_command(cmd)
    File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/home/gabrielgomes/anaconda3/lib/python3.6/site-packages/wheel/bdist_wheel.py", line 202, in run
      self.run_command('build')
    File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/command/build.py", line 135, in run
      self.run_command(cmd_name)
    File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/tmp/pip-install-r8jx0k34/libKMCUDA/setup.py", line 19, in run
      self._build()
    File "/tmp/pip-install-r8jx0k34/libKMCUDA/setup.py", line 34, in _build
      "-DCUDA_TOOLKIT_ROOT_DIR=%s" % cuda_toolkit_dir, "."))
    File "/home/gabrielgomes/anaconda3/lib/python3.6/subprocess.py", line 291, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '('cmake', '-DCMAKE_BUILD_TYPE=Release', '-DDISABLE_R=y', '-DCUDA_TOOLKIT_ROOT_DIR=/home/cuda-8.0/', '.')' returned non-zero exit status 1.
  
  ----------------------------------------
  Failed building wheel for libKMCUDA
  Running setup.py clean for libKMCUDA
Failed to build libKMCUDA
Installing collected packages: libKMCUDA
  Running setup.py install for libKMCUDA ... error
    Complete output from command /home/gabrielgomes/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-r8jx0k34/libKMCUDA/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-_4gwjv2t/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    CMake Error: The source directory "/tmp/pip-install-r8jx0k34/libKMCUDA" does not appear to contain CMakeLists.txt.
    Specify --help for usage, or press the help button on the CMake GUI.
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-r8jx0k34/libKMCUDA/setup.py", line 80, in <module>
        "Programming Language :: Python :: 3.6",
      File "/home/gabrielgomes/anaconda3/lib/python3.6/site-packages/setuptools/__init__.py", line 129, in setup
        return distutils.core.setup(**attrs)
      File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/dist.py", line 955, in run_commands
        self.run_command(cmd)
      File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/home/gabrielgomes/anaconda3/lib/python3.6/site-packages/setuptools/command/install.py", line 61, in run
        return orig.install.run(self)
      File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/command/install.py", line 545, in run
        self.run_command('build')
      File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/command/build.py", line 135, in run
        self.run_command(cmd_name)
      File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/home/gabrielgomes/anaconda3/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/tmp/pip-install-r8jx0k34/libKMCUDA/setup.py", line 19, in run
        self._build()
      File "/tmp/pip-install-r8jx0k34/libKMCUDA/setup.py", line 34, in _build
        "-DCUDA_TOOLKIT_ROOT_DIR=%s" % cuda_toolkit_dir, "."))
      File "/home/gabrielgomes/anaconda3/lib/python3.6/subprocess.py", line 291, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '('cmake', '-DCMAKE_BUILD_TYPE=Release', '-DDISABLE_R=y', '-DCUDA_TOOLKIT_ROOT_DIR=/home/cuda-8.0/', '.')' returned non-zero exit status 1.
    
    ----------------------------------------
Command "/home/gabrielgomes/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-r8jx0k34/libKMCUDA/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-_4gwjv2t/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-install-r8jx0k34/libKMCUDA/

Anybody can help me?

Thanks very much!!

C# wrapper article

Hello,

A question.

I have been wanting to write an article for educational use on how to write CLI/C++ wrapper on top of your C++ library. Do you have any objections to this idea?

I don't yet know where this tutorial is to be posted, so if you have any preferences I will welcome any ideas.

knn_cuda much slower than CPU implementation

Hi,

Was trying to use this for a GPU accelerated knn implementation. For example, finding the closest neighbor to a set of points.

import time
import numpy
from libKMCUDA import knn_cuda

numpy.random.seed(0)
arr = numpy.random.rand(10000, 3).astype(numpy.float32)
ca  = numpy.arange(10000, dtype=numpy.uint32)
timings = []
for i in range(10):
    if i < 5: continue
    start = time.time()
    neighbors1 = knn_cuda(1, arr, arr, ca, metric="L2", verbosity=0, device=0)
    timings.append(time.time() - start)

print("CUDA:", numpy.mean(timings))
print(neighbors1)

from scipy.spatial import cKDTree

timings = []
for i in range(10):
    if i < 5: continue
    start = time.time()
    tree = cKDTree(arr)
    neighbors2 = tree.query(arr, 2)
    timings.append(time.time() - start)

print("Scipy:", numpy.mean(timings))
print(neighbors2)

assert numpy.all(numpy.array(neighbors1[:, 0]) == neighbors2[1][:, 1])

I get something like:

CUDA: 0.26066293716430666
[[9005]
 [5850]
 [3095]
 ...
 [6974]
 [2078]
 [ 603]]
Scipy: 0.013833379745483399
(array([[0.        , 0.03410776],
       [0.        , 0.02017157],
       [0.        , 0.02988571],
       ...,
       [0.        , 0.02020105],
       [0.        , 0.01357323],
       [0.        , 0.0398004 ]]), array([[   0, 9005],
       [   1, 5850],
       [   2, 3095],
       ...,
       [9997, 6974],
       [9998, 2078],
       [9999,  603]]))

So the CUDA version is >20x slower than the CPU Scipy version?

Install instruction gives an error

On Mac with Python 3, the pip install command doesn't work:

$ pip install git+https://github.com/src-d/kmcuda.git
Collecting git+https://github.com/src-d/kmcuda.git
  Cloning https://github.com/src-d/kmcuda.git to /private/var/folders/67/wj749ptn3jlb4_nlf9nphyy00000gn/T/pip-_77ac4rm-build
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/tokenize.py", line 452, in open
        buffer = _builtin_open(filename, 'rb')
    FileNotFoundError: [Errno 2] No such file or directory: '/private/var/folders/67/wj749ptn3jlb4_nlf9nphyy00000gn/T/pip-_77ac4rm-build/setup.py'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/67/wj749ptn3jlb4_nlf9nphyy00000gn/T/pip-_77ac4rm-build/

compute capability mismatch for device 0: wanted 6.1, have 3.7

I'm getting this error on a fresh EC2 machine (p2.xlarge, ubuntu 16.04) on which I've installed cuda, python3, numpy and libKMCUDA with pip.

(venv) ubuntu@ip-172-31-17-253:~/code/km-test$ python test.py
reassignments threshold: 100
compute capability mismatch for device 0: wanted 6.1, have 3.7
>>>> you may want to build kmcuda with -DCUDA_ARCH=37 (refer to "Building" in README.md)
Traceback (most recent call last):
  File "test.py", line 10, in <module>
    centroids, assignments = kmeans_cuda(arr, 4, verbosity=1, seed=3)
ValueError: No such CUDA device exists

Here's how I installed CUDA:

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
apt-get update
apt-get install cuda

And here's test.py:

import numpy
from libKMCUDA import kmeans_cuda

numpy.random.seed(0)
arr = numpy.empty((10000, 2), dtype=numpy.float32)
arr[:2500] = numpy.random.rand(2500, 2) + [0, 2]
arr[2500:5000] = numpy.random.rand(2500, 2) - [0, 2]
arr[5000:7500] = numpy.random.rand(2500, 2) + [2, 0]
arr[7500:] = numpy.random.rand(2500, 2) - [2, 0]
centroids, assignments = kmeans_cuda(arr, 4, verbosity=1, seed=3)
print(centroids)

Random NaN issue

I have a cudaMemcpy failed that pops up randomly when my init is k-means++. Full log:

INFO:quantise:Processing slice 51/128
arguments: 1 0x7ffed93d5414 0.010 0.10 0 124557 4 8192 0 0 0 2 0x2621470 0x460b2a0 0x462b2b0 (nil)
reassignments threshold: 1245
yinyang groups: 819
reusing passed_yy for centroids_yy
GPU #0 memory: used 531759104 bytes (4.2%), free 12268404736 bytes, total 12800163840 bytes
GPU #0 has 49152 bytes of shared memory per block
transposing the samples...
transpose <<<(3893, 1), (8, 32)>>> 124557, 4, xyswap
performing kmeans++...
done
running Lloyd until reassignments drop below 13701
plans: [(0, 124557)]
planc: [(0, 8192)]
iteration 1: 124557 reassignments
iteration 2: 20600 reassignments
iteration 3: 11307 reassignments
transposing the samples...
transpose <<<(256, 1), (8, 32)>>> 8192, 4, xyswap
performing kmeans++...
step 1
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 2
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 3
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 4
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 5
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 6
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 7
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 8
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 9
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 10
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 11
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 12
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 13
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 14
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 15
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 16
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 17
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 18
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 19
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 20
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 21
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 22
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 23
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 24
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 25
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 26
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 27
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 28
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 29
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 30
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 31
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 32
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 33
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 34
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 35
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 36
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 37
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 38
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 39
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 40
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 41
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 42
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 43
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 44
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 45
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 46
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 47
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 48
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 49
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 50
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 51
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 52
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 53
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 54
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 55
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 56
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 57
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 58
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 59
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 60
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 61
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 62
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 63
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 64
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 65
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 66
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 67
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 68
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 69
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 70
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 71
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 72
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 73
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 74
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 75
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 76
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 77
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 78
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 79
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 80
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 81
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 82
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 83
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 84
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 85
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 86
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 87
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 88
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 89
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 90
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 91
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 92
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 93
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 94
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 95
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 96
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 97
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 98
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 99
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 100
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 101
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 102
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 103
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 104
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 105
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 106
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 107
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 108
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 109
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 110
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 111
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 112
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 113
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 114
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 115
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 116
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 117
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 118
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 119
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 120
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 121
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 122
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 123
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 124
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 125
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 126
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 127
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 128
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 129
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 130
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 131
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 132
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 133
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 134
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 135
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 136
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 137
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 138
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 139
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 140
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 141
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 142
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 143
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 144
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 145
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 146
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 147
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 148
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 149
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 150
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 151
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 152
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 153
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 154
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 155
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 156
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 157
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 158
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 159
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 160
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 161
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 162
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 163
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 164
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 165
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 166
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 167
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 168
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 169
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 170
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 171
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 172
internal bug inside kmeans_init_centroids: dist_sum is NaN

internal bug in kmeans_init_centroids: j = 0
step 173cudaMemcpyAsync( host_dists + offset, (*dists)[devi].get(), length * sizeof(float), cudaMemcpyDeviceToHost)
/tmp/kmcuda/src/kmeans.cu:814 -> an illegal memory access was encountered

kmeans_cuda_plus_plus failed
kmeans_init_centroids() failed for yinyang groups: an illegal memory access was encountered
kmeans_cuda_yy failed: no error
Traceback (most recent call last):
File "./tflm/quantise.py", line 162, in
main()
File "./tflm/quantise.py", line 103, in main
codebook, labels = kmeans_cuda(M, K, seed=0, verbosity=2, init="k-means++")
RuntimeError: cudaMemcpy failed

Mini-batch K-means?

First of all, your code is brilliant!
As to my understanding, most users of you project are those who have a massive dataset to run k-means on(in my case, ~20M points in ~10 dimension space, ~4M clusters).
For those datasets, using mini-batch k-means is better(significantly faster, little accuracy loss), wouldn't it be great if you have that implemented?

Out of memory

I'm trying to run k-means on 7,732,159 samples in 128 dimensions into 10000 clusters on an AWS p2.xlarge instance which has 12 GB of GPU memory and am getting this error message:

arguments: 1 0x7ffcf2138984 0.010 0.10 0 7732159 128 10000 3 0 0 3 0x7f89a0088010 0x7f8bba61d010
 0x7f8bb889e010 (nil)
reassignments threshold: 77321
yinyang groups: 1000
[0] *dest: 0x12052e0000 - 0x12f1257e00 (3958865408)
[0] device_centroids: 0x12f1260000 - 0x12f1742000 (5120000)
[0] device_assignments: 0x12f1760000 - 0x12f34deefc (30928636)
[0] device_assignments_prev: 0x12f34e0000 - 0x12f525eefc (30928636)
[0] device_ccounts: 0x12f5260000 - 0x12f5269c40 (40000)
[0] device_assignments_yy: 0x12f5269e00 - 0x12f5273a40 (40000)
cudaMalloc(&__ptr, __size)
/home/ubuntu/code/kmcuda/src/kmcuda.cc:455 -> out of memory
failed to allocate 7739891159 bytes for device_bounds_yy
Traceback (most recent call last):
  File "./bin/bow.py", line 83, in <module>
    }[args.action]()
  File "./bin/bow.py", line 57, in train
    engine.train()
  File "/home/ubuntu/code/cv/cv/bow.py", line 82, in train
    centroids = self.kmeans.fit(features)
  File "/home/ubuntu/code/cv/cv/kmeans/kmcuda.py", line 33, in fit
    seed = self.seed,
MemoryError: Failed to allocate memory on GPU

failed to allocate 7739891159 bytes for device_bounds_yy is ~7.7 GB which should fit in the GPU's memory (12 GB). It seems also comparable to this in the README:

kmcuda can sort 4M samples in 480 dimensions into 40000 clusters (if you have several days and 12 GB of GPU memory)

Any idea why I'm out of memory? Also, since 7732159 * 128 * 4 bytes = 3.96 GB, is it normal that it allocates almost double that?

RuntimeError: cudaMemcpy failed

I'm getting the "cudaMemcpy failed" error w/o any other information despite verbose=2 mode. I have rather big dataset (100k instances) trying to find k=10k clusters. CUDA memory use doesn't go above 600MB (4 GPU configuration).

Why K-means centroids contain NAN?

I use your kmcuda to run k-means in a large datasets. The dataset contains 21138972 128-dimension vectors, and the target number of centroids is 30k. I use the following code:

import time
import numpy as np
from libKMCUDA import kmeans_cuda
n_clusters = 300000
data = np.load('dataset.npy')
t0 = time.time()
centroids,_=kmeans_cuda(data, n_clusters, verbosity=1, device=0, yinyang_t=0)
t_mini_batch = time.time() - t0
print('This iteration takes ', t_mini_batch, 's')

and it outputs:

reassignments threshold: 211389
transposing the samples...
performing kmeans++...
done
too few clusters for this yinyang_t => Lloyd
iteration 1: 21138972 reassignments
iteration 2: 8493647 reassignments
iteration 3: 4261612 reassignments
iteration 4: 2966279 reassignments
iteration 5: 2296016 reassignments
iteration 6: 1872182 reassignments
iteration 7: 1583827 reassignments
iteration 8: 1369100 reassignments
iteration 9: 1203159 reassignments
iteration 10: 1073277 reassignments
iteration 11: 964482 reassignments
iteration 12: 874202 reassignments
iteration 13: 798805 reassignments
iteration 14: 734298 reassignments
iteration 15: 676895 reassignments
iteration 16: 629083 reassignments
iteration 17: 585535 reassignments
iteration 18: 549050 reassignments
iteration 19: 514284 reassignments
iteration 20: 482728 reassignments
iteration 21: 455612 reassignments
iteration 22: 429360 reassignments
iteration 23: 405368 reassignments
iteration 24: 385095 reassignments
iteration 25: 367105 reassignments
iteration 26: 349844 reassignments
iteration 27: 333942 reassignments
iteration 28: 319134 reassignments
iteration 29: 305205 reassignments
iteration 30: 292143 reassignments
iteration 31: 280165 reassignments
iteration 32: 267917 reassignments
iteration 33: 257570 reassignments
iteration 34: 247747 reassignments
iteration 35: 237848 reassignments
iteration 36: 228632 reassignments
iteration 37: 219988 reassignments
iteration 38: 210930 reassignments
This iteration takes  20040.06494617462 s

I use the 4*TITAN X (Pascal) server and each GPU memory is 12192 MB.

BUT I find centroids matrix contain NAN, This is the datasets (I use tar czf to compress it) and the centroid result. Am I doing something wrong ?

Thanks for any help!

install issue

I created a virtual env with conda, with python 3.6. Then I installed as instructed:

pip install libKMCUDA

Gives me following error:

Collecting libKMCUDA
Could not find a version that satisfies the requirement libKMCUDA (from versions: 1.0.0.linux-x86_64, 1.0.1.linux-x86_64, 1.0.2.linux-x86_64)
No matching distribution found for libKMCUDA

Then I also tried:

pip install git+https://github.com/src-d/kmcuda.git#subdirectory=src

Gives me following error:

Collecting git+https://github.com/src-d/kmcuda.git#subdirectory=src
Cloning https://github.com/src-d/kmcuda.git to /tmp/pip-ab5m6_e8-build
Requirement already satisfied: numpy in ./py3_env/lib/python3.4/site-packages (from libKMCUDA==6.2.0)
Installing collected packages: libKMCUDA
Running setup.py install for libKMCUDA ... error
Complete output from command /home/qlw/miniconda2/envs/py3_env/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-ab5m6_e8-build/src/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-r1fc33gf-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Try OpenMP C flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Try OpenMP CXX flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Found OpenMP: -fopenmp
CMake Error at /usr/share/cmake-3.5/Modules/FindPackageHandleStandardArgs.cmake:148 (message):
Could NOT find CUDA (missing: CUDA_CUDART_LIBRARY) (found version "9.0")
Call Stack (most recent call first):
/usr/share/cmake-3.5/Modules/FindPackageHandleStandardArgs.cmake:388 (_FPHSA_FAILURE_MESSAGE)
/usr/share/cmake-3.5/Modules/FindCUDA.cmake:949 (find_package_handle_standard_args)
CMakeLists.txt:10 (find_package)

-- Configuring incomplete, errors occurred!
See also "/tmp/pip-ab5m6_e8-build/src/CMakeFiles/CMakeOutput.log".
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-ab5m6_e8-build/src/setup.py", line 79, in <module>
    "Programming Language :: Python :: 3.5"
  File "/home/qlw/miniconda2/envs/py3_env/lib/python3.4/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/home/qlw/miniconda2/envs/py3_env/lib/python3.4/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/home/qlw/miniconda2/envs/py3_env/lib/python3.4/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/home/qlw/miniconda2/envs/py3_env/lib/python3.4/site-packages/setuptools-27.2.0-py3.4.egg/setuptools/command/install.py", line 61, in run
  File "/home/qlw/miniconda2/envs/py3_env/lib/python3.4/distutils/command/install.py", line 539, in run
    self.run_command('build')
  File "/home/qlw/miniconda2/envs/py3_env/lib/python3.4/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/home/qlw/miniconda2/envs/py3_env/lib/python3.4/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/home/qlw/miniconda2/envs/py3_env/lib/python3.4/distutils/command/build.py", line 126, in run
    self.run_command(cmd_name)
  File "/home/qlw/miniconda2/envs/py3_env/lib/python3.4/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/home/qlw/miniconda2/envs/py3_env/lib/python3.4/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/tmp/pip-ab5m6_e8-build/src/setup.py", line 19, in run
    self._build()
  File "/tmp/pip-ab5m6_e8-build/src/setup.py", line 34, in _build
    "-DCUDA_TOOLKIT_ROOT_DIR=%s" % cuda_toolkit_dir, "."))
  File "/home/qlw/miniconda2/envs/py3_env/lib/python3.4/subprocess.py", line 558, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('cmake', '-DCMAKE_BUILD_TYPE=Release', '-DDISABLE_R=y', '-DCUDA_TOOLKIT_ROOT_DIR=:/usr/local/cuda/bin', '.')' returned non-zero exit status 1

----------------------------------------

Command "/home/qlw/miniconda2/envs/py3_env/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-ab5m6_e8-build/src/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-r1fc33gf-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-ab5m6_e8-build/src

Any help? Thanks

Error in installing LightGBM GPU version for Windows.Error in -cmake.

Free memory in K-means

I have a doubt: when you allocate memory with CUMALLOC, do you free this memory before finishing the algorithm? I can not find cudaFree or something similar. I'm afraid if I use this algorithm (kmeans_cuda) in a code, it could overload the memory of the GPU

Thank you.

weird problem

The data i used is [183497*600] , i set the k = 4590 and i get the error :
'internal bug inside kmeans_init_centroids: dis_num is NaN '
'/src/kmeans.cu:814->'an illegal memory encounted'
'cudaMempy failed' .
but i set the k equal other number, like k = 4591, the kmeans works. - -!
i thought this might occur randomly
I have no idea where the problem is.

Examples for C API

Can you add examples how to include/use your kmcuda C api?

OOM on the module

Hi! I tried to use this on my model and it just explode the memory, the input size is:
2M * 6000 Matrix

reassignments threshold: 14050
/tmp/kmcuda/src/kmcuda.cc:152 -> out of memory
failed to allocate 8430000000 bytes for *dest

Traceback (most recent call last):
  File "KMCUDA_MSDSWHReduce.py", line 66, in <module>
    cen, assign = kmtrain(X, num_clusters)
  File "KMCUDA_MSDSWHReduce.py", line 52, in kmtrain
    centroids, assignments = kmeans_cuda(X, num_clusters, verbosity=1, yinyang_t=0, seed=3)
MemoryError: Failed to allocate memory on GPU

I set the yinyang_t = 0

The GPU I use is Tesla P100

K-means in C#

Seeing that there is an API made for R# - would it then be possible to use this library in C# as well?

In C# we have a managedCuda library that can leverage tasks to GPU, so I was wondering whether this is already possible or is planned, to adapt kmcuda to C# as well.

Currently, it takes me days to clusterize data on CPU..

src-d / kmcuda Goto Github PK

kmcuda's Introduction

"Yinyang" K-means and K-nn using NVIDIA CUDA

Table of contents

K-means

K-nn

Notes

Building

macOS

Testing

Benchmarks

100000x256@1024

Configuration

Contestants

Data

Notes

8000000x256@1024

Data

Notes

Contributions

License

Python examples

K-means, L2 (Euclidean) distance

K-means, angular (cosine) distance + average

K-nn

Python API

R examples

K-means

K-nn

R API

C examples

C API

README {#ignore_this_doxygen_anchor}

kmcuda's People

Contributors

Stargazers

Watchers

Forkers

kmcuda's Issues

Recommend Projects

Recommend Topics

Recommend Org