dwangomediavillage / pqkmeans Goto Github PK

View Code? Open in Web Editor NEW

256.0 21.0 44.0 577 KB

Fast and memory-efficient clustering

Home Page: http://yusukematsui.me/project/pqkmeans/pqkmeans.html

License: MIT License

CMake 0.07% Python 5.11% C++ 5.99% Jupyter Notebook 88.83%

clustering k-means product-quantization scikit-learn computer-vision

pqkmeans's Introduction

PQk-means

Project | Paper | Tutorial

A 2D example using both k-means and PQk-means	Large-scale evaluation

PQk-means [Matsui, Ogaki, Yamasaki, and Aizawa, ACMMM 17] is a Python library for efficient clustering of large-scale data. By first compressing input vectors into short product-quantized (PQ) codes, PQk-means achieves fast and memory-efficient clustering, even for high-dimensional vectors. Similar to k-means, PQk-means repeats the assignment and update steps, both of which can be performed in the PQ-code domain.

For a comparison, we provide the ITQ encoding for the binary conversion and Binary k-means [Gong+, CVPR 15] for the clustering of binary codes.

The library is written in C++ for the main algorithm with wrappers for Python. All encoding/clustering codes are compatible with scikit-learn.

Summary of features

Approximation of k-means
Tens to hundreds of times faster than k-means
Tens to hundreds of times more memory efficient than k-means
Compatible with scikit-learn
Portable; one-line installation

Installation

Requisites

CMake
- brew install cmake for OS X
- sudo apt install cmake for Ubuntu
OpenMP (Optional)
- If openmp is installed, it will be automatically used to parallelize the algorithm for faster calculation.

Build & install

You can install the library from PyPI:

pip install pqkmeans

Or, if you would like to use the current master version, you can manually build and install the library by:

git clone --recursive https://github.com/DwangoMediaVillage/pqkmeans.git
cd pqkmeans
python setup.py install

Run samples

# evaluation needs extra texmex package
pip install pqkmeans[texmex]
# with artificial data
python bin/run_experiment.py --dataset artificial --algorithm bkmeans pqkmeans --k 100
# with texmex dataset (http://corpus-texmex.irisa.fr/)
python bin/run_experiment.py --dataset siftsmall --algorithm bkmeans pqkmeans --k 100

Test

python setup.py test

Usage

For PQk-means

import pqkmeans
import numpy as np
X = np.random.random((100000, 128)) # 128 dimensional 100,000 samples

# Train a PQ encoder.
# Each vector is divided into 4 parts and each part is
# encoded with log256 = 8 bit, resulting in a 32 bit PQ code.
encoder = pqkmeans.encoder.PQEncoder(num_subdim=4, Ks=256)
encoder.fit(X[:1000])  # Use a subset of X for training

# Convert input vectors to 32-bit PQ codes, where each PQ code consists of four uint8.
# You can train the encoder and transform the input vectors to PQ codes preliminary.
X_pqcode = encoder.transform(X)

# Run clustering with k=5 clusters.
kmeans = pqkmeans.clustering.PQKMeans(encoder=encoder, k=5)
clustered = kmeans.fit_predict(X_pqcode)

# Then, clustered[0] is the id of assigned center for the first input PQ code (X_pqcode[0]).

Note that an instance of PQ-encoder (encoder) and an instance of clustering (kmeans) can be pickled and reused later.

import pickle

# An instance of PQ-encoder.
pickle.dump(encoder, open('encoder.pkl', 'wb'))
encoder_dumped = pickle.load(open('encoder.pkl', 'rb'))

# An instance of clustering. This can be reused as a vector quantizer later.
pickle.dump(kmeans, open('kmeans.pkl', 'wb'))
kmeans_dumped = pickle.load(open('kmeans.pkl', 'rb'))

For Bk-means

In almost the same manner as for PQk-means,

import pqkmeans
import numpy as np
X = np.random.random((100000, 128)) # 128 dimensional 100,000 samples

# Train an ITQ binary encoder
encoder = pqkmeans.encoder.ITQEncoder(num_bit=32)
encoder.fit(X[:1000])  # Use a subset of X for training

# Convert input vectors to binary codes
X_itq = encoder.transform(X)

# Run clustering
kmeans = pqkmeans.clustering.BKMeans(k=5, input_dim=32)
clustered = kmeans.fit_predict(X_itq)

Please see more examples on a tutorial

Note

This repository contains the re-implemented version of the PQk-means with the Python interface. There can be the difference between this repository and the pure c++ implementation used in the paper.
We tested this library with Python3, on OS X and Ubuntu 16.04.

Authors

Keisuke Ogaki designed the whole structure of the library, and implemented most of the Bk-means clustering
Yusuke Matsui implemented most of the PQk-means clustering

Reference

@inproceedings{pqkmeans,
    author = {Yusuke Matsui and Keisuke Ogaki and Toshihiko Yamasaki and Kiyoharu Aizawa},
    title = {PQk-means: Billion-scale Clustering for Product-quantized Codes},
    booktitle = {ACM International Conference on Multimedia (ACMMM)},
    year = {2017},
}

Todo

Evaluation script for billion-scale data
Nearest neighbor search with PQTable
Documentation

pqkmeans's People

Contributors

Stargazers

Watchers

pqkmeans's Issues

Update cluster

Hi there. First of all, thanks for this incredibly amazing project!! This is such a helpful and well designed code.

This is not an issue but rather a question, so sorry about using this channel for this:

Is there a way of efficiently updating the kmeans cluster when new data arrives? Im guessing the answer is no and I suppose I have to regenerate the kmeans cluster, as it would be the usual path technically, because new data can slowly change all centroids.. Is that correct or there is some clever way of doing this?

Thanks!

How to speed up the "predict" function?

After training the PQ-kmeans (with num_subdim=16, Ks = 256, kmeans.K = 25,000), I want to assign ~1,000,000 vectors to their clusters, but the kmeans.predict(new_vectors) takes a long time to run. Why this is slow and is there any method to speed up?

Getting centroid vector

Hi,

Is there any function to get centroid vector which is not encoded with encoder?

Comparision to Faiss

Faiss implemented k-means clustering in PQ space.

According to https://github.com/facebookresearch/faiss/wiki/Low-level-benchmarks, it shows very competitive result on million scale thanks to GPU acceleration.

However, there is lack of billion scale benchmark data so I cannot compare which one is faster.

In Billion-scale similarity search with GPUs by Jeff Johnson, Matthijs Douze, Hervé Jégou

I have two glimpses that Faiss may faster than PQk-menas.

1. Million scale test

Large scale. We can also compare to [3], an approximate CPU method that clusters 100M 128-d vectors to 85k centroids. Their clustering method runs in 46 minutes, but requires 56 minutes (at least) of pre-processing to encode the vectors. Our method performs exact k-means on 4 GPUs in 52 minutes without any pre-processing.

The quote says, N=100M, 128D (SIFT dim), K=10^5 (approx) took about 1h (except pq encoding part).
(iteration unknown)
Your paper Table 5. shows SIFT 1B (128D). N=1M, K=10^5 took 14h for 5 iter.

If I naively multiply 10 to the Faiss result because N is 10x smaller for the benchmark, it took 10h.

2. 1-nearest-neighbor. Assignment Step Speed

Another point is comparing assignment step in your paper Table3. SIFT1M K = 10^3. Assignment step took 6 sec.

Below figure from Faiss tells it took less than 1 sec.

Questions

Do you have any data for rigorous comparison? Your PQk-means and Faiss implementation is both competitive. I want to see who wins in terms of speed and accuracy performance.

One more question. Suppose your proposed update step can be mixed with faiss implementation, will it produce the state-of-the-art PQk-means clustering implementation?

Get Subvector to run with other algorithment

HI Author,
I'm using now PQKmeans to compress the high dimension to the lower dimension with product quantization.
But my professor told me that I have to try a dataset with many algorithms after the processing product quantization. Because he wants to try with each sub-vector.
That means, After "Encode.transform() " function I have to get the subvector.
E.g: 10000 element , in 100 dimension ----> after product quantization -> = 5 x 10000 in 20 dimension (number divide is 5).
Then I would like to get the separated sub-vector.
I have had a search on your package and google but I did not find the solution for that.
I would ask you :
Is that possible to get sub-vector on product quantization?

thank you very much

Can not install through pip

Thanks for this great library, but I could not really install it through pip. I have cmake installed in my ubuntu system, and when I try to install through pip3.8 install pqkmeans the error that I get :

(env_38) user@user-LPC:~/projects/tmp$ pip3.8 install pqkmeans
Collecting pqkmeans
  Using cached pqkmeans-1.0.5.tar.gz (161 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting scipy
  Using cached scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
Collecting numpy
  Using cached numpy-1.24.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting texmex-python==1.0.0
  Using cached texmex_python-1.0.0.tar.gz (2.5 kB)
  Preparing metadata (setup.py) ... done
Collecting pipe
  Using cached pipe-2.0-py3-none-any.whl (8.8 kB)
Collecting six
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.2.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.8 MB)
Collecting lshash3
  Using cached lshash3-0.0.8.tar.gz (9.4 kB)
  Preparing metadata (setup.py) ... done
Collecting joblib>=1.1.1
  Using cached joblib-1.2.0-py3-none-any.whl (297 kB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting bitarray
  Using cached bitarray-2.7.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (271 kB)
Building wheels for collected packages: pqkmeans
  Building wheel for pqkmeans (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for pqkmeans (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [86 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-38
      creating build/lib.linux-x86_64-cpython-38/pqkmeans
      copying pqkmeans/evaluation.py -> build/lib.linux-x86_64-cpython-38/pqkmeans
      copying pqkmeans/__init__.py -> build/lib.linux-x86_64-cpython-38/pqkmeans
      creating build/lib.linux-x86_64-cpython-38/test
      copying test/__init__.py -> build/lib.linux-x86_64-cpython-38/test
      creating build/lib.linux-x86_64-cpython-38/pqkmeans/clustering
      copying pqkmeans/clustering/__init__.py -> build/lib.linux-x86_64-cpython-38/pqkmeans/clustering
      copying pqkmeans/clustering/cpp_implemented_clustering_sample.py -> build/lib.linux-x86_64-cpython-38/pqkmeans/clustering
      copying pqkmeans/clustering/pure_python_clustering_sample.py -> build/lib.linux-x86_64-cpython-38/pqkmeans/clustering
      copying pqkmeans/clustering/bkmeans.py -> build/lib.linux-x86_64-cpython-38/pqkmeans/clustering
      copying pqkmeans/clustering/pqkmeans.py -> build/lib.linux-x86_64-cpython-38/pqkmeans/clustering
      creating build/lib.linux-x86_64-cpython-38/pqkmeans/encoder
      copying pqkmeans/encoder/encoder_sample.py -> build/lib.linux-x86_64-cpython-38/pqkmeans/encoder
      copying pqkmeans/encoder/__init__.py -> build/lib.linux-x86_64-cpython-38/pqkmeans/encoder
      copying pqkmeans/encoder/itq_encoder.py -> build/lib.linux-x86_64-cpython-38/pqkmeans/encoder
      copying pqkmeans/encoder/encoder_base.py -> build/lib.linux-x86_64-cpython-38/pqkmeans/encoder
      copying pqkmeans/encoder/pq_encoder.py -> build/lib.linux-x86_64-cpython-38/pqkmeans/encoder
      creating build/lib.linux-x86_64-cpython-38/test/clustering
      copying test/clustering/test_pqkmeans.py -> build/lib.linux-x86_64-cpython-38/test/clustering
      copying test/clustering/__init__.py -> build/lib.linux-x86_64-cpython-38/test/clustering
      copying test/clustering/test_cpp_implemented_clustering_sample.py -> build/lib.linux-x86_64-cpython-38/test/clustering
      copying test/clustering/test_pure_python_clustering_sample.py -> build/lib.linux-x86_64-cpython-38/test/clustering
      copying test/clustering/test_bkmeans.py -> build/lib.linux-x86_64-cpython-38/test/clustering
      creating build/lib.linux-x86_64-cpython-38/test/encoder
      copying test/encoder/__init__.py -> build/lib.linux-x86_64-cpython-38/test/encoder
      copying test/encoder/test_pq_encoder.py -> build/lib.linux-x86_64-cpython-38/test/encoder
      copying test/encoder/test_encoder_sample.py -> build/lib.linux-x86_64-cpython-38/test/encoder
      copying test/encoder/test_itq_encoder.py -> build/lib.linux-x86_64-cpython-38/test/encoder
      running build_ext
      Traceback (most recent call last):
        File "/home/user/projects/tmp/env_38/bin/cmake", line 5, in <module>
          from cmake import cmake
      ModuleNotFoundError: No module named 'cmake'
      Traceback (most recent call last):
        File "/home/user/projects/tmp/env_38/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/home/user/projects/tmp/env_38/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/home/user/projects/tmp/env_38/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 251, in build_wheel
          return _build_backend().build_wheel(wheel_directory, config_settings,
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 413, in build_wheel
          return self._build_with_temp_dir(['bdist_wheel'], '.whl',
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 398, in _build_with_temp_dir
          self.run_setup()
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 335, in run_setup
          exec(code, locals())
        File "<string>", line 76, in <module>
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/__init__.py", line 108, in setup
          return distutils.core.setup(**attrs)
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 185, in setup
          return run_commands(dist)
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
          dist.run_commands()
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
          self.run_command(cmd)
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/dist.py", line 1221, in run_command
          super().run_command(command)
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/tmp/pip-build-env-5x4zvxen/normal/lib/python3.8/site-packages/wheel/bdist_wheel.py", line 343, in run
          self.run_command("build")
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
          self.distribution.run_command(command)
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/dist.py", line 1221, in run_command
          super().run_command(command)
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/_distutils/command/build.py", line 131, in run
          self.run_command(cmd_name)
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
          self.distribution.run_command(command)
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/dist.py", line 1221, in run_command
          super().run_command(command)
        File "/tmp/pip-build-env-5x4zvxen/overlay/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "<string>", line 37, in run
        File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/usr/lib/python3.8/subprocess.py", line 516, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['cmake', '--version']' returned non-zero exit status 1.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pqkmeans
Failed to build pqkmeans
ERROR: Could not build wheels for pqkmeans, which is required to install pyproject.toml-based projects

How to fix it?

Missing LICENSE

What is the license for these codes?
The repository does not contain license headers or file. Could you add those, please?

init PQkmeans class with cluster centers

This awesome repo saves me a lot of time that could be wasted otherwise if I was stick to Ak-means clustering. Still it requires long time to fit PQkmeans object. I want to save its result, which is the found cluster centers. Then I want to use the saved result for assigning any descriptor to the most nearest center point.

First, I tried with pickle, but PQkmeans object cannot be pickled.

Seconds, I tried to saving only found cluster centers in .npy format. However, there is no way to initialize PQkmeans object with initial center points. The reason why I want to create PQkmeans clustering is that I want to use efficient assignment step.

My suggestion is

Create option to initialize PQkmeans object with initial center points, so user can predict with the object.
Provide static method which is used for assignment step, so user can assign most nearest cluster center for batch of descriptors.

Do you have any plan to implement this feature? This will make this work more usable and complete.

Cannot install the dependency: lshash3

My package depend on pqkmeans. Some time ago, it started producing errors during installation, because pqkmeans requires lshash3, and for some reason, lshash3 doesn't install anymore.

When I run pip install pqkmeans (Ubuntu, Python 3.10, reproducible in this notebook), I get the following error:

trace

Collecting pqkmeans==1.0.4
  Downloading pqkmeans-1.0.4.tar.gz (158 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.5/158.5 kB 3.3 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from pqkmeans==1.0.4) (1.23.5)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from pqkmeans==1.0.4) (1.2.2)
Collecting pipe (from pqkmeans==1.0.4)
  Using cached pipe-2.0-py3-none-any.whl (8.8 kB)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from pqkmeans==1.0.4) (1.11.3)
Collecting typing (from pqkmeans==1.0.4)
  Downloading typing-3.7.4.3.tar.gz (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.6/78.6 kB 8.3 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from pqkmeans==1.0.4) (1.16.0)
Collecting texmex_python==1.0.0 (from pqkmeans==1.0.4)
  Using cached texmex_python-1.0.0-py3-none-any.whl
Collecting lshash3 (from texmex_python==1.0.0->pqkmeans==1.0.4)
  Using cached lshash3-0.0.8.tar.gz (9.4 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->pqkmeans==1.0.4) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->pqkmeans==1.0.4) (3.2.0)
Collecting bitarray (from lshash3->texmex_python==1.0.0->pqkmeans==1.0.4)
  Using cached bitarray-2.8.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (286 kB)
Building wheels for collected packages: pqkmeans, typing, lshash3
  Building wheel for pqkmeans (setup.py) ... done
  Created wheel for pqkmeans: filename=pqkmeans-1.0.4-cp310-cp310-linux_x86_64.whl size=236311 sha256=940d560d65bd2e8ecec652ff7756502f85e47a99a9101b0b500afac859cb8948
  Stored in directory: /root/.cache/pip/wheels/84/fa/fe/7cf22e3d319bb9c9652768e4761b2b219541ad0334b127f340
  Building wheel for typing (setup.py) ... done
  Created wheel for typing: filename=typing-3.7.4.3-py3-none-any.whl size=26305 sha256=a64e1e1c1f212dc750d7dffea39c73c41cb5d2e96e166917de49adeefd665c72
  Stored in directory: /root/.cache/pip/wheels/7c/d0/9e/1f26ebb66d9e1732e4098bc5a6c2d91f6c9a529838f0284890
  error: subprocess-exited-with-error
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  note: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for lshash3 (setup.py) ... error
  ERROR: Failed building wheel for lshash3
  Running setup.py clean for lshash3
Successfully built pqkmeans typing
Failed to build lshash3
ERROR: Could not build wheels for lshash3, which is required to install pyproject.toml-based projects

Here is where we have discovered the problem: avidale/compress-fasttext#19

Can you please fix this error, e.g. by switching to a well-maintained implementation of lshash, such as https://github.com/guofei9987/pyLSHash?

pip build fails on mac

While running pip install pqkmeans:

[ 28%] Building CXX object CMakeFiles/_pqkmeans.dir/src/clustering/bkmeans.cpp.o
    In file included from /private/var/folders/pz/ckv4fct52c7_826fg4x48vgm0000gn/T/pip-install-khtx799x/pqkmeans/src/_pqkmeans.cpp:3:
    In file included from /private/var/folders/pz/ckv4fct52c7_826fg4x48vgm0000gn/T/pip-install-khtx799x/pqkmeans/src/encoder/encoder_sample.h:6:
    In file included from /private/var/folders/pz/ckv4fct52c7_826fg4x48vgm0000gn/T/pip-install-khtx799x/pqkmeans/ext/pybind/include/pybind11/stl.h:34:
    /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/experimental/optional:18:3: warning: "<experimental/optional> has been removed. Use <optional> instead." [-W#warnings]
    # warning "<experimental/optional> has been removed. Use <optional> instead."
      ^
    In file included from /private/var/folders/pz/ckv4fct52c7_826fg4x48vgm0000gn/T/pip-install-khtx799x/pqkmeans/src/_pqkmeans.cpp:3:
    In file included from /private/var/folders/pz/ckv4fct52c7_826fg4x48vgm0000gn/T/pip-install-khtx799x/pqkmeans/src/encoder/encoder_sample.h:6:
    /private/var/folders/pz/ckv4fct52c7_826fg4x48vgm0000gn/T/pip-install-khtx799x/pqkmeans/ext/pybind/include/pybind11/stl.h:283:46: error: no member named 'experimental' in namespace 'std'
    template<typename T> struct type_caster<std::experimental::optional<T>>
                                            ~~~~~^
    /private/var/folders/pz/ckv4fct52c7_826fg4x48vgm0000gn/T/pip-install-khtx799x/pqkmeans/ext/pybind/include/pybind11/stl.h:283:69: error: 'T' does not refer to a value
    template<typename T> struct type_caster<std::experimental::optional<T>>
                                                                        ^
    /private/var/folders/pz/ckv4fct52c7_826fg4x48vgm0000gn/T/pip-install-khtx799x/pqkmeans/ext/pybind/include/pybind11/stl.h:283:19: note: declared here
    template<typename T> struct type_caster<std::experimental::optional<T>>
                      ^
    /private/var/folders/pz/ckv4fct52c7_826fg4x48vgm0000gn/T/pip-install-khtx799x/pqkmeans/ext/pybind/include/pybind11/stl.h:286:36: error: no member named 'experimental' in namespace 'std'
    template<> struct type_caster<std::experimental::nullopt_t>
                                  ~~~~~^
    1 warning and 3 errors generated.
    make[2]: *** [CMakeFiles/_pqkmeans.dir/src/_pqkmeans.cpp.o] Error 1
    make[2]: *** Waiting for unfinished jobs....
    make[1]: *** [CMakeFiles/_pqkmeans.dir/all] Error 2
    make: *** [all] Error 2

My enviroment details is:

ProductName:	Mac OS X
ProductVersion:	10.15
BuildVersion:	19A583

Darwin MacBook-Pro.local 19.0.0 Darwin Kernel Version 19.0.0: Wed Sep 25 20:18:50 PDT 2019; root:xnu-6153.11.26~2/RELEASE_X86_64 x86_64

create conda pqkmeans from pypi

module 'pqkmeans' has no attribute 'encoder'

I was running the sample code and getting this:
" module 'pqkmeans' has no attribute 'encoder'" for error message

Pickling the encoder fails

Hi guys,

first of all thanks for the lib. I'm currently trying it out but having a hard time persisting the encoder:

The following snippet

    encoder = pqkmeans.encoder.PQEncoder(num_subdim=4, Ks=256)
    
    encoder.fit(features)

    with open('/Users/schumajs/Desktop/encoder.pkl', 'wb') as f:
        pickle.dump(encoder, f)

returns

pickle.PicklingError: Can't pickle <class 'pqkmeans.encoder.pq_encoder.TrainedPQEncoder'>: it's not found as pqkmeans.encoder.pq_encoder.TrainedPQEncoder

Any ideas?

Thank you

Jens

EDIT

Saving the encoder works with python 3. I have been using python 2. Sorry for that.

On a different matter: Would it be possible to save the trained pqkmeans model as well?

Can't pip install

Error message FileNotFoundError: [Errno 2] No such file or directory: 'cmake': 'cmake'

Is this method works with other PQ methods? (OPQ, LOPQ)

My main goal is minimizing error of PQk-means clustering that comes from quantization.
I want to use PQk-means clustering over Ak-means (Approximate k-means clustering with kd-tree forest) because of the fantastic speed. At the same time, I want to minimize error discrepancy between PQk-means and Ak-means.

There are few parameters I know to do that.

Increase M and L to make more accurate approximation of corresponding original vector.

However, from this picture from http://image.ntua.gr/iva/research/lopq/

I can see PQ has inherent limitation for approximate quantization compare to LOPQ. That is why I came up with idea of using LOPQ on PQk-means clustering.

I am new to PQ worlds, so I am not sure if PQk-means's assignment and update are still valid for other methods such as OPQ and LOPQ.

Do you have any idea?

Sparse Vector

Just wondering,
if pqkmeans can be optimized for sparse vectors
, csr format, instead of dense ones

thank you

create pqkmeans conda package version from pypi

I am failing to create a conda package of pqkmeans from pypi:

$conda skeleton pypi pqkmeans
$conda build pqkmeans

it ends with error exit (1)!!!!
Don't know how to solve the problem. sorry for my ignorance!!!

I really need pqkmeans from conda environment!
Is there any known anaconda channel from where I can conda install pqkmeans?

logging errors

--- Logging error --- Traceback (most recent call last): File "//.pyenv/versions/3.7.2/lib/python3.7/logging/__init__.py", line 1034, in emit msg = self.format(record) File "//.pyenv/versions/3.7.2/lib/python3.7/logging/__init__.py", line 880, in format return fmt.format(record) File "//.pyenv/versions/3.7.2/lib/python3.7/logging/__init__.py", line 619, in format record.message = record.getMessage() File "//.pyenv/versions/3.7.2/lib/python3.7/logging/__init__.py", line 380, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting

Lines like logging.debug (" R: ", R.shape) seem to be changed.

python 3.7.2