genbattle / dkm Goto Github PK

View Code? Open in Web Editor NEW

203.0 203.0 46.0 833 KB

A generic C++11 k-means clustering implementation

License: MIT License

CMake 2.06% C++ 97.94%

dkm's People

Contributors

Stargazers

Watchers

dkm's Issues

converting std::array<T,N> to std::vector<T>

@genbattle, don't you think it would be a better idea if std::vector<T> is used instead of std::array<T,N>. This way one can have a more dynamic design.

I understand std::array is probably a bit faster than std::vector but perhaps not too much.

I have changed std::array<T,N> to std::vector<T> in my code because I needed it post-compilation based on dataset.

Are you aware of any significant speed drops?

invalid probability vector for discrete_distribution

Hello,

Thanks for this useful project. I think I have found a suspiscious behavior, you may be interested:

I have this assert raised in Microsoft Visual Studio 2017 when running the following code in debug:

void ReproduceAssert()
{
    std::vector<std::array<double, 1>> data;
    data.push_back(std::array<double,1>({1181.43797999999310377461}));
    data.push_back(std::array<double,1>({35.40983650000003990499}));
    data.push_back(std::array<double,1>({15.00000000000000000000}));
    data.push_back(std::array<double,1>({15.00000000000000000000}));
    data.push_back(std::array<double,1>({264.58983650000004672620}));
    data.push_back(std::array<double,1>({18.56102000000697671567}));
    data.push_back(std::array<double,1>({96.22983649999991939694}));
    data.push_back(std::array<double,1>({15.00000000000000000000}));
    data.push_back(std::array<double,1>({10.34999999999968167685}));
    data.push_back(std::array<double,1>({185.00000000000000000000}));
    data.push_back(std::array<double,1>({10.34999999999968167685}));
    data.push_back(std::array<double,1>({89.65000000000031832315}));
    data.push_back(std::array<double,1>({89.65000000000031832315}));
    data.push_back(std::array<double,1>({185.00000000000000000000}));
    data.push_back(std::array<double,1>({10.34999999999968167685}));
    data.push_back(std::array<double,1>({185.00000000000000000000}));
    data.push_back(std::array<double,1>({89.65000000000031832315}));
    dkm::clustering_parameters<double> params(10);
    params.set_random_seed(0);
    params.set_max_iteration(1000);
    auto result = dkm::kmeans_lloyd<double, 1>(data, params);
}

Utility functions for commonly occurring problems

First of all, I would like to thank you for your work implementing dkm. I have used it in one of my projects and it was a breeze to set up and use.

During the time I was in the process of incorporating dkm into the project, I have written several utility functions that helped me use the library more efficiently. Currently, I have functions to

calculate distance of each point to cluster center
calculate sum of distances of each point to cluster center
return all points that belong to the given cluster
calculate the inertia of a given clustering using sum of distances method
return the best clustering based on inertia
calculate kmeans by running dkm::kmeans_lloyd function a given number of times and return the best clustering

Would you like me to send a pull request with these functions added to dkm.hpp file under a util namespace?

P.S.: All the functions are template functions; so the library would still be header-only.

[Feature Request] Please add dkm::kmeans_lloyd that take std::vector<std::vector<..>>

Hi,
thanks for your great library!
Would it be possible for you to implement an overloaded version of the dkm::kmeans_lloyd method that
accepts a std::vector<std::vector> for the data instead of a std::vector<std::array<>>. I think that would make it much easier to use this implementation for certain code bases that require a dynamic amount of dimension (or at least you don't need to pass that number as a template parameter which makes it more error prone because you need to change it at multiple locations in the code)

Invalid std::vector::reserve argument in closest_distance

The issue is minor

In closest_distance:
distances.reserve(k);
but should be
distances.reserve(data.size());

Using std::array forces the dimension of the points to be defined in compile time

Description

I'd like to use dkm in a system that can cluster different data sets that have their own dimensions. This cannot be done dynamically because the size of std::array needs to be defined in compile time.

Proposed Solution

Use std::vector instead of std::array to contain point data.

Using unsigned fixed width integer types for data point arrays causes compile-time error

Description

Using fixed width unsigned integer types like uint8_t, uint16_t, or uint32_t causes a compile-time error.

Example

For instance, using uint8_t as the array data type:

array<uint8_t, 625> points;

And then invoking:

auto cluster_data = dkm::kmeans_lloyd(data, K);

Will throw the following compile-time error:

dkm/include/dkm.hpp: In instantiation of ‘std::tuple<std::vector<std::array<_Tp, _Nm>, std::allocator<std::array<_Tp, _Nm> > >, std::vector<unsigned int, std::allocator<unsigned int> > > dkm::kmeans_lloyd(const std::vector<std::array<_Tp, _Nm> >&, const dkm::clustering_parameters<T>&) [with T = unsigned char; long unsigned int N = 625]’:
dkm/include/dkm.hpp:315:21:   required from ‘std::tuple<std::vector<std::array<_Tp, _Nm>, std::allocator<std::array<_Tp, _Nm> > >, std::vector<unsigned int, std::allocator<unsigned int> > > dkm::kmeans_lloyd(const std::vector<std::array<_Tp, _Nm> >&, uint32_t, uint64_t, T) [with T = unsigned char; long unsigned int N = 625; uint32_t = unsigned int; uint64_t = long unsigned int]’
src/main.cpp:148:76:   required from here
dkm/include/dkm.hpp:273:2: error: static assertion failed: kmeans_lloyd requires the template parameter T to be a signed arithmetic type (e.g. float, double, int)
  static_assert(std::is_arithmetic<T>::value && std::is_signed<T>::value,
  ^~~~~~~~~~~~~

Implement N-dimensional array class to replace `std::vector<std::array<T, N>>`

Issues (#8 and #11) have been raised about the flexibility of the current std::vector<std::array<T, N>> pattern used for passing data matrices in and out of this library.

A more flexible alternative is to have a dedicated N-dimensional array class (dkm::ndarray<T>) which has convenient constructors that take both the existing std::vector<std::array<T, N>> and std::vector<std::vector<T>>, which has been requested for some use cases. This class would also have methods to explicitly convert back to these data types after running the kmeans algorithm.

Make code citable with a Digital Object Identifiers (DOI)

Description

The source code should be made citable for academic references: https://guides.github.com/activities/citable-code/

Proposed Solution

Apply a DOI:

A DOI, or Digital Object Identifier, is a string of numbers, letters and symbols used to permanently identify an article or document and link to it on the web. A DOI will help your reader easily locate a document from your citation. - source.

See GitHub instructions on "making your code citable": https://guides.github.com/activities/citable-code/

Is all the source code in include?

Hi. it seems that you wrote a templatized version.

I am assuming that all the library source code (for Kmeans) is in include folder and opencv is only required for an example. Further, we don't need openMP if we don't intend to use parallel version. And I am thinking about building it with VS2015 so won't need CMAKE either.

Are my assumptions correct?

static_assert(__which < 0, /* needs to be dependent */

Hi,

I was building your library on a 64bit x86 machine with GCC, no issues. Cross-compiling for a 32bit ARM platform however gives a very confusing error:

In file included from /usr/arm-linux-gnueabihf/include/c++/8/random:49,
                 from /mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:9,
                 from /mnt/project_ws/src/pasta/camera/applications/tracking/src/pattern_extractor.cc:8:
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h: In instantiation of ‘struct std::__detail::_Select_uint_least_t<99, 1>’:
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:114:40:   required from ‘struct std::__detail::_Mod<long long unsigned int, 18446744073709551615, 25214903917, 11, false, false>’
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:147:48:   required from ‘_Tp std::__detail::__mod(_Tp) [with _Tp = long long unsigned int; _Tp __m = 18446744073709551615; _Tp __a = 25214903917; _Tp __c = 11]’
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:325:50:   required from ‘std::linear_congruential_engine<_UIntType, __a, __c, __m>::result_type std::linear_congruential_engine<_UIntType, __a, __c, __m>::operator()() [with _UIntType = long long unsigned int; _UIntType __a = 25214903917; _UIntType __c = 11; _UIntType __m = 18446744073709551615; std::linear_congruential_engine<_UIntType, __a, __c, __m>::result_type = long long unsigned int]’
/usr/arm-linux-gnueabihf/include/c++/8/bits/uniform_int_dist.h:243:31:   required from ‘std::uniform_int_distribution<_IntType>::result_type std::uniform_int_distribution<_IntType>::operator()(_UniformRandomNumberGenerator&, const std::uniform_int_distribution<_IntType>::param_type&) [with _UniformRandomNumberGenerator = std::linear_congruential_engine<long long unsigned int, 25214903917, 11, 18446744073709551615>; _IntType = unsigned int; std::uniform_int_distribution<_IntType>::result_type = unsigned int]’
/usr/arm-linux-gnueabihf/include/c++/8/bits/uniform_int_dist.h:166:51:   required from ‘std::uniform_int_distribution<_IntType>::result_type std::uniform_int_distribution<_IntType>::operator()(_UniformRandomNumberGenerator&) [with _UniformRandomNumberGenerator = std::linear_congruential_engine<long long unsigned int, 25214903917, 11, 18446744073709551615>; _IntType = unsigned int; std::uniform_int_distribution<_IntType>::result_type = unsigned int]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:81:43:   required from ‘std::vector<std::array<_Tp, _Nm> > dkm::details::random_plusplus(const std::vector<std::array<_Tp, _Nm> >&, uint32_t, uint64_t) [with T = float; unsigned int N = 3; uint32_t = unsigned int; uint64_t = long long unsigned int]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:278:65:   required from ‘std::tuple<std::vector<std::array<_Tp, _Nm>, std::allocator<std::array<_Tp, _Nm> > >, std::vector<unsigned int, std::allocator<unsigned int> > > dkm::kmeans_lloyd(const std::vector<std::array<_Tp, _Nm> >&, const dkm::clustering_parameters<T>&) [with T = float; unsigned int N = 3]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:316:22:   required from ‘std::tuple<std::vector<std::array<_Tp, _Nm>, std::allocator<std::array<_Tp, _Nm> > >, std::vector<unsigned int, std::allocator<unsigned int> > > dkm::kmeans_lloyd(const std::vector<std::array<_Tp, _Nm> >&, uint32_t, uint64_t, T) [with T = float; unsigned int N = 3; uint32_t = unsigned int; uint64_t = long long unsigned int]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/pattern_extractor.cc:47:48:   required from here
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:84:24: error: static assertion failed: sorry, would be too much trouble for a slow result
  static_assert(__which < 0, /* needs to be dependent */
                ~~~~~~~~^~~
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h: In instantiation of ‘struct std::__detail::_Mod<long long unsigned int, 18446744073709551615, 25214903917, 11, false, false>’:
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:147:48:   required from ‘_Tp std::__detail::__mod(_Tp) [with _Tp = long long unsigned int; _Tp __m = 18446744073709551615; _Tp __a = 25214903917; _Tp __c = 11]’
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:325:50:   required from ‘std::linear_congruential_engine<_UIntType, __a, __c, __m>::result_type std::linear_congruential_engine<_UIntType, __a, __c, __m>::operator()() [with _UIntType = long long unsigned int; _UIntType __a = 25214903917; _UIntType __c = 11; _UIntType __m = 18446744073709551615; std::linear_congruential_engine<_UIntType, __a, __c, __m>::result_type = long long unsigned int]’
/usr/arm-linux-gnueabihf/include/c++/8/bits/uniform_int_dist.h:243:31:   required from ‘std::uniform_int_distribution<_IntType>::result_type std::uniform_int_distribution<_IntType>::operator()(_UniformRandomNumberGenerator&, const std::uniform_int_distribution<_IntType>::param_type&) [with _UniformRandomNumberGenerator = std::linear_congruential_engine<long long unsigned int, 25214903917, 11, 18446744073709551615>; _IntType = unsigned int; std::uniform_int_distribution<_IntType>::result_type = unsigned int]’
/usr/arm-linux-gnueabihf/include/c++/8/bits/uniform_int_dist.h:166:51:   required from ‘std::uniform_int_distribution<_IntType>::result_type std::uniform_int_distribution<_IntType>::operator()(_UniformRandomNumberGenerator&) [with _UniformRandomNumberGenerator = std::linear_congruential_engine<long long unsigned int, 25214903917, 11, 18446744073709551615>; _IntType = unsigned int; std::uniform_int_distribution<_IntType>::result_type = unsigned int]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:81:43:   required from ‘std::vector<std::array<_Tp, _Nm> > dkm::details::random_plusplus(const std::vector<std::array<_Tp, _Nm> >&, uint32_t, uint64_t) [with T = float; unsigned int N = 3; uint32_t = unsigned int; uint64_t = long long unsigned int]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:278:65:   required from ‘std::tuple<std::vector<std::array<_Tp, _Nm>, std::allocator<std::array<_Tp, _Nm> > >, std::vector<unsigned int, std::allocator<unsigned int> > > dkm::kmeans_lloyd(const std::vector<std::array<_Tp, _Nm> >&, const dkm::clustering_parameters<T>&) [with T = float; unsigned int N = 3]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:316:22:   required from ‘std::tuple<std::vector<std::array<_Tp, _Nm>, std::allocator<std::array<_Tp, _Nm> > >, std::vector<unsigned int, std::allocator<unsigned int> > > dkm::kmeans_lloyd(const std::vector<std::array<_Tp, _Nm> >&, uint32_t, uint64_t, T) [with T = float; unsigned int N = 3; uint32_t = unsigned int; uint64_t = long long unsigned int]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/pattern_extractor.cc:47:48:   required from here
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:114:40: error: no type named ‘type’ in ‘struct std::__detail::_Select_uint_least_t<99, 1>’
            + std::__lg(__m) + 2>::type _Tp2;
                                        ^~~~
make[2]: *** [CMakeFiles/tracking.dir/build.make:89: CMakeFiles/tracking.dir/src/pattern_extractor.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:137: CMakeFiles/tracking.dir/all] Error 2
make: *** [Makefile:141: all] Error 2

Changing line 76 in dkm.hpp from

  // std::linear_congruential_engine<uint64_t, 6364136223846793005, 1442695040888963407, UINT64_MAX> rand_engine(seed);

  std::linear_congruential_engine<std::uint_fast32_t, 48271, 0, 2147483647> rand_engine(seed);

I can put a pull request if you like, but if you have a particular preference for some other random engine, I can also put that one in.

Building for visual studio 2017 community edition

JFYI:

It won't be able to find opencv out-of-the-box. So you have to put is manually:
mkdir build
cd build
cmake -G"Visual Studio 15 2017 Win64" -DOpenCV_DIR=D:\opencv\build ../
The generated project won't build, you have to load dkm.sln and remove extra compilation options for each project.

readme wrong information for return type of kmeans_lloyd()

in section Usage of readme, the return type of kmeans_lloyd() should be
std::tuple<std::vector<std::array<T, N>>, std::vector<uint32_t>>
not
std::tuple<std::array<T, N>, std::vector<uint32_t>>
which make me confuse at the first look

Can you expand in the readme to give more instructions?

I just found your repo because I am trying to make clusters from x, y coordinates that I extract from frames from ffmpeg and a remote camera. I ran some live data through the example given in the readme and I am not sure what the returned data is exactly. I have virtually no C++ experience, so please excuse me. When accessing std::cout<<std::get<0>(means)<<std::endl; I get an array of 2 arrays that are structured like [[482, 212], [191, 393]]. Are these the centroids of a cluster? Can your code be used to create a bounding box around clusters of coordinates and if yes, can you please show me? Thanks for the code!

Fix CI build

Fix Travis integration or migrate to Github Actions.

genbattle / dkm Goto Github PK

dkm's People

Contributors

Stargazers

Watchers

Forkers

dkm's Issues

Description

Proposed Solution

Description

Example

Description

Proposed Solution

Recommend Projects

Recommend Topics

Recommend Org