genbattle / dkm Goto Github PK
View Code? Open in Web Editor NEWA generic C++11 k-means clustering implementation
License: MIT License
A generic C++11 k-means clustering implementation
License: MIT License
@genbattle, don't you think it would be a better idea if std::vector<T>
is used instead of std::array<T,N>
. This way one can have a more dynamic design.
I understand std::array
is probably a bit faster than std::vector
but perhaps not too much.
I have changed std::array<T,N> to std::vector<T>
in my code because I needed it post-compilation based on dataset.
Are you aware of any significant speed drops?
Hello,
Thanks for this useful project. I think I have found a suspiscious behavior, you may be interested:
I have this assert raised in Microsoft Visual Studio 2017 when running the following code in debug:
void ReproduceAssert()
{
std::vector<std::array<double, 1>> data;
data.push_back(std::array<double,1>({1181.43797999999310377461}));
data.push_back(std::array<double,1>({35.40983650000003990499}));
data.push_back(std::array<double,1>({15.00000000000000000000}));
data.push_back(std::array<double,1>({15.00000000000000000000}));
data.push_back(std::array<double,1>({264.58983650000004672620}));
data.push_back(std::array<double,1>({18.56102000000697671567}));
data.push_back(std::array<double,1>({96.22983649999991939694}));
data.push_back(std::array<double,1>({15.00000000000000000000}));
data.push_back(std::array<double,1>({10.34999999999968167685}));
data.push_back(std::array<double,1>({185.00000000000000000000}));
data.push_back(std::array<double,1>({10.34999999999968167685}));
data.push_back(std::array<double,1>({89.65000000000031832315}));
data.push_back(std::array<double,1>({89.65000000000031832315}));
data.push_back(std::array<double,1>({185.00000000000000000000}));
data.push_back(std::array<double,1>({10.34999999999968167685}));
data.push_back(std::array<double,1>({185.00000000000000000000}));
data.push_back(std::array<double,1>({89.65000000000031832315}));
dkm::clustering_parameters<double> params(10);
params.set_random_seed(0);
params.set_max_iteration(1000);
auto result = dkm::kmeans_lloyd<double, 1>(data, params);
}
First of all, I would like to thank you for your work implementing dkm. I have used it in one of my projects and it was a breeze to set up and use.
During the time I was in the process of incorporating dkm into the project, I have written several utility functions that helped me use the library more efficiently. Currently, I have functions to
Would you like me to send a pull request with these functions added to dkm.hpp file under a util namespace?
P.S.: All the functions are template functions; so the library would still be header-only.
Hi,
thanks for your great library!
Would it be possible for you to implement an overloaded version of the dkm::kmeans_lloyd method that
accepts a std::vector<std::vector> for the data instead of a std::vector<std::array<>>. I think that would make it much easier to use this implementation for certain code bases that require a dynamic amount of dimension (or at least you don't need to pass that number as a template parameter which makes it more error prone because you need to change it at multiple locations in the code)
The issue is minor
In closest_distance:
distances.reserve(k);
but should be
distances.reserve(data.size());
I'd like to use dkm in a system that can cluster different data sets that have their own dimensions. This cannot be done dynamically because the size of std::array needs to be defined in compile time.
Use std::vector instead of std::array to contain point data.
Using fixed width unsigned integer types like uint8_t
, uint16_t
, or uint32_t
causes a compile-time error.
For instance, using uint8_t
as the array data type:
array<uint8_t, 625> points;
And then invoking:
auto cluster_data = dkm::kmeans_lloyd(data, K);
Will throw the following compile-time error:
dkm/include/dkm.hpp: In instantiation of ‘std::tuple<std::vector<std::array<_Tp, _Nm>, std::allocator<std::array<_Tp, _Nm> > >, std::vector<unsigned int, std::allocator<unsigned int> > > dkm::kmeans_lloyd(const std::vector<std::array<_Tp, _Nm> >&, const dkm::clustering_parameters<T>&) [with T = unsigned char; long unsigned int N = 625]’:
dkm/include/dkm.hpp:315:21: required from ‘std::tuple<std::vector<std::array<_Tp, _Nm>, std::allocator<std::array<_Tp, _Nm> > >, std::vector<unsigned int, std::allocator<unsigned int> > > dkm::kmeans_lloyd(const std::vector<std::array<_Tp, _Nm> >&, uint32_t, uint64_t, T) [with T = unsigned char; long unsigned int N = 625; uint32_t = unsigned int; uint64_t = long unsigned int]’
src/main.cpp:148:76: required from here
dkm/include/dkm.hpp:273:2: error: static assertion failed: kmeans_lloyd requires the template parameter T to be a signed arithmetic type (e.g. float, double, int)
static_assert(std::is_arithmetic<T>::value && std::is_signed<T>::value,
^~~~~~~~~~~~~
Issues (#8 and #11) have been raised about the flexibility of the current std::vector<std::array<T, N>>
pattern used for passing data matrices in and out of this library.
A more flexible alternative is to have a dedicated N-dimensional array class (dkm::ndarray<T>
) which has convenient constructors that take both the existing std::vector<std::array<T, N>>
and std::vector<std::vector<T>>
, which has been requested for some use cases. This class would also have methods to explicitly convert back to these data types after running the kmeans algorithm.
The source code should be made citable for academic references: https://guides.github.com/activities/citable-code/
Apply a DOI:
A DOI, or Digital Object Identifier, is a string of numbers, letters and symbols used to permanently identify an article or document and link to it on the web. A DOI will help your reader easily locate a document from your citation. - source.
See GitHub instructions on "making your code citable": https://guides.github.com/activities/citable-code/
Hi. it seems that you wrote a templatized version.
I am assuming that all the library source code (for Kmeans) is in include
folder and opencv is only required for an example. Further, we don't need openMP if we don't intend to use parallel version. And I am thinking about building it with VS2015 so won't need CMAKE either.
Are my assumptions correct?
Hi,
I was building your library on a 64bit x86 machine with GCC, no issues. Cross-compiling for a 32bit ARM platform however gives a very confusing error:
In file included from /usr/arm-linux-gnueabihf/include/c++/8/random:49,
from /mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:9,
from /mnt/project_ws/src/pasta/camera/applications/tracking/src/pattern_extractor.cc:8:
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h: In instantiation of ‘struct std::__detail::_Select_uint_least_t<99, 1>’:
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:114:40: required from ‘struct std::__detail::_Mod<long long unsigned int, 18446744073709551615, 25214903917, 11, false, false>’
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:147:48: required from ‘_Tp std::__detail::__mod(_Tp) [with _Tp = long long unsigned int; _Tp __m = 18446744073709551615; _Tp __a = 25214903917; _Tp __c = 11]’
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:325:50: required from ‘std::linear_congruential_engine<_UIntType, __a, __c, __m>::result_type std::linear_congruential_engine<_UIntType, __a, __c, __m>::operator()() [with _UIntType = long long unsigned int; _UIntType __a = 25214903917; _UIntType __c = 11; _UIntType __m = 18446744073709551615; std::linear_congruential_engine<_UIntType, __a, __c, __m>::result_type = long long unsigned int]’
/usr/arm-linux-gnueabihf/include/c++/8/bits/uniform_int_dist.h:243:31: required from ‘std::uniform_int_distribution<_IntType>::result_type std::uniform_int_distribution<_IntType>::operator()(_UniformRandomNumberGenerator&, const std::uniform_int_distribution<_IntType>::param_type&) [with _UniformRandomNumberGenerator = std::linear_congruential_engine<long long unsigned int, 25214903917, 11, 18446744073709551615>; _IntType = unsigned int; std::uniform_int_distribution<_IntType>::result_type = unsigned int]’
/usr/arm-linux-gnueabihf/include/c++/8/bits/uniform_int_dist.h:166:51: required from ‘std::uniform_int_distribution<_IntType>::result_type std::uniform_int_distribution<_IntType>::operator()(_UniformRandomNumberGenerator&) [with _UniformRandomNumberGenerator = std::linear_congruential_engine<long long unsigned int, 25214903917, 11, 18446744073709551615>; _IntType = unsigned int; std::uniform_int_distribution<_IntType>::result_type = unsigned int]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:81:43: required from ‘std::vector<std::array<_Tp, _Nm> > dkm::details::random_plusplus(const std::vector<std::array<_Tp, _Nm> >&, uint32_t, uint64_t) [with T = float; unsigned int N = 3; uint32_t = unsigned int; uint64_t = long long unsigned int]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:278:65: required from ‘std::tuple<std::vector<std::array<_Tp, _Nm>, std::allocator<std::array<_Tp, _Nm> > >, std::vector<unsigned int, std::allocator<unsigned int> > > dkm::kmeans_lloyd(const std::vector<std::array<_Tp, _Nm> >&, const dkm::clustering_parameters<T>&) [with T = float; unsigned int N = 3]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:316:22: required from ‘std::tuple<std::vector<std::array<_Tp, _Nm>, std::allocator<std::array<_Tp, _Nm> > >, std::vector<unsigned int, std::allocator<unsigned int> > > dkm::kmeans_lloyd(const std::vector<std::array<_Tp, _Nm> >&, uint32_t, uint64_t, T) [with T = float; unsigned int N = 3; uint32_t = unsigned int; uint64_t = long long unsigned int]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/pattern_extractor.cc:47:48: required from here
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:84:24: error: static assertion failed: sorry, would be too much trouble for a slow result
static_assert(__which < 0, /* needs to be dependent */
~~~~~~~~^~~
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h: In instantiation of ‘struct std::__detail::_Mod<long long unsigned int, 18446744073709551615, 25214903917, 11, false, false>’:
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:147:48: required from ‘_Tp std::__detail::__mod(_Tp) [with _Tp = long long unsigned int; _Tp __m = 18446744073709551615; _Tp __a = 25214903917; _Tp __c = 11]’
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:325:50: required from ‘std::linear_congruential_engine<_UIntType, __a, __c, __m>::result_type std::linear_congruential_engine<_UIntType, __a, __c, __m>::operator()() [with _UIntType = long long unsigned int; _UIntType __a = 25214903917; _UIntType __c = 11; _UIntType __m = 18446744073709551615; std::linear_congruential_engine<_UIntType, __a, __c, __m>::result_type = long long unsigned int]’
/usr/arm-linux-gnueabihf/include/c++/8/bits/uniform_int_dist.h:243:31: required from ‘std::uniform_int_distribution<_IntType>::result_type std::uniform_int_distribution<_IntType>::operator()(_UniformRandomNumberGenerator&, const std::uniform_int_distribution<_IntType>::param_type&) [with _UniformRandomNumberGenerator = std::linear_congruential_engine<long long unsigned int, 25214903917, 11, 18446744073709551615>; _IntType = unsigned int; std::uniform_int_distribution<_IntType>::result_type = unsigned int]’
/usr/arm-linux-gnueabihf/include/c++/8/bits/uniform_int_dist.h:166:51: required from ‘std::uniform_int_distribution<_IntType>::result_type std::uniform_int_distribution<_IntType>::operator()(_UniformRandomNumberGenerator&) [with _UniformRandomNumberGenerator = std::linear_congruential_engine<long long unsigned int, 25214903917, 11, 18446744073709551615>; _IntType = unsigned int; std::uniform_int_distribution<_IntType>::result_type = unsigned int]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:81:43: required from ‘std::vector<std::array<_Tp, _Nm> > dkm::details::random_plusplus(const std::vector<std::array<_Tp, _Nm> >&, uint32_t, uint64_t) [with T = float; unsigned int N = 3; uint32_t = unsigned int; uint64_t = long long unsigned int]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:278:65: required from ‘std::tuple<std::vector<std::array<_Tp, _Nm>, std::allocator<std::array<_Tp, _Nm> > >, std::vector<unsigned int, std::allocator<unsigned int> > > dkm::kmeans_lloyd(const std::vector<std::array<_Tp, _Nm> >&, const dkm::clustering_parameters<T>&) [with T = float; unsigned int N = 3]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/dkm.hpp:316:22: required from ‘std::tuple<std::vector<std::array<_Tp, _Nm>, std::allocator<std::array<_Tp, _Nm> > >, std::vector<unsigned int, std::allocator<unsigned int> > > dkm::kmeans_lloyd(const std::vector<std::array<_Tp, _Nm> >&, uint32_t, uint64_t, T) [with T = float; unsigned int N = 3; uint32_t = unsigned int; uint64_t = long long unsigned int]’
/mnt/project_ws/src/pasta/camera/applications/tracking/src/pattern_extractor.cc:47:48: required from here
/usr/arm-linux-gnueabihf/include/c++/8/bits/random.h:114:40: error: no type named ‘type’ in ‘struct std::__detail::_Select_uint_least_t<99, 1>’
+ std::__lg(__m) + 2>::type _Tp2;
^~~~
make[2]: *** [CMakeFiles/tracking.dir/build.make:89: CMakeFiles/tracking.dir/src/pattern_extractor.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:137: CMakeFiles/tracking.dir/all] Error 2
make: *** [Makefile:141: all] Error 2
Changing line 76 in dkm.hpp from
// std::linear_congruential_engine<uint64_t, 6364136223846793005, 1442695040888963407, UINT64_MAX> rand_engine(seed);
to
std::linear_congruential_engine<std::uint_fast32_t, 48271, 0, 2147483647> rand_engine(seed);
I can put a pull request if you like, but if you have a particular preference for some other random engine, I can also put that one in.
JFYI:
in section Usage of readme, the return type of kmeans_lloyd() should be
std::tuple<std::vector<std::array<T, N>>, std::vector<uint32_t>>
not
std::tuple<std::array<T, N>, std::vector<uint32_t>>
which make me confuse at the first look
I just found your repo because I am trying to make clusters from x, y coordinates that I extract from frames from ffmpeg and a remote camera. I ran some live data through the example given in the readme and I am not sure what the returned data is exactly. I have virtually no C++ experience, so please excuse me. When accessing std::cout<<std::get<0>(means)<<std::endl;
I get an array of 2 arrays that are structured like [[482, 212], [191, 393]]
. Are these the centroids of a cluster? Can your code be used to create a bounding box around clusters of coordinates and if yes, can you please show me? Thanks for the code!
Fix Travis integration or migrate to Github Actions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.