Giter Club home page Giter Club logo

cfxkmeans's Introduction

CFXKMeans

CFXKMeans library is an open-source, optimized library for K-means clustering analysis developed by Colfax Research. Development of the library is described in detail in the white paper, "Optimization of Hamerly’s K-Means Clustering Algorithm: CFXKMeans Library" webpage.

Installation

C/C++ API

The installation requires either a GNU C/C++ Compiler or Intel C/C++ Compiler with OpenMP 4.0 support. It has been verified on Intel Compiler 2017 and GCC-6.3.0. To compile CFXKMeans, first configure the Makefile by selecting the compiler and the vecorization flags for your architecture:

########### Intel compiler ###################
CC=icc
CXX = icpc
### Enable vectorization support
## Intel Xeon Phi processor        -> -xMIC_AVX512
## Intel Xeon processor (HSW, BDW) -> -xCORE_AVX2
## Intel Xeon processor (SNB, IVB) -> -xAVX
VECFLAG=-xMIC_AVX512
CXXFLAGS=$(COMMONFLAGS) $(VECFLAG) -qopenmp
########### GCC (> 4.9.1) ####################

Then run make:

%  make

The archive file will be created in lib/ directory, and the header file is in include/.

Python API

Python API requires python and cython to compile. Numpy is rquired for CFXKMeans library. The Makefile has support for compiling python extension:

%  make python-build     # creates a build/ with the extension
%  make python-install   # attempts to install

If you choose to use the setup.py manually, remember to set CC and ```CXX`` to the same compilers used for making the archive file.

Usage

C/C++

See includes/kmeans.h for the definition of the KMeans class. Example coming soon.

Python

Example usage can be found in example/kmeans.py.

import cfxkmeans
cfxk = cfxkmeans.KMeans(k, init).fit(data)
print(cfxk.labels_)
print(cfxk.centroids_)

Note that, currently, the library does not support creating the initial centroids. See the example/kmeans.py for an example on how to create inital centroids.

Running the application

CFXKMeans library uses OpenMP parallel framework for threading. OpenMP framework supports various environment variables to control threading. The performance of the library can be drastically different for some systems if the environment variables are not set correctly.

For Intel Xeon Phi processors we have found the folloing to give the best performance

% # For Xeon Phi 7250 processor
% export OMP_NUM_THREADS=68 # 1 thread per core
% export OMP_PLACES=cores
% export OMP_PROC_BIND=spread
% numactl -m 1 python kmeans.py #Using MCDRAM

For Intel Xeon processors we have found the folloing to give the best performance

% # For Xeon Phi 7250 processor
% export OMP_PLACES=cores
% export OMP_PROC_BIND=spread
% python kmeans.py

Design and TODO

CFXKMeans library was originally developed as an excersize in code modernization (see white paper) of the computational workload (minimizing within-cluster variance). So features such as choosing the initial centroids or computing the final variance total have not been implemented yet. These features may be added in the future.

cfxkmeans's People

Contributors

rasai avatar

Watchers

James Cloos avatar Vikram Dutt avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.