Giter Club home page Giter Club logo

kmpp's Introduction

kmpp

Travis CI

When dealing with lots of data points, clustering algorithms may be used to group them. The k-means algorithm partitions n data points into k clusters and finds the centroids of these clusters incrementally.

The algorithm assigns data points to the closest cluster, and the centroids of each cluster are re-calculated. These steps are repeated until the centroids do not changing anymore.

The basic k-means algorithm is initialized with k centroids at random positions. This implementation addresses some disadvantages of the arbitrary initialization method with the k-means++ algorithm (see "Further reading" at the end).

Installation

Installing via npm

Install kmpp as Node.js module via NPM:

$ npm install kmpp

Example

var kmpp = require('kmpp');

kmpp([
  [x1, y1, ...],
  [x2, y2, ...],
  [x3, y3, ...],
  ...
], {
  k: 4
});

// =>
// { converged: true,
//   centroids: [[xm1, ym1, ...], [xm2, ym2, ...], [xm3, ym3, ...]],
//   counts: [ 7, 6, 7 ],
//   assignments: [ 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1 ]
// }

API

kmpp(points[, opts)

Exectes the k-means++ algorithm on points.

Arguments:

  • points (Array): An array-of-arrays containing the points in format [[x1, y1, ...], [x2, y2, ...], [x3, y3, ...], ...]
  • opts: object containing configuration parameters. Parameters are
    • distance (function): Optional function that takes two points and returns the distance between them.
    • initialize (Boolean): Perform initialization. If false, uses the initial state provided in centroids and assignments. Otherwise discards any initial state and performs initialization.
    • k (Number): number of centroids. If not provided, sqrt(n / 2) is used, where n is the number of points.
    • kmpp (Boolean, default: true): If true, uses k-means++ initialization. Otherwise uses naive random assignment.
    • maxIterations (Number, default: 100): Maximum allowed number of iterations.
    • norm (Number, default: 2): L-norm used for distance computation. 1 is Manhattan norm, 2 is Euclidean norm. Ignored if distance function is provided.
    • centroids (Array): An array of centroids. If initialize is false, used as initialization for the algorithm, otherwise overwritten in-place if of the correct size.
    • assignments (Array): An array of assignments. Used for initialization, otherwise overwritten.
    • counts (Array): An output array used to avoid extra allocation. Values are discarded and overwritten.

Returns an object containing information about the centroids and point assignments. Values are:

  • converged: true if the algorithm converged successfully
  • centroids: a list of centroids
  • counts: the number of points assigned to each respective centroid
  • assignments: a list of integer assignments of each point to the respective centroid
  • iterations: number of iterations used

Credits

  • Jared Harkins improved the performance by reducing the amount of function calls, reverting to Manhattan distance for measurements and improved the random initialization by choosing from points

  • Ricky Reusser refactored API

Further reading

License

© 2017-2019. MIT License.

kmpp's People

Contributors

cmtt avatar dependabot[bot] avatar lpmi-13 avatar rreusser avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

kmpp's Issues

Suspicious type cast in k-means++ initialization

kmpp/kmpp.js

Lines 256 to 265 in b579100

for (i = 0; i < ntries; ++i) {
var rndVal = ~~(RANDOM() * Dsum);
var n = 0;
for (; n < l; ++n) {
if (rndVal <= D[n]) {
break;
} else {
rndVal -= D[n];
}
}

On line 257, it looks like k-means++ initialization computes a random number from 0 to DSum and then floors it (via ~~) to an integer. rndVal is only used in comparison to other distances. This seems strange since I don't see any particular connection between distances and integers. If things were normalized, that would be different, but if distances are on the order of ~1e-3 and n ~ 1000, won't rndVal always be ~0? Is it possible this is a typo?

Can we add hardcap to centroid ?

Can we add hardcap to centroid ?

  • Number of max assignment to each centroid
  • Max distance between point and centroid
  • Dynamic centroid number with previous configuration

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.