Giter Club home page Giter Club logo

skmeans's Introduction

skmeans

Super fast simple k-means and k-means++ implementation for unidimiensional and multidimensional data. Works on nodejs and browser.

Installation

npm install skmeans

Usage

NodeJS

const skmeans = require("skmeans");

var data = [1,12,13,4,25,21,22,3,14,5,11,2,23,24,15];
var res = skmeans(data,3);

Browser

<!doctype html>
<html>
<head>
	<script src="skmeans.js"></script>
</head>
<body>
	<script>
		var data = [1,12,13,4,25,21,22,3,14,5,11,2,23,24,15];
		var res = skmeans(data,3);

		console.log(res);
	</script>
</body>
</html>

Results

{
	it: 2,
	k: 3,
	idxs: [ 2, 0, 0, 2, 1, 1, 1, 2, 0, 2, 0, 2, 1, 1, 0 ],
	centroids: [ 13, 23, 3 ]
}

API

skmeans(data,k,[centroids],[iterations])

Calculates unidimiensional and multidimensional k-means clustering on data. Parameters are:

  • data Unidimiensional or multidimensional array of values to be clustered. for unidimiensional data, takes the form of a simple array [1,2,3.....,n]. For multidimensional data, takes a NxM array [[1,2],[2,3]....[n,m]]
  • k Number of clusters
  • centroids Optional. Initial centroid values. If not provided, the algorith will try to choose an apropiate ones. Alternative values can be:
    • "kmrand" Cluster initialization will be random, but with extra checking, so there will no be two equal initial centroids.
    • "kmpp" The algorythm will use the k-means++ cluster initialization method.
  • iterations Optional. Maximum number of iterations. If not provided, it will be set to 10000.
  • distance function Optional. Custom distance function. Takes two points as arguments and returns a scalar number.

The function will return an object with the following data:

  • it The number of iterations performed until the algorithm has converged
  • k The cluster size
  • centroids The value for each centroid of the cluster
  • idxs The index to the centroid corresponding to each value of the data array
  • test Function to test new point membership

Examples

// k-means with 3 clusters. Random initialization
var res = skmeans(data,3);

// k-means with 3 clusters. Initial centroids provided
var res = skmeans(data,3,[1,5,9]);

// k-means with 3 clusters. k-means++ cluster initialization
var res = skmeans(data,3,"kmpp");

// k-means with 3 clusters. Random initialization. 10 max iterations
var res = skmeans(data,3,null,10);

// k-means with 3 clusters. Custom distance function
var res = skmeans(data,3,null,null,(x1,x2)=>Math.abs(x1-x2));

// Test new point
var res = skmeans(data,3,null,10);
res.test(6);

// Test new point with custom distance
var res = skmeans(data,3,null,10);
res.test(6,(x1,x2)=>Math.abs(x1-x2));

skmeans's People

Contributors

solzimer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

skmeans's Issues

Undefined centroid

When using custom distance function, the passed centroid to the function is undefined. The index for getting the centroid should be j instead of k.

var dist = fndist ? fndist(data[i],ks[k]) :

FR: custom distance function

Sometimes two data points should never be put in the same cluster, I'd like to assign them a high distance value in those cases, which I believe would require a custom function. Ideally something where you could return 1.0 for "big distance" and then call the super method for the real distance in the rest of the cases.

kmpp cluster initialization option does not use custom distance function

The function:

// kinit.js
kmpp(data,k) {
  var distance = data[0].length? eudist : dist;
  var ks = [], len = data.length;
  var multi = data[0].length>0;
  var map = {};
// ...
}

Does not use the user-provided distance function. Something like this would remedy:

// kinit.js
kmpp(data,k,fndist) {
  var distance = fndist? fndist : data[0].length? eudist : dist;
// ...

// main.js
else if(initial=="kmpp") {
  ks = kmpp(data,k,fndist);
}
// ...

[0, 0] centroid

@solzimer I used this dataset (500 random points) for this simple test:

var result = skmeans(dataset(), 16);
console.log(result.centroids);

However I noticed sometimes one or two centroid are set to [0, 0], which is completely out of range (here a graphical representation of the result).
I tried to debug the code to trace the issue (I believe it's a bug), but I couldn't find any useful hint.

Could you please check this out?
Due to the indeterminability of the output, unfortunately I can't give you a more specific example to better highlight the issue, but please let me know if I can be of help.

License?

There doesn't seem to be any license attributed to this repo. Please clarify the licensing of this library. BTW, thank you @solzimer for this super simple implementation!

How to tell which centroid is most common/

This might be a stupid question.

I was reading https://adamspannbauer.github.io/2018/03/02/app-icon-dominant-colors/ and stumbled across this library.

The code on that blog article that I'm questioning is:

    #cluster and assign labels to the pixels 
    clt = KMeans(n_clusters = k)
    labels = clt.fit_predict(image)

    #count labels to find most popular
    label_counts = Counter(labels)

    #subset out most popular centroid
    dominant_color = clt.cluster_centers_[label_counts.most_common(1)[0][0]]

Is it possible to see which centroid cluster is the most common, or is that not really part of the k-means method?

Possible mutation-related bug

On this line, a data point is assigned to the centroids array by reference:

ks[z++] = data[idx];

This means that as the centroids change later in the algorithm, the underlying data array is also changed. This is undesirable for two reasons: mutating the data array which is passed in could have downstream effects, and even if the data array isn't used subsequently, it makes the algorithm incorrect.

Assuming I'm not missing something here (very possible, as I am not a JS developer), data[idx].slice() would be the fix I would suggest.

`result.it` is wrong if specifying max iterations

I assume result.it should be <= maxIterations, but is rather a big number (close to MAX).

const skmeans = require("skmeans")
const data = [1,12,13,4,25,21,22,3,14,5,11,2,23,24,15];

const result = skmeans(data,3,null,10);
console.log(result.it) // 9991

I see that it is calculated using MAX - it, where MAX = 10000, and it was counted backwards from 10 in this case. Should rather use maxit if it is set.

skmeans/main.js

Line 175 in 6c8ffeb

it : MAX-it,

Same run gives different results

For me each run of the same dataset gives different results. Is this due to the kmpp initialisation algorithm? Is there a seed that can be passed in this case? Thanks.

Publish dist/skmeans.js as main

I'm trying to import your library into a pure ES5 library, however your main in package.json is pointing to your ES6 main.js instead of your dist/skmeans.js (which is ES5 built by browserify/babel).

Would it possible to replace your main with your distributed file.

Pass in a random value seed to cluster initialization functions

I'm using this library in a context where I need the results to be deterministic. Would it be possible to provide the option to pass a seed into the cluster initialization functions?

I was thinking an optional argument could be passed into the skmeans function called seed and that would be passed on to the kmpp and kmrand .

I am able to implement this functionality, I just want to make sure that this is behavior that library owners want.

Thanks!

Types in the @types/skmeans package are wrong

Hi,

First off, love your package, it's helped me out a lot!

Secondly I was using your package to run some kmeans clustering and I noticed that the types published under the @types/skmeans package are not correct.

Specifically, the skmeans function is typed as returning Data and Data is typed as follows

interface Data {
    it: number;
    k: number;
    centroids: number;
    idxs: number[];
    test: (x: number, point?: (x1: number, x2: number) => number) => void;
}

https://github.com/adamzerella/DefinitelyTyped/blob/e84211cf43b463bb00759e3156ea7010516ee9b1/types/skmeans/index.d.ts#L11

But centroids: number is not correct. As far as I see the centroids property returned by skmeans returns an array of centroids, so the correct type should be centroids: number[] | number[][]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.