juliastats / clustering.jl Goto Github PK

View Code? Open in Web Editor NEW

344.0 21.0 118.0 6.43 MB

A Julia package for data clustering

License: Other

Julia 98.95% R 1.05%

julia clustering k-means hierarchical-clustering fuzzy-clustering markov-clustering

clustering.jl's People

Contributors

Stargazers

Watchers

Forkers

nfoti milktrader ianfiske donkang75 pluskid zatricion jejomath davidavdav rened hellcoderz cosinequanon lendle bjarkehs pozorvlak szelok elkingtowa westleyargentum nw11 annesauve axsk jpfairbanks bdeonovic wolffmb slundberg tkelman jumutc innerlee multidis raghuch arunreddy tanmaykm mrkn omritreidel lifeinoppo algoskynet alyst eigenmode16 alstat lbollar vinayprakashsingh naveenjafer paulhendricks carlolucibello tpoisot jmsteitz memoiryclear iglpdc s89011507 annimesh2809 jmboehm nan2ge1 wildart zgornel kescobo ashedko spencerx oxinabox montyvesselinov chrisrackauckas nithintkv jbdatascience vpetukhov aliddell benluteijn hua-zhou holgerteichgraeber tlienart youngfaithful leonardopetrini altre nova-wvi sepehr3pehr dinarior stjordanis xukai92 phymucs ayoublasri zeta1999 stanta birm jing-xinxing baobunuo mrio diegozea juliohm flavell-lab alexpattyn snaderi2000 standardgalactic djoop algorithmx guo-yong-zhi cocomoff playfloor lvyuling12 martavanin yankaicao biomedai zsteve lilithhafner

clustering.jl's Issues

Inclusion of KShifts and / or Quickshift?

Would there be interest in having https://github.com/rened/KShiftsClustering.jl included here? I have not registered KShiftsClustering with METADATA yet. Or we keep it separate, whatever you think is best.
It is basically a one-iteration kmeans algorithm, similar to self-organizing maps.

DBSCAN should not require a distance matrix

Distance matrixes are too expensive.
They need O(n^2) memory, and thus O(n^2) time to create.

Better DBSCAN implementations use indexes for acceleration.

Docs out of date

Hi,

The docs haven't seem to be built for 1 year and 5 months. Has someone done this:

http://read-the-docs.readthedocs.io/en/latest/webhooks.html

Warnings with Julia v"0.4.0"

I'm seeing quite a few warnings from Clustering.jl on Julia 0.4.0

e.g.

WARNING: [a] concatenation is deprecated; use collect(a) instead
 in depwarn at deprecated.jl:73
 in oldstyle_vcat_warning at ./abstractarray.jl:29
 in hclust at /home/jeff/.julia/v0.4/Clustering/src/hclust.jl:334
 in hclust at /home/jeff/.julia/v0.4/Clustering/src/hclust.jl:342
 [inlined code] from /home/jeff/.julia/v0.4/Clustering/test/hclust.jl:7
 in anonymous at no file:0
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:304
 [inlined code] from /home/jeff/.julia/v0.4/Clustering/test/runtests.jl:15
 in anonymous at no file:14
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:304
 in process_options at ./client.jl:308
 in _start at ./client.jl:411
while loading /home/jeff/.julia/v0.4/Clustering/test/hclust.jl, in expression starting on line 6

and

WARNING: int(x) is deprecated, use Int(x) instead.
 in depwarn at deprecated.jl:73
 in int at deprecated.jl:50
 in kmeans! at /home/jeff/.julia/v0.4/Clustering/src/kmeans.jl:37
 in kmeans at /home/jeff/.julia/v0.4/Clustering/src/kmeans.jl:53
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:304
 [inlined code] from /home/jeff/.julia/v0.4/Clustering/test/runtests.jl:15
 in anonymous at no file:14
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:304
 in process_options at ./client.jl:308
 in _start at ./client.jl:411
while loading /home/jeff/.julia/v0.4/Clustering/test/kmeans.jl, in expression starting on line 15

and

WARNING: deprecated syntax "{a=>b, ...}" at /home/jeff/.julia/v0.4/Clustering/test/hclust_generated_examples.jl:558.
Use "Dict{Any,Any}(a=>b, ...)" instead.

I think these come warnings from changes to the Julia language that appeared around the time of the release candidates. I'm using version 0.4.0 of Clustering.jl, but I get the warnings when I checkout master too.

test failure on OS X with 0.6

The cause is mcl: here is a simple reproducible example:

adj_matrix = [1.0 0.125 1.0 0.0 0.16 0.0; 0.125 1.0 0.0 0.25 0.0 0.16; 1.0 0.0 1.0 0.0 0.2 0.0; 0.0 0.25 0.0 1.0 0.0 0.5; 0.16 0.0 0.2 0.0 1.0 0.0; 0.0 0.16 0.0 0.5 0.0 1.0]
mcl(adj_matrix, display=:verbose, inflation=1.5, expansion=1.5, save_final_matrix=true)

I had a quick look through, my guess is that there is a slight change in BLAS behaviour which causes eig to give subtly different values, and that this error then amplifies, but i don't understand the algorithm well enough to be sure.

Documentation for hclust missing from online manual

It seems the file doc/source/hclust.md is not included anywhere in the manual. This means there is absolutely zero documentation for that function, since it doesn't have a docstring.

EDIT: found #60.

DBSCAN gives error on Float32 array

I can only use the DBSCAN algorithm with a Float64 array but not with a Float32 array, that is actually my input.

using Clustering; positions = zeros(Float32, 3, 10); clusters = dbscan(positions, 0.3, min_neighbors=1, min_cluster_size=1, leafsize=20)

ERROR: MethodError: no method matching _dbscan(::NearestNeighbors.KDTree{StaticArrays.SVector{3,Float32},Distances.Euclidean,Float32}, ::Array{Float32,2}, ::Float64; min_neighbors=1, min_cluster_size=1)
Closest candidates are:
_dbscan{T<:AbstractFloat}(::NearestNeighbors.KDTree{V<:AbstractArray{T,1},M<:Union{Distances.Chebyshev,Distances.Cityblock,Distances.Euclidean,Distances.Minkowski},T}, ::Array{T<:AbstractFloat,2}, ::T<:AbstractFloat; min_neighbors, min_cluster_size) at /user/.julia/v0.5/Clustering/src/dbscan.jl:144
in #dbscan#6(::Int64, ::Array{Any,1}, ::Function, ::Array{Float32,2}, ::Float64) at /user/.julia/v0.5/Clustering/src/dbscan.jl:137
in (::Clustering.#kw##dbscan)(::Array{Any,1}, ::Clustering.#dbscan, ::Array{Float32,2}, ::Float64) at ./:0

Offering non-classical hierarchical clustering techniques within scope?

Is it within the scope of this package to provide some more modern HC techniques (e.g. ROCK/CURE, BIRCH, etc)? Classical HC techniques (single-linkage, centroid linkage, etc) lack robustness and are sensitive to noise/outliers, plus their quadratic computational complexities are problematic when applying them to large datasets.

More modern algorithms like CURE can better handle multidimensional data and sophiscated cluster shapes. It has 2000+ citations on Google Scholar so there is definitely a large demand for HC techniques that can handle "big data". Wikipedia has the algorithm's pseudocode (I haven't checked it's validity).

Affinity propagation documentation

The affinity propagation clustering method would gain with some docs on the use.

predict and score functions

For some use cases predict and score functions would be helpful.
The predict function should return the assigned clusters for a set of observations like this for kmeans:

function predict(kmresult, X)
    dmat = Distances.pairwise(Distances.SqEuclidean(), kmresult.centers, X)   
    mod(findmin(dmat, 1)[2] .- 1, size(dmat, 1)) .+ 1
end

The score function should assign given observations and return the 1/totalcost for these observations and could look like this for kmeans:

function score(kmresult, X)
    dmat = Distances.pairwise(Distances.SqEuclidean(), kmresult.centers, X)   
    sum(findmin(dmat, 1)[1])
end

Of course, it would be great to have those functions for all of the available clustering algorithms.

ERROR: sample_by_weights not defined

I did not see this error earlier, until I pulled the latest Julia.

julia(113)% julia ~/.julia/Clustering/test/kmeans_t1.jl
non-weighted
ERROR: sample_by_weights not defined
in kmeanspp_initialize! at /home/dr/.julia/Clustering/src/seeding.jl:24
in kmeans at /home/dr/.julia/Clustering/src/kmeans.jl:399
in kmeans at /home/dr/.julia/Clustering/src/kmeans.jl:403
in include_from_node1 at loading.jl:92
in process_options at client.jl:274
in _start at client.jl:349
at /home/dr/.julia/Clustering/test/kmeans_t1.jl:12

Failing on 0.2 according to PackageEvaluator

http://status.julialang.org/

Just tried it on 0.2 myself manually:

idunning@IAIN-DESKTOP:~/.../JuMP/test$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.0 (2013-11-16 23:48 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org release
|__/                   |  x86_64-linux-gnu

julia> Pkg.add("Clustering")
INFO: Cloning cache of Clustering from git://github.com/johnmyleswhite/Clustering.jl.git
INFO: Cloning cache of Distance from git://github.com/JuliaStats/Distance.jl.git
INFO: Cloning cache of NumericExtensions from git://github.com/lindahua/NumericExtensions.jl.git
INFO: Cloning cache of StatsBase from git://github.com/JuliaStats/StatsBase.jl.git
INFO: Installing Clustering v0.2.4
INFO: Installing Distance v0.2.6
INFO: Installing NumericExtensions v0.3.6
INFO: Installing StatsBase v0.2.10
INFO: REQUIRE updated.

julia> using Clustering
Warning: could not import Base.foldl into NumericExtensions
Warning: could not import Base.foldr into NumericExtensions
Warning: could not import Base.sum! into NumericExtensions
Warning: could not import Base.maximum! into NumericExtensions
Warning: could not import Base.minimum! into NumericExtensions
ERROR: Distributions not found
 in require at loading.jl:39
 in include at boot.jl:238
at /home/idunning/.julia/Clustering/src/Clustering.jl:4

Bounds Error with Hclust on large Matrix

I'm trying to do hierarchical clustering on large-ish distance matrices. The following works fine:

using Distances
using Clustering

m1 = rand(100,100)
d1 = pairwise(Jaccard(), m1)
c1s = hclust(d1, :single)
c1a = hclust(d1, :average)

I was able to do it on a random table as big as 10k x 10k, but for my actual datatable which is about 12k x 12k, the :single works, but :average hclust produces an error - (eventually, after a rather long time):

BoundsError: attempt to access 8947-element Array{Any,1} at index [8984]
hclust(::Symmetric{Float64,Array{Float64,2}}, ::Symbol) at hclust.jl:338
hclust(::Array{Float64,2}, ::Symbol) at hclust.jl:351
include_string(::String, ::String) at loading.jl:515
include_string(::String, ::String, ::Int64) at eval.jl:30
include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N} where N) at eval.jl:34
(::Atom.##49#53{String,Int64,String})() at eval.jl:50
withpath(::Atom.##49#53{String,Int64,String}, ::String) at utils.jl:30
withpath(::Function, ::String) at eval.jl:38
macro expansion at eval.jl:49 [inlined]
(::Atom.##48#52{Dict{String,Any}})() at task.jl:80

Is this a memory error? I can do hclust in R for the same data, so iI think in principle it should work.

tag new release...

and add 0.6 badge to readme, now that #82 is merged?

Move to JuliaStats

@johnmyleswhite I am wondering whether you are happy with moving this package to JuliaStats.

The Clustering.jl is one of the ML packages that received relative broader attention. This moving might make it easier to get more support from the community.

What's up with Clustering's tags?

I notice that at https://github.com/JuliaStats/Clustering.jl/tags, v0.3.3 is the highest version. But over at LightGraphs, we're requiring 0.4, and your README shows 0.5. Where are these located? Is it the case that git tags do not correlate with Julia versions? (They appear to in LightGraphs.)

How to use k-medoids?

The documentation for k-medoids requires a cost matrix C, and parameter k, the number of clusters. But C must be a kxm matrix, so k can be inferred from C, why are both necessary? Also the matlab version doesn't require C as input at all. And also, C must be re-calculated whenever a new candidate medoid is selected, I don't see any hooks to allow this, is it possible this is half completed, and does't do step 5 like described in the wiki? Or maybe it's expected that step 5 is done outside.

[PackageEvaluator.jl] Your package Clustering may have a testing issue.

This issue is being filed by a script, but if you reply, I will see it.

PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their test (if available) on both the stable version of Julia (0.2) and the nightly build of the unstable version (0.3).

The results of this script are used to generate a package listing enhanced with testing results.

The status of this package, Clustering, on...

Julia 0.2 is 'No tests, but package loads.'
Julia 0.3 is 'No tests, but package loads.'

'No tests, but package loads.' can be due to their being no tests (you should write some if you can!) but can also be due to PackageEvaluator not being able to find your tests. Consider adding a test/runtests.jl file.

'Package doesn't load.' is the worst-case scenario. Sometimes this arises because your package doesn't have BinDeps support, or needs something that can't be installed with BinDeps. If this is the case for your package, please file an issue and an exception can be made so your package will not be tested.

This automatically filed issue is a one-off message. Starting soon, issues will only be filed when the testing status of your package changes in a negative direction (gets worse). If you'd like to opt-out of these status-change messages, reply to this message.

Implement fast optimal leaf ordering for Hclust

I recently had need of an implementation of the method in this paper:

We present the first practical algorithm for the optimal linear leaf ordering of trees that are generated by hierarchical clustering. Hierarchical clustering has been extensively used to analyze gene expression data, and we show how optimal leaf ordering can reveal biological structure that is not observed with an existing heuristic ordering method. For a tree with n leaves, there are 2(n-1) linear orderings consistent with the structure of the tree. Our optimal leaf ordering algorithm runs in time O(n(4)), and we present further improvements that make the running time of our algorithm practical.

I'm not sure I did it the most efficient way possible, but for an hclust of a 5k x 5k distance matrix it took ~50 ms (generating the hclust itself was ~1000 ms). See jupyter notebook here

I initially wrote it for my Microbiome package, but I think it makes more sense to live here if you're up for a PR

Plot recipe PR?

I'm in the process of writing a user recipe for Plots.jl to enable plotting of Hclust see here. Generally, it makes sense to have the plotting recipes live in the package the generates the object, but it would require accepting RecipesBase.jl as a dependency.

Before I get too far into the development I was wondering if this would be a PR that you would be willing to take on.

Docs

doc/source/varinfo.rst is not showing up on the readthedocs site for the latest build.
Are there any other functions that are missing from the documentation?
When I built it locally varinfo was there.

Nondetermanistic methods should take a n_init parameter, for how many times to run

Hiya,
In python's sklearn methods like K-Means and K-Mediods take an n_init parameter:
To use their description:

n_init : int, default: 10
Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs ...

I think it would be good to have that here,
particularly since a single run of k-means is fairly poor as a method for clustering data.
Running it several times and taking the best is common practice.

Add spectral clustering

Would it be in the scope of the package to add spectral clustering?

sparse matrices fail

I get this error when calling kmeans on a sparse matrix:

julia> kmeans(x', 50)                                                                                                                                                         
ERROR: no method kmeans(SparseMatrixCSC{Float32,Int32}, Int64)

Could this be due to the StoredArray change in julia?

Graphical clustering algorithms

The kinds of things mentioned here: https://www.sciencedirect.com/science/article/pii/S0098299717300493 are noticeably missing. This is a nice self-contained problem which can be paired with maybe a few others to make a good future GSoC.

Specific Docs on algo used for K-Means?

Hi, is the algo used for implementing K-Means the naive lloyd iteration? Are there any benefits/necessity of trying other algorithms like pelleg-moore or hamerly? I would like to get started with some API on julia and found the K-Means lib to be a good place to start, is it okay to take it up and work on it?

inconsistent use of `minpts` parameter in DBSCAN

In the body of _dbscan the minpts parameter is used with non-strict inequality, whereas it is used with strict inequality in the body of `_dbs_expand_cluster. They should be consistent, and the usual convention is that the test be non-strict.

I'd create the single-character pull request myself, but I currently don't have the set up to do that easily.

New validation measures

Hello folks,

What about new validation measures for clustering like: Rand index, Adjusted Rand index etc.?
Please take a look at this Wikipage and corresponding publication in Journal of Classification.

Package compatibility caps

Ref: https://discourse.julialang.org/t/package-compatibility-caps/15301

`kmeans` dispatch problem

The current implementation of kmeans has the following declaration:

function kmeans(X::Matrix, k::Int;
                weights=nothing,
                init=_kmeans_default_init,
                maxiter::Integer=_kmeans_default_maxiter,
                tol::Real=_kmeans_default_tol,
                display::Symbol=_kmeans_default_display)

although it calls the function kmeans! whose declaration is:

function kmeans!{T<:AbstractFloat}(X::Matrix{T}, centers::Matrix{T};
                                   weights=nothing,
                                   maxiter::Integer=_kmeans_default_maxiter,
                                   tol::Real=_kmeans_default_tol,
                                   display::Symbol=_kmeans_default_display)

Here T is a subtype of AbstractFloat. This constraint in T is not present in kmeans which allows us to call kmeans as:

kmeans(rand(Int,3,100), 5)

which throws an error.

I also require kmeans{T} to be constrained with T<:AbstractFloat because I am dispatching on kmeans for ImageSegmentation as:

kmeans{T<:Colorant,N}(x::AbstractArray{T,N}, args...; kwargs...)

Generic clustering interface

Continuing the off topic conversation in #12

In R's cluster package, partitioning method results inherit from a common class which contains cluster assignments, silhouette information, value of the objective at the clustering, dissimilarity matrix, and sometimes the original data matrix. Hierarchical methods also inherit from a common class but there's not much information about it in the manual and I haven't looked at the code closely.

Maybe it's more useful to think about what methods should operate on the result of a clustering operation. The obvious one is cluster assignment. Even that is ambiguous for hierarchical methods without specifying some cutoff criterion or number of clusters. Others might be silhouette widths or objective value. In principle those could be applied to any clustering algorithm given a dissimilarity matrix (once cluster memberships are assigned), but for some algorithms you don't necessarily have a dissimilarity matrix sitting there. Fuzzy and model based algorithms would have additional methods.

Based on this brainstorm, types and methods might be

ClusterPartition
    store cluster memberships
    method to return cluster memberships
    method to return silhouette
ClusterHierarchy
    store the clustering tree?
    method to reduce to a ClusterPartition given some criterion
    methods to summarize hierarchy (I'm not to familiar with common approaches here)
ClusterFuzzy
    store weighs for each cluster/observation
    method to reduce to a ClusterPartition, probably just argmax of cluster weights
    summarization
Maybe ClusterModel?
    method to reduce to ClusterFuzzy.

assignments of Fuzzy C-mean

I tired to use fuzzy_cmean and the function probably worked.
However, there are no-assignments or counts for FuzzyCMeansResult.
How can we get these values for each datasets?

affinity propagation result is not consistent with sklearn in python

the two clustering results are different. Julia version did not do any clustering since the assignment is just the index of each object! My similarity matrix is too large to show here.

using Clustering

@time affinityPropResult = Clustering.affinityprop(similarityMatrix)

affinityPropResult.assignments

using PyCall

@pyimport sklearn.cluster as cl
af = cl.AffinityPropagation(affinity="precomputed")[:fit](similarityMatrix)

labels = af[:labels_]

The travis test also did not verify the correctness of the result.

kmeans extremely slow in julia 0.5

Hello,

In trying to get GaussianMixtures working with julia v0.5, I am stumbeling on extremely slow kmeans, which I use for initializing the Gaussians.

v0.5

@time kmeans(rand(10,10000), 5)
 19.641323 seconds (3.55 M allocations: 545.293 MB, 0.37% gc time)

v0.4:

@time kmeans(rand(10,10000), 5)
  0.084591 seconds (152.87 k allocations: 14.603 MB, 7.42% gc time)

It might be related to

WARNING: slice is deprecated, use view instead.
 in depwarn(::String, ::Symbol) at ./deprecated.jl:64
 in slice(::Array{Float64,2}, ::Vararg{Any,N}) at ./deprecated.jl:30
 in colwise!(::Array{Float64,1}, ::Distances.SqEuclidean, ::SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true}, ::Array{Float64,2}) at /Users/david/.julia/v0.5/Distances/src/generic.jl:36
 in initseeds!(::Array{Int64,1}, ::Clustering.KmppAlg, ::Array{Float64,2}, ::Distances.SqEuclidean) at /Users/david/.julia/v0.5/Clustering/src/seeding.jl:98
 in initseeds(::Clustering.KmppAlg, ::Array{Float64,2}, ::Int64) at /Users/david/.julia/v0.5/Clustering/src/seeding.jl:22
 in initseeds(::Symbol, ::Array{Float64,2}, ::Int64) at /Users/david/.julia/v0.5/Clustering/src/seeding.jl:34
 in #kmeans#2(::Void, ::Symbol, ::Int64, ::Float64, ::Symbol, ::Function, ::Array{Float64,2}, ::Int64) at /Users/david/.julia/v0.5/Clustering/src/kmeans.jl:51
 in kmeans(::Array{Float64,2}, ::Int64) at /Users/david/.julia/v0.5/Clustering/src/kmeans.jl:49

as (repeated) warnings tend to make julia very slow.

Feature Request: Mean shift clustering

It would be great to have an implementation of an algorithm that finds modes of kernel density estimates. Most common algorithm is mean-shift algorithm:

Comaniciu, Dorin; Peter Meer (May 2002). "Mean Shift: A Robust Approach Toward Feature Space Analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE) 24 (5): 603–619. doi:10.1109/34.1000236.

A great short (2 page) guide to using mean shift algorithm for clustering http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf

I initially posted the request at KernelDensity.jl (JuliaStats/KernelDensity.jl#11) but thought it might be better suited here.

I made a pull request: #43

Potential type conversions in function hclust_minimum{T<:Real}(ds::Symmetric{T})

The method uses Inf irrespective of T. Ideally, the Inf, Inf16, Inf32 should be used based on the parameter T. Again for Rational type the method may convert to Float64.

it seems that we need to make the distance matrix positive for the hierarchical clustering

I tried to use -similarityMatrix and maximum(similarityMatrix) .- similarityMatrix, the result is very different. the minus version completely fails, but the other version which makes sure that the distance matrix values are all positive works.

Master test file?

It'd be great if there was a "master" test file to allow PackageEvaluator to run the tests for this package.

Here are all the candidates we currently look for:
https://github.com/IainNZ/PackageEvaluator.jl/blob/master/src/package.jl#L144

Why is the kmeans algorithm column-oriented instead of row-oriented?

In the docs (below), the kmeans algorithm takes a matrix where each column X[:, i] corresponds to an observed sample. This implementation goes against the idea of tidy data as well as differs from Python's scikit-learn implementation of kmeans and R's base implementation of kmeans.

Is there a good reason for this? Should this algorithm be changed from column-oriented to row-oriented so as to be consistent with R and Python as well as with the concept of tidy data?

URL: http://clusteringjl.readthedocs.io/en/stable/overview.html

Inputs

A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:

Sample matrix X, where each column X[:,i] corresponds to an observed sample.
Distance matrix D, where D[i,j] indicates the distance between samples i and j, or the cost of assigning one to the other.

tag a new release of Clustering.jl ?

I submitted a PR last week that suppressed warnings from Julia 0.4, but these changes never got released. Without the changes, Clustering.jl generates so many warnings that it's hard to use. (Running the unit tests for Clustering.jl generates over 16,000 lines of warning messages currently.) Will you a tag a new release? Thanks much!

Run femtocleaner

varinfo() clashes with InteractiveUtils.varinfo()

Unless there are better ideas, I suggest to rename it into variatinfo(), because varinfo really sounds like some information about a variable.
What should be the roadmap for renaming? Introduce the new name and deprecate varinfo() in the next minor release, then remove varinfo() after some period of time (6 month or so)?

cc @ararslan

HClust initializes the Nearest Neighbor Array to 0-index leading to failure.

At places the code initializes the nearest neighbor array to 0-index leading to failure. This is particularly for the :single method when one of data points is not linked or a group of data points are not linked.

kmeans() error with float32 vectors

I get this error when my input vectors are float32:

ERROR: no method _kmeans!(Array{Float32,2}, Nothing, Array{Float32,2}, Array{Int64,1}, Array{Float64,1}, Array{Int64,1}, Array{Float32,1}, KmeansOpts)                        
 in kmeans! at /Users/swade/.julia/Clustering/src/kmeans.jl:367
 in kmeans at /Users/swade/.julia/Clustering/src/kmeans.jl:387
 in kmeans at /Users/swade/.julia/Clustering/src/kmeans.jl:390
 in include_from_node1 at loading.jl:120

It goes away when vectors are float64. Looking at the code, it seems this is not intended.

Improve k-means

The current structure looks good to me. But it can be further extended to allow more options to use it.

I am considering several improvements to k_means:

Change from row-based to column-based

Currently, it considers each "row" as a sample -- this is not cache friendly, as Julia matrix is column-major. For a large data-matrix, operating by rows may incur very severe penalty due to cache miss.

Also, in typical machine learning literatures, samples are considered as column vectors in general.

Additional Interface

Currently, it is

k_means(x, k, opts)

We can add an additional function, as

k_means(x, init_centers, opts)

This function allows users to directly supply their own set of initial centers -- it is quite possible in practice that users can come up with a better initial guess based on their domain-specific knowledge.

Also, you don't have to provide the number k here, as it can be immediately inferred from the number of columns in init_centers.

Then, the original k_means(x, k, opts) can then just initialize a set of centers (using kmeans++) and then call the function above.

Use the Distance.jl to compute distances

My benchmarks shows this can lead to over 100x performance gain. Pairwise distance computation is the performance bottleneck of k-means algorithms.

Add more options

weights: allowing users to assign weights to samples
replicates: allowing the users to specify the number of times to run k-means (eventually the function returns the best result)
allows the user to specify what to do if a center gets no samples during iterative update (this is possible in practice). The default option can be to redraw a new center using a kmeans++ scheme.

Provide Elkan's method as an optional choice,

which takes advantage of triangle inequality to reduce the computation of distances.

Would you please me know if you have any feedback on this proposal?

There are two ways that I can contribute to this:

(1) If you grant me the privilege to commit, I may create a new branch for this development, and merge it to the master when both of us agree that it is ready.

(2) I can fork it and do a pull request later. But there can be some hassles if I have to modify it in future for bug fixes or further improvements.

WARNING: both ArrayViews and Base export "view"

Testing on 0.5.0-dev+5478 gives:

WARNING: both ArrayViews and Base export "view"; uses of it in module Clustering must be qualified

Although ArrayViews.jl now says:

By and large, this package is no longer necessary: base julia now has efficient SubArrays (i.e., sub and slice).

Eventually giving errors like this:

ERROR: LoadError: UndefVarError: view not defined
 in initseeds!(::Array{Int64,1}, ::Clustering.KmppAlg, ::Array{Float64,2}, ::Distances.SqEuclidean) at /Users/me/.julia/v0.5/Clustering/src/seeding.jl:98

Function dbscan does not accept keyword arguments

Not sure if this is a Julia v0.5 issue:

I just installed Clustering and ran the following commands in the REPL:

using Clustering;
clusters= dbscan(randn(3,10000), 0.05, min_neighbors=3, min_cluster_size=20);

ERROR: function dbscan does not accept keyword arguments
in kwfunc(::Any) at .\boot.jl:236

Please tag latest version in METADATA

Hello,

I am in the process of making GaussianMixtures compatible with julia v0.5. GaussianMixtures depends on Clustering. It needs the latest commits in order to compile. I don't see a way how I can depend on a specific commit, so the request is to tag the latest commit with METADATA.

Thanks a lot,

---david

Pkg.add("Clustering") is not working in julia 0.3.5

Dear All,

I tried to add the package Clustering and Julia is giving this :

ERROR: unknow package Clustering
in wait at task.jl:51
in sync_end at task.jl:311
in add at pkg/entry.jl:55
in anonymous at pkg/dir.jl:28
in cd at file.jl:30
in cd at pakg/dir.jl:28

The computing environment :

Julia version : 0.3.5
OS : Windows 8.1 with Bing 32 bits.

Best Regards.

Hierarchical clustering?

Any plans / interest to add hierarchical clustering to this package? Or is that more appropriate for a different package? In that case, this package should be renamed to Kmeans or some such.