Giter Club home page Giter Club logo

dbcv's Introduction

DBCV

Python implementation of Density-Based Clustering Validation

Source

Moulavi, Davoud, et al. "Density-based clustering validation." Proceedings of the 2014 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2014.

PDF

What is DBCV

How do you validate clustering assignmnets from unsupervised learning algorithms? A common method is the Silhoette Method, which provides an objective score between -1 and 1 on the quality of clustering. The silhouette value measures how well an object is classified in its own cluster instead of neighboring clusters. The silhouette (and most other popular methods) work very well on globular clusters, but can fail on non-glubular clusters such as:

non-globular

Here, we implement DBCV which can validate clustering assignments on non-globular, arbitrarily shaped clusters (such as the example above). In essence, DBCV computes two values:

  • The density within a cluster
  • The density between clusters

High density within a cluster, and low density between clusters indicates good clustering assignments.

Example

Here, I deliberately picked an example of clusters that density based clustering works well on.

from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns

n_samples=150
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
X = noisy_moons[0]
plt.scatter(X[:,0], X[:,1])
plt.show()

moons

What happens when we try K-means clustering on these non-globular clusters?

from sklearn.cluster import KMeans

kmeans =  KMeans(n_clusters=2)
kmeans_labels = kmeans.fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=kmeans_labels)
plt.show()

kmeans

...Not so great. What about HDBSCAN, a density based clustering method?

import hdbscan

hdbscanner = hdbscan.HDBSCAN()
hdbscan_labels = hdbscanner.fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=hdbscan_labels)

hdbscan

That's pretty good. To assess the quality of clustering, using Density-Based Clustering Validation, we call DBCV

from scipy.spatial.distance import euclidean

kmeans_score = DBCV(X, kmeans_labels, dist_function=euclidean)
hdbscan_score = DBCV(X, hdbscan_labels, dist_function=euclidean)
print(kmeans_score, hdbscan_score)

K means returns a DBCV score of -0.71, and HDBSCAN returns a score of 0.60.

dbcv's People

Contributors

christopherjenness avatar galeone avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbcv's Issues

Question about the Core Distance of an Object formula

Thank you very much for providing the code for the DBCV index.

I noticed in the _core_dist function that you have defined the number of neighbours (n_neighbors) to equal the dimensionality of the dataset np.shape(neighbors)[1] (Line 57 of the DBCV.py) shouldn't this have been np.shape(neighbors)[0] ?

Also based on the formula of Moulavi et al (definition 1, equation 3.1) Line 62 of your code shouldn't have been core_dist = (numerator / (n_neighbors -1 )) ** (-1/n_features) ?

Results don't match with reference implementation in Matlab

Hello,

Thanks for this implementation of the DBCV in Python. However, the results with this method don't match with the reference implementation in Matlab by Moulavi et al.
This is partly because your implementation treats outliers as a cluster, but even fixing this leads to completly different results. The first example dataset of the reference Implementation will give values of -0.2986 for your Implementation, 0.5074 for your implementation with the correct outlier processing and 0.6149 for the reference implementation.

I think these quite significant difference discourage from using this implementation in scientific contexts until this is fixed.

Error running DBCV

Hi!

I am running the following code:
db = DBSCAN(eps=5, min_samples=9).fit(df)
labels = db.labels_
dbscan_score = DBCV(df, labels, dist_function=euclidean)
print(dbscan_score)

but I am having the following error:
File "*\DBScan.py", line 68, in
dbscan_score = DBCV(df, labels, dist_function=euclidean)
File "C:\Python27\lib\site-packages\DBCV\DBCV.py", line 30, in DBCV
graph = _mutual_reach_dist_graph(X, labels, dist_function)
File "C:\Python27\lib\site-packages\DBCV\DBCV.py", line 113, in _mutual_reach_dist_graph
point_i = X[row]
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2927, in getitem
indexer = self.columns.get_loc(key)
File "C:\Python27\lib\site-packages\pandas\core\indexes\base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas_libs\hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

Incomplete requirement list to run tests

Hi @christopherjenness I think it would be nice if there was a requirements list to go though to install everything needed for the test file. Here the is what I had to do to set that up on my system:

pip install -U scikit-learn to install sklearn
pip install pytest
pip install hdbscan or conda install -c conda-forge hdbscan

I actually also expected the test folder to provide an example of the code´s application, not the assertions, I would add an example.py for that (eg. using the code in the README).

The execution time is very slow

Your solution is interesting. Unfortunately, it is not scalable. I made it turn for 200 points of two dimensions, it takes almost 6 seconds. For thousands of points I can't keep it running anymore.

nan in result

On my program your dbcv code return nan in some cases (sklearm Calinski-Harabaz and Shilhuette index work well with this data (3 dimensional, about 200-1000 points)).

Issues with installation

Hello! Would like to hear your input about whats the best option for installing this package in an Anaconda Environment.
I've tried this code in my Anaconda Prompt:

conda config --set ssl_verify false
pip install git+https://github.com/christopherjenness/DBCV.git#egg=DBCV

with the following output:

  fatal: unable to access 'https://github.com/christopherjenness/DBCV.git/': SSL certificate problem: self signed certificate in certificate chain
  error: subprocess-exited-with-error

  × git clone --filter=blob:none --quiet https://github.com/christopherjenness/DBCV.git 'C:\Users\s41534\AppData\Local\Temp\pip-install-f3_se_hc\dbcv_0cc75111be504782ad7112db4e48e064' did not run successfully.
  │ exit code: 128
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet https://github.com/christopherjenness/DBCV.git 'C:\Users\s41534\AppData\Local\Temp\pip-install-f3_se_hc\dbcv_0cc75111be504782ad7112db4e48e064' did not run successfully.
│ exit code: 128
╰─> See above for output.

Thanks in advance!

Travis CI - use conda

HDBSCAN is inaccessible with pip, so conda is required. This is causing Travis CI issues:

The command "conda update --yes conda" failed and exited with 127 during .

Minimum spanning tree for each cluster vs. entire data set?

Thank you for publishing this DBCV implementation. I believe, however, that there is an error in the logic. On page 842 of the paper, regarding the minimum spanning tree computations, the paper states:

Based on the MRDs, a Minimum Spanning Tree (MSTMRD ) is then built. This process is repeated for all the clusters in the partition, resulting in l minimum spanning trees, one for each cluster.

In this implementation, however, it appears that only one MST is being created for the entire data set: https://github.com/christopherjenness/DBCV/blob/master/DBCV/DBCV.py#L90

If `np.shape(neighbors)[0]` is taken instead of `np.shape(neighbors)[1]` (as it should be), the resultant index has always a low value (hardly never a positive one) ... even when evaluating good clustering results as the one obtained running hdbscan with the noisy moons dataset (provided by the author).

          If `np.shape(neighbors)[0]` is taken instead of `np.shape(neighbors)[1]` (as it should be), the resultant index has always a low value (hardly never a positive one) ... even when evaluating good clustering results as the one obtained running hdbscan with the noisy moons dataset (provided by the author). 

Does anyone know why?

Originally posted by @onofricamila in #10 (comment)

I also got negative dbcv score for a good clustering from hdbscan. Is this expected?

Add installation instructions to README.md

I 100% appreciate the care that was given to make sure that this package is pip installable from a well organized GH repo, but I was surprised to find that I would have to find the egg name from setup.py. It was a minor inconvence, but just adding an "Installation" section to the readme with the following line will probably be very helpful for others too. Cheers!

pipenv install git+https://github.com/christopherjenness/DBCV.git#egg=DBCV

Using with precomputed similarity matrix?

Hello,

Is it possible to use this with a precomputed similarity matrix? I suppose I could set X to a dummy matrix of index values and use a distance function that does a simple matrix lookup?

Ross

How to import DBCV

Hello,
I am a newbie in data science environment. I want to use your DBCV library in my project. But, i did not find how to import it conda environment.

Thanks,

Parallelization for speed improvement

I just wanted to try to calculate DBCV for my HDBSCAN result (312 points) and this takes me now forever. As I look into the code, it seems that it may be rather simple to parallelize e.g. the computation of mutual reachability graph as it takes so far the most time... I might fork and make a pull-request then.

About apts core distance numerator calculation

This is the original formula to calculate the core distance of a given object:

image

"KNN (o, i) be the distance between object o and its i th nearest neighbor.",

says the paper.

So, my question is, shouldn't we divide 1 by the ith KNN instead of the dist to the ith element? This is what we are currently doing. Thx

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.