christopherjenness / dbcv Goto Github PK

View Code? Open in Web Editor NEW

140.0 11.0 40.0 128 KB

Python implementation of Density-Based Clustering Validation

License: MIT License

Python 99.49% Shell 0.51%

clustering clustering-validation density-based-clustering dbcv machine-learning

dbcv's Introduction

DBCV

Python implementation of Density-Based Clustering Validation

Source

Moulavi, Davoud, et al. "Density-based clustering validation." Proceedings of the 2014 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2014.

PDF

What is DBCV

How do you validate clustering assignmnets from unsupervised learning algorithms? A common method is the Silhoette Method, which provides an objective score between -1 and 1 on the quality of clustering. The silhouette value measures how well an object is classified in its own cluster instead of neighboring clusters. The silhouette (and most other popular methods) work very well on globular clusters, but can fail on non-glubular clusters such as:

Here, we implement DBCV which can validate clustering assignments on non-globular, arbitrarily shaped clusters (such as the example above). In essence, DBCV computes two values:

The density within a cluster
The density between clusters

High density within a cluster, and low density between clusters indicates good clustering assignments.

Example

Here, I deliberately picked an example of clusters that density based clustering works well on.

from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns

n_samples=150
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
X = noisy_moons[0]
plt.scatter(X[:,0], X[:,1])
plt.show()

What happens when we try K-means clustering on these non-globular clusters?

from sklearn.cluster import KMeans

kmeans =  KMeans(n_clusters=2)
kmeans_labels = kmeans.fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=kmeans_labels)
plt.show()

...Not so great. What about HDBSCAN, a density based clustering method?

import hdbscan

hdbscanner = hdbscan.HDBSCAN()
hdbscan_labels = hdbscanner.fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=hdbscan_labels)

That's pretty good. To assess the quality of clustering, using Density-Based Clustering Validation, we call DBCV

from scipy.spatial.distance import euclidean

kmeans_score = DBCV(X, kmeans_labels, dist_function=euclidean)
hdbscan_score = DBCV(X, hdbscan_labels, dist_function=euclidean)
print(kmeans_score, hdbscan_score)

K means returns a DBCV score of -0.71, and HDBSCAN returns a score of 0.60.

dbcv's People

Contributors

Stargazers

Watchers

dbcv's Issues

Question about the Core Distance of an Object formula

Thank you very much for providing the code for the DBCV index.

I noticed in the _core_dist function that you have defined the number of neighbours (n_neighbors) to equal the dimensionality of the dataset np.shape(neighbors)[1] (Line 57 of the DBCV.py) shouldn't this have been np.shape(neighbors)[0] ?

Also based on the formula of Moulavi et al (definition 1, equation 3.1) Line 62 of your code shouldn't have been core_dist = (numerator / (n_neighbors -1 )) ** (-1/n_features) ?

Request release via pypi

Hey! This is a great implementation of DBCV! Do you have any plans to release it on pypi?

Results don't match with reference implementation in Matlab

Hello,

Thanks for this implementation of the DBCV in Python. However, the results with this method don't match with the reference implementation in Matlab by Moulavi et al.
This is partly because your implementation treats outliers as a cluster, but even fixing this leads to completly different results. The first example dataset of the reference Implementation will give values of -0.2986 for your Implementation, 0.5074 for your implementation with the correct outlier processing and 0.6149 for the reference implementation.

I think these quite significant difference discourage from using this implementation in scientific contexts until this is fixed.

Error running DBCV

Hi!

I am running the following code:
db = DBSCAN(eps=5, min_samples=9).fit(df)
labels = db.labels_
dbscan_score = DBCV(df, labels, dist_function=euclidean)
print(dbscan_score)

but I am having the following error:
File "*\DBScan.py", line 68, in
dbscan_score = DBCV(df, labels, dist_function=euclidean)
File "C:\Python27\lib\site-packages\DBCV\DBCV.py", line 30, in DBCV
graph = _mutual_reach_dist_graph(X, labels, dist_function)
File "C:\Python27\lib\site-packages\DBCV\DBCV.py", line 113, in _mutual_reach_dist_graph
point_i = X[row]
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2927, in getitem
indexer = self.columns.get_loc(key)
File "C:\Python27\lib\site-packages\pandas\core\indexes\base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas_libs\hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

Citation for scientific papers (adding citation widget)

Thank you for publishing this DBCV implementation,

i would like to cite your implementation in addition to the original paper. Since Github now supports an official citation widget i suggest to implement this widget or cite your repo in an unofficial way.

Incomplete requirement list to run tests

Hi @christopherjenness I think it would be nice if there was a requirements list to go though to install everything needed for the test file. Here the is what I had to do to set that up on my system:

pip install -U scikit-learn to install sklearn
pip install pytest
pip install hdbscan or conda install -c conda-forge hdbscan

I actually also expected the test folder to provide an example of the code´s application, not the assertions, I would add an example.py for that (eg. using the code in the README).

The execution time is very slow

Your solution is interesting. Unfortunately, it is not scalable. I made it turn for 200 points of two dimensions, it takes almost 6 seconds. For thousands of points I can't keep it running anymore.

nan in result

On my program your dbcv code return nan in some cases (sklearm Calinski-Harabaz and Shilhuette index work well with this data (3 dimensional, about 200-1000 points)).

Issues with installation

Hello! Would like to hear your input about whats the best option for installing this package in an Anaconda Environment.
I've tried this code in my Anaconda Prompt:

conda config --set ssl_verify false
pip install git+https://github.com/christopherjenness/DBCV.git#egg=DBCV

with the following output:

  fatal: unable to access 'https://github.com/christopherjenness/DBCV.git/': SSL certificate problem: self signed certificate in certificate chain
  error: subprocess-exited-with-error

  × git clone --filter=blob:none --quiet https://github.com/christopherjenness/DBCV.git 'C:\Users\s41534\AppData\Local\Temp\pip-install-f3_se_hc\dbcv_0cc75111be504782ad7112db4e48e064' did not run successfully.
  │ exit code: 128
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet https://github.com/christopherjenness/DBCV.git 'C:\Users\s41534\AppData\Local\Temp\pip-install-f3_se_hc\dbcv_0cc75111be504782ad7112db4e48e064' did not run successfully.
│ exit code: 128
╰─> See above for output.

Thanks in advance!

Travis CI - use conda

HDBSCAN is inaccessible with pip, so conda is required. This is causing Travis CI issues:

The command "conda update --yes conda" failed and exited with 127 during .

Handle noise assignments

DBCV is capable of handling noise assignments. This needs to be implemented.

Minimum spanning tree for each cluster vs. entire data set?

Thank you for publishing this DBCV implementation. I believe, however, that there is an error in the logic. On page 842 of the paper, regarding the minimum spanning tree computations, the paper states:

Based on the MRDs, a Minimum Spanning Tree (MST_MRD ) is then built. This process is repeated for all the clusters in the partition, resulting in l minimum spanning trees, one for each cluster.

In this implementation, however, it appears that only one MST is being created for the entire data set: https://github.com/christopherjenness/DBCV/blob/master/DBCV/DBCV.py#L90

If `np.shape(neighbors)[0]` is taken instead of `np.shape(neighbors)[1]` (as it should be), the resultant index has always a low value (hardly never a positive one) ... even when evaluating good clustering results as the one obtained running hdbscan with the noisy moons dataset (provided by the author).

          If `np.shape(neighbors)[0]` is taken instead of `np.shape(neighbors)[1]` (as it should be), the resultant index has always a low value (hardly never a positive one) ... even when evaluating good clustering results as the one obtained running hdbscan with the noisy moons dataset (provided by the author).

Does anyone know why?

Originally posted by @onofricamila in #10 (comment)

I also got negative dbcv score for a good clustering from hdbscan. Is this expected?

Add installation instructions to README.md

I 100% appreciate the care that was given to make sure that this package is pip installable from a well organized GH repo, but I was surprised to find that I would have to find the egg name from setup.py. It was a minor inconvence, but just adding an "Installation" section to the readme with the following line will probably be very helpful for others too. Cheers!

pipenv install git+https://github.com/christopherjenness/DBCV.git#egg=DBCV

Using with precomputed similarity matrix?

Hello,

Is it possible to use this with a precomputed similarity matrix? I suppose I could set X to a dummy matrix of index values and use a distance function that does a simple matrix lookup?

Ross

How to import DBCV

Hello,
I am a newbie in data science environment. I want to use your DBCV library in my project. But, i did not find how to import it conda environment.

Thanks,

Results shown in the read me don't match with current code

When running the code from the read me I don't get the score that is mentioned in the read me. When going back to commit b28e70a, it works as communicated.

Parallelization for speed improvement

I just wanted to try to calculate DBCV for my HDBSCAN result (312 points) and this takes me now forever. As I look into the code, it seems that it may be rather simple to parallelize e.g. the computation of mutual reachability graph as it takes so far the most time... I might fork and make a pull-request then.

About apts core distance numerator calculation

This is the original formula to calculate the core distance of a given object:

"KNN (o, i) be the distance between object o and its i th nearest neighbor.",

says the paper.

So, my question is, shouldn't we divide 1 by the ith KNN instead of the dist to the ith element? This is what we are currently doing. Thx