Giter Club home page Giter Club logo

cluster-analysis's Introduction

Cluster Analysis

Introduction

Clustering is the most important unsupervised learning. It finds structures in a collection of unlabeled data. Actually, it is a process of organizing objects into groups whose elements are similar in some way. As a result, a cluster is similar to the objects belonging to it and dissimilar to the objects belonging to other clusters. There are several techniques for clustering. A comparison on different clustering algorithms using different datasets with performance measurements is shown here.

Clustering Algorithms

Partitional clustering, hierarchical clustering and density-based clustering techniques have been used to cluster analysis. The algorithms that are used are listed below:

  • Partitional Clustering Approach

    • KMeans
    • KMedoids
  • Hierarchical Clustering Approach

    • Agglomerative Nesting ( AGNES )
    • Balanced Iterative Reducing & Clustering Using Hierarchies ( BIRCH )
  • Density-based Clustering Approach

    • Density-Based Spatial Clustering of Applications with Noise ( DBSCAN )

For implementation purpose, these algorithms are used from sklearn library in python.[1]

Datasets

The datasets that are used to experiment different clustering algorithms are generated from sklearn datasets library [2]. The name of the datasets is listed below:

  • Sklearn dataset blobs
  • Sklearn dataset moons
  • Sklearn dataset circles

Each dataset contains 1500 samples with 2 features. Artificial noises have been added. Each dataset has been preprocessed using feature scaling before using to the clustering algorithms.[3]

GitHub Logo

Evaluating Metrics

Different evaluating metrics are used to show the performance of different algorithms on different datasets. Silhouette score, Adjusted Rand Index (ARI) score and Normalized Mutual Information (NMI) score have been used. Short description about these metrics is given below:

  • Silhouette score’s value ranges from -1 to 1. 1 means clusters are well apart from each other and clearly distinguished. 0 Means clusters are indifferent or the distance between clusters is not significant and -1 Means clusters are assigned in the wrong way.[4]

  • ARI score’s value ranges from 0 to 1. 0 means random labelling and 1 means perfect labelling.[5]

  • NMI score’s value also ranges from 0 to 1 and have same meaning like ARI.[6]

Experiment on Sklearn Blobs Dataset

KMeans, KMedoids, AGNES, BIRCH and DBSCAN algorithms have been used on sklearn blobs dataset. The results are shown in Figure 2. It shows that all algorithms perform very good clustering on it. Partitional clustering approaches such as KMeans and KMedoids perform well because cluster shapes are very simple with same density and spherical.

GitHub Logo

Evaluating metrics of KMeans, KMedoids, AGNES, BIRCH and DBSCAN algorithms on sklearn blobs dataset is shown in Table 1. ARI score and NMI score show that all algorithms perform well on this dataset but silhouette score fails to show it. However, the table shows that all algorithms perform well on this dataset.

GitHub Logo

Experiment on Sklearn Moons Dataset

KMeans, KMedoids, AGNES, BIRCH and DBSCAN algorithms have been used on sklearn moons dataset. The results are shown in Figure 3. It shows that DBSCAN performs very well on this dataset because it is capable of finding clusters of arbitrary shapes and sizes. But the KMeans, KMedoids, AGNES and BIRCH fail to cluster this dataset because it has complex cluster shapes and sizes.

GitHub Logo

Evaluating metrics of KMeans, KMedoids, AGNES, BIRCH and DBSCAN algorithms on sklearn moons dataset is shown in Table 2. All the metrics show that DBSCAN performs very well on this complex clustering shapes and other algorithms fail.

GitHub Logo

Experiment on Sklearn Circles Dataset

KMeans, KMedoids, AGNES, BIRCH and DBSCAN algorithms have been used on sklearn circles dataset. The results are shown in Figure 4. It shows that DBSCAN performs very well on this dataset because it can perform well on arbitrary shape and size. But the KMeans, KMedoids, AGNES and BIRCH fail to cluster this dataset because it has a circle shaped cluster which is very complex and these algorithms can’t deal with complex cluster shapes.

GitHub Logo

Evaluating metrics of KMeans, KMedoids, AGNES, BIRCH and DBSCAN algorithms on sklearn circles dataset is shown in Table 3. It shows DBSCAN performs very well and other algorithms fail to cluster this dataset.

GitHub Logo

References

[1]https://scikit-learn.org/stable/modules/clustering.html

[2]https://scikit-learn.org/stable/datasets/sample_generators.html

[3]https://realpython.com/k-means-clustering-python/

[4]https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c?fbclid=IwAR19fWzXXF_8t8oy1Oe7o1eCjqIpRsEul3Mu9h3_fKPGDG4_674sB0PecF4

[5]https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html

[6]https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html

cluster-analysis's People

Contributors

wnoyan avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.