This notebook is dedicated to exploring the optimal number of clusters (K) in K-Means clustering, an important step in unsupervised learning. It uses two renowned methods: the Elbow Method and Silhouette Analysis, providing insights into their mathematical formulas and practical applications.
The choice of K is pivotal in clustering:
- An underestimated K may lead to the merging of distinct groups, obscuring valuable insights ๐.
- An overestimated K could result in overfitting, capturing noise rather than the actual patterns, potentially forming meaningless clusters ๐ซ.
- Elbow Method: This method involves plotting the Within-Cluster Sum of Squares (
WCSS
) and identifying the 'elbow' point where the rate of decrease sharply changes. This point suggests a suitable number of clusters ๐. - Silhouette Analysis: This technique evaluates how similar a data point is to its own cluster compared to others. It calculates a silhouette score for each point, aiding in assessing the separation distance between the resulting clusters ๐.
An example is provided in the notebook, illustrating the practical application of these methods on a dataset. It includes generating mock data (where we already know the value of K), applying both Elbow and Silhouette analyses, and interpreting the results to determine the K again.