View Code? Open in Web Editor
NEW
Clustering methods in Machine Learning includes both theory and python code of each algorithm. Algorithms include K Mean, K Mode, Hierarchical, DB Scan and Gaussian Mixture Model GMM. Interview questions on clustering are also added in the end.
clustering-in-python's Introduction
Welcome to Clustering (Theory & Code)
01 Unsupervised Learning (Theory)
What is Unsupervised Learning & Goals of Unsupervised Learning
Type of Unsupervised Learning: 1.Clustering, 2.Association Rule & 3.Dimensionality Reduction
Definition and Application of Clustering
4 methods: 1.K Means 2.Hierarchical 3.DBScan & 4.Gaussian Mixture
03 Euclidean & Manhattan Distance (Theory)
Two points are near to each other, chances they are similar
Distance Measure between two points
Euclidean Distance: Under-root of Square distance between two points
Manhattan Distance: Absolute Distance between points
04 K-Means Clustering (Theory)
How Algorithim works (Step Wise Calculation)
Pre-processing required for K Means
Determining optimal number of K: 1.Profiling Approach & 2.Elbow Method
Working of Elbow Method with Example
3 concepts: 1.Total Error, 2.Variance/Total Squared Error & 3.Within Cluster Sum of Square (WCSS)
06 K Means Clustering (Python Code)
Define number of clusters, take centroids and measure distance
Euclidean Distance : Measure distance between points
Number of Clusters defined by Elbow Method
Elbow Method : WCSS vs Number of Cluster
Silhouette Score : Goodness of Clustering
07 Hierarchical Clustering (Theory)
Two Approaches: 1.Agglomerative(Botton-Up) & 2.Divisive(Top-Down)
Types of Linkages:
Single Linkage - Nearest Neighbour (Minimal intercluster dissimilarity)
Complete Linkage - Farthest Neighbour (Maximal intercluster dissimilarity)
Average Linkage - Average Distance (Mean intercluster dissimilarity)
Steps in Agglomerative Hierarchical Clustering with Single Linkage
Determining optimal number of Cluster: Dendogram
Hierarchical relationship between objects
Optimal number of Clusters for Hierarchical Clustering
09 Hierarchical Clustering (Python Code)
Type of HC
Agglomerative : Bottom Up approach
Divisive : Top Down approach
Number of Clusters defined by Dendogram
Dendogram : Joining datapoints based on distance & creating clusters
Linkage : To calculate distance between two points of two clusters
Single linkage : Minimum Distance between two clusters
Complete linkage : Maximum Distance between two clusters
Average linkage : Average Distance between two clusters
10 DBScan Clustering (Theory)
Density Based Clustering
Kmeans & Hierarchical good for compact & well seperated Data
Both are sensitive to Outliers & Noise
DBScan overcome all the issue & works well with Outliers
2 important parameters -
eps: Distance between 2 points is lower/equal to eps they are neighbours
MinPts: Minimum number of neighbours/data points with eps radius
11 DBScan Clustering (Python Code)
No need to give pre-define clusters
Distance metric is Euclidean Distance
Need to give 2 parameters
eps : Radius of the circle
min_samples : minimum data points to consider it as clusters
12 GMM Clustering (Theory)
Weakness of K Means
Expectation Maximization(EM) method
13 Gausian Mixture Model Clustering (Python Code)
Probablistic Model
Uses Expectation-Minimization (EM) steps:
E Step : Probability of datapoint of each cluster
M Step : For each cluster,revise parameter based on proabability
14 Cluster Adjustment (Theory)
2 Steps we normally do for Cluster Adjustement
Quality of Clustering (Cardinality & Magnitude)
Performance of Similiarity Measure (Euclidean Distance)
15 Silhouette Coefficient - Cluster Validation (Theory)
Clusters are well apart from each other as the silhouette score is closer to 1
It is a metric used to calculate the goodness of a clustering technique
Its value ranges from -1 to 1.
1: Means clusters are well apart from each other and clearly distinguished
0: Means clusters are indifferent, or distance between clusters is not significant
-1: Means clusters are assigned in the wrong way
16 Disadvantage & Choosing Right Clustering Method (Theory)
Disadvantage of each clustering techniques respectively
Based on the data, which is the right clustering method
17 Clustering Revision (Theory)
Short Description of Each Clustering Alogrithim
Advantage, Disadvantage
When to use what
18 Interview Questions on Clustering (Theory)
Commonly asked question on Clustering
For Categorical variable clustering, use K Modes
It uses the dissimilarities(total mismatch) between data points
Lesser the dissimilarities, the more our data points are closer
It uses Mode for most value in the column
clustering-in-python's People
Contributors