Clustering algorithms
Unsupervised machine learning
K-Means
A basic K-Means configuration is denoted with k-means:
k-means:
run: True
k-min: 2
k-max: 5
n-init: 10 # number of initializations of k means algorithm
elbow method: True # whether to use elbow method to find optimal number of clusters
silhouette method: True # whether to use silhouette method to find optimal number of clusters
run: Whether to run K-Means clustering.k-min: Minimum number of clusters.k-max: Maximum number of clusters.n-init: Number of initializations of k means algorithm.elbow method: Whether to use elbow method to find optimal number of clusters.silhouette method: Whether to use silhouette method to find optimal number of clusters.
Elbow method
The elbow method is a simple way to find the optimal number of clusters. It is based on the idea that the optimal number of clusters is the one that maximizes the within-cluster sum of squares (WCSS). The WCSS is the sum of the squared distances between each point and its cluster center. The elbow method plots the WCSS as a function of the number of clusters and finds the optimal number of clusters as the point where the WCSS starts to level off. This point is often referred to as the "elbow point".
Silhouette method
The silhouette method is a simple way to find the optimal number of clusters. It is based on the idea that the optimal number of clusters is the one that maximizes the silhouette score. The silhouette score is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette score ranges from -1 to 1, where 1 means the object is well matched to its own cluster and poorly matched to neighboring clusters. The silhouette method plots the silhouette score as a function of the number of clusters and finds the optimal number of clusters as the point where the silhouette score starts to level off. This point is often referred to as the "silhouette point".
Running silhouette method is computationally expensive, so it is recommended to use it only for small datasets.
One can use elbow method for large k-max values and examine a smaller range of values to find the optimal number of clusters using the silhouette method.