Clustering algorithms

Unsupervised machine learning

K-Means

A basic K-Means configuration is denoted with k-means:

k-means:
  run: True
  k-min: 2
  k-max: 5
  n-init: 10 # number of initializations of k means algorithm
  elbow method: True # whether to use elbow method to find optimal number of clusters
  silhouette method: True # whether to use silhouette method to find optimal number of clusters
  • run: Whether to run K-Means clustering.
  • k-min: Minimum number of clusters.
  • k-max: Maximum number of clusters.
  • n-init: Number of initializations of k means algorithm.
  • elbow method: Whether to use elbow method to find optimal number of clusters.
  • silhouette method: Whether to use silhouette method to find optimal number of clusters.

Elbow method

The elbow method is a simple way to find the optimal number of clusters. It is based on the idea that the optimal number of clusters is the one that maximizes the within-cluster sum of squares (WCSS). The WCSS is the sum of the squared distances between each point and its cluster center. The elbow method plots the WCSS as a function of the number of clusters and finds the optimal number of clusters as the point where the WCSS starts to level off. This point is often referred to as the "elbow point".

Silhouette method

The silhouette method is a simple way to find the optimal number of clusters. It is based on the idea that the optimal number of clusters is the one that maximizes the silhouette score. The silhouette score is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette score ranges from -1 to 1, where 1 means the object is well matched to its own cluster and poorly matched to neighboring clusters. The silhouette method plots the silhouette score as a function of the number of clusters and finds the optimal number of clusters as the point where the silhouette score starts to level off. This point is often referred to as the "silhouette point".

Running silhouette method is computationally expensive, so it is recommended to use it only for small datasets.

One can use elbow method for large k-max values and examine a smaller range of values to find the optimal number of clusters using the silhouette method.

DBSCAN

HDBSCAN

Birch