• Home
  • how-to-evaluate-clustering-algorithms

How To Evaluate Clustering Algorithms


A step-by-step guide to ensuring your clustering algorithm is efficient, accurate and effective.

The world of Artificial Intelligence has one algorithm after another. Each algorithm or model is designed to cater to the various problems that can either be automated, solved, or enhanced by the utilization of these models. Anyone who has not seen the rise of Machine Learning is not up to date with where the world is headed. The advancement in this particular domain is not only undeniable, but inevitable and necessary.

Amongst the models of ML, there are very useful ones that have been used since the advent of AI. One type of these models are the clustering algorithms. Their purpose is to classify something based on the distance of the new data point in comparison with the rest of the data points in order to identify the group it belongs to. 

Today, our goal is not to delve further into these algorithms and how they work. We are more concerned with how to evaluate these models and whether they are actually performing the way we like or need or not.

To do so, we have a set of metrics that we can employ. Let us have a look at them in detail.

Why Are Cluster Validity/Evaluation Indices Required?

The reasons for us to see if the clustering algorithms are valid or not (how good the clusters are?) can be summed up into the following three:

  • To compare two cluster sets.
  • To compare two clusters and determine which one is more compact and linked.
  • To see if there is any random structure in the data owing to noise.

Classification Of Cluster Validity Measures

Cluster validity measures are often divided into three categories:

1) Internal Cluster Validation: The clustering result is assessed solely on the basis of the data clustered (internal information), with no reference to external data.

2) External Cluster Validation: Clustering results are assessed using an externally known outcome, such as class labels provided by the user.

3) Relative Cluster Validation: For the same algorithm, the clustering results are validated by modifying different parameters (e.g., changing the number of clusters).

We also need to know about the inter-cluster distance d(a, b) between two clusters a, b, and the intra-cluster index D(a) of cluster a, in addition to the term cluster validity index.

There are a few categories that inter-cluster distance d(a, b) between two clusters a and b can be put into:

  • Single Linkage Distance: This is the closest distance between two objects belonging to a and b. 
  • Complete Linkage Distance: This is the distance between the two most remote objects belonging to a and b. 
  • Average Linkage Distance: This is the average distance between all the objects belonging to a and b. 
  • Centroid linkage distance: This is the distance between the centroid of the two clusters a and b. 

Intra-cluster distance D(a) of a cluster a can be:

  • Complete Diameter Linkage Distance: This is the distance between two farthest objects belonging to cluster a. 
  • Average Diameter Linkage Distance: This is the average distance between all the objects belonging to cluster a.  
  • Centroid Diameter Linkage Distance: This is twice the average distance between all the objects and the centroid of the cluster a.


Metrics For Evaluation of Clustering Algorithms

One of the keys to understand here is that these evaluation metrics are specifically for clustering algorithms and not others. It aims to give detail about how well clustering algorithms are performing only. 

Clustering algorithms are of the unsupervised learning type. Therefore, we cannot use the usual confusion matrix, accuracy, and other metrics of that kind. This is because there is no target variable (actual value) that we can compare with the predicted variable (model's output value). 

Henceforth, some common examples of clustering algorithms include products with similar characteristics, finding customers with similar behavior patterns, and other tasks where the goal is to find groups with distinct features in a dataset.

The metrics or techniques for the evaluation of clustering algorithms are:


The following formula calculates the DB Index:

where n stands for the number of clusters, while σi stands for the average distance of all points inside cluster i from the cluster centroid ci.

The DB index encapsulates the idea that clusters that are: 

1) Well-spaced from one another

2) Dense themselves are likely to be 'excellent' clusters.

The reason is that the measure's 'max' statement selects the values where the average point is farthest from its centroid and the centroids are closest together, which occurs repeatedly. Clustering is considered 'better' when the DB index shrinks.

So to sum it up, this index/score/metric measures the similarity of the clusters. This means that the lower the score, the better the separation there is between your clusters. 

It can be calculated/implemented using scikit-learn in the following way:



The following formula follows the Dunn-Index:

where i, j, and k are indices that stand for clusters, while d measures the inter-cluster distance, and d’ measures the intra-cluster difference.

The Dunn Index is similar to the DB Index in that it improves when clusters are well-spaced and dense. However, as performance improves, the Dunn Index rises.

What's different is how this issue is approached. The Dunn Index only analyses the worst instances in clustering. It, therefore, analyses the clusters that are closest together and the single most scattered cluster. The DB index considers the dispersion and separation of all clusters. The change in objective may cause unanticipated problems depending on your application.

It's entirely up to you to choose between the two approaches, whether it be the DB Index or Dunn Index.

It can be implemented/calculated using the jqmcvi library using the following python code:

Silhouette Coefficient

The Silhouette Coefficient is calculated as follows:

where a(i) is the average distance between point i and all other points in its cluster, and b(i) is the shortest average distance between point i and all other points in any other cluster. To clarify, b(i) is calculated by measuring average distance of i from every point in cluster A, and then measuring the average distance of i from every point in cluster B, then taking the smaller of the two values.

Silhouette analysis is a technique for interpreting and validating consistency within data clusters. When compared to other clusters (separation), the silhouette value is a measure of how similar an object is to its own cluster (cohesion). It can be used to figure out how far apart the generated clusters are. The silhouette plot shows how close each point in one cluster is to points in neighboring clusters, and so it allows you to examine factors like cluster count visually.

The Silhouette Coefficient indicates how well each individual point is assigned. If S(i) is near to 0, the inflection point between two clusters is present. We would have definitely been better off assigning it to the other cluster if it was closer to -1. If S(i) is close to 1, the point has been well-assigned and can be considered to be part of a 'suitable' cluster.


So to sum it up, the range of silhouette index value, S(i), will lie between [-1, 1]:

  1. If the silhouette value is close to 1, the sample is well-clustered and already assigned to a very appropriate cluster.
  2. If the silhouette value is about 0, the sample could be assigned to a cluster closest to it, and therefore the sample lies equally far away from both the clusters. That means it indicates overlapping clusters.
  3. If the silhouette value is close to –1, the sample is misclassified and is merely placed somewhere in between the clusters.


The Silhouette Coefficient is an innovative and straightforward distance measurement. Its drawback is that computing on all n points can be pretty expensive. This is due to the fact that for each i, we must compute the distance of i from every other n — 1 point, resulting in a complexity of O(n2).

Many practitioners may scoff at that cautious assessment and shrug it off, claiming that it isn't as good as NP. They are not quite right on this as this time complexity can become unmanageable for huge datasets.

A way to implement/calculate the Silhouette Coefficient using scikit-learn in python is:

Calinski-Harabaz Index

When ground truth labels are unknown, the Calinski-Harabasz (CH) Index (developed by Calinski and Harabasz) can be used to evaluate the model. The CH Index (also known as the Variance Ratio Criterion) is a metric that compares how similar an object is to its own cluster (cohesion) to other clusters (separation). The distances between data points in a cluster and the cluster centroid are used to assess cohesiveness. In contrast, the distance between cluster centroids and the global centroid is used to estimate separation. (a. Separation)/(b. Cohesion) is the formula for the CH index, where a and b are the weights.


The CH index for K number of clusters on a dataset D =[ d1 , d2 , d3 , … dN ] is defined as:


where, nk  and ck  are the no. of points and centroid of the kth cluster, respectively, c is the global centroid, N is the total no. of data points.

Although there is no "good" cut-off number for the CH index, a higher value indicates that the clusters are dense and well divided. On the line plot of CH indices, we must choose the solution that produces a peak or, at the very least, an abrupt elbow. If the line is smooth (horizontal, ascending, or descending), however, there is no reason to choose one option over another.

An implementation of the CH Index in python is as shown below:



One of the most significant drawbacks of any clustering technique is the difficulty in evaluating its effectiveness. Therefore, another metric created to address this issue is the V-Measure.

The V-Measure must is calculated using the formula below.:


h is the homogeneity


c is the completeness

Homogeneity: A perfectly homogenous clustering is one in which all data points belong to the same class label. Homogeneity refers to how near the clustering method comes to achieving this perfection.

Completeness: In a totally complete clustering, all data points from the same class are grouped together in the same cluster. The clustering algorithm's completeness describes how close it comes to this perfection.

Trivial Homogeneity: Trivial homogeneity occurs when the number of clusters equals the number of data points, and each data point belongs to just one cluster. When homogeneity is highest, and completeness is lowest, this is the extreme case.

Trivial Completeness: When all of the data points are clustered into one cluster, this is known as trivial completeness. When homogeneity is at a minimum and completeness is at a maximum, this is the extreme case.

Note: Homogeneity differs from completeness. When we talk about homogeneity, the basic idea is the respective cluster, and we verify whether each data point in each cluster has the same class label. When we talk about completeness, we're talking about the relevant class label, and we're looking to see if the data points for each class label are in the same cluster.

Let us assume that there are N data samples, C different class labels, K clusters, and the number of data points belonging to the class c and cluster k. Then the homogeneity h is given by the following:-

The implementation of V-Measure in python is:

We can summarize the above in a very successful fashion by understanding what algorithm works best in what situations. One method of doing so can be by seeing various implementations on the internet of the clustering algorithms. Although these implementations may seem complicated, they are pretty simple when you really get down to it. The key here is that these models are of the unsupervised learning type, and that is what causes the difference in their implementations, what they are assessing, and what is desired of them. Therefore, all these things must be considered before choosing an evaluation metric or technique to see if our clustering algorithm or model is performing up to the expectations that we desire.


Ref: https://www.geeksforgeeks.org