Hierarchical Clustering in Python

9 min read

By Vibhu Singh

With the abundance of raw data and the need for analysis, the concept of unsupervised learning became popular over time. The main goal of unsupervised learning is to discover hidden and exciting patterns in unlabeled data. The most common unsupervised learning algorithm is clustering. Applications for cluster analysis ranges from medical to face recognition to stock market analysis. In this blog, we talk about Heirarchical Clustering.  

This article is structured as follows.

What is Hierarchical Clustering?

Hierarchical clustering is a type of unsupervised learning that groups similar data points or objects into groups called clusters. The question that comes in your mind is what are clusters and unsupervised learning. Let’s try to find this.

Difference between Clustering & Classification

Both classification and clustering try to group the data points into one or more classes based on the similarity of various features. The difference lies in the way both works. Classification is a supervised algorithm, where there are predefined labels (yi) assigned to each input data point (Xi).

Whereas, clustering is an unsupervised algorithm where labels are missing meaning the dataset contains only input data points (Xi).

The other major difference is since classification techniques have labels, there is a need for training and test dataset to verify the model. In clustering, there are no labels so there is no need for training and test dataset.

Popular examples of classification algorithms are:

  1. Logistic Regression
  2. Support Vector Classifier
  3. Naive Bayes
  4. Decision Trees
  5. Random Forest
  6. Neural Networks

Examples of clustering algorithms are:

  1. Hierarchical clustering
  2. K-Means Clustering
  3. Mean Shift Clustering
  4. Spectral Clustering

In this article, we will deep dive into the details of the Hierarchical clustering.

Difference between K-Means & Hierarchical Clustering

The answer to why we need Hierarchical clustering lies in the process of K-means clustering.

We will understand the K-means clustering in a layman's language.

Consider this unlabeled data for our problem. Our task is to group the unlabeled data into clusters using K-means clustering.

Step 1

The first step is to decide the number of clusters (k). Let’s say we have decided to divide the data into two clusters.

Step 2

Once the clusters are decided, we randomly initialize two points, called the cluster centroids.

Step 3

In the third step, the algorithm goes to each of the data points and divides the points into respective classes, depending on whether it is closer to the red cluster centroid or green cluster centroid.

Step 4

In the fourth step, we move the centroid step. We compute the mean of all the red points and move the red cluster centroid there and do the same for the green cluster.

We will do steps 3 and 4  till the cluster centroid will not move any further. That is in this example, the colours of the point will not change any further.

The K-means process looks good, right?

Yes, but there is one problem or we can say limitation of this process. At the beginning of the algorithm, we need to decide the number of clusters. But we don’t know how many clusters we need in the start.

Hierarchical clustering bridges this gap. In hierarchical clustering, we don’t need to define the number of clusters at the beginning. Let’s see how it works.

Types of Hierarchical Clustering

There are two types of hierarchical clustering:

  • Agglomerative hierarchical clustering
  • Divisive hierarchical clustering

Agglomerative Hierarchical Clustering

The Agglomerative Hierarchical Clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It's a bottom-up approach where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

How does Agglomerative Hierarchical Clustering work?

Suppose you have data points which you want to group in similar clusters.

Step 1: The first step is to consider each data point to be a cluster.

Step 2: Identify the two clusters that are similar and make them one cluster.

Step 3: Repeat the process until only single clusters remains

How to identify if two clusters are similar?

One of the ways to do so is to find the distance between clusters.

Measure of distance (similarity)

The distance between two clusters can be computed based on the length of the straight line drawn from one cluster to another. This is commonly known as the Euclidean distance.

Euclidean distance: The Euclidean distance between two points in either the plane or 3-dimensional space measures the length of a segment connecting the two points. It is the most obvious way of representing distance between two points.

If the (x1,y1) and (x2,y2) are points in 2-dimensional space then the euclidean distance between is

(x2-x1)2 - (y2-y1)2

Other than Euclidean distance, several other metrics have been developed to measure distance such as:

  • Hamming Distance
  • Manhattan Distance (Taxicab or City Block)
  • Minkowski Distance

The choice of distance metrics should be based on the field of study or the problem that you are trying to solve.

For example, if you are trying to measure distance between objects on a uniform grid, like a chessboard or city blocks. Then Manhattan distance would be an apt choice.

Linkage Criterion

After selecting a distance metric, it is necessary to determine from where distance is computed. Some of the common linkage methods are:

Single-Linkage: Single linkage or nearest linkage is the shortest distance between a pair of observations in two clusters.  

Complete-linkage: Complete linkage or farthest linkage is the farthest distance between a pair of observations in two clusters.

Average-linkage: Average linkage is the distance between each observation in one cluster to every observation in the other cluster.

Centroid-linkage: Centroid linkage is the distance between the centroids of two clusters. In this, you need to find the centroid of two clusters and then calculates the distance between them before merging.

Ward’s-linkage: Ward’s method or minimum variance method or Ward’s minimum variance clustering method calculates the distance between two clusters as the increase in the error sum of squares after merging two clusters into a single cluster. This method seeks to choose the successive clustering steps so as to minimize the increase in sum of squares error at each step.

The choice of linkage criterion is based on the domain application. Average-linkage and complete-linkage are the two most popular distance metrics in hierarchical clustering. However, when there are no clear theoretical justifications for the choice of linkage criteria, Ward’s method is the default option.

How to choose the number of clusters?

To choose the number of clusters in hierarchical clustering, we make use of concept called dendrogram.

What is a Dendrogram?

Dendrogram is a tree like diagram that shows the hierarchical relationship between the observations. It contains the memory of hierarchical clustering algorithms.

Just by looking at the Dendrogram you can tell how the cluster is formed. Let see how to form the dendrogram for the below data points.

The observations E and F are closest to each other by any other points. So, they are combined into one cluster and also the height of the link that joins them together is the smallest. The next observations that are closest to each other are A and B which are combined together.

This can also be observed in the dendrogram as the height of the block between A and B is slightly bigger than E and F. Similarly, D can be merged into E and F clusters and then C can be combined to that. Finally A and B combined to C, D, E and F to form a single cluster.

The important point to note while reading dendrogram is that:

  1. Height of the blocks represents the distance between clusters, and
  2. Distance between observations represents dissimilarities.

But the question still remains the same, how do we find the number of clusters using a dendrogram or where should we stop merging the clusters. Observations are allocated to clusters by drawing a horizontal line through the dendrogram.

Generally, we cut the dendrogram in such a way that it cuts the tallest vertical line. In the above example, we have two clusters. One cluster has observations A and B, and a second cluster has C, D, E, and F.

Divisive Hierarchical Clustering

Divisive hierarchical clustering is not used much in solving real-world problems. It works in the opposite way of agglomerative clustering. In this, we start with all the data points as a single cluster.

At each iteration, we separate the farthest points or clusters which are not similar until each data point is considered as an individual cluster. Here we are dividing the single clusters into n clusters, therefore the name divisive clustering.

Hierarchical Clustering in Python

To demonstrate the application of hierarchical clustering in Python, we will use the Iris dataset. Iris dataset is one of the most common datasets that is used in machine learning for illustration purposes.

The Iris data has three types of Iris flowers which are three classes in the dependent variable. And it contains four independent variables which are sepal length, sepal width, petal length and petal width, all in cm. We will compare the original classes with the classes formed using hierarchical clustering method.

Import data

We will import the dataset from the sklearn library.

Iris Datset

Visualise the classes

The above scatter plot shows that all three classes of Iris flowers are overlapping with each other. Our task is to form the cluster using hierarchical clustering and compare them with the original classes.

Create a dendrogram

We start by importing the library that will help to create dendrograms. Dendrogram helps to give a rough idea of the number of clusters.

By looking at the above dendrogram, we divide the data into three clusters.

Fit the model

We instantiate the AgglomerativeClustering. Pass euclidean distance as the measure of the distance between points and ward linkage to calculate clusters' proximity. Then we fit the model on our data points. Finally, we return an array of integers where the values correspond to the distinct categories using lables_ property.

The above output shows the values of 0s, 1s, 2s, since we defined 4 clusters. 0 represents the points that belong to the first cluster and 1 represents points in the second cluster. Similarly, third represents points in the third cluster.

Visualise the cluster

There is still an overlap between Type 1 and Type 3 clusters. But if you compare with the original clusters, the classification has improved quite a bit.

Pros and Cons of Hierarchical Clustering

1. Like K-means clustering, we need not to specify the number of clusters required for the algorithm.

2. It doesn’t work well on the large dataset. It is generally applicable to the smaller data. If you have a large dataset, it can become difficult to determine the correct number of clusters by the dendrogram.

3. In comparison to K-means, hierarchical clustering is computationally heavy and takes longer time to run.


Despite the limitations of hierarchical clustering when it comes to large datasets, it is still a great tool to deal with small to medium dataset and find patterns in them. In this article, we have dealt with the basic concepts of hierarchical clustering, which is a type of unsupervised learning algorithm and its implementation in Python.

If you are interested in learning about supervised and reinforcement learning algorithms and their implementation in the trading domain, you can check out the courses on Quantra.

Disclaimer: The information in this project is true and complete to the best of our Student’s knowledge. All recommendations are made without guarantee on the part of the student or QuantInsti®. The student and QuantInsti® disclaim any liability in connection with the use of this information. All content provided in this project is for informational purposes only and we do not guarantee that by using the guidance you will derive a certain profit.

Learning Track: Machine Learning & Deep Learning in Financial Markets