K-Means Clustering For Pair Selection In Python - Part I

8 min read

K-Means Clustering For Pair Selection In Python – Part I

By Lamarcus Coleman

In this series, we will cover what K-Means clustering is, how it can be used for solving the age-old problem of pair selection for Statistical Arbitrage, and the advantage of using K-Means for pair selection compared to using a brute force method. We will also create a Statistical Arbitrage strategy using K-Means for pair selection and implement the elbow technique to determine the value of K.

Let’s get started!

Part I – Life Without K-Means

To gain an understanding of why we may want to use K-Means to solve the problem of pair selection we will attempt to implement a Statistical Arbitrage as if there was no K-Means. This means that we will attempt to develop a brute force solution to our pair selection problem and then apply that solution within our Statistical Arbitrage strategy.

Let's take a moment to think about why K-Means could be used for trading. What's the benefit of using K-Means to form subgroups of possible pairs? I mean couldn't we just come up with the pairs ourselves?

This is a great question and one undoubtedly you may have wondered about. To better understand the strength of using a technique like K-Means for Statistical Arbitrage, we'll do a walk-through of trading a Statistical Arbitrage strategy if there was no K-Means. I'll be your ghost of trading past so to speak.

First, let's identify the key components of any Statistical Arbitrage trading strategy.

  1. We must identify assets that have a tradable relationship
  2. We must calculate the Z-Score of the spread of these assets, as well as the hedge ratio for position sizing
  3. We generate buy and sell decisions when the Z-Score exceeds some upper or lower bound
To begin we need some pairs to trade. But we can't trade Statistical Arbitrage without knowing whether or not the pairs we select are cointegrated. Cointegration simply means that the statistical properties between our two assets are stable. Even if the two assets move randomly, we can count on the relationship between them to be constant, or at least most of the time.

Traditionally, when solving the problem of pair selection, in a world with no K-Means, we must find pairs by brute force, or trial and error. This was usually done by grouping stocks together that were merely in the same sector or industry. The idea was that if these stocks were of companies in similar industries, thus having similarities in their operations, their stocks should move similarly as well. But, as we shall see this is not necessarily the case.

The first step is to think of some pairs of stocks that should yield a trading relationship. We'll use stocks in the S&P 500 but this process could be applied to any stocks within any index. Hmm, how about Walmart and Target. They both are retailers and direct competitors. Surely they should be cointegrated and thus would allow us to trade them in a Statistical Arbitrage Strategy.

Let's begin by importing the necessary libraries as well as the data that we will need. We will use 2014-2016 as our analysis period.

#importing necessary libraries
#data analysis/manipulation import numpy as np import pandas as pd
#importing pandas datareader to get our data import pandas_datareader as pdr
#importing the Augmented Dickey Fuller Test to check for cointegration from statsmodels.tsa.api import adfuller

Now that we have our libraries, let's get our data.

#setting start and end dates
#importing Walmart and Target using pandas datareader

Before testing our two stocks for cointegration, let's take a look at their performance over the period. We'll create a plot of Walmart and Target.

#Creating a figure to plot on
#Creating WMT and TGT plots plt.plot(wmt["Close"],label='Walmart') plt.plot(tgt[‘Close'],label='Target') plt.title('Walmart and Target Over 2014-2016') plt.legend(loc=0) plt.show()
Walmart and Target graph

In the above plot, we can see a slight correlation at the beginning of 2014. But this doesn't really give us a clear idea of the relationship between Walmart and Target. To get a definitive idea of the relationship between the two stocks, we'll create a correlation heat-map.

To begin creating our correlation heatmap, must first place Walmart and Target prices in the same dataframe. Let's create a new dataframe for our stocks.

#initializing newDF as a pandas dataframe
#adding WMT closing prices as a column to the newDF
#adding TGT closing prices as a column to the newDF

Now that we have created a new dataframe to hold our Walmart and Target stock prices, let's take a look at it.

Prices of Stocks

We can see that we have the prices of both our stocks in one place. We are now ready to create a correlation heatmap of our stocks. To this, we will use python's Seaborn library. Recall that we imported Seaborn earlier as sns.

#using seaborn as sns to create a correlation heatmap of WMT and TGT
Correlation heatmap

In the above plot, we called the corr() method on our newDF and passed it into Seaborn's heatmap object. From this visualization, we can see that our two stocks are not that correlated. Let's create a final visualization to asses this relationship. We'll use a scatter plot for this.

Earlier we used Matplotlibs scatter plot method. So now we'll introduce Seaborn's scatter plot method. Note that Seaborn is built on top of Matplotlib and thus matplotlibs functionality can be applied to Seaborn.
#Creating a scatter plot using Seaborn
Seaborn's scatter plot

One feature that I like about using Seaborn's scatter plot is that it provides the Corrleation Coefficient and P-Value. From looking at this pearsonr value, we can see that WMT and TGT were not positively correlated over the period. Now that we have a better understanding of our two stocks, let's check to see if a tradable relationship exists.

We'll use the Augmented Dickey Fuller Test to determine of our stocks can be traded within a Statistical Arbitrage Strategy. Recall that we imported the adfuller test from the statsmodels.tsa.api package earlier.

To perform the ADF test, we must first create the spread of our stocks. We add this to our existing newDF dataframe.

#adding the spread column to the nemDF dataframe
#instantiating the adfuller test

We have now performed the ADF test on our spread and need to determine whether or not our stocks are cointegrated. Let's write some logic to determine the results of our test.

#Logic that states if our test statistic is less than
#a specific critical value, then the pair is cointegrated at that
#level, else the pair is not cointegrated

if adf[0] < adf[4]['1%']:
              print('Spread is Cointegrated at 1% Significance Level')
elif adf[0] < adf[4]['5%']:
              print('Spread is Cointegrated at 5% Significance Level')
elif adf[0] < adf[4]['10%']:
              print('Spread is Cointegrated at 10% Significance Level')
              print('Spread is not Cointegrated')
Spread is not Cointegrated

The results of the Augmented Dickey Fuller Test showed that Walmart and Target were not cointegrated. This is determined by a test statistic that is not less than one of the critical values. If you would like to view the actual print out of the ADF test you can do so by keying ADF. In the above example, we use indexing to decipher between the t-statistic and critical values. The statsmodels ADF Test provides you with other useful information such as the p-value. You can learn more about the ADF test here

#printing out the results of the adf test
{'1%': -3.4434175660489905,
'10%': -2.5698395516760275,
'5%': -2.8673031724657454},

Okay, let's try one more. Maybe we'll have better luck identifying a tradable relationship in a brute force manner. How about Dollar Tree and Dollar General. They're both discount retailers and look they both even have a dollar in their names. Since we've gotten the hang of things, we jump right into the ADF test.

Let's first import the data for DLTR and DG.

#importing dltr and dg
dltr=pdr.get_data_yahoo('DLTR',start, end)
dg=pdr.get_data_yahoo('DG',start, end)

Now that we've gotten our data, let's add these stocks to our newDF and create their spread.

#adding dltr and dg to our newDF dataframe
#creating the dltr and dg spread as a column in our newDF dataframe newDF['Spread_2']=newDF['DLTR']-newDF['DG']

We've now added the DLTR and DG stocks as well as their spread to our newDF dataframe. Let's take a quick look at our dataframe.

Spread_2 Chart

Now that we have Spread_2 or the spread of DLTR and DG, we can create ADF2 or a second ADF test for these two stocks.

#Creating another adfuller instance

We've just run the ADF test on our DLTR and DG spread. We can now repeat our earlier logic to determine if the spread yields a tradable relationship.
if adf2[0] < adf2[4]['1%']:
               print('Spread is Cointegrated at 1% Significance Level')
elif adf2[0] < adf2[4]['5%']:
               print('Spread is Cointegrated at 5% Significance Level')
elif adf2[0] < adf2[4]['10%']:
               print('Spread is Cointegrated at 10% Significance Level')
               print('Spread is not Cointegrated')
Spread is not Cointegrated
To view the complete print out of the ADF2 test, we can call adf2.
{'1%': -3.4434437319767452,
'10%': -2.5698456884811351,
'5%': -2.8673146875484368},
To recap, in this post we began our journey toward understanding the efficacy of K-Means for pair selection and Statistical Arbitrage by attempting to develop a Statistical Arbitrage strategy in a world with no K-Means.

We learned that in a Statistical Arbitrage trading world without K-Means, we are left to our own devices for solving the historic problem of pair selection. We've learned that despite two stocks being related on a fundamental level, this doesn't necessarily insinuate that they will provide a tradable relationship.

In Part II of this series, we will get a better understanding of what K-Means and then prepare to apply it to our own Statistical Arbitrage strategy.

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!


We have noticed that some users are facing challenges while downloading the market data from Yahoo and Google Finance platforms. In case you are looking for an alternative source for market data, you can use Quandl for the same. 

Disclaimer: All investments and trading in the stock market involve risk. Any decisions to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.


  • Python Code for K-Means Clustering for Pair Selection in Python - Part 1