K-Means For Pair Selection In Python Part IV

9 min read

K-Means Clustering For Pairs Selection In Python - Part 4

By Lamarcus Coleman

We've covered a lot thus far and we're almost to the point we've been waiting for, coding our strategy.

Part IV: K-Means StatArb Implementation

We're going to create a class that will allow us to clean our data, test for cointegration, and run our strategy simply by calling methods off of our Statistical Arbitrage object. Create Price Spread Check for cointegration

Generate signalsCreate Returns

Let's briefly walk through what the above code does. We begin by creating an instance of our statarb class. We pass in the two dataframes of the two stocks that we want to trade along with parameters for the moving average, the floor or buy level for z-score, the ceiling, or sell level for z-score, the beta lookback period for our hedge ratio, and our exit level for our z-score. By default this level is set to 0.

Once we've created our stat arb object, we can then access its methods. We begin by cleaning our data within our create spread method like we did earlier. Once our data has been clean, we can then call our check for cointegration method. This method will take the spread created in our create spread method and then pass it into the ADF test and return whether or not the two stocks are cointegrated. If our stocks are cointegrated, we can then call our generate signals method which will implement the Statistical Arbitrage strategy. From here we can call our create returns method. This method will require that we pass in an allocation amount. This is the amount of our hypothetical portfolio and will be used to create our equity curve. After our equity curve is created, we will then have access to other data such as our hit ratio, wins and losses, and our Sharpe ratio.

To implement the strategy, we must keep track of our current position while iterating over the dataframe to check for signals. For a good walkthrough of how to implement the Statistical Arbitrage strategy, visit Quantstart.com. Here's the link to the article. Michael does a great job providing insight into each step of the implementation.

We'll set K equal to 100. Let's begin by instantiating our K-Means object.

Create an instance

Now let's fit our model on the features of our data.

Fitting the model K Means Algorithm

Now that we have fitted our model to our features, we can get our clusters and add them to our stockDataCopy dataframe. Also, note that we called the fillna() method on our features. This was done because we noticed that some of our features had NaN values and this can't be passed into our model. We didn't call dropna() to drop these values as this would have changed the length of our features and would present issues when adding the clusters back to our dataframe.

Let's take a look at our clusters.

Strategy Means Array

Okay now that we have our clusters, we can add them back to our data. Let's first take another look at our dataframe.

Head of stockdatacopy Dataframe1

Let's add a column for our clusters to our dataframe.


Let's review our dataframe once more.

View ten rows Stockdatacopy 2

Now that we have our clusters in our dataframe, our next task is to identify tradable pairs within our clusters.

Let's take a look at how many stocks were placed in each cluster. To do this we will import the Counter method from the collections library.

Import counter Create variable

Now that we have created a variable to count the number of symbols in each cluster, we can print it.

Print cluster Counter chart

Now we are ready to add our clusters to our dataframe. But, notice above, when looking at how many values are in each of our clusters, we can see that some clusters only have 1 symbol. We would like to eliminate these and only view clusters that have 2 or more symbols within them. We would also like to order our data by cluster so that we can view every symbol within a respective cluster.

To achieve this we will create a new dataframe and use pandas concat method to concatenate it with our stockDataCopy dataframe grouped according to the Clusters column but filter out those clusters contain less than 2 symbols.

Cluster Pairs Cluster Pairs Chart

We can save out concatenation into a variable that will allow us to perform the operation on it.

Test 1 Test 2 Test Chart

The above dataframe shows us five symbols in Cluster 0. We can see that the index represents their original position in our stock dataframe. We’re also able to make a brief comparison of their features.

We can now see scroll or iterate through our dataframe and see which symbols are in each cluster with the minimum being at least two symbols. Let's use our statarb method to test a pair of symbols for cointegration and develop a Statistical Arbitrage strategy.

We will begin by creating an instance of our statarb object. We will randomly select two stocks from cluster 0 for this analysis. We must first import the data for each symbol over our testing period. Importing Stocks

Now that we have imported our data, let's take a quick look at it.

Plotting the data Data Chart

Okay now, let's test our statarb class to test the symbols for cointegration and build a strategy.


Okay in the above line of code, we have just created an instance of our statarb strategy class. We can now call our create_spread method to create the spread of our data. We passed in the entire dataframes of bbby and gt because this method will parse the closing prices and created the spread for us. Afterwards, we can call the remaining methods to complete our Statistical Arbitrage analysis.

Calling Create Spread Data frame spread

In the above dataframe, we can see that we’ve added our spread. The values at the head are NaN which is reflective of the lookback that we are using. We can scroll down to see that values have been populated for each column.

Spreading the pair

Now that we have created the spread of our pair, let's check to see if they are cointegrated.

Check cointegration Spread is cointegrated

Our pair is highly cointegrated. Now that we have confirmed this, we can call our generate signals and create returns to see how our strategy would have performed over our testing period.

Generating signals Added signals in dataframe

We can see that we have added our signals to our dataframe.

Now that we have generated signals for our pair, let's use our create returns method to calculate our returns and print our equity curve. Recall that this method takes in an allocation amount. This is the starting value of our portfolio. This method also requires that we pass in a name for our pair as a string to be included in our plot.

Create Strategy Returns Strategy Returns

We can see that our strategy did well for a while before appearing to no longer be cointegrated. Let's take a look our Sharpe Ratio.

Checking Sharpe Ratio Checking Sharpe Ratio

Challenge: See If You Can Improve This Strategy

Try your hand at improving this strategy. Our analysis showed that these two stocks were highly cointegrated. However, after performing well for the majority of our testing period, the strategy appears to have loss cointegration. This shows a couple of things. First, it shows that Statistical Arbitrage is not a riskless trading strategy. Secondly, it underscores the importance of the parameters used when trading. These are what are truly proprietary.


Wow! We have covered an immense amount of information is a short time. To recap, we began by gaining an understanding of K-Means. We created our own toy data in which we initialized our own clusters. We then applied K-Means to our toy data to see if it would be able to identify the clusters that we created.

Next, we took a walk through a Statistical Arbitrage world without K-Means. We brute forced the creation of a couple of pairs and learned that identifying tradeable relationships involved a little more than finding pairs in the same sector. We then used real stock data from the S&P 500, namely Dividend Yields, P/E, MarketCap, EPS and EBITDA, as features to begin creating a real-world K-Means analysis for trading Statistical Arbitrage.

We then added our clusters to our dataframe and manipulated it so that we could test the pairs in each cluster for cointegration via the ADF test. We randomly selected BBBY and GT from cluster 0 of our analysis and found that they were cointegrated at the 99% significance level. Afterwards, we used the statarb class we created to backtest our new found pair. Whew!

This analysis also showed the strength of K-Means for finding non-traditional pairs for trading Statistical Arbitrage. BBBY is the ticker symbol of Bed Bath and Beyond and GT is the ticker symbol for Goodyear Tire & Rubber Co. These two stocks appear to have nothing in common on the surface but have been cointegrated at the 1% critical value in the past.

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

Disclaimer: All investments and trading in the stock market involve risk. Any decisions to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.