Statistics Behind Pair Trading (I): Understanding Correlation and Cointegration

3 min read

Statistics Behind Pair Trading - I

In pair trading, usually a pair of stocks is traded in a market neutral strategy, i.e. it doesn’t matter whether market is trending upwards or downwards, the two open positions for each stock hedge against each other. To be able to pair trade, the key challenges are to:

  • Choose a pair which will give you good statistical arbitrage opportunities over time
  • Choose the entry/exit points
In this post, we will discuss in details how statistics play a crucial role in the first challenge of deciding the pair to trade. The pair is commonly chosen from the same basket of stocks for instance, Microsoft and Google (technology domain) or ICICI & Axis (Banking) or Nifty Index and MSCI index (market indices). Among each domain, there are thousands of pairs are possible. The best ones are those which are based on mathematical or statistical tests.


Though not common, a few pair trading strategies look at correlation to find a suitable pair to trade. Correlation is measurement of relationship between two variables, in this case, log returns of prices of stocks A and B. If correlation is high, say 0.8, traders may choose that pair. This high number represents a strong relationship between the two stocks. So if A goes up, the chances of B going up are also quite high. Based on this assumption a market neutral strategy is played where A is bought and B is sold; bought and sold decisions are made based on their individual patterns.

Just looking at correlation might give you spurious results. For instance, if your strategy is based on the spread between the prices of the two stocks, it is possible that the prices of the two stocks keep on increasing without ever mean reverting.

Spread = log(a) – nlog(b), where ‘a’ and ‘b’ are prices of stocks A and B respectively. For each stock of A bought you have sold n stocks of B.

Now, both ‘a’ and ‘b’ increases in such as way that the value of spread decreases. This will result in a loss since stock A is increasing at a rate lower than stock B and you are short on stock B.


The most common test for pair trading is the co integration test. Cointegration is a statistical property of two or more time series variables which indicates if a linear combination of the variables is stationary. Let us understand this statement above. The two time series variables in this case are the log of prices of stocks A and B. Linear combination of these variables can be a linear equation defining the spread:

Spread = log(a) – nlog(b), where ‘a’ and ‘b’ are prices of stocks A and B respectively. For each stock of A bought you have sold n stocks of B.

If A and B are cointegrated then it implies that this equation above is stationary. A stationary process has very valuable features which are required to model pair trading strategies. For instance, in this case if the equation above is stationary, that suggests that the mean and variance of this equation remains constant over time. So if we start with ‘n’, which is called the hedge ratio, so that spread = 0, the property of stationary implies that expected value of spread will remain as 0. Any deviation from this expected value is a case for statistical abnormality, hence a case for trading!

How to choose a pair of stocks for trading?

  1. For any pair of two stocks, define the spread as below:
Spread = log(a) – nlog(b), where ‘a’ and ‘b’ are prices of stocks A and B respectively.

Assumption: n, the hedge ratio, is a constant.

  1. Calculate ‘n’ using regression so that spread is as close to 0 as possible. Hence, we regress the stock prices to calculate the hedge ratio.
Theory: In regression, we get a term called the residuals which represents the distance of observed value from the curve fitting line or estimated value. These residuals tell us how much the actual value of ‘spread’ deviates from 0 for the calculated ‘n’. These residuals are studied so that we understand whether or not they form a trend. If they do not form a trend, that means the spread moves around 0 randomly and is stationary.
  1. Run the Dicky Fuller test on the spread (more complicated and popular version is called Augmented Dicky Fuller Test or ADF) values inserting the value of ‘n’. DF test is a hypothesis test which gives pValue as the result. If this value is less than 0.05 or 0.01, we can say with 95% or 99% confidence that the signal is stationary and we can choose this pair.
In our next blog, we will work out the statistics involved in deciding the entry and exit signals of a pair trading strategy.

Further reading on statistical arbitrage: