Grokking Linear Regression Analysis in Finance

13 min read

By Vivek Krishnamoorthy

Linear regression, simple linear regression, ordinary least squares, multiple linear, OLS, multivariate, ...

You've probably come across these names when you encountered regression. If that isn't enough, you even have stranger ones like lasso, ridge, quantile, mixed linear, etc.

My series of articles is meant for those who have some exposure to regression in that you've used it or seen it used. So you probably have a fuzzy idea of what it is but not spent time looking at it intimately. There are many write-ups and material online on regression (including the QI blog) which place prominence on different aspects of the subject.

We have a post that shows you how to use regression analysis to create a trend following trading strategy. We also have one that touches upon using the scikit-learn library to build and regularize linear regression models. There are posts that show how to use it on forex data, gold prices and stock prices framing it as a machine learning problem.

My emphasis here is on building some level of intuition with a brief exposure to background theory. I will then go on to present examples to demonstrate the techniques we can use and what inferences we can draw from them.

I have intentionally steered clear of any derivations, because its already been tackled well elsewhere (check the references section). There's enough going on here for you to feel the heat a little bit.

This is the first article on the subject where we will explore the following topics.


Some high school math

Most of us have seen the equation of a straight line in high school.

$$
y = mx + c
$$
where

  • $x$ and $y$ are the $X$- and $Y$- coordinates of any point on the line respectively,
  • $m$ is the slope of the line,
  • $c$ is the $y$- intercept (i.e. the point where the line cuts the $Y$-axis)

The relationships among $x, y, m$ and $c$ are deterministic, i.e. if we know the value of any three, we can precisely calculate the value of the unknown fourth variable.

All linear models in econometrics (a fancy name for statistics applied to economics and finance) start from here with two crucial differences from what we studied in high school.

  1. The unknowns now are always $m$ and $c$
  2. When we calculate our unknowns, it's only our 'best' guess at what their values are. In fact, we don't calculate; we estimate the unknowns.

Before moving on to the meat of the subject, I'd like to unpack the term linear models.

We start with the second word.


What are models?

Generally speaking, models are educated guesses about the working of a phenomenon. They reduce or simplify reality. They do so to help us understand the world better. If we didn’t work with a reduced form of the subject under investigation, we could as well have worked with reality itself. But that's not feasible or even helpful.

In the material world, a model is a simplified version of the object that we study. This version is created such that we capture its main features. The model of a human eye reconstructs it to include its main parts and their relationships with each other.

Similarly, the model of the moon (based on who is studying it) would focus on features relevant to that field of study (such as the topography of its surface, or its chemical composition or the gravitational forces it is subject to etc.).

However, in economics and finance (and other social sciences), our models are slightly peculiar. Here too, a model performs a similar function. But instead of dissecting an actual object, we are investigating social or economic phenomena.

Like what happens to the price of a stock when inflation is high or when there's a drop in GDP growth (or a combination of both). We only have raw observed data to go by. But that in itself doesn't tell us much. So we try to find a suitable and faithful approximation of our data to help make sense of it.

We embody this approximation in a mathematical expression with variables (or more precisely, parameters) that have to be estimated from our data set. These type of models are data-driven (or statistical) in nature.

In both cases, we wilfully delude ourselves with stories to help us interpret what we see.

In finance, we have no idea how the phenomenon is wired. But our models are useful mathematical abstractions, and for the most part, they work satisfactorily. As the statistician George Box said, “All models are wrong, but some are useful”. Otherwise, we wouldn’t be using them. :)

These finance models stripped to their bones can be seen as

$$
data = model + error
$$
or

$$
data = signal + noise
$$
It is useful to think of the modeling exercise as a means to unearth the structure of the hidden data-generating process (which is the process that causes the data to appear the way it does). Here, the model (if specified and estimated suitably) would be our best proxy to reveal this process.

I also find it helpful to think of working with data as a quest to extract the signal from the noise.


Why linear?

Because the most used statistical or mathematical models we encounter are either linear or transformed to a quasi-linear form. I speak of general ones like simple or multiple linear regression, logistic regression, etc. or even finance-specific ones like the CAPM, the Fama-French or the Carhart factor models.


Where does regression fit in?

Regression analysis is the fundamental method used in fitting models to our data set, and linear regression is its most commonly used form.

Here, the basic idea is to measure the linear relationship between variables whose behavior (with each other) we are interested in.

Both correlation and regression can help here. However, with correlation, we summarize the relationship into a single number which is not very useful. Regression, on the other hand, gives us a mathematical expression that is richer and more interpretative. So we prefer to work with it.

Linear regression assumes that the variable of our interest (the dependent variable) can be modeled as a linear function of the independent variable(s) (or explanatory variable(s)).

Francis Galton coined the name in the nineteenth century when he compared the heights of parents and their children. He observed that tall parents tended to have shorter children and short parents tended to have taller children. Over generations, the heights of human beings converged to the mean. He referred to the phenomenon as ‘regressing to the mean’.

The objective of regression analysis is to:

  • either measure the strength of relationships (between the response variable and one or more explanatory variables), or
  • forecast into the future

Suggested course: Financial Time Series Analysis for Trading


Nomenclature

When we read and learn about regression (and econometrics), every term or concept goes by a variety of names. So I’ve created a table here to check when you see a new term (in this post or elsewhere).

Don’t spend much time on it at first glance. A scan should do. I expect this to be of help in the same way a human language dictionary is. You look at it when you see something unfamiliar. But you don’t usually read dictionaries cover to cover.

Term Also known as Conventional expression Explanation
Simple linear regression linear regression, OLS regression, univariate regression, bivariate regression $Y_i = \beta_0+ \beta_1 X_i + \epsilon_i ~\text{(scalar form)}$
where $i = 1, 2, ..., n$ for each of the $n$ observations
$ \mathbf{Y} = \mathbf{XB} + \mathbf{\epsilon} ~\text{(matrix form)}$
In the scalar form,
$Y_1, Y_2, ..., Y_n$ are the values of the response variable,
$X_1, X_2, ..., X_n$ are the values of the explanatory variable,
$\epsilon_1, \epsilon_2, ..., \epsilon_n$ are the error terms for each observation,
$\beta_0$ and $\beta_1$ are the regression parameters

In the matrix form, I use the $\mathbf{bold}$ type to denote vectors and matrices
$\mathbf{Y}$ is an $n \times 1$ response vector,
$\mathbf{X}$ is an $n \times 2$ regressor matrix ,
$\mathbf{B}$ is a $2 \times 1$ vector of parameters,
$\mathbf{\epsilon}$ is an $n \times 1$ vector of error terms
Linear regression multiple regression, multiple OLS regression, multivariate regression $Y_i = \beta_0+ \beta_1 X_{1,i} + \beta_2 X_{2,i} + ... + \beta_{k-1}X_{k-1,i} + \epsilon_i ~\text{(scalar form)}$
where $i = 1, 2, ..., n$ for each of the $n$ observations
$\mathbf{Y} = \mathbf{XB} + \mathbf{\epsilon} ~\text{(matrix form)}$
In the scalar form,
$Y_1, Y_2, ..., Y_n$ are the values of the response variable,
$X_{1,i}, X_{2,i} ..., X_{k-1,i}$ are the values of the explanatory variables for the $i^{th}$ observation,
$\epsilon_1, \epsilon_2, ..., \epsilon_n$ are the error terms for each observation,
$\beta_0, \beta_1, ..., \beta_{k-1}$ are the regression parameters

In the matrix form, I use the $\mathbf{bold}$ type to denote vectors and matrices
$\mathbf{Y}$ is an $n \times 1$ response vector,
$\mathbf{X}$ is an $n \times k$ regressor matrix ,
$\mathbf{B}$ is a $k \times 1$ vector of parameters,
$\mathbf{\epsilon}$ is an $n \times 1$ vector of error terms
Explanatory variable(s) independent variable(s), covariate(s), feature(s), predictor(s), input(s),X-variable(s), regressor $x$, $x_i$, $X$ or $X_i$ ($x_i$ or $X_i$ are used when there is
more than one explanatory variable).
The subscript $i = 1, 2, 3, ...$ based on the model used
The variable(s) which should tell us something about the response variable.

Ex. In our model, the returns on the IBM (NYSE : IBM) stock are driven by the returns on SPDR S&P 500 ETF (NYSEARCA : SPY), and the Microsoft (NASDAQ : MSFT) stock.

$Return_{IBM} = \beta_0 + \beta_1Return_{SPY} + \beta_2Return_{MSFT} + \epsilon$

Here,
- $Return_{IBM}$ is the response variable.
- $Return_{SPY}$ and $Return_{MSFT}$ are the explanatory variables
Response variable dependent variable, output, label/value, outcome variable, Y-variable, predicted variable, regressand $y$ or $Y$ (there’s usually only one response variable hence no subscript.
If there are more than one, we use $Y_i$ or $y_i$.
The subscript $i = 1, 2, 3, ...$ based on the model used)
The variable we are interested in

Ex. In our model, the IBM stock returns are driven by the SPY returns and the MSFT stock returns.

$Return_{IBM} = \beta_0 + \beta_1Return_{SPY} + \beta_2Return_{MSFT} + \epsilon$

Here,
- $Return_{IBM}$ is the response variable.
- $Return_{SPY}$ and $Return_{MSFT}$ are the explanatory variables
Model parameters estimators, regression parameters, population parameters, unknown parameters, regression coefficients $\beta_0, \beta_1, \beta_2$, or more generally $\beta_i, \alpha$ or $b_0, b_1$, etc. They are the variables internal to the model and are estimated from the data set.

Ex. $y = \beta_0 + \beta_1 x + \epsilon$

- Here, we model the relationship between X and Y as shown above
- $\beta_0$ and $\beta_1$ are used to describe the relationship between $x$ and $y$ with each other
Model estimates slopes, estimates, regression estimates, parameter estimates $\hat\beta_0, \hat\beta_1,\hat\beta_2$, more generally $\hat\beta_i, \hat\alpha, \hat b_0, \hat b_1, ...$ . They are the estimates of model parameters like $\beta_0, \beta_1$, etc.

Ex. $\hat{y} = \hat\beta_0 + \hat\beta_1x$

- Here, we calculate the fitted values of the response variable
- $\hat\beta_0$ and $\hat\beta_1$ are the model estimates
Intercept y-intercept, constant $\beta_0, \alpha, a, b_0$ Ex. $\hat{Y_i} = \hat\beta_0+ \hat\beta_1 X_{1,i} + \hat\beta_2 X_{2,i}$

In the above specified equation, the intercept is the predicted value of the response variable ($\hat{Y_i}$) when all the $X$’s (in this case $X_{1,i}$ and $X_{2,i}$) are zero. If the $X$’s can never jointly be zero, then the intercept has no interpretable meaning. It’s a plug value needed for making predictions of the response variable. The intercept here is equivalent to $c$ in the equation $y = mx+c$
Errors noise, residuals, innovations, disturbance $\epsilon_i, \epsilon, e_i, e $, $u$, $u_i$ They are the difference between the predicted value and the actual value of the response variable.
Ex. $Y_i = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon$
In the above specified model, the errors are what is left after fitting the model to the data. The errors are equal to the difference between the observed and the fitted values of the response variable (i.e. $\epsilon_i = Y_i - \hat{Y_i}$)

Note: Features and labels/values are machine learning terminology used when referring to explanatory variables and response variables respectively.

We now look at the main types of regression analysis.


Types of linear regression

1. Simple linear regression

Imagine that we hold the Coca-Cola (NYSE : KO) stock and are interested in its returns. Conventionally, we denote our variable of interest with the letter $\textbf{Y}$. We usually have multiple observations (taken to be $n$) of it. So, the $\textbf{Y}$ that we previously mentioned is an n-dimensional vector containing values $Y_i$.

Here and throughout this post, I use the scalar versions of the equations. You can refer to this section to view the matrix forms. You can also read a more detailed treatment of the analytical expressions and derivations in standard econometric textbooks like Baltagi (2011), Woolridge (2015) and Greene (2018).

We want to examine the relationship between our stock’s returns($\textbf{Y}$) and the market returns(denoted as $\textbf{X}$). We believe the market returns i.e. the SPDR S&P 500 ETF (NYSEARCA : SPY) should tell us something about KO's returns. For each observation $i$,
$$
Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
\label{eq1}
\tag{1}
$$

$\beta_0$ and $\beta_1$ are called the model parameters.

Equation $\ref{eq1}$ is just a dolled-up version of $y=mx+c$ that we'd seen earlier with an additional $\epsilon_i$ term. In it $\beta_0$ and $\beta_1$ are commonly referred to as the intercept and the slope respectively.

This is the simple linear regression model.

We call it simple, since there is only one explanatory variable here; and we call it linear, since the equation is that of a straight line. It's easy for us to visualize it in our mind’s eye since they are like the $X$- and $Y$-coordinates on a Cartesian plane.

A linear regression is linear in its regression coefficients.

A natural extension to this model is the multiple linear regression model.

2. Multiple linear regression

Let’s now say we believe there are multiple factors that tell us something about KO's returns. They could be SPY's returns, its competitor PepsiCo’s (NASDAQ : PEP) returns, and the US Dollar index (ICE : DX) returns. We denote these variables with the letter $\mathbf{X}$ and add subscripts for each of them. We use the notation $X_{i,1}, X_{i,2}$ and $X_{i,3}$ to refer to the $i^{th}$ observation of SPY, PEP and DX returns respectively.

Like before, let’s put them all in an equation format to make things explicit.

$$
Y_i = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \beta_3 X_{i,3} + \epsilon_i
\label{ref2}
\tag{2}
$$

$\beta_0, \beta_1, \beta_2$ and $\beta_3$ are the model parameters in equation $\ref{ref2}$.

Here, we have a multiple linear regression model to describe the relation between $\mathbf{Y}$ (the returns on KO) and $\mathbf{X_i}; {i=1, 2, 3}$ (the returns on SPY, PEP, and DX respectively).

We call it multiple, since there is more than one explanatory variable (three, in this case); and we call it linear, since the coefficients are linear.

When we go from one to two explanatory variables, we can visualize it as a 2-D plane (which is the generalization of a line) in three dimensions.

For ex. $Y = 3 - 2X_1 +4 X_2$ can be plotted as shown below.

visualizing variable

As we add more features, we move to n-dimensional planes (called hyperplanes) in $(n+1)$ dimensions which are much harder to visualize (anything above three dimensions is). Nevertheless, they would still be linear in their coefficients and hence the name.

The objective of multiple linear regression is to find the “best” possible values for $\beta_0, \beta_1, \beta_2$, and $\beta_3$ such that the formula can “accurately” calculate the value of $Y_i$.

In our example here, we have three $\mathbf{X}'s$.

Multiple regression allows for any number of $\mathbf{X}'s$ (as long as they are less than the number of observations).

3. Linear regression of a non-linear relationship

Suppose we have a model like so:
$$
Y_i = AL_i^\beta K_i^{\alpha}
$$
For the curious reader, this is the Cobb-Douglas production function, where

  • $Y_i$ - Total production in the $i^{th}$ economy
  • $L$- Labor input in the $i^{th}$ economy
  • $K$ - Capital input in the $i^{th}$ economy
  • $A$ - Total factor productivity

We can linearize it by taking logarithms on both sides to get

$log~ Y_i = log~ A + \beta~ log~ L_i + \alpha~ log~ K_i$

This is still a multiple linear regression equation.

Since the coefficients $\alpha$ and $\beta$ are linear (i.e. they have degree 1).

We can use standard procedures like the OLS (details below) to estimate them if we have the data for $\textbf{Y, L}$ and $\textbf{K}$.


Model parameters and model estimates

In equation $\ref{eq1}$, the values of $Y_i$ and $X_i$ can be easily computed from an OHLC data set for each day. However, that is not the case with $\beta_0, \beta_1$ and $\epsilon_i$. We need to estimate them from the data.

Estimation theory is at the heart of how we do it. We use Ordinary Least Squares (or Maximum Likelihood Estimation) to get a handle on the values of $\beta_0$ and $\beta_1$. We call the process of finding the best estimates for the model parameters as "fitting" or "training" the model.

Estimates, however, are still estimates. We never know the actual theoretical values of the model parameters (i.e. $\beta_0$ and $\beta_1$). OLS helps us make a conjecture based on what their values are. The hats we put over them (i.e. $\hat\beta_0$ and $\hat\beta_1$) are to denote that they are model estimates.

In quantitative finance, our data sets are small, mostly numerical, and have a low signal-to-noise ratio. Therefore, our parameter estimates have a high margin of error.


So what’s OLS?

OLS is Ordinary Least Squares. It’s an important estimation technique used to estimate the unknown parameters in a linear regression model.

I’d earlier mentioned choosing the ‘best’ possible values for the model parameters so that the formula can be as ‘accurate’ as possible.

OLS has a particular way of describing 'best' and 'accurate'. Here goes.

It estimates the 'best' coefficients to be such that we minimize the sum of the squared differences between the predicted values, $\hat{Y_i}$ (as per the formula) and the actual values, $Y_i$.


What's next?

As promised, the good news is that I will not delve into the analytical derivations of model parameter estimates. We will instead defer to the better judgement of our statistician and economist friends, and in our next post, get to implementing what we've learned.

Until next time!


References

  1. Baltagi, Badi H., Econometrics, Springer, 2011.
  2. Greene, William H., Econometric analysis. Pearson Education, 2018.
  3. Wooldridge, Jeffrey M., Introductory econometrics: A modern approach, Cengage learning, 2015.

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT equips you with the required skill sets to build a promising career in algorithmic trading. Enroll now!


Disclaimer: All investments and trading in the stock market involve risk. Any decisions to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.

Live Q&A | Skills to Get Quant Jobs