The exotic flavours of regression in finance - A first glance

13 min read

By Vivek Krishnamoorthy and Udisha Alok

Regression is a technique to unearth the relationship between dependent and independent variables. It is routinely seen in machine learning and used mainly for predictive modelling. In the final installment of this series, we expand our scope to cover other types of regression analysis and their uses in finance.

We explore:

Previously we have covered the linear regression in great detail. We explored how linear regression analysis can be used in finance,  applied it to financial data, and looked at its assumptions and limitations. Be sure to give them a read.

Linear regression

We have covered linear regression in detail in the preceding blogs in this series. We present a capsule version of it here before moving on to the newer stuff. You can skip this section if you’ve spent sufficient time with it earlier.

Simple linear regression

Simple linear regression allows us to study the relationships between two continuous variables- an independent variable and a dependent variable.

Linear regression
Linear regression: Source

The generic form of the simple linear regression equation is as follows:

\(y_{i} = β_{0} + β_{1}X_{i} + ϵ_{i}\)                           - (1)

where \(β_{0}\) is the intercept, \(β_{1}\) is the slope, and \(ϵ_{i}\) is the error term. In this equation, ‘y’ is the dependent variable, and ‘X’ is the independent variable. The error term captures all the other factors that influence the dependent variable other than the regressors.

Multiple linear regression

We study the linear relationships between more than two variables in multiple linear regression. Here more than one independent variable is used to predict the dependent variable.

The equation for multiple linear regression can be written as:

\(y_{i} = β_{0} + β_{1}X_{i1} + β_{2}X_{i2} + β_{3}X_{i3} + ϵ_{i}\)                                   -(2)

where, \(β_{0}\), \(β_{1}\), \(β_{2}\) and \(β_{3}\) are the model parameters, and \(ϵ_{i}\) is the error term.

Polynomial regression

Linear regression works well for modelling linear relationships between the dependent and independent variables. But what if the relationship is non-linear?

In such cases, we can add polynomial terms to the linear regression equation to make it model the data better. This is called polynomial regression. Since the model is linear in parameters, it’s still, strictly speaking, linear regression.

Linear vs Polynomial regression
Linear vs Polynomial regression: Source

Using polynomial regression, we can model the relationship between the independent and dependent variables in the form of a polynomial equation.

The equation for a \(kth\) order polynomial can be written as:

\(y_{i} = β_{0} + β_{1}X_{i} + β_{2}X_{i2} + β_{3}X_{i3} + β_{4}X_{i4} +.....+ β_{k}X_{ik} + ϵ_{i}\)                 -(3)

Choosing the polynomial order is crucial, as a higher degree polynomial could overfit the data. So we try to keep the order of the polynomial model as low as possible.

There are two approaches to choosing the order of the model:

  • Forward selection procedure, where we successively fit models in increasing order and test the significance of the coefficients at each iteration till the t-test for the highest order term is not significant.
  • Backward elimination procedure, where we start with the highest order polynomial and successively decrease the order in each iteration till the highest order term has a  significant t-statistic.

The most commonly used polynomial regression models are the first- and second-order polynomials.

Polynomial regression is more suited when we have a large number of observations. However, it is sensitive to the presence of outliers.

The polynomial regression model can be used for the prediction of non-linear data like the stock prices. You can read more about polynomial regression and its use in predicting stock prices here.

Logistic regression

This is also known as the logit regression. Logistic regression is an analytical method to predict the binary outcome of an occurrence based on past data.

When the dependent variable is qualitative and takes binary values, it is called a dichotomous variable.

If we use linear regression for predicting such a variable, it will produce values outside the range of 0 to 1. Also, since a dichotomous variable can take on only two values, the residuals will not be normally distributed about the predicted line.

Logistic regression is a non-linear model that produces a logistic curve where the values are limited to 0 and 1.

This probability is compared to a threshold value of 0.5 to decide the final classification of the data into one class. So if the probability for a class is more than 0.5, it is labeled as 1, else 0.

One of the use cases of logistic regression in finance is that it can be used to predict the performance of stocks.

You can read more about logistic regression along with Python code on how to use it to predict stock movement on this blog.

Logistic regression
Logistic regression: Source

Quantile regression

As we have seen in our last blog, the linear regression model has several limitations when dealing with financial time series data, such as when dealing with skewness and the presence of outliers.

In 1978, Koenker and Bassett proposed quantile regression as a tool that allows us to explore the entire data distribution. So, we can examine the relationship between the independent and dependent variables at different parts of the distribution, say, the 10th percentile, the median, the 99th percentile, etc.

Quantile regression estimates the conditional median or the conditional quartile of the dependent variables for the given independent variables.

Quantile regression
Quantile regression: Source

The classical linear regression attempts to predict the mean value of the dependent variable based on the different values of the independent variable(s). The OLS regression coefficients of the independent variables signify the changes from one-unit changes of the associated predictor variables. Similarly, the quantile regression coefficients of the independent variables denote the changes in the specified quantile from one-unit changes of the associated predictor variables.

Quantiles and percentiles are used to divide the data samples into different groups. The linear regression model works on the assumption that the errors are normally distributed. However, this approach may fail in case we have significant outliers, that is, if the distribution has a fat tail. Quantile regression is more robust than linear regression in nature and is able to capture outliers effectively.

In quantile regression, the conditional median function is estimated by the median estimator, which reduces the sum of absolute errors.

Quantile regression can help risk managers in managing the tail-risk in a better manner. So it is used in risk management, especially in the context of the Value at Risk (VaR), which is, by definition, a conditional quantile.

The VaR can be interpreted as the amount lost on a portfolio with a given probability over a time period. We can also identify the periods of higher risk exposure based on quantile regression.

Quantile regression can be used to forecast returns and for portfolio construction too.

Ridge regression

As we discussed previously, linear regression assumes there is no multicollinearity in the data. Hence, it is not suitable when the predictor variables are correlated. Multicollinearity can cause wide swings in the regression model coefficients.

Ridge regression is suitable to be used in such a scenario. It is especially useful when the number of predictor variables is larger than the number of observations and when each predictor contributes to predicting the dependent variable.

Ridge regression aims at reducing the standard error by constraining the size of the coefficients.

It does so by introducing a penalty term lambda (𝜆) equivalent to the sum of the magnitude of the coefficients. Lambda penalizes large regression coefficients, and as the value of lambda increases, so does the penalty. Since it regularizes the coefficients, it is also known as L2 regularization.

An important point to note is that while the OLS estimator is scale-invariant, the ridge regression is not so. So, we need to scale the variables before applying ridge regression.

Ridge regression decreases the model complexity but does not reduce the number of variables, as it can shrink the coefficients close to zero but does not make them exactly zero. Hence, it cannot be used for feature selection.

You can read more about ridge regression here.

Lasso regression

Lasso stands for Least Absolute Shrinkage and Selection Operator.

It is a close cousin of ridge regression and is also used to regularize the coefficients in a regression model. Regularization is done to avoid overfitting when we have a large number of predictor variables that make the model more complex.

The lasso regression’s penalty term is equal to the absolute value of the magnitude of the coefficients.

Lasso regression is also known as L1 regularization.

As its name suggests, the lasso regression can shrink some of the coefficients to absolute zero. Hence, it can be used for feature selection.

Ridge vs Lasso regression
Ridge vs Lasso regression: Source

Comparison between Ridge regression and Lasso regression

Ridge regression and lasso regression can be compared as follows:

  • Lasso regression can be used for feature selection while ridge regression can not.
  • While both ridge and lasso regression work well to deal with multicollinearity in the data, they deal with it differently. While ridge regression shrinks the coefficients of all correlated variables, making them similar, lasso regression retains one of the correlated variables with a larger coefficient, while the remaining tend to zero.
  • Ridge regression works well in cases where there are a large number of significant predictor variables. Lasso regression is effective in cases where there are many predictor variables, but only a few are significant.
  • Both these models can be used for stock prediction. However, since Lasso regression performs feature selection and selects only the non-zero coefficients for training the model, it may be a better choice in some cases. You can read this paper to know more about using Lasso regression for stock market analysis.

Elastic net regression

Lasso regression’s feature selection may not be reliable as it is dependent on the data. Elastic net regression is a combination of the ridge and lasso regression models. It combines the penalty terms from both these models and usually performs better.

We first compute the ridge regression coefficients in elastic net regression, which are then shrunk using lasso regression.

Elastic net regression can be used for regularization as well as feature selection.

Read this blog to learn more about the ridge, lasso and elastic net regressions along with their implementation in Python.

Penalty terms for Ridge, Lasso, and Elastic net regression
Penalty terms for Ridge, Lasso, and Elastic net regression: Source

Least angle regression

As we saw earlier, lasso regression constrains the coefficients of a model by applying a bias, hence avoiding overfitting. However, we need to provide a hyperparameter lambda (𝛌) to the model, which controls the weight of the penalty of the function.

The Least Angle Regression (LARS) is an alternative approach to solve the problem of overfitting in a linear regression model, which can be tuned to perform lasso regression without providing a hyperparameter.

LARS is used when we have high-dimensional data, i.e., data that has a large number of features. It is similar to the forward stepwise regression.

In LARS, we start with all coefficients equal to zero and find the explanatory variable that is most correlated with the response variable. We then take the largest step possible in the direction of this explanatory variable until another explanatory variable has a similar correlation with the residual.

Now, the LARS proceeds in an equiangular direction between both these explanatory variables till a third explanatory variable pops with the same value of correlation with the residual.

As earlier, we move forth equiangularly (with the least angle) in the direction of these three explanatory variables. This is done till all the explanatory variables are in the model.

However, it must be noted that the LARS model is sensitive to noise.

Geometric representation of LARS
Geometric representation of LARS: Source

Principal components regression

The principal component analysis is used to represent data parsimoniously with the least amount of information loss. The aim of PCA is to find principal components that are a linear combination of the estimators that are mutually orthogonal and have the maximum variance. Two principal components are said to be orthogonal if the scalar product of their vectors is equal to zero.

Principal component regression involves using PCA for dimensionality reduction on the original data and then conducting regression on the top principal components and discarding the remaining.

Image representing principal component analysis
Image representing principal component analysis: Source

Comparison between Multiple Linear regression and PCA

Principal component regression is an alternative to multiple linear regression, which has some major disadvantages.

MLR cannot handle multicollinearity among the estimators and assumes that the estimators are measured accurately and without noise. It cannot handle missing values.

Also, in case we have a large number of estimators, which is more than the number of observations, the MLR cannot be used.

PCA replaces a large number of estimators with a smaller number of principal components that capture the maximum variance represented by the estimators. It simplifies the complexity of the model while retaining most of the information. It is also able to handle any missing data.

Comparison between Ridge regression and PCA

Ridge regression and principal component regression are similar. Conceptually, ridge regression can be imagined as projecting the estimators in the direction of the principal components and then shrinking them proportional to their variance.

This will shrink all the principal components but will not completely shrink any to zero. However, the principal components analysis effectively shrinks some principal components to zero (which get excluded) and does not shrink some principal components at all.

Decision trees regression

Decision trees split the datasets into smaller and smaller subsets at the nodes, thereby creating a tree-like structure. Each of the nodes where the data is split based on a criterion is called an internal/split node, and the final subsets are called the terminal/leaf nodes.

Decision trees can be used for solving classification problems like predicting whether the prices of a financial instrument will go up or down. It can also be used to predict the prices of a financial instrument.

Decision tree regression is when the decision tree model is used to perform a regression task used to predict continuous values instead of discrete ones.

Decision trees follow a top-down greedy approach known as recursive binary splitting. It is a greedy approach because, at each step, the best split is made at that particular node instead of looking ahead and picking a split that may lead to a better tree in the future.

Each node is split to maximize the information gain. The information gain is defined as the difference in the impurity of the parent node and the sum of the impurities of the child nodes.

For regression trees, the two popular measures of impurity are:

  • Least squares: Each split is chosen to minimize the residual sum of squares (RSS) between the observation and the mean at each node.
  • Least absolute deviations: This method minimizes the mean absolute deviation from the median within each node. This method is more robust to outliers but may be insensitive when dealing with a dataset with a large number of zero values.

If a highly nonlinear and complex relationship exists between the explanatory variables and the response variable, a decision tree may outperform classical approaches.

Decision trees are easier to interpret, have a nice visual representation, and can easily handle qualitative predictors without the need to create dummy variables.

However, they are not robust and have poor predictive accuracy compared to some of the other regression models. Also, they are prone to overfitting for a dataset with many estimator variables.

By using ensemble methods like bagging, boosting, and random forests, we can improve the predictive performance of decision trees.

Random forest regression

Random forest regression is an ensemble method of regression that gives a significantly better performance than an individual decision tree. It goes with the simple logic of applying the ‘wisdom of the crowd’. It takes many different decision trees, constructed in a ‘random’ way and then makes them vote.

Multiple regression trees are built on bootstrapped training samples, and each time a split is considered in a tree, a random sample of predictors is selected from the total number of predictors.

This means that when building a tree in the random forest, the algorithm is not even allowed to consider the entire set of predictors available. So, if we have one strong predictor and some moderately strong predictors, some of the trees in the random forest will be constructed without even considering the strong predictor, giving the other predictors a better chance.

This is essentially like introducing some de-correlation among the trees, thereby making the results more reliable.

Read this post if you want to learn more about random forests and how they can be used in trading.

Image representation of a Random forest regressor
Image representation of a Random forest regressor: Source

Support vector regression

Support Vector Regression (SVR) applies the principles of the support vector machine (SVM) to predict a discrete number. It attempts to find the hyperplane that contains the maximum number of data points. You can learn more about how support vector machines can be used in trading here.

Unlike other regression algorithms that attempt to minimize the error between the predicted and actual values of the response variable, the SVR tries to fit the hyperplane within a margin of tolerance (ε) which is used to create a pair of boundary lines.

The SVR uses different mathematical functions (kernels) to transform the input data, which are used to find a hyperplane in a higher-dimensional space. Some of the kernels are linear, non-linear, polynomial, etc. The type of kernel to be used is based on the dataset.

SVR uses a symmetric loss function that penalizes both the higher and lower misestimates. The complexity of the SVR model makes it difficult to use on larger datasets. Therefore, the linear kernel function is used if we are working with a big dataset.

The SVR is robust to outliers and has high predictive accuracy. You can read more about using SVR, linear, and polynomial regression models for stock market prediction here.

Image representation of Support vector regression
Image representation of Support vector regression: Source


  1. Econometrics by example - Damodar Gujarati
  2. The basics of financial econometrics - Frank J. Fabozzi, Sergio M. Focardi, Svetlozar T. Rachev, Bala G. Arshanapalli
  3. Econometric Data Science - Francis X. Diebold
  4. An Introduction to Statistical Learning - Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani


In this blog, we have covered some important types of regression that are used in the financial world. Each comes with its own strengths and maybe some challenges.

We hope you enjoyed reading about these and would go ahead and try some of these out to implement your ideas.

With the right training and guidance from industry experts, it can be possible for you to learn it as well as Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. These and various aspects of Algorithmic trading are covered in this algo trading course. EPAT equips you with the required skill sets to build a promising career in algorithmic trading. Be sure to check it out.

Until next time!

Disclaimer: All investments and trading in the stock market involve risk. Any decision to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.

Algo Trading Course