# Autoregression: Model, Autocorrelation and Python Implementation

Time series modelling is a very powerful tool to forecast future values of time-based data. Time-based data is data observed at different timestamps (time intervals) and is called a time series. These time intervals can be regular or irregular. Based on the pattern, trend, etc. observed in the past data, a time series model predicts the value in the next time period.

The time series models analyses and trains the model on the past data to make future predictions. There are a number of time series models available to make these predictions. In this article you will learn about one such model, the Autoregression Model or the AR Model.

The following topics are covered:

## Autoregression Model

Before you learn what is autoregression, let’s recall what is a regression model.

A regression model is a statistical technique to estimate the relationship between a dependent variable (y) and an independent variable (X). Thus, while working with a regression model, you deal with two variables.

For example, you have the stock prices of Bank of America (referred to as BAC) and the stock prices of J.P. Morgan (referred to as JPM).

• Now you want to predict the stock price of JPM based on the stock price of BAC.
• Here, the stock price of JPM would become the dependent variable, y and the stock price of BAC would be the independent variable X.
• Assuming that there is a linear relationship between X and y, the regression model estimating the relationship between x and Y is of the form
$$y = mX + c$$

where m is the slope of the equation and c is the constant.

Now, let’s suppose you only have one series of data, say, the stock price of JPM. Instead of using a second time series (the stock price of BAC), the current value of the stock price of JPM will be estimated based on the past stock price of JPM. Let the observation at any point in time, t, be denoted as yt. You can estimate the relationship between the value at any point t, yt, and the value at any point (t-1), yt-1using the regression model as below.

$$AR (1) = y_t = 𝚽_1y_{t-1}+c$$

where 𝚽1is the parameter of the model and c is the constant. This is the autoregression model of order 1. The term autoregression means regression of a variable against its own past values.

Like the linear regression model, the autoregression model assumes that there is a linear relationship between yt and yt-1. This is called the autocorrelation. You will be learning more about this later.

Let us have a look at other orders of the autoregression model.

## Autoregression Models of Order p

In the above example, you saw how the value of yt is estimated using one past value (lag term), yt-1. You can use more than one past term to predict the value at time t. Let’s see the form of the model when two lag values are used in the model.

$$AR (2) = y_t = 𝚽_1y_{t-1}+ 𝚽_2y_{t-2}+c$$

where 𝚽1 and 𝚽2 are the parameters of the model and c is the constant. This is the autoregression model of order 2. Here yt is assumed to have a correlation with its past two values and is predicted as a linear combination of past 2 values.

This model can be generalised to order p as below.

$$AR (p) = y_t = 𝚽_1y_{t-1}+ 𝚽_2y_{t-2}+... + 𝚽_py_{t-p+c}$$

where 𝚽p's ; (p = 1, 2, ..., p) are the parameters of the model and c is the constant. This is the autoregression model of order p. Here yt is assumed to have a correlation with its past p values and is predicted as a linear combination of past p values.

There are a number of ways available to compute the parameters 𝚽p's.

Some commonly used techniques are the Ordinary Least Square technique and the Yule-Walker equation. Determining these coefficients mathematically is little complex and is beyond the scope of this article. Later you will learn how to estimate these parameters using the statsmodel library in Python.

You must be wondering, how to determine the order p of the autoregression model?

Remember, while learning about the AR (1) model, you had come across a term - autocorrelation. This is a very important concept in autoregression models and is also used in determining the order p of the autoregression model. Let’s see how.

## Autocorrelation & Partial Autocorrelation

As you have already seen, an autoregression model predicts the current value based on past values. That means that the model assumes that the past values of the time series are affecting its current value. This is called the autocorrelation.

In other words, autocorrelation is nothing but a correlation coefficient. But the correlation here is not measured between two variables. It is measured between a time series with its own lagged values over successive time intervals.

As the name suggests, autocorrelation means correlation with itself. Sometimes, autocorrelation is also referred to as serial correlation or lagged correlation.

Just like the correlation coefficient, autocorrelation also measures the degree of relationship between a variable’s current value and its past values. The value of autocorrelation lies between -1 and +1.

• -1 means a perfect negative autocorrelation,
• +1 means a perfect positive autocorrelation, and
• 0 means no autocorrelation.

As autocorrelation measures the relationship between the current value and the past value of the data, it is very useful in measuring the randomness of data. This randomness in the data can be detected using the autocorrelation function (ACF) plot.

Before looking into the partial autocorrelation, you can watch the video below on autocorrelation.

Let us understand what is partial autocorrelation.

Partial autocorrelation is the conditional correlation between the variable and it’s lagged value. That means, the partial autocorrelation between the current value of the time series, yt and it’s lagged value yt-h would be the conditional correlation between yt and yt-h, conditional on all the lag terms in between t and t-h, i.e. yt-1, yt-2, ..., yt-h+1. This mean, unlike the autocorrelation values, the partial autocorrelation values controls the other lag orders and ignore their effect.

A partial autocorrelation function (PACF) plot is used to identify the order of the autoregression model.

Let us now move forward and explore the ACF plot and the PACF plot.

## Autocorrelation Function (ACF) Plot & Partial Autocorrelation Function (PACF) Plot

An autocorrelation function plot is the plot of the autocorrelation for the different lagged values. r1 measures the correlation between the variable and its first lagged value, i.e. yt and yt-1. Similarly, r2 measures the correlation between the variable and its second lagged value, i.e. yt and yt-2. And so on.

An ACF plot will plot the values of r0, r1, r2, ... corresponding to the respective lag orders 0, 1, 2, … These values are plotted with a confidence band that helps us identify whether a value is statistically significant or not. The same has been illustrated while visualising the ACF plot in Python.

Note that r0 is the correlation between the variable with itself, and hence will always be equal to 1.

The ACF plot is a good indicator of the randomness of the data. For non-random data, the value of the autocorrelation for at least one lagged term would be statistically significant (significantly non-zero). However, this is not the only measure of randomness. Zero autocorrelation for all the lagged terms does not necessarily mean random data and vice versa.

Just like an autocorrelation function plot, a partial autocorrelation function plot is the plot of the partial autocorrelation for different lagged terms. They are also plotted with a confidence band that helps us identify the significant lagged terms, which then becomes the order of the autoregression model.

You will next learn how to visualise the ACF and PACF plot in Python.

## Visualising ACF Plot and PACF Plot in Python

To visualise the plots, we will download the stock price data of J.P. Morgan using the yfinance library from January 2019 to April 2020.

You can plot the ACF and PACF plots using the plot_acf and plot_pacf methods from the statsmodels library respectively.

From the above plot, you can see that the value of autocorrelation at lag 0 is 1 (as it is the correlation of the variable with itself). The blue region that you see is the confidence band and autocorrelation up to lag 20 lies outside this blue region.

This means that values up to lag 20 are statistically significant, that is, they affect the current price. Also, the autocorrelation is gradually approaching zero as the lag term increases. This means that farther we go, lesser is the correlation.

Form the above plot you can see that lag 1, 2, 3, 4, etc are outside the confidence band (blue region) and hence are statistically significant.

Finally, you will learn to estimate the parameters of the autoregression models. But before that, let’s see if there are any challenges.

## Limitations of Autoregression Models

A very important point to note that an autoregression model makes an assumption that the underlying data comes from a stationary process. A stationary time series is a time series whose statistical properties like mean and variance are independent of the point in time where it is observed. You can read more about Stationarity here.

As most of the time series in real life are non-stationary, the use of AR models cannot be used on them without transforming the data.

## Python Implementation of Autoregression Models

As one can use the AR model only on stationary data, let’s first check if the JPM stock price is stationary of not. You can use the adfuller method from the statsmodels library to check this.

The output of the code below is

p-value: 0.21

Since the p-value is greater than 0.05, the time series is not stationary. Let’s calculate the first-order difference of the series and test for stationarity again.

The output of the code below is

p-value: 0.00

Since the p-value is less than 0.05, the time series is stationary. You can now apply the autoregression model on the transformed series. But before that, let’s find the order of the AR model using the PACF plot on the transformed series.

From the above plot you can see that lag 1, 2, 3, 4, etc are outside the confidence band (blue region) and hence are statistically significant. Also, the plot suggests that we can fit an autoregression model of order 1 on the differenced series.

You can use the ARIMA method from the statsmodel library to fit the model.

The output of the code below is

From the output above, you can see that the fitted model is

$$AR(1) = y_t = 113.42 + 0.99*y_{t-1}$$

Similarly, you can replace the value of p in the code above and fit an AR model of a different order.

## Conclusion

Autoregression is one of the most commonly used tools in the time series analysis. An autoregression model works on the principle that the value of any time series at any given point in time is related to its past values.

In this blog, you have learnt about the structure, the order and limitations of an autoregression model. The statsmodel library in Python has a method called ARIMA. The ARIMA method can be used to fit an AR model with proper parameters.

Another very commonly used model in the time series analysis is the moving average model. Together with the moving average model, the autoregression model forms the Autoregression Moving Average (ARMA) model.

Not only the ARMA model, but autoregression forms the foundation of many other time series models. To explore more about time series analysis and various time series models, check out our course on Financial Time Series Analysis for Trading.

Disclaimer: All investments and trading in the stock market involve risk. Any decisions to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.