Autoregression: Time Series, Models, Trading, Python and more

16 min read

By José Carlos Gonzáles Tanaka and Chainika Thakar (Originally written by Satyapriya Chaudhari)

Autoregression is a powerful tool for anticipating future values in time-based data. This data, known as a time series, consists of observations collected at various timestamps, regularly or irregularly. By leveraging historical trends, patterns, and other hidden influences, autoregression models can forecast the value for the next time step.

These models (including various options beyond autoregression) predict future outcomes by analyzing and learning from past data. This article delves deeper into one particular type: the autoregression model, often abbreviated as the AR model.

Prerequisite Blogs

Before delving into the this blog, it’s ideal to follow a structured learning track covering foundational to advanced topics.

Start with the basics in Introduction to Time Series and a comparative deep-learning perspective in the Time Series Vs LSTM Models.

Next, establish the essentials of Stationarity, the Hurst Exponent, and Mean Reversion to understand how and why time‐series data exhibit long‐term memory.

Once you’re comfortable with these, progress to advanced or multivariate methods, including Vector Autoregression (VAR), Johansen Cointegration, and Time-Varying-Parameter VAR.

This comprehensive roadmap equips you with the necessary background to fully appreciate this Blog.

You are expected to know how to use these models to forecast time series. You should also have a basic understanding of R or Python for time series analysis.

This article covers:


What is Autoregression?

Autoregression models time-series data as a linear function of its past values. It assumes that the value of a variable today is a weighted sum of its previous values.

For example, analyzing the past month’s AAPL (APPLE) performance can help predict future performance.


Formula of Autoregression

In simpler terms, first-order autoregression says: "Today's value depends on yesterday's value". We express this relationship mathematically using a formula:

$$y_t = c + \phi_1 y_{t-1} + \epsilon_t$$
Where,
• Xt is the current value in the time series.
• c is a constant or intercept term.
• ϕ1 is the autoregressive coefficients.
• Xt-1 is the past value of the time series.
• ϵt is the error term representing the random fluctuations or unobserved factors.

Autoregression Calculation

The autoregressive coefficient, ϕ1, is estimated using statistical methods like maximum likelihood estimation, Yule-Walker estimation, two-step regression estimation, and conditional least squares.

In the context of autoregressive (AR) models, the coefficients represent the weights assigned to the lagged values of the time series to predict the current value. These coefficients capture the relationship between the current observation and its past values.

The goal is to find the coefficients that best fit the historical data, allowing the model to capture the underlying patterns in the time series accurately. Once the coefficients are determined, they help forecast future values in the time series based on the observed values from previous time points. Hence, the autoregression calculation helps to create an autoregressive model for time series forecasting.

You can explore the video below to learn more about autoregression.


Autoregression Model

Before delving into autoregression, it's beneficial to revisit the concept of a regression model.

A regression model is a statistical method to determine the association between a dependent variable (often denoted as y) and an independent variable (typically represented as X). Thus, in regression analysis, the focus is on understanding the relationship between these two variables.

For instance, consider having the stock prices of Bank of America (ticker: BAC) and J.P. Morgan (ticker: JPM).

If the objective is to forecast the stock price of JPM based on BAC's stock price, then JPM's stock price would be the dependent variable, y, while BAC's stock price would act as the independent variable, X. Assuming a linear association between X and y, the regression equation would be:

$$y=mX + c$$

Here,

m represents the slope, and c denotes the intercept of the equation.

However, if you possess only one set of data, such as the stock prices of JPM, and wish to forecast its future values based on its past values, you can employ the autoregression model explained in the previous section.

Like linear regression, the autoregressive model presupposes a linear connection between yt and yt−1, termed autocorrelation. A deeper exploration of this concept will follow subsequently.


Autoregression Models of Order 2 and Generalise to Order p

Let's delve into autoregression models, starting with order 2 and then generalising to order p.

Autoregression Model of Order 2 (AR(2))

In an autoregression model of order 2 (AR(2)), the current value yt is predicted based on its two most recent lagged values, that is, yt-1 and yt-2.

$$y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \epsilon_t$$
Where,
• c is a constant.
• ϕ1 and ϕ2 are the autoregressive coefficients for the first and second lags, respectively.
• ϵt represents the error term.

Generalising to order p (AR(p))

For an autoregression model of order p (AR(p)), the current value yt is predicted based on its p most recent lagged values.

$$y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} +...+ \phi_p y_{t-p} + \epsilon_t$$
Where,
• c is a constant.
• ϕ1, ϕ2,..., ϕp are the autoregressive coefficients for the respective lagged terms yt-1,yt-2, ..., yt-p.
• ϵt represents the error term.

In essence, an AR(p) model considers the influence of the p previous observations on the current value. The choice of p depends on the specific time series data and is often determined using methods like information criteria or examination of autocorrelation and partial autocorrelation plots.

The higher the order p, the more complex the model becomes, capturing more historical information but also potentially becoming more prone to overfitting. Therefore, it's essential to strike a balance and select an appropriate p based on the data characteristics and model diagnostics.


Autoregression vs Autocorrelation

Before determining the difference between autoregression and autocorrelation, you can find the introduction of autocorrelation in this video below. This video will help you learn about autocorrelation with some interesting examples.

Now, let us find the difference between autoregression and autocorrelation in a simplified manner below.

Aspect

Autoregression

Autocorrelation

Modelling

Incorporates past observations to predict future values.

Describes the linear relationship between a variable and its lags.

Output

Model coefficients (lags) and forecasted values.

Correlation coefficients at various lags.

Diagnostics

ACF and PACF plots to determine model order.

ACF plot to visualise autocorrelation at different lags.

Applications

Stock price forecasting, weather prediction, etc.

Signal processing, econometrics, quality control, etc.


Autoregression vs Linear Regression

Now, let us see the difference between autoregression and linear regression below. Linear regression can be learned better and more elaborately with this video below.

Aspect

Autoregression

Linear Regression

Model Type

Specifically for time series data where past values predict the future.

Generalised for any data with independent and dependent variables.

Predictors

Past values of the same variable (lags).

Independent variables can be diverse (not necessarily past values).

Purpose

Forecasting future values based on historical data.

Predicting an outcome based on one or more input variables.

Assumptions

Time series stationarity, no multicollinearity among lags.

Linearity, independence, homoscedasticity, no multicollinearity.

Diagnostics

ACF and PACF mainly.

Residual plots, Quantile-Quantile plots, etc.

Applications

Stock price prediction, economic forecasting, etc.

Marketing analytics, medical research, machine learning, etc.


Autocorrelation Function and Partial Autocorrelation Function

Let's walk through how to create Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots using Python's statsmodels library and then interpret them with examples.

Step 1: Install Required Libraries

First, ensure you have the necessary libraries installed:

Step 2: Import Libraries

Step 3: Create Sample Time Series Data

Let's create a simple synthetic time series for demonstration:

Step 4: Plot ACF and PACF

Now, plot the ACF and PACF plots for the time series:

Output:

Auto Correlation Function and Partial Auto Correlation Function in autoregressive model

Interpretation

  • The ACF measures the correlation between a time series and its lagged values. A decreasing ACF value suggests that past values from the time series affect today’s time series.
  • The higher the significance of very long lags’ ACF on the time series, the more distant past values greatly impact today’s time series. This is what we found in this plot. The ACF slowly decreases, and even at lag 40, the ACF keeps being high.
  • The PACF drops off at lag 1. So, whenever we have a slowly decreasing ACF and a PACF significant only at lag 1, it is a clear sign we have a random-walk process, i.e., the time series is not stationary.
  • By examining the ACF and PACF plots and their significant lags, you can gain insights into the temporal dependencies within the time series and make informed decisions about model specification in Python.
  • The example given is a price series following a random-walk process, i.e., is not stationary.

Let’s see below how to estimate a stationary AR model.


Steps to Build an Autoregressive Model

Building an autoregressive model involves several steps to ensure that the model is appropriately specified, validated, and optimized for forecasting. Here are the steps to build an autoregressive model:

Step 1: Data Collection

  • Gather historical time series data for the variable of interest.
  • Ensure the data covers a sufficiently long period and is consistent in frequency (e.g., daily, monthly).

Step 2: Data Exploration and Visualisation

  • Plot the time series data to visualize trends, seasonality, and other patterns.
  • Check for outliers or missing values that may require preprocessing.

Step 3: Data Preprocessing

  • Handle missing values using appropriate methods such as interpolation or imputation.
  • Ensure the data is stationary. Stationarity is important to model autoregressive models. If not, you must difference or de-trend the data.

Step 4: Model Specification

  • Determine the appropriate lag order (p) based on the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots.
  • Decide on including any exogenous variables or external predictors that may improve the model's forecasting ability.

Step 5: Model Estimation

  • Described above. However, in this computer age, Almost all statistical packages can estimate an ARMA model.

Step 6: Forecasting

  • Split the data into training and test sets.
  • Fit the model on the training data.
  • Perform statistical metrics such as Mean Absolute Error (MAE) and root Mean Square Error (RMSE) to assess the model's predictive accuracy using the test data.

Step 7: Model Refinement

  • If the model performance is unsatisfactory for new data streams, consider returning to step 3.

Step 8: Documentation and Communication:

  • Document the model's specifications, assumptions, and validation results.
  • Communicate the model's findings, limitations, and implications to stakeholders or end-users.

By following these steps systematically and iteratively refining the model as needed, you can develop a robust autoregressive model tailored to your time series data's specific characteristics and requirements.


Example of Autoregressive Model in Python for Trading

Below is a step-by-step example demonstrating how to build an autoregressive (AR) model for time series forecasting in trading using Python. We'll use historical stock price data for Bank of America Corp (ticker: BAC) and the statsmodels library to construct the AR model.

Let us now see the steps in Python below.

Step 1: Install Required Packages

If you haven't already, install the necessary Python packages:

Step 2: Import Libraries

Step 3: Load Historical Stock Price Data

Some things to say:

  • Use the Apple stock data from 2000 to January 2025.
  • Save the window size to be used as the train span to estimate the AR model as “rolling_window”.

Output:

AAPL Stock prices
AAPL Stock prices

Step 4: Find the Order of Integration of the price series

You need a stationary time series to estimate an AR model. Due to that, you’ll need to find the order of integration of the price series, i.e., find the order “d” of integration of the prices, such that, to make it stationary, you’ll need to difference the data “d” times. To find that number “d”, you can apply an Augmented Dickey-Fuller test to the prices series, its first and second differences (the second difference is enough based on stylized facts). See below:

We use the adfuller method provided in the statsmodels library and output its second result, the p-value. Whenever the p-value is less than 5%, it means the time series is stationary.

Output:
(0.9987469346686696, 1.2195696223837154e-26, 0.0)
 

As we can see, the price, its first difference, and the second difference are non-stationary, stationary, and stationary, respectively. This price series needs to be first differenced to make it stationary. This makes us understand that the price has an order of integration 1, i.e., I(1).

So, to run an AR model, we need to estimate it based on the first difference, which in the ARIMA method of the statsmodels, means d=1. Here we estimate a stationary AR(1), i.e., an ARIMA(1,1,0), as described below.

Step 5: Train the AR model using ARIMA

Let us train the AR(1) model using the ARIMA method from the statsmodels library.

The ARIMA method can be imported as shown below

Using the ARIMA method, the autoregressive model can be trained as

ARIMA(data, (p, d, q))

where

  • p is the AR parameter that needs to be defined.
  • d is the difference parameter. This will be zero in case we’re sure the time series is stationary, 1 in case the time series is I(1), 2 in case the time series is I(2), and so on. Since we found that our price series is I(1), we set d as 1.
  • q is the MA parameter. This will also be zero in the case of an AR model. You will learn about this later.

Hence, the autoregressive model can be trained as

ARIMA(data, (p, 1, 0))

Output:
ar.L1     0.01
sigma2    0.05
dtype: float64
 

From the output above, you can see that

  • \( \phi_1 = 0.01 \)
  • \( \text{Variance of the residuals} = \sigma^2 = sigma2 = 0.05 \)

Therefore, the model becomes

$$AR(1) = y_t = 0.01*y_{t-1}$$

For the price, the first difference of the data. Remember that the AR model should have a stationary time series as input.

Let’s estimate an AR model for each day and forecast the next-day price. You can do it quickly using pandas.DataFrame.rolling.apply. Let’s create a function to estimate the model and return a forecast for the next day.

And let’s run the model for each day using as the train span the rolling_window variable. Thus, the first rolling_window days will be NaN values.

The forecast of tomorrow will be saved today. Consequently, we shifted the predicted_price.

Step 6: Evaluate model performance

We compute, for a specific year, in this function:

  • The Mean Absolute Error
  • The mean Squared Error
  • The Root Mean Squared Error
  • The Absolute Percentage Error
  • Plo the actual and forecasted prices
  • Plot the residuals
  • Plot the ACF
  • Plot the PACF
Output:
The Mean Absolute Error is 2.63
The Mean Squared Error is 11.41
The Root Mean Squared Error is 3.38
The Mean Absolute Percentage Error is 1.74
 
Model performance
Model performance

The first plot above shows that the predicted values are close to the observed value. However, the forecasted prices don’t exactly follow the actual prices.

Tip: Whenever you compare actual prices against forecasted prices, do not compare them for a big data span. People usually compare those prices, e.g., from 1990 to 2025. When you see those plots, you’ll tend to think the forecasted prices follow exactly the actual prices’ behavior. But that’s not a good way to go. If you want to compare them well, a zoom-in inspection will be needed, e.g., compare the two prices for a specific month if the data frequency is daily, and so on.

From the third and fourth plots above, you can see that the model captures almost entirely the price behavior because there are very few significant ACF and PACF across the lags. To formally choose the correct model, you can follow the Box-Jenkins methodology to do it graphically each day, or you can select the best model with an information criteria, as described below, to do it algorithmically.

**Note: You can log into quantra.quantinsti.com and enroll in the course on  Financial Time Series to find out the detailed autoregressive model in Python.**

Forecasting is a statistical process, so forecasting variance will be higher than zero, i.e., there can be errors in the forecasting prices with respect to actual prices.

Here are some reasons why your autoregressive model can have poor performance:

  • Model Misspecification: The AR model's assumptions or specifications may not align with the true data-generating process, leading to biased forecasts.
  • Lag Selection: Incorrectly specifying the lag order in the AR model can result in misleading predictions. Including too many or too few lags may distort the model's predictive accuracy.
  • Missed Trends or Seasonality: The AR model may not adequately capture underlying trends, seasonality, or other temporal patterns in the data, leading to inaccurate predictions.
  • External Factors: Unaccounted external variables or events that influence the time series but are not included in the model can lead to discrepancies between predicted and actual prices.
  • Data Anomalies: Outliers, anomalies, or sudden shocks in the data that were not accounted for in the model can distort the predictions, especially if the model is sensitive to extreme values.
  • Stationarity Assumption: If the time series is not stationary, applying an AR model can produce unreliable forecasts. Stationarity is a key assumption for the validity of AR models.

Applications of Autoregression Model in Trading

Autoregression (AR) models have been applied in various ways within trading and finance. Here are some applications of autoregression in trading:

  • Price prediction: As previously shown, traders often use autoregressive models to analyze historical price data and identify patterns to forecast prices or price direction. This is the most used case of AR models.
  • Risk Management: Autoregression can model and forecast volatility in financial markets. However, we would need the AR model together with the GARCH model to forecast variance, and with both you can do proper risk management.
  • Market Microstructure: Autoregression can be used to model the behavior of market disturbances, such as in high-frequency trading.

Common Challenges of Autoregression Models

The following are common challenges of the autoregression model:

  • Overfitting: Autoregressive models can become too complex and fit the noise in the data rather than the underlying trend or pattern. This can lead to poor out-of-sample performance and unreliable forecasts. That’s why a parsimonious model is the best choice for estimating AR models.
  • Stationarity: Many financial time series exhibit non-stationary behavior, meaning their statistical properties (like mean and variance) change over time. Autoregressive models assume stationarity, so failure to account for non-stationarity can result in inaccurate model estimates.
  • Model Specification: Determining an autoregressive model's appropriate lag order (p) is challenging. Too few lags might miss important information, while too many can introduce unnecessary complexity. A parsimonious model helps with this type of issue.
  • Seasonality and Periodicity: Autoregressive models might not capture seasonal patterns or other periodic effects in the data, leading to biased forecasts. You might need to de-seasonalize the data before you apply the AR model.

Tips for Optimizing Autoregressive Model Performance Algorithmically

Now, let us see some tips for optimizing the autoregressive model’s performance below.

  • Data Preprocessing: Ensure the data is stationary or apply techniques like differencing or de-trending to achieve stationarity before fitting the autoregressive model.
  • Model Selection: Usually, you apply the Box-Jenkins methodology to select the appropriate number of lags of the AR model. This methodology uses a graphical inspection of the ACF and PACF to derive the best model. In algorithmic trading, you can just estimate multiple AR models and select the best using information criteria (e.g., Akaike Information Criteria, AIC; Bayesian Information Criteria, BIC, etc.).
  • Include Exogenous Variables: It’s usually the case the AR models are estimated only with the time series lags. However, you can also incorporate relevant external factors or predictors that might improve the model's forecasting accuracy.
  • Continuous Monitoring and Updating: Financial markets and economic conditions evolve over time, this is called regime changes. Regularly re-evaluate and update the model to incorporate new data and adapt to changing dynamics.

By addressing these challenges and following the optimization tips, practitioners can develop more robust and reliable autoregressive models for forecasting and decision-making in trading and finance.


Expanding on the AR Model

We have talked about everything about autoregressive models. However, what about if we also lag the error term, i.e., we can do something like:

$$y_t = c + \phi_1y_{t-1} + \epsilon_t + \theta \epsilon_{t-1} $$

This model is the so-called ARMA model; specifically, it’s an ARMA(1,1) model; because we have the first lag of the time series (The AR component) and we also have the first lag of the model error (The MA component).

In case you want to:

  • Understand what ARMA/ARIMA model is thorougly.
  • Identify correctly the number of lags using the ACF and PACF graphically.
  • Learn how to estimate the ARMA model.
  • Learn how to choose the best number of lags for the AR and MA components.
  • Create a backtesting code using this model as a strategy.
  • Learn how to improve the model’s performance.

I would suggest reading the following 3 blog articles, where you’ll have everything you need to know about this type of model:


Conclusion

Utilizing time series modeling, specifically Autoregression (AR), offers insights into predicting future values based on historical data. We comprehensively covered the AR model, its formula, calculations, and applications in trading.

By understanding the nuances between autoregression, autocorrelation, and linear regression, traders can make informed decisions, optimize model performance, and navigate challenges in forecasting financial markets. Last but not least, continuous monitoring, model refinement, and incorporating domain knowledge are vital for enhancing predictive accuracy and adapting to dynamic market conditions.

You can learn more with our course on Financial Time Series Analysis for Trading for learning the analysis of financial time series in detail.

With this course, you will learn the concepts of Time Series Analysis and how to implement them in live trading markets. Starting from basic AR and MA models to advanced models like SARIMA, ARCH, and GARCH, this course will help you learn it all. Also, after learning from this course, you can apply time series analysis to data exhibiting characteristics like seasonality and non-constant volatility.

Continue Learning

  1. Strengthen your grasp by looking into Autocorrelation & Autocovariance to see how data points relate over time, then deepen your knowledge with fundamental models such as Autoregression (AR), ARMA, ARIMA and ARFIMA
  2. If your goal is to discover alpha, you may want to experiment with a variety of techniques, such as technical analysis, trading risk management, pairs trading basics, and Market microstructure. By combining these approaches, you can develop and refine trading strategies that better adapt to market dynamics.
  3. For a structured approach to algo trading—and to master advanced statistics for quant strategies—consider the Executive Programme in Algorithmic Trading (EPAT). This rigorous course covers time series fundamentals (stationarity, ACF, PACF), advanced modelling (ARIMA, ARCH, GARCH), and practical Python‐based strategy building, providing the in‐depth skills needed to excel in today’s financial markets.

File in the download:

  • The Python code snippets for implementing the model are provided, including the installation of libraries, data download, create relevant functions for the model fitting and the forecasting performance.


Note: The original post has been revamped on 11th Feb 2025 for recentness, and accuracy.


Disclaimer: All investments and trading in the stock market involve risk. Any decision to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.

Live Webinar: GenAI & Automated Trading