Processing math: 71%

AutoRegressive Fractionally Integrated Moving Average (ARFIMA) model

16 min read

By José Carlos Gonzáles Tanaka

Usually, in algorithmic trading, we talk about AutoRegressive Integrated Moving Average (ARIMA) models. They’re very popular in the industry. You might remember that the “d” component in the model can be 0, 1, 2, etc. What if 'd' could take fractional values? We’ll learn about such models i.e. AutoRegressive Fractionally Integrated Moving Average (ARFIMA) here.

Let’s dive in and enjoy!


What is an ARFIMA model?

Do you remember an ARIMA(p,d,q) model? Let’s write its equation:

yt(1L)d=c+ϕ1yt1+ϕ2yt2+...+ϕpytp+ϵt+θ1ϵt1+θ2ϵt2++θqϵtq

Usually “d” is 0, when we model asset returns, and d=1 when we model asset prices, “d=2” when second differences of the original series are stationary, etc.

An ARFIMA(p,d,q) is the same as an ARIMA(p,d,q). The only difference is that for the ARFIMA model, “d” can take float values between -1 and 1.


Purpose of the ARFIMA model

While learning about algo trading, you might have learned that in order to apply an ML model, or an econometric model such as the ARMA, GARCH, etc., it’s really important to convert the usually-non-stationary series into a stationary series by finding the integer-type order of integration and differencing “d” times the series.

Understanding the stationary data definition is crucial for effectively applying models like ARFIMA, ARMA, or GARCH. By ensuring your data is stationary, you lay the foundation for more accurate predictions and stronger models. Explore further to see how this concept can refine your analysis and enhance the reliability of your trading strategies.

Well, we say this because ML models, as statistical models, need to be used on data that has a constant mean, variance and covariance (with respect to time).

However, when we convert prices to returns, to make them stationary, we have returns which have the good statistical properties to be used as input for an ML, but they lose what is called “memory”.

This is nothing else than the persistence of autocorrelation that we usually find it in asset prices. Then, a researcher in the 80s came up with an interesting model.

Hosking (1981)  presented the ARIMA process to have the “d” value become a non-integer value, i.e., generalized the ARIMA model to make it have the differencing degree to be fractional values. This is what we call the ARFIMA model.

The main model’s purpose is to account for the persistence of memory that we find in economic and financial prices in levels while estimating an ARMA model on them.

This means that we could model an ARMA model which could have incorporated the long-term persistence that we don’t usually have in ARMA models applied to prices in first differences. This feature increased the forecasting power of simple ARIMA models.

According to López de Prado (2018), there is scarce literature related to ARFIMA models. Applications in retail trading are minimal, too. Let’s try our hand at it and see what emerges.


The ARFIMA Model Specification

Let’s understand the model better with algebraic formulations as it’s usually done with the ARMA model. We follow López de Prado's (2018) book notation and formulas.

You have learned in the ARMA article about lag operators. Let’s leverage on that article knowledge and use the lag operator here to explain the ARFIMA model.

Let’s reprise the mechanics of the lag operator. If X(t) is a time series, we have

(Lk)(Xt)=Xtk, for k>=0 (1L)2=12L+L2

For d = 2 is:

L2Xt=Xt2, for k>=0 (1L)2Xt=Xt2Xt1+Xt2

In order to explain in detail the ARFIMA model lag slopes, we remember here the following math formula, that, for any positive integer “n”.

(x+y)n=nk=0(nk)xkynk=nk=0(nk)xnkyk

Where

(nk)=n!k!(nk)!

Besides, for any real number “d”:

(1+x)d=k=0(dk)xk

Now, an ARFIMA(0,d,0) model can be described as:

(1B)dYt=ϵt

With d float values between -1 and 1.

The polynomial characteristic

(1B)d

can be converted to a binomial series expansion as:

\begin{align}(1-B)^d = \sum_{k=0}^{\infty} {\begin{pmatrix} d \\ k \end{pmatrix}(-B)^k} &= \sum_{k=0}^{\infty} \frac{\prod_{i=0}^{k-1}(d-i)}{k!}(-B)^k\\ &=(-B)^k\prod_{i=0}^{k-1}\frac{d-i}{k-i}\\ &=1-dB+\frac{d(d-1)}{2!}B²-\frac{d(d-1)(d-2)}{3!}B³+...\end{align}

Besides, the ARFIMA(0,d,0) model residuals can be described as X:

\tilde{X_{t}} = \sum_{k=0}^{\infty}\omega_kX_{t-k}

Where the coefficients (the weights) per each X, using all the above formulas, are described as

\omega = \left\{1,-d,\frac{d(d-1)}{2!},-\frac{d(d-1)(d-2)}{3!},...,(-1)^k\prod_{i=0}^{k-1}\frac{d-i}{k!},...\right\}

At this stage, you might want to quit this article.

I don’t understand these formulas!

Don’t worry! We have the following formula to iterate through each weight:

\omega_k = -\omega_{k-1}\frac{d-k+1}{k}

Where "k" represents the lag of price series. This is a nicer formula to create the weights, right? So whenever you have a specific “d”, you can use the above formula to create your ARFIMA residuals.

However, if you want to estimate an ARFIMA(p,d,q) model, then you would need to estimate the parameters with optimization algorithm.

Should I recreate the estimation from scratch?

No! There are great R libraries called “arfima” and “rugarch”. Let’s continue to the next section to learn more about this.


Estimation of an ARFIMA model in R

We’re going to estimate an ARFIMA(1,d,1) model and also create an ARFIMA-based trading strategy.

Let’s do it!

First, we install and import the necessary libraries:

install.packages('TTR')
install.packages('quantmod')
install.packages('stats')
install.packages('lubridate')
install.packages('dplyr')
install.packages('ggplot2')
install.packages('forecast')
install.packages('arfima')
install.packages('rugarch')
install.packages('openxlsx')
install.packages('parallel')
install.packages('R.utils')
library('TTR')
library('quantmod')
library('stats')
library('lubridate')
library('dplyr')
library('ggplot2')
library('forecast')
library('openxlsx')
library('arfima')
library('rugarch')
library('parallel')
library('R.utils')

Step 1: Import libraries

We import the Microsoft stock from 1990 to 2023-09-09 and pass the data into a dataframe.

# Set start and end dates
start = "1990-01-01"
end = "2023-09-09"
# Create an environment to save the data
df <- new.env()
# Import the data
getSymbols('MSFT',src='yahoo',env = df, from=start,to=end,auto.assign=TRUE)
# Create a dataframe to save the Apple stock adjusted close prices
data <- data.frame(coredata(fortify.zoo(Ad(df[['MSFT']]))))
# Change the column names
colnames(data) = c('date','MSFT')

Step 2: Estimate ARFIMA

We estimate an ARFIMA(1,d,1) with the “arfima” function provided by the “arfima” package.

# Estimate an ARFIMA(0,d,0) model
arfima_fit <- arfima(data$MSFT, order=c(1,0,1))
# Print the estimation summary
summary(arfima_fit)

Let’s show the summary

summary-arfima

In the “Mode 1 Coefficients” section, you will see the coefficients. In this case we estimated an ARFIMA(1,d,1) model. The phi estimate represents, as in the ARMA model, the autoregressive component. It’s significant.

The theta estimate represents the moving average component, which is also significant. The “d.f” is the fractional integration parameter, which is 0.49623, which is also significant. The fitted mean is the mean of model-based in-sample prediction values. Sigma^2 is the variance of residuals.

In the Numerical Correlations of Coefficients section, you see the correlation values between each parameter. In the Theoretical Correlations of Coefficients, you see the correlation obtained by the Quinn (2013) algorithm.

The theoretical correlations are the ones we should expect in case  The last result, the Expected Fisher Information Matrix of Coefficients, is the covariance matrix associated with maximum-likelihood estimates.

Don’t worry about these concepts. We only need to take care of the coefficients, their values and their statistical significance.

Step 3: Plotting

Let’s compute the residuals and plot them.

# Print the ARFIMA residuals
data$resids <- residuals(arfima_fit)$Mode1
# Plot the arfima(1,d,1) residuals
ggplot(data=data, aes(x = date)) +
# Choose the residuals as the y-axis variable
geom_line(aes(y = resids, color="Residuals")) +
# Set the title, x and y labels
ggtitle("ARFIMA Residuals") + xlab("Date") + ylab("Residuals") + theme(plot.title = element_text(hjust = 0.5, size=30), legend.position="bottom", legend.text = element_text(size=20)) + scale_x_date(date_labels = "%b %y") +
theme(
axis.title.x = element_text(size = 20),
axis.text.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
axis.text.y = element_text(size = 20),)
ggsave("arfim-1d1-residuals.png")
arfima residuals

The ARFIMA(1,d,1) model might not be the best model for the MSFT prices. The parameters p and q might be other numbers. That’s why we’re going to use the “autoarfima” function provided by the “rugarch” library.

We’re going to estimate a series of ARFIMA models for each day. In order to do that, we’re going to use all the CPU cores available on our computer.

That’s why we’re going to use the parallel package.

Parallel Package

Step 1: Find the number of cores

We’re going to use the total number of cores minus one, so the one not used will be actually used for the whole operation.

# Set the number of cores to be used for parallelising the ARFIMA estimation
nCores <- detectCores() - 1

Step 2: Create clusters

We’re going to create a cluster of these cores with the following function.

# Make the cluster based on the previous number of cores
cluster <- makePSOCKcluster(nCores)
view raw make_cluster.R hosted with ❤ by GitHub

Step 3: Estimate ARFIMA models

Now we estimate several ARFIMA(p,d,q) models varying their parameters for only the first 1000 observations (4 years, aprox.). There are many inputs for the autoarfima function:

  • We choose to estimate the ARFIMA models with p and q values ranging from 0 to 5.
  • To select the best model, we use the BIC.
  • We select the “partial” method, in the sense that, e.g., if estimate an ARFIMA(2,d,0), then we only estimate a single model with 2 lags in the AR component. If we selected the “full” method, then for the ARFIMA(2,d,0), we would estimate 3 models: A model with only the first lag, only the second lag, and the last with the two lags.
  • If we set the arfima input to FALSE, then we would estimate an ARMA model, so we set it to TRUE.
  • We include a mean for the series, so we set it to TRUE.
  • We use the cluster created above to parallelize the estimation and gain speed with it.
  • We use the normal distribution setting distribution.model to “norm”.
  • Estimate the models with the general nonlinear programming algorithm setting the solver to “solnp”. You can also choose “hybrid” so the estimation is done with all the possible algorithms.
  • Set return.all to False because we don’t want to return all the fitted models, only the best one selected with the BIC.
# Estimate the ARFIMA models and choose the best with the Bayesian information criteria
arfima_best_fit <- autoarfima(data$MSFT[1:1000], ar.max = 5, ma.max = 5, criterion = "BIC", method = "partial",
arfima = TRUE, include.mean = FALSE, cluster = cluster,
distribution.model = "norm", solver = "solnp", return.all = FALSE)

Let’s code to get the ARFIMA residuals’ plot and then show it

# Obtain the ARFIMA residuals
residuals <- data.frame(date = data$date[1:1000], residuals = arfima_best_fit$fit@fit['residuals']$residuals)
# Plot the arfima(1,d,1) residuals
ggplot(data=residuals, aes(x = date)) +
# Choose the residuals as the y-axis variable
geom_line(aes(y = residuals, color="Residuals")) +
# Set the title, x and y labels
ggtitle("Best ARFIMA model Residuals") + xlab("Date") + ylab("Residuals") + theme(plot.title = element_text(hjust = 0.5, size=30), legend.position="bottom", legend.text = element_text(size=20)) + scale_x_date(date_labels = "%b %y") +
theme(
axis.title.x = element_text(size = 20),
axis.text.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
axis.text.y = element_text(size = 20),)
ggsave("best-arfima-residuals.png")
best arfima residuals

Don't worry about the first value, it's actually the first asset price value. You can take the residuals from row 2 onwards in case you want to do something else with them.

Estimating the ARFIMA model in Python

Up to now, there’s no way to create an ARFIMA model in Python. So what do you do?

There are many ways. You can use libraries. Here we’re going to create our own way without using any Python library.

First, you’ll have two files

A Python file with the following code:

  1. Import libraries
  2. Import data from yahoo finance
  3. Save the data into an xlsx file with the name “data_to_estimate_arfima”
  4. Call the R script called “Estimate_ARFIMA.R”.
  5. Import the dataframe that results from the R file as “df”
  6. Plot the residuals from “df”

An R script with the following code:

  1. Import libraries
  2. Set the working directory the same as the script’s folder
  3. Import the data called “data_to_estimate_arfima.xlsx” and save it in data
  4. Estimate the best ARFIMA model with the autoarfima function from the “rugarch” package.
  5. Create another dataframe called “data2” to save the dates and the residuals
  6. Save “data2” into an Excel file named “arfima_residuals_R.xlsx”

The whole procedure will be based on the Python file steps. In Python step 4, we’re going to use the R script. Once the R script finishes running, then we continue with Python step 5 and onwards.

Let’s present the R script file called “Estimate_ARFIMA.R”. It uses a lot of the code learned above.

library('openxlsx')
library('rugarch')
library('parallel')
# Set the working directory. Change the address as per where your folder is located in your PC
setwd("/home/josgt/Documents/Quantinsti/EPAT/Blog Articles/Article 004")
data <- read.xlsx(xlsxFile = "data_to_estimate_arfima.xlsx", sheet = 1, detectDates = TRUE)
data$Date <- as.Date(data$Date, origin = "1899-12-30")
# Set the number of cores to be used for parallelize the ARFIMA estimation
nCores <- detectCores() - 1
# Make the cluster based on the previous number of cores
cluster <- makePSOCKcluster(nCores)
# Estimate the ARFIMA models and choose the best with the Bayesian information criteria
arfima_best_fit <- autoarfima(data$Close, ar.max = 5, ma.max = 5, criterion = "BIC", method = "partial",
arfima = TRUE, include.mean = FALSE, cluster = cluster,
distribution.model = "norm", solver = "solnp", return.all = FALSE)
# Stop the CPU multithreading
stopCluster(cluster)
data2 <- data.frame(date = data$Date, residuals=arfima_best_fit$fit@fit['residuals']$residuals)
write.xlsx(data2, file = "arfima_residuals_R.xlsx", colNames = TRUE)

Let’s now go through each step in the jupyter notebook

Step 1: Import libraries

First, we import the necessary libraries

# To use the R script
import subprocess
# For data manipulation
import pandas as pd
import yfinance as yf
# For data visualisation
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-darkgrid')

Step 2: Getting the data

Then we download the data from yahoo finance of Apple for the years 2020 to September 2023 and save it in an Excel file

# Import the data from yahoo finance
data = yf.download('AAPL',start='2020-01-01', end='2023-09-15', auto_adjust=True)
# Convert the index type to datetime
data.index = pd.to_datetime(data.index)
# Save the data into an excel file
data.to_excel('data_to_estimate_arfima.xlsx')

Then, we call the subprocess library and use the “run” function. It has two inputs. The second one is the R script which should be located in the same folder in which the jupyter notebook is located.

And the first input is the “Rscript” application from the R programming language which allows you to make this whole process happen. I'm using Linux, so you just need to put "Rscript".

In the case of Windows, you'll need to specify the "Rscritpt" address. The address “C:\Program Files\R\R-4.2.2\bin” is from my personal computer. Try searching in your own computer where this Rscript is located.

# Call the R script and run it
subprocess.run(['Rscript',"Estimate_ARFIMA.R"])
# Call the R script in Windows. Change the address as per where your Rscript is located on your PC
#address = 'C:/Program Files/R/R-4.2.2/bin/'
#subprocess.run([address+'Rscript',"Estimate_ARFIMA.R"])

Voilà! There you go! You have run an R script without open any R IDE. Just using a jupyter notebook!

Now, let’s plot the residuals. First, we import the created dataframe. It will be saved in the same working directory. We dropped the first row because it can be ignored.

# Import the residuals dataframe obtained from the R script
df = pd.read_excel('arfima_residuals_R.xlsx', index_col=0)
df = df[1:]

Now, let’s plot the residuals

# Set the figure size
plt.figure(figsize=(15,7))
# Plot the residuals
plt.plot(df.index, df['residuals'], label = "Best ARFIMA Residuals")
# Set the title of the graph
plt.title('Residuals from the Best ARFIMA model', fontsize=16)
# Set the x- and y- axis labels and ticks sizes
plt.xlabel('Year-Month', fontsize=15)
plt.ylabel('Residuals', fontsize=15)
plt.tick_params(axis='both', labelsize=15)
# Set the plot legend location
plt.legend(loc=2, prop={'size': 15}, bbox_to_anchor=(0,1))
plt.show()
best-arfima-residuals-python-and-r

An ARMA-based vs an ARFIMA-based model strategy performance comparison

We’re going to compare an ARMA-based and an ARFIMA-based model trading strategy to see which one performs better!

We’re going to use again the Apple price time series from 1990 to 2023-09-09.

Step 1: Import libraries

We'll import and install the necessary libraries.

install.packages('TTR')
install.packages('quantmod')
install.packages('stats')
install.packages('lubridate')
install.packages('dplyr')
install.packages('ggplot2')
install.packages('forecast')
install.packages('arfima')
install.packages('rugarch')
install.packages('parallel')
install.packages('R.utils')
library('TTR')
library('quantmod')
library('stats')
library('lubridate')
library('dplyr')
library('ggplot2')
library('forecast')
library('openxlsx')
library('rugarch')
library('parallel')
library('R.utils')

Step 2: Downloading data

Download the data and create the adjusted close price returns.

# Set start and end dates
start = "1990-01-01"
end = "2023-09-09"
df <- new.env()
# Import the data
getSymbols('AAPL',src='yahoo',env = df, from=start,to=end,auto.assign=TRUE)
# Create a dataframe to save the Apple stock adjusted close prices
data <- data.frame(coredata(fortify.zoo(Ad(df[['AAPL']]))))
colnames(data) = c('date','AAPL')
# Compute the log returns for each asset
data$returns <- c(0,log(data$AAPL[-1]/data$AAPL[-length(data$AAPL)]))

Step 3: Estimating the ARFIMA model

Create a “df_forecasts” dataframe in which we will save the ARFIMA residuals and the trading strategy signals. We choose (arbitrarily) to estimate the ARFIMA model using a span of 1500 observations (6 years).

# Group the data dates by year and month
options(dplyr.summarise.inform = FALSE)
dates <- data %>%
mutate(month = format(date, "%m"), year = format(date, "%Y")) %>%
group_by(year, month) %>% summarise(first_dates = first(date))
# Get the first date of Oct-2019
initial_date = subset(dates, (dates$year=='2019') & (dates$month=='01'))$first_dates
# Set the initial iloc for the forecast data
initial_iloc_to_forecast <- which(data$date==as.Date(initial_date))
# Subset the df2 dataframe to only our forecast data
df_forecasts <- subset(data[c('date', 'AAPL', 'returns')], data$date>=as.Date(initial_date))
# Create names for each ticker's forecasts
df_forecasts$signal <- 0.0
# Select 6 years of data
span <- 1500

Step 4: Estimating the ARFIMA model each day

Create a loop to estimate the ARFIMA model each day. We make use of CPU-multithreading as we did before. We go long whenever the forecasted price is higher than the last historical price. We choose the best model based on the BIC.

Before the for loop, we present 2 functions. The first one is to estimate the ARFIMA model. The second one is a wrapper function which will allow us to stop the estimation whenever it takes more than 10 minutes. There could be some cases where the estimation doesn't converge, or takes too long to converge. In these cases, we can stop the estimation with this wrapper function.

# Set the number of cores to be used for parallelize the ARFIMA estimation
nCores <- detectCores() - 1
# A function to estimate the ARFIMA model
arfima_func <- function(data) {
# Estimate the ARFIMA model
arfima_fit <- autoarfima(data, ar.max = 5, ma.max = 5, criterion = "BIC", method = "partial",
arfima = TRUE, include.mean = TRUE, cluster = cluster,
distribution.model = "norm", solver = "hybrid", return.all = FALSE)
return(arfima_fit)
}
# A function to stop runningn the ARFIMA model in case it takes more 10 minutes to run
myWrapperFunction <- function(data) {
result = tryCatch(expr = withTimeout(arfima_func(data), timeout = 10*60),
TimeoutException = function(ex) "TimedOut")
return(result)
}
# The for loop to estimate the ARFIMA model each day
for (i in (initial_iloc_to_forecast):nrow(data)) {
start_time <- Sys.time()
# Select the corresponding 6-year previous data to train the ARFIMA model
#data_sample <- subset(data, !(data$date %in% dates_to_be_removed))
data_sample <- data$AAPL[((i-1)-(span-1)):(i-1)]
# Print some days info
print(paste0(strrep('=',50)))
print(paste0("Forecast ",(i+1-initial_iloc_to_forecast), ' out of ', length(df_forecasts$date)))
# Make the cluster based on the previous number of cores
cluster <- makePSOCKcluster(nCores)
# Estimate the ARFIMA model for this day
arfima_fit <- myWrapperFunction(data_sample)
# Stop the CPU multithreading
stopCluster(cluster)
if (typeof(arfima_fit)!="list") {
# Print a statement saying estimtation took too long
print(paste0("Estimation took longer than expected. Today's signal will be 0."))
# Signal will be 0 this day
df_forecasts[(i+1-initial_iloc_to_forecast),'signal'] <- 0.0
# Save the df_forecasts in an excel file
write.xlsx(df_forecasts, 'df_forecasts_in_R_v01.xlsx')
# Print some important things for log
print(paste0("Date is ",df_forecasts[(i+1-initial_iloc_to_forecast),]$date))
print(paste0("Signal is ",df_forecasts[(i+1-initial_iloc_to_forecast),'signal']))
print(paste0("Start time is ",start_time))
end_time <- Sys.time()
print(paste0("End time is ",end_time))
#calculate time difference in minutes
diff <- difftime(end_time, start_time, units="mins")
print(paste0("Time passed is ",round(diff,2)," minutes"))
} else {
# Obtain the data sample Apple last price
last_apple_price <- tail(data_sample,1)[[1]]
# Forecast the next day Apple stock close price forecast
forecast <- arfimaforecast(arfima_fit$fit, data=data_sample, n.ahead=1)@forecast$seriesFor[1]
# Go long if the forecast is higher than the previous-day stock close price
df_forecasts[(i+1-initial_iloc_to_forecast),'signal'] <- if (forecast>=last_apple_price) 1 else 0
# Save the df_forecasts in an excel file
write.xlsx(df_forecasts, 'df_forecasts_in_R_v01.xlsx')
print(paste0("Date is ",df_forecasts[(i+1-initial_iloc_to_forecast),]$date))
print(paste0('Price forecast is ',forecast))
print(paste0('Last AAPL price is ',last_apple_price))
print(paste0("Signal is ",df_forecasts[(i+1-initial_iloc_to_forecast),'signal']))
print(paste0("Start time is ",start_time))
end_time <- Sys.time()
print(paste0("End time is ",end_time))
#calculate time difference in minutes
diff <- difftime(end_time, start_time, units="mins")
print(paste0("Time passed is ",round(diff,2)," minutes"))
}
}

Step 5: Creating signals

Estimate the ARMA model and create signals based on the best model chosen by the BIC, too. We also use a 1500-observation span. We keep using the df_forecasts dataframe from before to save the signals.

# Create ARMA model signal column
df_forecasts$arma_signal <- 0.0
# The for loop to estimate the ARFIMA model each day
for (i in initial_iloc_to_forecast:nrow(data)) {
start_time <- Sys.time()
# Select the corresponding 6-year previous data to train the ARFIMA model
data_sample <- data$returns[((i-1)-(span-1)):(i-1)]
# Estimate the models
arimaFit = auto.arima(
data_sample,
max.p = 5,
max.q = 5,
stationary = TRUE,
seasonal = FALSE,
ic = c("bic"),
stepwise = FALSE,
allowdrift = FALSE,
allowmean = FALSE,
parallel = TRUE,
num.cores = 1
)
# Forecast one step ahead
forecast1 = forecast(arimaFit,1)$mean[1]
# Go long if the forecast is higher than the previous-day stock close price
df_forecasts[(i+1-initial_iloc_to_forecast),'arma_signal'] <- if (forecast1>0) 1 else 0
}

Step 6: Cumulative returns

Create the ARFIMA-based and ARMA-based cumulative returns.

# Compute the ARFIMA-based strategy returns
df_forecasts$stra_returns <- df_forecasts$returns * df_forecasts$signal
# Create the ARFIMA-based strategy cumulative returns
df_forecasts$stra_cum_returns <- exp(cumsum(df_forecasts$stra_returns))
# Compute the ARMA-based strategy returns
df_forecasts$arma_stra_returns <- df_forecasts$returns * df_forecasts$arma_signal
# Create the ARMA-based strategy cumulative returns
df_forecasts$arma_stra_cum_returns <- exp(cumsum(df_forecasts$arma_stra_returns))
# Create the buy and hold strategy cumulative returns
df_forecasts$bnh_cum_returns <- exp(cumsum(df_forecasts$returns))

Let’s code to plot both strategies’ cumulative returns together with the buy-and-hold strategy.

# Plot the both buy-and-hold and the strategy cumulative returns
ggplot(data=df_forecasts, aes(x = df_forecasts$date)) +
# Choose the ARFIMA-based strategy cumulative returns as the y-axis variable
geom_line(aes(y = df_forecasts$stra_cum_returns, color="ARFIMA-based Strategy Cumulative Returns")) +
# Choose the ARMA-based strategy cumulative returns as the second y-axis variable
geom_line(aes(y = df_forecasts$arma_stra_cum_returns,color="ARMA-based Strategy Cumulative Returns")) +
# Choose the buy-and-hold benchmark cumulative returns as the third y-axis variable
geom_line(aes(y = df_forecasts$bnh_cum_returns,color="B&H Cumulative Returns")) +
# Set the title, x and y labels
ggtitle("Buy and Hold and Strategy Cumulative Returns") + xlab("Date") + ylab("Returns") +
# Set the title and legends labels and sizes
theme(plot.title = element_text(hjust = 0.5, size=30), legend.position="bottom", legend.text = element_text(size=20)) +
# Set the x and y labels and tick sizes
scale_x_date(date_labels = "%b %y") +
theme(
axis.title.x = element_text(size = 20),
axis.text.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
axis.text.y = element_text(size = 20),)
arfima vs arma vs bechmark cum returns

In terms of the equity curve last values, the Buy and Hold strategy performs the best compared to the other two strategies.

Let’s compute the statistics of each strategy:

StatisticBuy and HoldARFIMA modelARMA Model
Annual Return39.18%27.98%23.21%
Cumulative Returns369.60%217.18%165.51%
Annual Volatility33.02%26.38%20.12%
Sharpe Ratio1.171.071.14
Calmar Ratio1.250.711.37
Max Drawdown-31.43%-39.22%-16.97%
Sortino Ratio1.721.671.81

According to the table, we can see that the buy-and-hold strategy is the best one as per all the statistics. However, the ARFIMA model gets a lower volatility. The ARMA model is also better than the B&H strategy except for the annual and cumulative returns.

Besides, the ARMA model is also superior with respect to the ARFIMA model except with respect to the annual and cumulative returns, where the latter model performs better than the former model.

Some considerations are to be taken into account. We didn’t:

  • Incorporate slippage and commissions.
  • Incorporate a risk-management process.
  • Optimize the span
  • Use other information criterias such as Akaike, HQ, etc.

Suggestion by Lopez de Prado

Having stationarity and memory preservation at the same time

ARFIMA model residuals might not always result in being a stationary time series. If the ARFIMA estimation gives a “d” between the range [-0.5,0.5], then the ARFIMA model will be stationary; otherwise, the model won’t be stationary.

So, even though the ARFIMA model captures the long memory of the price series. Not necessarily the model will provide stationary residuals.

López de Prado suggests that we can assure having a stationary process while preserving the long memory of the price series.

How?
Well, instead of estimating the ARFIMA model, we actually calibrate the “d” parameter with an ADF test in order to find the best “d” that makes the ARFIMA residuals stationary and also has the memory persistence of the asset prices. To evaluate the memory preservation, we find the correlation between the asset prices and their price ARFIMA-based residuals.

Step 1: Import libraries

Let’s show the code. First we import the respective libraries

# For fetching data
import yfinance as yf
# For data manipulation
import pandas as pd
import numpy as np
# For time series analysis
from statsmodels.tsa.stattools import adfuller
# For data visualisation
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-darkgrid')
# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

Step 2: Import data

Then, we import the Apple price data from 2001 to September 2023.

# Download the adjusted close price data
data = yf.download('AAPL', start='2001-01-01', end='2023-09-15',
auto_adjust=True, threads=True)

Step 3: Functions

We provide the following two functions given by López de Prado’s book with slight modifications.

# Function to get the ARFIMA weights
def getWeights_FFD(d, thres, lim):
# Set w as a list and k as one
w, k = [1.], 1
ctr = 0
while True:
# Create the new weight value
w_ = -w[-1] / k * (d - k + 1)
# End the loop in case the threshold is breached
if abs(w_) < thres:
break
# Append the new value of w
w.append(w_)
# Update k and ctr
k += 1
ctr += 1
# End the loop in case it breaches the limit
if ctr == lim - 1:
break
# Convert the w from list to a numpy array
w = np.array(w[::-1]).reshape(-1, 1)
return w
def fracDiff_FFD(series,d,thres=1e-5, w = None):
"""
Constant width window (new solution)
Note 1: thres determines the cut-off weight for the window
Note 2: d can be any positive fractional, not necessarily bounded [0,1].
"""
# 1) Compute weights for the longest series
# Get the length of the series
length = len(series)
# In case we don't have the weights, estimate them with the previous function
if w is None:
w=getWeights_FFD(d,thres,length)
# Set this variable to define the span of the ARFIMA residuals
width=len(w)-1
# 2) Apply weights to values
df= {}
# Create a forward-filled seriesF and an empty series df_
seriesF,df_=series.fillna(method='ffill').dropna(),\
pd.Series()
# Loop through each observation of the seriesF
for iloc1 in range(width,len(seriesF.index)):
# Get the loc's that are the first and last loc's for the ARFIMA residuals
loc0,loc1=seriesF.index[iloc1-width],seriesF.index[iloc1]
# exclude NAs
if not np.isfinite(series.loc[loc1]):continue
# Compute the ARFIMA residual for loc1
df_[loc1]=np.dot(w.T,seriesF.loc[loc0:loc1])[0]
# Create the ARFIMA residual of the respective OHLC dataframe
df['Close']=df_.copy(deep=True)
# Convert the series into a dataframe
df=pd.concat(df,axis=1)
return df
view raw fracDiff_FFD.py hosted with ❤ by GitHub

The first function is to compute the weights described above.

  • We use as inputs the chosen “d”, a threshold to be used to truncate the number of weights.
  • As you go well behind the past, the weights will have a small number, in order to avoid such tiny numbers, we truncate the number of weights.
  • The “lim” is a limit number to be considered also a truncation number to select a specific number of weights.

The second function is to compute the ARFIMA residuals based on the chosen “d”.

  • The first input is the price series, then you input the chosen “d”. Next the threshold described above, and the last input are the weights array.
  • In case you input the weights array, this second function will use it; otherwise, the function will use the first one to compute the weights.

The next function is also provided by López de Prado’s book.

  • It’s used to compute a dataframe in which we will use a range of “d” values in order to compute the ADF test statistic, p-value, the number of autoregressive lags in the ADF equation, the number of observations in the ARFIMA residuals, the 95% confidence level and the correlation.
def d_estimates_db(DF,minimum,maximum):
# Create the output dataframe
out=pd.DataFrame(columns=['adfStat','pVal','lags',\
'nObs','95% conf','corr'])
# Copy the dataframe in order to not change it
df0=DF.copy()
# Create a range of posible d values and loop throughout them
for d in np.linspace(minimum,maximum,11):
# Create the log of close prices
df1=np.log(df0['Close'])
# Estimate the ARFIMA residuals
df2=fracDiff_FFD(df1,d,thres=1e-5)
# Compute the correlation between the log prices and the ARFIMA residuals
corr=np.corrcoef(df1.loc[df2.index],df2['Close'])[0,1]
# Estimate a unit root test to the ARFIMA residuals
df2=adfuller(df2['Close'],regression='c',autolag='AIC')
# Save all the results in the out dataframe
out.loc[d]=list(df2[:4])+[df2[4]['5%']]+[corr] # with critical value
# Return the out dataframe
return out

Step 4: Applying the functions

Let’s use all these functions. We apply the last function to the Apple prices. See the results:

# Get d estimation summary
out = d_estimates_db(data,0,1)
view raw gistfile1.txt hosted with ❤ by GitHub
dadfStatpVallagsnObs95% confcorr
0-0.800.82155695-2.861.00
0.1-1.300.6391626-2.860.99
0.2-1.800.3892320-2.860.95
0.3-2.790.06153421-2.860.86
0.4-4.170.00164237-2.860.75
0.5-6.060.00154769-2.860.56
0.6-7.800.00155106-2.860.47
0.7-7.360.00275312-2.860.37
0.8-8.540.00275456-2.860.23
0.9-13.270.00195567-2.860.12
1-19.710.00145695-2.860.00

As you can see, we get the results of the ARFIMA residuals’ statistics using a range of “d” values: 10 “d” values.

As per López de Prado, the best d will be chosen based on the ARFIMA model whose residuals will be stationary and also preserve memory. How to detect that?

Let’s see. In order to choose the best “d” which makes an ARFIMA model stationary. We need to choose the ARFIMA model ADF p-value which is close and lower than 5%. As you can see in the table, we’re going to choose the “d” values from 0.3 to 1, because these “d” values make the ARFIMA residuals stationary.

But that’s not all, we need to take care also of the long memory preservation. How to check that?

Well, the “corr” value is the correlation between the ARFIMA model residuals and the price series. Whenever there is a high correlation, we are going to be sure the long memory is preserved.

Note: If you compute the correlation of the price return series, you will see that the autocorrelation will be very low, we leave you that as an exercise.

So, are you ready? Did you guess it?

The best “d” according to this range of “d” values is 0.3 because the ARFIMA residuals for this “d” are stationary (p-value is 0.049919) and it preserves the long memory of th price series (correlation is 0.85073).

Now, we know you are a good student and you say:
Why would we need to settle for this 0.3 value?
We can can an even better value with more decimals!

Yes, that’s right.

That’s why we present you below a function which will estimate the best “d” with better precision. The function uses the following inputs:

  • DF: The dataframe of the asset prices.
  • Alpha: The confidence level to be used to compare the ARFIMA models’ results.
  • Minimum: The minimum value of the range of “d” values.
  • Maximum: The maximum value of the range of “d” values.

The function procedure goes like this:

Step 1: Copy the dataframe of the asset prices.

Step 2: Use the “d_estimates_db” function from above with the minimum and maximum values.

Step 3: Open a try-except block in which:

  • Select the “d” value from the “out” dataframe whose confidence level is the closest to and higher than “alpha” confidence level and save it in d1.
  • Select the “d” value from the “out” dataframe whose confidence level is the closest to and lower than “alpha” confidence level and save it in d2.
  • Proceed to estimate again the “out” dataframe having as minimum and maximum values for the range of “d” values the d1 and d2 numbers.
  • Open a try-except block to repeat the process. This new block will have inside another try-except block.

This whole process is to assure that we get a “d” number for which its ARFIMA model will be stationary as per the “alpha” confidence level.

We show the function now:

def optimal_d(DF, alpha=0.035, minimum = 0, maximum = 1):
# Copy the dataframe
df = DF.copy()
# Estimate the best d based on the basic d range
out = d_estimates_db(df,minimum,maximum)
out.to_excel('out1.xlsx')
# A try-except block to handle erros while getting a d with better decimals
try:
# Get the d value which critical value is closest to and higher than alpha
d1 = out[out['pVal']>alpha].index.values[-1]
# Get the d value which critical value is closest to and lower than alpha
d2 = out[out['pVal']<alpha].index.values[0]
# Estimate the best d between d1 and d2
out = d_estimates_db(df,d1,d2)
out.to_excel('out2.xlsx')
try:
# Get the d value which critical value is closest to and higher than alpha
d1 = out[out['pVal']>alpha].index.values[-1]
# Get the d value which critical value is closest to and lower than alpha
d2 = out[out['pVal']<alpha].index.values[0]
# Estimate the best d between d1 and d2
out = d_estimates_db(df,d1,d2)
out.to_excel('out3.xlsx')
try:
# Get the d value which critical value is closest to and higher than alpha
d1 = out[out['pVal']>alpha].index.values[-1]
# Get the d value which critical value is closest to and lower than alpha
d2 = out[out['pVal']<alpha].index.values[0]
# Set d as the average between d1 and d2 with four decimals
d = round((d1+d2)/2,4)
except:
# In case the try fails, assign d with the critical value which is closest to and higher than alpha
d = out[out['pVal']<alpha].index.values[0]
except:
# In case the try fails, assign d with the critical value which is closest to and higher than alpha
d = out[out['pVal']<alpha].index.values[0]
except:
# In case the try fails, assign d with the critical value which is closest to and higher than alpha
d = out[out['pVal']<alpha].index.values[0]
# Return the d value
return d

We could have created a single range of "d" values, instead of creating these nested try-except blocks. But this process assures we don’t estimate too many ARFIMA models whose “d” specific numbers will be useless.

Let’s use this function to compute the best “d” for the Apple prices. Our p-value threshold will be 0.04. This number is chosen arbitrarily. As long you it is below 5% (and close) everything will be ok. The closeness should be evaluated as per the correlation between the asset prices and the ARFIMA-based residuals

# Get the optimal ARFIMA d parameter
d = optimal_d(data, 0.04, 0, 1)

The best “d” is 0.3145 for the Apple prices between 2001 and 2022.

Let’s use the fracDiff_FFD function to compute the ARFIMA residuals:

# Estimate ARFIMA residuals with optimal d
arfima_residuals = fracDiff_FFD(np.log(data['Close']),d)
view raw gistfile1.txt hosted with ❤ by GitHub

And let’s plot these residuals:

# Set the figure size
plt.figure(figsize=(15,7))
# Plot the residuals
plt.plot(arfima_residuals.index, arfima_residuals, label = "Best d ARFIMA Residuals")
# Set the title of the graph
plt.title('Residuals from the Best d ARFIMA model', fontsize=16)
# Set the x- and y- axis labels and ticks sizes
plt.xlabel('Year-Month', fontsize=15)
plt.ylabel('Residuals', fontsize=15)
plt.tick_params(axis='both', labelsize=15)
# Set the plot legend location
plt.legend(loc=2, prop={'size': 15}, bbox_to_anchor=(0,1))
plt.show()
best d arfima residuals 1

Just to confirm, we apply an ADF test to these residuals

df2=adfuller(arfima_residuals, regression='c', autolag='AIC')
p_value = df2[1]
p_value

The p-value is 2.6% and less than 5%. Thus, these ARFIMA residuals are stationary.

Let's see the correlation between these residuals and the price series

arfima_residuals.rename(columns={'Close':'Residuals'}, inplace=True)
new_df = pd.concat([data['Close'], arfima_residuals], axis=1).dropna()
round(new_df.corr().loc['Close','Residuals'],2)

Correlation is 0.84, thus is high. This means the ARFIMA residuals preserve the long memory persistence of the price series.

How would we use these ARFIMA residuals for our trading strategies?
López de Prado suggests using these residuals as our prediction feature in any machine-learning model to create an algorithmic trading strategy on any asset price.


Conclusion

Everything’s been so cool, right? We spent time learning the basic theory of the ARFIMA model, estimated it and also used it as a strategy to forecast price returns. Don’t forget also that you can use the model’s residuals as a prediction feature for a machine learning model.

In case you want to learn more about time series models, you can profit from our course Financial Time Series Analysis for Trading. Here you will learn everything about the econometrics of time series. Don’t lose the opportunity to improve your strategy performance!


Files in the download

  • ARFIMA for article Part 1
  • ARFIMA for article Part 2
  • Estimate ARFIMA
  • Estimate ARFIMA with R in Python
  • Lopez de Prado ARFIMA


Disclaimer: All investments and trading in the stock market involve risk. Any decision to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.

Live Webinar: GenAI & Automated Trading

Jose Carlos Gonzales Tanaka

José Carlos is a Quantitative Researcher at QuantInsti who focuses on algorithmic trading and advanced analytics. He built a setup to trade forex using the Interactive Brokers API.