Gold Price Prediction: Step By Step Guide Using Python Machine Learning

7 min read

By Ishan Shah and Rekhit Pachanekar

Is it possible to predict where the Gold price is headed?

Yes, let’s use machine learning regression techniques to predict the price of one of the most important precious metal, the Gold.

We will create a machine learning linear regression model that takes information from the past Gold ETF (GLD) prices and returns a Gold price prediction the next day.

GLD is the largest ETF to invest directly in physical gold. (Source)

We will cover the following topics in our journey to predict gold prices using machine learning in python.


Import the libraries and read the Gold ETF data

First things first: import all the necessary libraries which are required to implement this strategy.

Then, we read the past 12 years of daily Gold ETF price data and store it in Df. We remove the columns which are not relevant and drop NaN values using dropna() function. Then, we plot the Gold ETF close price.

Output:

The image shows the Gold ETF (Ticker name: GLD) price series from 2012 to 2024.
Gold ETF (Ticker: GLD) Price Series

Define explanatory variables

An explanatory variable is a variable that is manipulated to determine the value of the Gold ETF price the next day. Simply, they are the features which we want to use to predict the Gold ETF price.

The explanatory variables in this strategy are the moving average for past 3 days and 9 days. We drop the NaN values using dropna() function and store the feature variables in X.

However, you can add more variables to X which you think are useful to predict the prices of the Gold ETF. These variables can be technical indicators, the price of another ETF such as Gold miners ETF (GDX) or Oil ETF (USO), or US economic data.


Define dependent variable

Similarly, the dependent variable depends on the values of the explanatory variables. Simply put, it is the Gold ETF price which we are trying to predict. We store the Gold ETF price in y.


Non-stationary variables in linear regression

Have you noticed that we are using the price series directly to create the explanatory and dependent variables? You must have heard warning bells in your head!!

You’re correct when you say that linear regression typically requires stationary data, as non-stationarity can lead to issues such as spurious relationships, biased coefficients, and unreliable predictions.

However, as you explore more advanced nuances, you will encounter situations where non-stationary data can still be used effectively, provided certain conditions are met.

Broadly, there are two approaches to working with non-stationary data in this context:

Differencing: You can transform the data by differencing to make it stationary. You can also calculate the daily returns to transform the price series into a stationary returns series.

Residual Stationarity: Another approach is to work with non-stationary data but check if the residuals from the regression are stationary. This suggests that the non-stationary variables are cointegrated, meaning a genuine relationship exists. In other words, if the data is non-stationary but the residuals are stationary, it indicates that the linear regression model has effectively addressed the non-stationarity.

Non-stationarity is common in financial time series, so caution is indeed essential when applying linear regression.

We will try the second approach and check for residual stationarity.

Output:

Cointegration p-value between S_3 and next_day_price: 0.0

Cointegration p-value between S_9 and next_day_price: 5.54437936472372e-26

S_3 and next_day_price are cointegrated.

S_9 and next_day_price are cointegrated.

The time series S_3 (3-day moving average) and next_day_price, as well as S_9 (9-day moving average) and next_day_price, are cointegrated. Thus, we can proceed with running a linear regression directly without transforming the series to achieve stationarity.

Why You Can Run the Regression Directly?

Cointegration implies that there is a stable, long-term relationship between the two non-stationary series. This means that while the individual series may each contain unit roots (i.e., be non-stationary), their linear combination is stationary and running an Ordinary Least Squares (OLS) regression will not lead to a spurious regression. This is because the residuals of the regression (i.e., the difference between the predicted and actual values) will be stationary.

Key Points to Remember

As cointegration already ensures a valid statistical relationship, making OLS appropriate for estimating the parameters, there is no need to difference the series to make them stationary before running the regression

The regression run between S_3 (or S_9) and next_day_price will capture a valid long-term equilibrium relationship, which cointegration confirms.


Split the data into train and test dataset

In this step, we split the predictors and output data into train and test data. The training data is used to create the linear regression model, by pairing the input with expected output.

The test data is used to estimate how well the model has been trained.

historical gold etf
  1. First 80% of the data is used for training and remaining data for testing
  2. X_train & y_train are training dataset
  3. X_test & y_test are test dataset

Create a linear regression model

We will now create a linear regression model. But, what is linear regression?

If we try to capture a mathematical relationship between ‘x’ and ‘y’ variables that “best” explains the observed values of ‘y’ in terms of observed values of ‘x’ by fitting a line through a scatter plots then such an equation between x and y is called linear regression analysis.

dependent and independent variable

To break it down further, regression explains the variation in a dependent variable in terms of independent variables. The dependent variable - ‘y’ is the variable that you want to predict. The independent variables - ‘x’ are the explanatory variables that you use to predict the dependent variable. The following regression equation describes that relation:

Y = m1 * X1 + m2 * X2 + C
Gold ETF price = m1 * 3 days moving average + m2 * 9 days moving average + c

Then we use the fit method to fit the independent and dependent variables (x’s and y’s) to generate coefficient and constant for regression.

Output:

Linear Regression model

Gold ETF Price (y) = 1.17 * 3 Days Moving Average (x1) + -0.17 * 9 Days Moving Average (x2) + 0.28 (constant)


Predict the Gold ETF prices

Now, it’s time to check if the model works in the test dataset. We predict the Gold ETF prices using the linear model created using the train dataset. The predict method finds the Gold ETF price (y) for the given explanatory variable X.

Output:

This image shows the predicted price of Gold ETF using linear regression and compares it with the actual price of the GLD ETF.
Gold ETF (GLD) Predicted Price Versus Actual Price

The graph shows the predicted and actual price of the Gold ETF.

Now, let’s compute the goodness of the fit using the score() function.

Output:

99.32

As it can be seen, the R-squared of the model is 99.32%. R-squared is always between 0 and 100%. A score close to 100% indicates that the model explains the Gold ETF prices well.


Plotting cumulative returns

Let’s calculate the cumulative returns of this strategy to analyse its performance.

  1. The steps to calculate the cumulative returns are as follows:
  2. Generate daily percentage change of gold price
  3. Create a buy trading signal represented by “1” when the next day’s predicted price is more than the current day predicted price. No position is taken otherwise
  4. Calculate the strategy returns by multiplying the daily percentage change with the trading signal.
  5. Finally, we will plot the cumulative returns graph

The output is given below:

This image shows the cumulative returns of the Gold ETF using predicted price given by linear regression model.
Cumulative Returns of Gold ETF Price Prediction Using Linear Regression

We will also calculate the Sharpe ratio.

The output is given below:

'Sharpe Ratio 0.76'


Suggested Reads:


How to use this model to predict daily moves?

You can use the following code to predict the gold prices and give a trading signal whether we should buy GLD or take no position.

The output is as shown below

Using the linear regression model, you can generate trading signals to buy or hold no position in the Gold ETF
Signal Generation Using Linear Regression 

Congrats! You just learned a fundamental yet strong machine learning technique, with an example of Gold price prediction. Thanks for reading!

Create your first trading strategy using machine learning algorithm in a step-by-step fashion with this course. And if you want to learn from experts like Dr. Ernest Chan, the complete lifecycle of strategy creation, execution and Live trading using ML, then this Learning Track is for you.


Access the Github link below: Gold Price Prediction Strategy Jupyter Notebook


Disclaimer: All investments and trading in the stock market involve risk. Any decisions to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.

Live Q&A | Skills to Get Quant Jobs