Reinforcement Learning in Trading: Build Smarter Strategies with Q-Learning & Experience Replay

By Ishan Shah

Initially, AI research focused on simulating human thinking, only faster. Today, we've reached a point where AI "thinking" amazes even human experts. As a perfect example, DeepMind's AlphaZero revolutionised chess strategy by demonstrating that winning doesn't require preserving pieces—it's about achieving checkmate, even at the cost of short-term losses.

This concept of "delayed gratification" in AI strategy sparked interest in exploring reinforcement learning for trading applications. This article explores how reinforcement learning can solve trading problems that might be impossible through traditional machine learning approaches.

Prerequisites

Before exploring the concepts in this blog, it’s important to build a strong foundation in machine learning, particularly in its application to financial markets.

Begin with Machine Learning Basics or Machine Learning for Algorithmic Trading in Python to understand the fundamentals, such as training data, features, and model evaluation. Then, deepen your understanding with the Top 10 Machine Learning Algorithms for Beginners, which covers key ML models like decision trees, SVMs, and ensemble methods.

Learn the difference between supervised techniques via Machine Learning Classification and regression-based price prediction in Predicting Stock Prices Using Regression.

Also, review Unsupervised Learning to understand clustering and anomaly detection, crucial for identifying patterns without labelled data.

This guide is based on notes from Deep Reinforcement Learning in Trading by Dr Tom Starke and is structured as follows.

What is Reinforcement Learning?
How to Apply Reinforcement Learning in Trading
How is Reinforcement Learning Different from Traditional ML?
Components of Reinforcement Learning
Putting It All Together
Q-Table and Q-Learning
Experience Replay and Advanced Techniques in RL
Challenges in Reinforcement Learning for Trading

What is Reinforcement Learning?

Despite sounding complex, reinforcement learning employs a simple concept we all understand from childhood. Remember receiving rewards for good grades or scolding for misbehavior? Those experiences shaped your behavior through positive and negative reinforcement.

Like humans, RL agents learn for themselves to achieve successful strategies that lead to the greatest long-term rewards. This paradigm of learning by trial-and-error, solely from rewards or punishments, is known as reinforcement learning (RL).

How to Apply Reinforcement Learning in Trading

In trading, RL can be applied to various objectives:

Maximising profit
Optimising portfolio allocation

The distinguishing advantage of RL is its ability to learn strategies that maximise long-term rewards, even when it means accepting short-term losses.

Consider Amazon's stock price, which remained relatively stable from late 2018 to early 2020, suggesting a mean-reverting strategy might work well.

Amazon's stock price- Reinforcement learning in trading

However, from early 2020, the price began trending upward. Deploying a mean-reverting strategy at this point would have resulted in losses, causing many traders to exit the market.

Amazon's stock price in early 2020- Reinforcement learning in trading

An RL model, however, could recognise larger patterns from previous years (2017-2018) and continue holding positions for substantial future profits—exemplifying delayed gratification in action.

How is Reinforcement Learning Different from Traditional ML?

Unlike traditional machine learning algorithms, RL doesn't require labels at each time step. Instead:

The RL algorithm learns through trial and error
It receives rewards only when trades are closed
It optimises strategy to maximise long-term rewards

Traditional ML requires labels at specific intervals (e.g., hourly or daily) and focuses on regression to predict the next candle percentage returns or classification to predict whether to buy or sell a stock. This makes solving the delayed gratification problem particularly challenging through conventional ML approaches.

Components of Reinforcement Learning

This guide focuses on the conceptual understanding of Reinforcement Learning components rather than their implementation. If you're interested in coding these concepts, you can explore the Deep Reinforcement Learning course on Quantra.

Actions

Actions define what the RL algorithm can do to solve a problem. For trading, actions might be Buy, Sell, and Hold. For portfolio management, actions would be capital allocations across asset classes.

Policy

Policies help the RL model decide which actions to take:

Exploration policy: When the agent knows nothing, it decides actions randomly and learns from experiences. This initial phase is driven by experimentation—trying different actions and observing the outcomes.
Exploitation policy: The agent uses past experiences to map states to actions that maximise long-term rewards.

In trading, it is crucial to maintain a balance between exploration and exploitation. A simple mathematical expression that decays exploration over time while retaining a small exploratory chance can be written as:

$\varepsilon_t = \frac{1}{t^k} + \varepsilon_{\min}$

Here, εₜ is the exploration rate at trade number t, k controls the rate of decay, and εₘᵢₙ ensures we never stop exploring entirely.

Here, $ε_{t}$ is the exploration rate at trade number t, k controls the rate of decay, and $ε_{\min}$ ensures we never stop exploring entirely.

State

The state provides meaningful information for decision-making. For example, when deciding whether to buy Apple stock, useful information might include:

Technical indicators
Historical price data
Sentiment data
Fundamental data

All this information constitutes the state. For effective analysis, the data should be weakly predictive and weakly stationary (having constant mean and variance), as ML algorithms generally perform better on stationary data.

Rewards

Rewards represent the end objective of your RL system. Common metrics include:

Profit per tick
Sharpe Ratio
Profit per trade

When it comes to trading, using just the PnL sign (positive/negative) as the reward works better as the model learns faster. This binary reward structure allows the model to focus on consistently making profitable trades rather than chasing larger but potentially riskier gains.

Environment

The environment is the world that allows the RL agent to observe states. When the agent applies an action, the environment processes that action, calculates rewards, and transitions to the next state.

RL Agent

The agent is the RL model that takes input features/state and decides which action to take. For instance, an RL agent might take RSI and 10-minute returns as input to determine whether to go long on Apple stock or close an existing position.

Putting It All Together

Let's see how these components work together:

Step 1:

State & Action: Apple's closing price was $92 on Jan 24, 2025. Based on the state (RSI and 10-day returns), the agent gives a buy signal.
Environment: The order is placed at the open on the next trading day (Jan 27) and filled at $92.
Reward: No reward is given as the trade is still open.

Step 2:

State & Action: The next state reflects the latest price data. On Jan 27, the price reached $94. The agent analyses this state and decides to sell.
Environment: A sell order is placed to close the long position.
Reward: A reward of 2.1% is given to the agent.

Date	Closing price	Action	Reward (% returns)
Jan 24	$92	Buy	-
Jan 27	$94	Sell	2.1

Q-Table and Q-Learning

At each time step, the RL agent needs to decide which action to take. The Q-table helps by showing which action will give the maximum reward. In this table:

Rows represent states (days)
Columns represent actions (hold/sell)
Values are Q-values indicating expected future rewards

Example Q-table:

Date	Sell	Hold
23-01-2025	0.954	0.966
24-01-2025	0.954	0.985
27-01-2025	0.954	1.005
28-01-2025	0.954	1.026
29-01-2025	0.954	1.047
30-01-2025	0.954	1.068
31-01-2025	0.954	1.090

On Jan 23, the agent would choose "hold" since its Q-value (0.966) exceeds the Q-value for "sell" (0.954).

Creating a Q-Table

Let's create a Q-table using Apple's price data from Jan 22-31, 2025:

Date	Closing Price	% Returns	Cumulative Returns
22-01-2025	97.2	-	-
23-01-2025	92.8	-4.53%	0.95
24-01-2025	92.6	-0.22%	0.95
27-01-2025	94.8	2.38%	0.98
28-01-2025	93.3	-1.58%	0.96
29-01-2025	95.0	1.82%	0.98
30-01-2025	96.2	1.26%	0.99
31-01-2025	106.3	10.50%	1.09

If we've bought one Apple share with no remaining capital, our only choices are "hold" or "sell." We first create a reward table:

State/Action	Sell	Hold
22-01-2025	0	0
23-01-2025	0.95	0
24-01-2025	0.95	0
27-01-2025	0.98	0
28-01-2025	0.96	0
29-01-2025	0.98	0
30-01-2025	0.99	0
31-01-2025	1.09	1.09

Using only this reward table, the RL model would sell the stock and get a reward of 0.95. However, the price is expected to increase to $106 on Jan 31, resulting in a 9% gain, so holding would be better.

To represent this future information, we create a Q-table using the Bellman equation:

Q (s, a) = R (s, a) + γ max [Q (s^{'}, a^{'})]

Where:

s is the state
a is a set of actions at time t
a' is a specific action
R is the reward table
Q is the state-action table that's constantly updated
γ is the learning rate

Starting with Jan 30's Hold action:

The reward for this action (from R-table) is 0
Assuming γ = 0.98, the maximum Q-value for actions on Jan 31 is 1.09
The Q-value for Hold on Jan 30 is 0 + 0.98(1.09) = 1.068

Completing this process for all rows gives us our Q-table:

Date	Sell	Hold
23-01-2025	0.95	0.966
24-01-2025	0.95	0.985
27-01-2025	0.98	1.005
28-01-2025	0.96	1.026
29-01-2025	0.98	1.047
30-01-2025	0.99	1.068
31-01-2025	1.09	1.090

The RL model will now select "hold" to maximise Q-value. This process of updating the Q-table is called Q-learning.

In real-world scenarios with vast state spaces, building complete Q-tables becomes impractical. To overcome this, we can use Deep Q Networks (DQNs)—neural networks that learn Q-tables from past experiences and provide Q-values for actions when given a state as input.

Experience Replay and Advanced Techniques in RL

Experience Replay

Stores (state, action, reward, next_state) tuples in a replay buffer
Trains the network on random batches from this buffer
Benefits: breaks correlations between samples, improves data efficiency, stabilises training

Double Q-Networks (DDQN)

Uses two networks: primary for action selection, target for value estimation
Reduces overestimation bias in Q-values
More stable learning and better policies

Other Key Advancements

Prioritised Experience Replay: Samples important transitions more frequently
Dueling Networks: Separates state value and action advantage estimation
Distributional RL: Models the entire return distribution instead of just the expected value
Rainbow DQN: Combines multiple improvements for state-of-the-art performance
Soft Actor-Critic: Adds entropy regularisation for robust exploration

These techniques address fundamental challenges in deep RL, improving efficiency, stability, and performance across complex environments.

Challenges in Reinforcement Learning for Trading

Type 2 Chaos

While training, the RL model works in isolation without interacting with the market. Once deployed, we don't know how it will affect the market. Type 2 chaos occurs when an observer can influence the situation they're observing. Although difficult to quantify during training, we can assume the RL model will continue learning after deployment and adjust accordingly.

Noise in Financial Data

RL models might interpret random noise in financial data as actionable signals, leading to inaccurate trading recommendations. While methods exist to remove noise, we must balance noise reduction against a potential loss of important data.

Conclusion

We've introduced the fundamental components of reinforcement learning systems for trading. The next step would be implementing your own RL system to backtest and paper trade using real-world market data.

For a deeper dive into RL and to create your own reinforcement learning trading strategies, consider specialised courses in Deep Reinforcement Learning on Quantra.

Explore Now >

References & Further Readings

Once you’re comfortable with the foundational ML concepts, you can explore advanced reinforcement learning and its role in trading through more structured learning experiences. Start with the Machine Learning & Deep Learning in Trading learning track, which offers hands-on tutorials on AI model design, data preprocessing, and financial market modelling.
For those looking for an advanced, structured approach to quantitative trading and machine learning, the Executive Programme in Algorithmic Trading (EPAT) is an excellent choice. This program covers classical ML algorithms (such as SVM, k-means clustering, decision trees, and random forests), deep learning fundamentals (including neural networks and gradient descent), and Python-based strategy development. You will also explore statistical arbitrage using PCA, alternative data sources, and reinforcement learning applied to trading.
Once you have mastered these concepts, you can apply your knowledge in real-world trading using Blueshift. Blueshift is an all-in-one automated trading platform that offers institutional-grade infrastructure for investment research, backtesting, and algorithmic trading. It is a fast, flexible, and reliable platform, agnostic to asset class and trading style, helping you turn your ideas into investment-worthy opportunities.

Disclaimer: All investments and trading in the stock market involve risk. Any decision to place trades in the financial markets, including trading in stock or options or other financial instruments, is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.

EPAT Walkthrough & Live Q&A