By Ishan Shah
Initially, AI research focused on simulating human thinking, only faster. Today, we've reached a point where AI "thinking" amazes even human experts. As a perfect example, DeepMind's AlphaZero revolutionised chess strategy by demonstrating that winning doesn't require preserving pieces—it's about achieving checkmate, even at the cost of short-term losses.
This concept of "delayed gratification" in AI strategy sparked interest in exploring reinforcement learning for trading applications. This article explores how reinforcement learning can solve trading problems that might be impossible through traditional machine learning approaches.
Prerequisites
Before exploring the concepts in this blog, it’s important to build a strong foundation in machine learning, particularly in its application to financial markets.
Begin with Machine Learning Basics or Machine Learning for Algorithmic Trading in Python to understand the fundamentals, such as training data, features, and model evaluation. Then, deepen your understanding with the Top 10 Machine Learning Algorithms for Beginners, which covers key ML models like decision trees, SVMs, and ensemble methods.
Learn the difference between supervised techniques via Machine Learning Classification and regression-based price prediction in Predicting Stock Prices Using Regression.
Also, review Unsupervised Learning to understand clustering and anomaly detection, crucial for identifying patterns without labelled data.
This guide is based on notes from Deep Reinforcement Learning in Trading by Dr Tom Starke and is structured as follows.
- What is Reinforcement Learning?
- How to Apply Reinforcement Learning in Trading
- How is Reinforcement Learning Different from Traditional ML?
- Components of Reinforcement Learning
- Putting It All Together
- Q-Table and Q-Learning
- Experience Replay and Advanced Techniques in RL
- Challenges in Reinforcement Learning for Trading
What is Reinforcement Learning?
Despite sounding complex, reinforcement learning employs a simple concept we all understand from childhood. Remember receiving rewards for good grades or scolding for misbehavior? Those experiences shaped your behavior through positive and negative reinforcement.
Like humans, RL agents learn for themselves to achieve successful strategies that lead to the greatest long-term rewards. This paradigm of learning by trial-and-error, solely from rewards or punishments, is known as reinforcement learning (RL).
How to Apply Reinforcement Learning in Trading
In trading, RL can be applied to various objectives:
- Maximising profit
- Optimising portfolio allocation
The distinguishing advantage of RL is its ability to learn strategies that maximise long-term rewards, even when it means accepting short-term losses.
Consider Amazon's stock price, which remained relatively stable from late 2018 to early 2020, suggesting a mean-reverting strategy might work well.

However, from early 2020, the price began trending upward. Deploying a mean-reverting strategy at this point would have resulted in losses, causing many traders to exit the market.

An RL model, however, could recognise larger patterns from previous years (2017-2018) and continue holding positions for substantial future profits—exemplifying delayed gratification in action.
How is Reinforcement Learning Different from Traditional ML?
Unlike traditional machine learning algorithms, RL doesn't require labels at each time step. Instead:
- The RL algorithm learns through trial and error
- It receives rewards only when trades are closed
- It optimises strategy to maximise long-term rewards
Traditional ML requires labels at specific intervals (e.g., hourly or daily) and focuses on regression to predict the next candle percentage returns or classification to predict whether to buy or sell a stock. This makes solving the delayed gratification problem particularly challenging through conventional ML approaches.
Components of Reinforcement Learning
This guide focuses on the conceptual understanding of Reinforcement Learning components rather than their implementation. If you're interested in coding these concepts, you can explore the Deep Reinforcement Learning course on Quantra.
Actions
Actions define what the RL algorithm can do to solve a problem. For trading, actions might be Buy, Sell, and Hold. For portfolio management, actions would be capital allocations across asset classes.
Policy
Policies help the RL model decide which actions to take:
- Exploration policy: When the agent knows nothing, it decides actions randomly and learns from experiences. This initial phase is driven by experimentation—trying different actions and observing the outcomes.
- Exploitation policy: The agent uses past experiences to map states to actions that maximise long-term rewards.
In trading, it is crucial to maintain a balance between exploration and exploitation. A simple mathematical expression that decays exploration over time while retaining a small exploratory chance can be written as:
Here, εₜ is the exploration rate at trade number t, k controls the rate of decay, and εₘᵢₙ ensures we never stop exploring entirely.
Here,
is the exploration rate at trade number
State
The state provides meaningful information for decision-making. For example, when deciding whether to buy Apple stock, useful information might include:
- Technical indicators
- Historical price data
- Sentiment data
- Fundamental data
All this information constitutes the state. For effective analysis, the data should be weakly predictive and weakly stationary (having constant mean and variance), as ML algorithms generally perform better on stationary data.
Rewards
Rewards represent the end objective of your RL system. Common metrics include:
- Profit per tick
- Sharpe Ratio
- Profit per trade
When it comes to trading, using just the PnL sign (positive/negative) as the reward works better as the model learns faster. This binary reward structure allows the model to focus on consistently making profitable trades rather than chasing larger but potentially riskier gains.
Environment
The environment is the world that allows the RL agent to observe states. When the agent applies an action, the environment processes that action, calculates rewards, and transitions to the next state.
RL Agent
The agent is the RL model that takes input features/state and decides which action to take. For instance, an RL agent might take RSI and 10-minute returns as input to determine whether to go long on Apple stock or close an existing position.
Putting It All Together
Let's see how these components work together:
Step 1:
- State & Action: Apple's closing price was $92 on Jan 24, 2025. Based on the state (RSI and 10-day returns), the agent gives a buy signal.
- Environment: The order is placed at the open on the next trading day (Jan 27) and filled at $92.
- Reward: No reward is given as the trade is still open.
Step 2:
- State & Action: The next state reflects the latest price data. On Jan 27, the price reached $94. The agent analyses this state and decides to sell.
- Environment: A sell order is placed to close the long position.
- Reward: A reward of 2.1% is given to the agent.
Date |
Closing price |
Action |
Reward (% returns) |
Jan 24 |
$92 |
Buy |
- |
Jan 27 |
$94 |
Sell |
2.1 |
Q-Table and Q-Learning
At each time step, the RL agent needs to decide which action to take. The Q-table helps by showing which action will give the maximum reward. In this table:
- Rows represent states (days)
- Columns represent actions (hold/sell)
- Values are Q-values indicating expected future rewards
Example Q-table:
Date |
Sell |
Hold |
23-01-2025 |
0.954 |
0.966 |
24-01-2025 |
0.954 |
0.985 |
27-01-2025 |
0.954 |
1.005 |
28-01-2025 |
0.954 |
1.026 |
29-01-2025 |
0.954 |
1.047 |
30-01-2025 |
0.954 |
1.068 |
31-01-2025 |
0.954 |
1.090 |
On Jan 23, the agent would choose "hold" since its Q-value (0.966) exceeds the Q-value for "sell" (0.954).
Creating a Q-Table
Let's create a Q-table using Apple's price data from Jan 22-31, 2025:
Date |
Closing Price |
% Returns |
Cumulative Returns |
22-01-2025 |
97.2 |
- |
- |
23-01-2025 |
92.8 |
-4.53% |
0.95 |
24-01-2025 |
92.6 |
-0.22% |
0.95 |
27-01-2025 |
94.8 |
2.38% |
0.98 |
28-01-2025 |
93.3 |
-1.58% |
0.96 |
29-01-2025 |
95.0 |
1.82% |
0.98 |
30-01-2025 |
96.2 |
1.26% |
0.99 |
31-01-2025 |
106.3 |
10.50% |
1.09 |
If we've bought one Apple share with no remaining capital, our only choices are "hold" or "sell." We first create a reward table:
State/Action |
Sell |
Hold |
22-01-2025 |
0 |
0 |
23-01-2025 |
0.95 |
0 |
24-01-2025 |
0.95 |
0 |
27-01-2025 |
0.98 |
0 |
28-01-2025 |
0.96 |
0 |
29-01-2025 |
0.98 |
0 |
30-01-2025 |
0.99 |
0 |
31-01-2025 |
1.09 |
1.09 |
Using only this reward table, the RL model would sell the stock and get a reward of 0.95. However, the price is expected to increase to $106 on Jan 31, resulting in a 9% gain, so holding would be better.
To represent this future information, we create a Q-table using the Bellman equation:
Where:
- s is the state
- a is a set of actions at time t
- a' is a specific action
- R is the reward table
- Q is the state-action table that's constantly updated
- γ is the learning rate
Starting with Jan 30's Hold action:
- The reward for this action (from R-table) is 0
- Assuming γ = 0.98, the maximum Q-value for actions on Jan 31 is 1.09
- The Q-value for Hold on Jan 30 is 0 + 0.98(1.09) = 1.068
Completing this process for all rows gives us our Q-table:
Date |
Sell |
Hold |
23-01-2025 |
0.95 |
0.966 |
24-01-2025 |
0.95 |
0.985 |
27-01-2025 |
0.98 |
1.005 |
28-01-2025 |
0.96 |
1.026 |
29-01-2025 |
0.98 |
1.047 |
30-01-2025 |
0.99 |
1.068 |
31-01-2025 |
1.09 |
1.090 |
The RL model will now select "hold" to maximise Q-value. This process of updating the Q-table is called Q-learning.
In real-world scenarios with vast state spaces, building complete Q-tables becomes impractical. To overcome this, we can use Deep Q Networks (DQNs)—neural networks that learn Q-tables from past experiences and provide Q-values for actions when given a state as input.
Experience Replay and Advanced Techniques in RL
Experience Replay
- Stores (state, action, reward, next_state) tuples in a replay buffer
- Trains the network on random batches from this buffer
- Benefits: breaks correlations between samples, improves data efficiency, stabilises training
Double Q-Networks (DDQN)
- Uses two networks: primary for action selection, target for value estimation
- Reduces overestimation bias in Q-values
- More stable learning and better policies
Other Key Advancements
- Prioritised Experience Replay: Samples important transitions more frequently
- Dueling Networks: Separates state value and action advantage estimation
- Distributional RL: Models the entire return distribution instead of just the expected value
- Rainbow DQN: Combines multiple improvements for state-of-the-art performance
- Soft Actor-Critic: Adds entropy regularisation for robust exploration
These techniques address fundamental challenges in deep RL, improving efficiency, stability, and performance across complex environments.
Challenges in Reinforcement Learning for Trading
Type 2 Chaos
While training, the RL model works in isolation without interacting with the market. Once deployed, we don't know how it will affect the market. Type 2 chaos occurs when an observer can influence the situation they're observing. Although difficult to quantify during training, we can assume the RL model will continue learning after deployment and adjust accordingly.
Noise in Financial Data
RL models might interpret random noise in financial data as actionable signals, leading to inaccurate trading recommendations. While methods exist to remove noise, we must balance noise reduction against a potential loss of important data.
Conclusion
We've introduced the fundamental components of reinforcement learning systems for trading. The next step would be implementing your own RL system to backtest and paper trade using real-world market data.
For a deeper dive into RL and to create your own reinforcement learning trading strategies, consider specialised courses in Deep Reinforcement Learning on Quantra.
References & Further Readings
- Once you’re comfortable with the foundational ML concepts, you can explore advanced reinforcement learning and its role in trading through more structured learning experiences. Start with the Machine Learning & Deep Learning in Trading learning track, which offers hands-on tutorials on AI model design, data preprocessing, and financial market modelling.
- For those looking for an advanced, structured approach to quantitative trading and machine learning, the Executive Programme in Algorithmic Trading (EPAT) is an excellent choice. This program covers classical ML algorithms (such as SVM, k-means clustering, decision trees, and random forests), deep learning fundamentals (including neural networks and gradient descent), and Python-based strategy development. You will also explore statistical arbitrage using PCA, alternative data sources, and reinforcement learning applied to trading.
- Once you have mastered these concepts, you can apply your knowledge in real-world trading using Blueshift. Blueshift is an all-in-one automated trading platform that offers institutional-grade infrastructure for investment research, backtesting, and algorithmic trading. It is a fast, flexible, and reliable platform, agnostic to asset class and trading style, helping you turn your ideas into investment-worthy opportunities.
Disclaimer: All investments and trading in the stock market involve risk. Any decision to place trades in the financial markets, including trading in stock or options or other financial instruments, is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.