Learning to Trade viaLearning to Trade via Direct ReinforcementDirect Reinforcement
John MoodyInternational Computer Science Institute,
Berkeley&
J E Moody & Company LLC, Portland
[email protected]@JEMoody.Com
Global Derivatives Trading & Risk ManagementParis, May 2008
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tWhat is Reinforcement What is Reinforcement Learning?Learning?
RL Considers:• A Goal-Directed “Learning” Agent • interacting with an Uncertain Environment• that attempts to maximize Reward / Utility
RL is an Active Paradigm:• Agent “Learns” by “Trial & Error” Discovery• Actions result in Reinforcement
RL Paradigms:• Value Function Learning (Dynamic
Programming)• Direct Reinforcement (Adaptive Control)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
I. Why Direct Reinforcement?I. Why Direct Reinforcement?
Direct Reinforcement Learning:
Finds predictive structure in financial data
Integrates Forecasting w/ Decision Making
Balances Risk vs. RewardIncorporates Transaction Costs
Discover Trading Strategies!
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tOptimizing Trades based on Optimizing Trades based on ForecastsForecasts
Indirect Approach:• Two sets of parameters• Forecast error is not Utility • Forecaster ignores transaction costs• Information bottleneck
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Learning to Trade via Direct ReinforcementLearning to Trade via Direct Reinforcement
Trader Properties:• One set of parameters• A single utility function • U includes transaction costs• Direct mapping from inputs to actions
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Direct RL TraderDirect RL Trader (USD/GBP):(USD/GBP): ReturnReturnAA=15%,=15%, SR SRAA=2.3,=2.3, DDR DDRAA=3.3=3.3
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
II. II. Direct Reinforcement:Direct Reinforcement: Algorithms & Algorithms &
IllustrationsIllustrations
Algorithms:Recurrent Reinforcement Learning (RRL)Stochastic Direct Reinforcement (SDR)
Illustrations:Sensitivity to Transaction CostsRisk-Averse Reinforcement
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tLearning to Trade via Direct Learning to Trade via Direct ReinforcementReinforcement
DR Trader:
• Recurrent policy (Trading signals, Portfolio weights)
• Takes action, Receives reward (Trading Return w/ Transaction Costs)
• Causal performance function(Generally path-dependent)
• Learn policy by varying GOAL: Maximize performance
or marginal performance
1( ; , )t t t tF F F I
1, ; ,t t t tR F F S
1 1( , ,..., )t tU R R R
1t t t tD U U U
1( ; , )t t tF F I t
TU
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tRecurrent Reinforcement Learning (RRL)Recurrent Reinforcement Learning (RRL)(Moody & Wu 1997)(Moody & Wu 1997)
Deterministic gradient (batch):
with recursion:
Stochastic gradient (on-line):
stochastic recursion:
Stochastic parameter update (on-line):
Constant : adaptive learning. Declining : stochastic approx.
1
1 1
TT t t t tT
t t t t
dU dR dF dR dFdU
d dR dF d dF d
1
1 1
t t t t t t
t t t t t t
dU dU dR dF dR dF
d dR dF d dF d
t tt
t
dU
d
1
1
t t t t
t
dF F dF dF
d dF d
1
1 1
t t t t
t t t t
dF F dF dF
d dF d
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Structure of TradersStructure of Traders
• Single Asset- Price series
- Return series
• Traders - Discrete position size - Recurrent policy
• Observations:
– Full system State is not known
• Simple Trading Returns and Profit:
• Transaction Costs: represented by .
tz1t t tr z z
1,0,1tF
1( ; , )t t t tF F F I
1 2 1 2, , ,...; , , ,...t t t t t t tI z z z y y y
1 1
1
t t t t t
T
t tt
R F r F F
P R
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Risk-Averse Reinforcement:Risk-Averse Reinforcement:Financial Performance MeasuresFinancial Performance Measures
Performance Functions:• Path independent: (Standard Utility Functions)• Path dependent:
Performance Ratios:• Sharpe Ratio:
• Downside Deviation Ratio:
For Learning:• Per-Period Returns: • Marginal Performance:
e.g. Differential Sharpe Ratio .
( )t tU U W
1 0( , ,..., )t t tU U R R W
Average( )
Standard Deviation( )t
t
R
R
Average( )
Downside Deviation( )t
t
R
R
tR
1t t t tD U U U
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Long / Short Trader SimulationLong / Short Trader SimulationSensitivity to Transaction CostsSensitivity to Transaction Costs
• Learns from scratch and on-line
• Moving average Sharpe Ratio with = 0.01
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Trader SimulationTrader SimulationTransaction Costs vs. Performance
100 Runs; Costs = 0.2%, 0.5%, and 1.0%
SharpeRatio
TradingFrequency
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Minimizing Downside Risk:Minimizing Downside Risk:Artificial Price Series w/ Artificial Price Series w/ Heavy TailsHeavy Tails
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Comparison of Risk-Averse Comparison of Risk-Averse TradersTraders Underwater Curves Underwater Curves
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tComparison of Risk-Averse Traders: Comparison of Risk-Averse Traders: Draw-DownsDraw-Downs
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
III. III. Direct Reinforcement vs.Direct Reinforcement vs. Dynamic Dynamic ProgrammingProgramming
Algorithms:Value Function Method (Q-Learning)Direct Reinforcement Learning (RRL)
Illustration:Asset Allocation: S&P 500 & T-BillsRRL vs. Q-Learning
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
RL Paradigms ComparedRL Paradigms Compared
Value Function Learning
• Origins: Dynamic Programming
• Learn “optimal” Q-Function
Q: state action value
• Solve Bellman’s Equation
Action:
“Indirect”
Direct Reinforcement
• Origins: Adaptive Control• Learn “good” Policy P
P: observations p(action)
• Optimize “Policy Gradient”
Action:
“Direct”
ˆ( , )P obsa b
ˆargmax ( , , )Q x ba
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
S&P-500 / T-Bill Asset Allocation:S&P-500 / T-Bill Asset Allocation:Maximizing the Differential Sharpe RatioMaximizing the Differential Sharpe Ratio
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
S&P-500: Opening Up the Black BoxS&P-500: Opening Up the Black Box85 series: Learned relationships are nonstationary over
time
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Closing RemarksClosing Remarks• Direct Reinforcement Learning:
– Discovers Trading Opportunities in Markets– Integrates Forecasting w/ Trading– Maximizes Risk-Adjusted Returns– Optimizes Trading w/ Transaction Costs
• Direct Reinforcement Offers Advantages Over:– Trading based on Forecasts (Supervised Learning)– Dynamic Programming RL (Value Function Methods)
• Illustrations:– Controlled Simulations– FX Currency Trader– Asset Allocation: S&P 500 vs. Cash
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Selected ReferencesSelected References::
[1] John Moody and Lizhong Wu. Optimization of trading systems and portfolios. Decision Technologies for Financial Engineering, 1997.
[2] John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17:441-470, 1998.
[3] Jonathan Baxter and Peter L. Bartlett. Direct gradient-based reinforcement learning: Gradient estimation algorithms. 2001.
[4] John Moody and Matthew Saffell. Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks, 12(4):875-889, July 2001.
[5] Carl Gold. FX Trading via Recurrent Reinforcement Learning. Proceedings of IEEE CIFEr Conference, Hong Kong, 2003.
[6] John Moody, Y. Liu, M. Saffell and K.J. Youn. Stochastic Direct Reinforcement: Application to Simple Games with Recurrence. In Artificial Multiagent Learning, Sean Luke et al. eds, AAAI Press, 2004.
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Supplemental SlidesSupplemental Slides
• Differential Sharpe Ratio
• Portfolio Optimization
• Stochastic Direct Reinforcement (SDR)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Maximizing the Sharpe RatioMaximizing the Sharpe Ratio
Sharpe Ratio:
Exponential Moving Average Sharpe Ratio:
with time scale and
Motivation:• EMA Sharpe ratio emphasizes recent patterns;• can be updated incrementally.
Average( )
Standard Deviation( )t
Tt
RS
R
2 1 2( )
( )t
t t
AS t
K B A
1 1( )t t t tA A R A 2
1 1( )t t t tB B R B 1 2
1 2
1K
1
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Differential Sharpe RatioDifferential Sharpe Ratio for Adaptive Optimizationfor Adaptive Optimization
Expand to first order in :
Define Differential Sharpe Ratio as:
where
1 1
2 3 21 1
1( ) 2( )
( )
t t t t
t t
B A A BdS tD t
d B A
20
( )( ) ( 1) | ( ).
dS tS t S t O
d
1t t tA R A 2
1t t tB R B
( )S t
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Learning with the Differential SRLearning with the Differential SR
Evaluate “Marginal Utility” Gradient:
Motivation for DSR:• isolates contribution of to (“marginal utility” );• provides interpretability;• adapts to changing market conditions;• facilitates efficient on-line learning (stochastic
optimization).
1 12 3 2
1 1
( )
( )t t t
t t t
dD t B A R
dR B A
tR tU
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Trader SimulationTransaction costs vs. Performance
100 runs; Costs = 0.2%, 0.5%, and 1.0%
TradingFrequency
CumulativeProfit
SharpeRatio
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Portfolio Optimization (3 Securities)Portfolio Optimization (3 Securities)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tStochastic Direct Reinforcement: Stochastic Direct Reinforcement:
Probabilistic PoliciesProbabilistic Policies
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Learning to TradeLearning to Trade
• Single Asset- Price series
- Return series
• Trader - Discrete position size - Recurrent policy
• Observations:
– Full system State is not known
• Simple Trading Returns and Profit:
• Transaction cost rate .
tz1t t tr z z
1,0,1ta
1( ; , )t t t tP Ia a 1 2 1 2, , ,...; , , ,...t t t t t t tI r r r i i i
1 1
1
t t t t t
T
t tt
R r
P R
a a a
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Consider a learning agent with stochastic policy function
whose inputs include recent observations o and actions a :
Why should past actions (recurrence) be included?
Examples:Games (observations o are opponent’s actions)
Trading financial markets
In General:
Why does Reinforcement need Why does Reinforcement need Recurrence? Recurrence?
1 2 1 2( ; ; )t t t tt o o a aP a
Model opponent’s responses o to previous actions a
Minimize transaction costs, market impact
Recurrence enables discovery of better policiesthat capture an agent’s impact on the world !!
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Stochastic Direct Reinforcement (SDR):Stochastic Direct Reinforcement (SDR):Maximize PerformanceMaximize Performance
Expected total performance of a sequence of T actions
Maximize performance via direct gradient ascent
Must evaluate total policy gradient
for a policy represented by
1 11
( ) ( | ) ( )T t
T
T t t t t tH t a
U u a p a H p H
t
t
dU
d
1
1 1( ) ( | ) ( )t
t t t tH
d dp a p a H p H
d d
( , ) ( , ) ( ) ( )1 1 1 1with( ) ( )n m n m n m
t t t t tP a H H O A
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Stochastic Direct Reinforcement (SDR):Stochastic Direct Reinforcement (SDR):Maximize PerformanceMaximize Performance
The goal of SDR is to maximize expected total performance
of a sequence of T actions
via direct gradient ascent
Must evaluate
for a policy represented by
1 11
( ) ( | ) ( )T t
T
T t t t t tH t a
U u a p a H p H
1 1( ) ( | ) ( )T t
Tt t t t t
H t a
dU du a p a H p H
d d
Notation: The complete history is denoted . is a partial history of length (n,m) .
( , ) ( ) ( )( )n m n mt t tH O A
( )t t tH O A
1
1 1( ) ( | ) ( )t
t t t tH
d dp a p a H p H
d d
( ) ( )
1 1( , )n mt t tP a O A
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Stochastic Direct Reinforcement:Stochastic Direct Reinforcement:First Order Recurrent Policy GradientFirst Order Recurrent Policy Gradient
For first order recurrence (m=1), conditional action probability is given by the policy:
The probabilities of current actions depend upon the probabilities of prior actions:
The total (recurrent) policy gradient is computed as :
with partial (naïve) policy gradient :
1
1 1( ) ( ) ( )t
t t t ta
p a p a a p a
1
1 11 1
( ) ( ) ( )( ) ( )
t
t t t tt t t
a
dp a p a a dp ap a p a a
d d
( )1 1 1( ) ( )n
t t t t tp a a P a O a
( )1 1 1( ;...) ( )n
t t t t tp a a P a O a
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
SDR Trader SimulationSDR Trader Simulation w/ Transaction Costsw/ Transaction Costs
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tTrading Frequency vs. Transaction Trading Frequency vs. Transaction CostsCosts
Recurrent SDR Non-Recurrent
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Sharpe Ratio vs. Transaction CostsSharpe Ratio vs. Transaction Costs
Recurrent SDR Non-Recurrent