Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | linda-benson |
View: | 224 times |
Download: | 2 times |
BioIntelligence Lab. 1
Learning to Trade via Direct Learning to Trade via Direct ReinforcementReinforcement
John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001
Summarized by Jangmin O
BioIntelligence Lab. 2
AuthorAuthor
J. Moody Director of Computational Finance Program and a Professor of
CSEE at Oregon Graduate Institute of Science and Technology Founder & President of Nonlinear Prediction Systems Program Co-Chair for Computational Finance 2000 a past General Chair and Program Chair of the NIPS a member of the editorial board of Quantitative Finance
BioIntelligence Lab. 3
I. IntroductionI. Introduction
BioIntelligence Lab. 4
Optimizing Investment Optimizing Investment PerformancePerformance Characteristic
Path-dependent
Methods : Direct Reinforcement learning (DR) Recurrent Reinforcement Learning [1, 2] No need for forecasting model Single security or Asset allocation
Recurrent Reinforcement Learning (RRL) Adaptive policy search Learning investment strategy on-line No need to learn a value function Immediate rewards available in financial market
BioIntelligence Lab. 5
Difference between RRL & Q Difference between RRL & Q or TDor TD Financial decision making problem : suitable to RRL
Immediate feedback available
Performance criteria : risk-adjusted investment returns Shape ratio Downside risk minimization
Differential form
BioIntelligence Lab. 6
Experimental DataExperimental Data
U.S. dollar/British Pound foreign exchange market
S&P 500 Stock Index and Treasury Bills
RRL v.s. Q Bellman’s curse of dimensionality
BioIntelligence Lab. 7
II. Trading Systems and II. Trading Systems and Performance CriteriaPerformance Criteria
BioIntelligence Lab. 8
Structure of Trading Systems Structure of Trading Systems (1)(1) An agent : assumption
단일 시장에서 고정 포지션씩 거래 Trader at time t , Ft {+1, 0, 1}
Long : 매수 , Neutral : 관망 , Short : 공매도 이익 Rt
(t-1, t] 의 끝에 실현 , Ft-1 포지션에 따른 손익 + Ft-1 에서 Ft 로의 포지션 이동에 따른 수수료
Recurrent 구조로 가야 한다 ! 수수료 , 마켓 임팩트 , 세금등을 고려한 결정을 하기
위해서
BioIntelligence Lab. 9
Structure of Trading Systems Structure of Trading Systems (2)(2) A single asset trading system
t : system parameter at time t
It : information at time t
zt : price series, yt : other external variable series
Simple example
,...},,,...;,,{
),;(
2121
1
ttttttt
tttt
yyyzzzI
IFFF
)...( 1101 wrvrvrvuFsignF mtmtttt
BioIntelligence Lab. 10
Profit and Wealth for Trading Profit and Wealth for Trading Systems (1)Systems (1) Performance functions, U(), for risk insensitive trader =
Profit Additive profits
Security 의 고정 주수 (shares or contracts) 에 대한 거래 rt = zt – zt-1 : risky asset 의 리턴 rt
f : risk-free asset 의 리턴 (T-bill 같은 )
: 수수료 비율 Trader 의 자산 : WT = W0 + PT
||)( , 111
ttf
tttf
tt
T
ttT FFrrFrRRP
BioIntelligence Lab. 11
Profit and Wealth for Trading Profit and Wealth for Trading Systems (2)Systems (2) Multiplicative profits
누적 자산의 일정 비율 > 0 이 투자됨 rt = (zt/zt-1 –1)
In case of no short sales, when = 1
||1)1(1}1{
}1{
111
10
ttttf
ttt
T
ttT
FFrFrFR
RWW
BioIntelligence Lab. 12
Performance CriteriaPerformance Criteria
UT in general form of U(RT,…,Rt,…,R2,R1;W0) Simple form U(WT) : standard economic utility
Path-dependent performance function : Sharpe ratio etc.
Moody 의 관심사 . Marginal increase of Ut, caused by Rt at each time step
Differential performance criteria
1 ttt UUD
BioIntelligence Lab. 13
Differential Sharpe Ratio (1)Differential Sharpe Ratio (1)
Sharpe ratio : risk adjusted return
Differential Sharpe ratio 온라인 러닝을 위해 , 시간 t 에서의 Rt 의 영향을 계산이
필요 . 지수 이동 평균 사용
Adaptation rate 에 대한 1 차 Taylor 전개
)(Deviation Standard
)(Average
t
tT R
RS
)(
)(||
2
0
1
2
0
00
OS
S
OS
SS
tt
ttt
= 0 이면 St = St-1
BioIntelligence Lab. 14
Differential Sharpe Ratio (2)Differential Sharpe Ratio (2)
Exponential moving average with adaptation rate
Sharpe Ratio
Taylor 전개로부터 ,
ttttt
ttttt
BBBRB
AAARA
112
11
)1(
)1(
2/12 )( tt
tt ABK
AS
2/3211
11
0)(
21
tt
ttttt
t AB
BAABSD
Rt > At-1 : increased reward
Rt2 > Bt-1 : increased risk
BioIntelligence Lab. 15
Differential Sharpe Ratio (3)Differential Sharpe Ratio (3)
Derivative with .
Dt is max at Rt = Bt-1/At-1
Meaning of differential Sharpe ratio Making on-line learning possible : At-1 과 Bt-1 로부터 쉽게
계산 가능 Recursive updating 이 가능함 최근 return 에 강한 가중치 부여 해석력 : Rt 의 기여도를 알 수 있게됨
2/3211
11
)(
tt
ttt
t
t
AB
RAB
R
D
BioIntelligence Lab. 16
IIIIII. . Learning to TradeLearning to Trade
BioIntelligence Lab. 17
Reinforcement FrameworkReinforcement Framework
RL Maximizing the expected reward Trial and error exploration of the environment
Comparison with supervised learning [1, 2] Problematic with transaction costs Structural credit assignment v.s. temporal credit assignment
Types of RL DL : policy search Q-learning : value function Actor-critic method
BioIntelligence Lab. 18
Recurrent Reinforcement Recurrent Reinforcement Learning (1)Learning (1) Goal
트레이딩 시스템 Ft() 에 대해 , UT 를 최대화 하는 파라미터 를 찾는 것
Example 트레이딩 시스템
Trading return
시간 T 후의 미분 공식
,...},,,...;,,{
),;(
2121
1
ttttttt
tttt
yyyzzzI
IFFF
|| , 11 ttttttT FFrFRRP
1
11
)( t
t
tt
t
tT
t t
TT F
F
RF
F
R
R
UU
BioIntelligence Lab. 19
Recurrent Reinforcement Recurrent Reinforcement Learning (2)Learning (2) 학습 기법
Back-propagation through time (BPTT)
Temporal dependencies
Stochastic version Rt 에 관계되는 항에만 집중
1
1
t
t
ttt F
F
FFF
1
1
11
)(
t
t
t
t
t
t
t
t
t
t
t
tt F
F
RF
F
R
R
UU
1
1
)( t
t
tt
t
t
t
t
t
ttt
F
F
RF
F
R
R
DD
Differential performance criteria Dt
BioIntelligence Lab. 20
Recurrent Reinforcement Recurrent Reinforcement Learning (3)Learning (3) Remind
Moody 는 특정 액션에 대한 즉각적인 측정치 , Dt 를 최적화 하는 것에 초점
[1, 2] 포트폴리오 최적화 등
BioIntelligence Lab. 21
Value Function (1)Value Function (1)
Implicitly learning correct actions through value iteration Value function
Discounted future rewards being received from state x following the policy
a y
xy yVayxDapaxxV )}(),,({)(),()(
상태 x 에서 액션 a 를 취할 확률
x y 상태 전이시 액션 a 를 취할 확률
x y 상태 전이시 액션 a 를 취할 때의 immediate reward
Future reward 와 immediate rewards 간의 discount factor
BioIntelligence Lab. 22
Value Function (2)Value Function (2)
Optimal value function & Bellman’s optimally equation
Value iteration update : Converge to optimal solution
Optimal Policy
)(max)(* xVxV
)}(),,({)(max)( ** yVayxDapxVy
xya
)}(),,({)(max)(1 yVayxDapxV ty
xya
t
)}(),,({)(maxarg ** yVayxDapay
xya
BioIntelligence Lab. 23
Q-LearningQ-Learning
Q-function : 현재 상태와 현재 액션에 대한 future reward 계산
Value iteration update : Converge to optimal Q-function
Calculating the best action No need to know pxy(a)
)},(max),,({)(),( ** byQayxDapaxQb
yxy
)},(max),,({)(max),(1 byQayxDapaxQ tb
yxy
at
)),((maxarg ** axQaa
2)),(),(max),,((2
1axQbyQayxD
b
Error function of function approximator (i.e. NN)
BioIntelligence Lab. 24
IVIV. . Empirical ResultsEmpirical Results
1. Artificial price series
2. U.S. Dollar/British Pound Exchange rate
3. Monthly S&P 500 stock index
BioIntelligence Lab. 25
A trading system based on DR
BioIntelligence Lab. 26
Artificial price seriesArtificial price series
Data : autoregressive trend processes
10,000 samples 검증
RRL 이 트레이딩 전략의 학습 도구로 적합한지 ? 거래세의 증가에 따른 거래 횟수의 경향은 ?
)()1()(
)()1()1()(
tvtt
tkttptp
)/)(exp()( Rtptz
BioIntelligence Lab. 27
Error function of function approximator (i.e. NN)10,000 샘플
{long, short} position only
~2,000 기간 동안 성능 저하
BioIntelligence Lab. 28
9,000~ 확대
= 0.01
BioIntelligence Lab. 29
거래횟수
누적이익
Sharpe Ratio
100 번 실험 후 결과100 에포크 학습 + 온라인 적응거래세 0.2%, 0.5%, 1%
BioIntelligence Lab. 30
U.S. Dollar/British Pound U.S. Dollar/British Pound Foreign Exchange TradingForeign Exchange Trading {long, neutral, short} trading system
30 minute U.S. Dollar/British Pound foreign exchange (FX) rate data 주 5 일 , 24 시간 거래 : 1996 년 1~8 월 분량
전략 2,000 데이터 학습 480 데이터 트레이딩 (2 주 ) 윈도우 이동후 재학습
결과 Annualized 15% return with annualized Sharpe ratio 2.3 평균적으로 5 시간당 1 번 거래
고려되지 않은 사항 피크를 이룬 트레이딩 . 시장의 비유동성
BioIntelligence Lab. 31
BioIntelligence Lab. 32
S&P 500/T-Bill Asset S&P 500/T-Bill Asset Allocation (1)Allocation (1) 소개
Long position : S&P 500 에 포지션 , T-Bill 이윤은 없음 Short position : 2 배의 T-Bill 비율을 얻음
배당금 재투자
T-Bill 배당금
S&P500 배당금
BioIntelligence Lab. 33
S&P 500/T-Bill Asset S&P 500/T-Bill Asset Allocation (2)Allocation (2) 시뮬레이션
데이터 (1950 ~ 1994): 초기 학습 (~1969) + 테스트 (1970~) 학습 윈도우 : 10 년 학습 + 10 년 validation Input Feature : 84 (financial + macroeconomic) series
RRL-trader tanh 유닛 1 개 , weight decay
Q-trader bootstrap 샘플 사용 2-layer FNN (30 tanh 유닛 ) Bias/variance trade off : 10, 20, 30, 40 유닛 모델중 선택
BioIntelligence Lab. 34
Voting methods
RRL : 30 번 , Q : 10 번거래세 0.5%이익금 재투자Multiplicative profit ratio
Buy and Hold : 1348%Q-Trader : 3359%RRL-Trader : 5860%
BioIntelligence Lab. 35
대전제 : 1970 ~ 1994 의 25 년 동안 미국 증권 /재무증권 시장은 예측가능했다 .
오일쇼크
통화긴축
시장조정
시장붕괴
걸프 전쟁
Statistically significant
BioIntelligence Lab. 36
Sensitivity Analysisj
ji
i dx
dF
dx
dFS max/
인플레이션 기대치
BioIntelligence Lab. 37
VV . . Learn the Policy or Learn Learn the Policy or Learn the Value?the Value?
BioIntelligence Lab. 38
Immediate v.s. Future Immediate v.s. Future RewardsRewards Reinforcement signal
Immediate (RRL) or delayed (Q , dynamic programming, or TD)
RRL Policy is represented directly. Learning value function is bypassed
Q Policy is represented indirect
BioIntelligence Lab. 39
Policies v.s. ValuesPolicies v.s. Values
Some limitations of value function approach Original formulation of Q-learning : discrete action & state spaces Curse of dimensionality Policies derived from Q-learning tend to be brittle : small changes in
value function may lead large changes in the policy Large scale noise and non-stationarity may lead severe problems
RRL’s advantages Policy is represented directly : Simpler functional form is sufficient Can produce real valued actions More robust in noisy environment / Quick adaptation to non-
stationarity
BioIntelligence Lab. 40
An ExampleAn Example
Simple trading system {buy, sell} a single asset. Assumption: rt+1 is known in advance.
No need to future rewards : = 0
Policy funciton is trivial : at = rt+1
1 tanh unit is sufficient
Value function : ability to treat XOR 2 tanh units needed
BioIntelligence Lab. 41
ConclusionConclusion
How to train trading systems via DR RRL algorithm Differential Sharpe ratio & differential downside
deviation ratio RRL is more efficient than Q-learning in financial area.