BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell,...

BioIntelligence Lab. 1

Learning to Trade via Direct Learning to Trade via Direct ReinforcementReinforcement

John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001

Summarized by Jangmin O


AuthorAuthor

J. Moody Director of Computational Finance Program and a Professor of

CSEE at Oregon Graduate Institute of Science and Technology Founder & President of Nonlinear Prediction Systems Program Co-Chair for Computational Finance 2000 a past General Chair and Program Chair of the NIPS a member of the editorial board of Quantitative Finance


I. IntroductionI. Introduction


Optimizing Investment Optimizing Investment PerformancePerformance Characteristic

Path-dependent

Methods : Direct Reinforcement learning (DR) Recurrent Reinforcement Learning [1, 2] No need for forecasting model Single security or Asset allocation

Recurrent Reinforcement Learning (RRL) Adaptive policy search Learning investment strategy on-line No need to learn a value function Immediate rewards available in financial market


Difference between RRL & Q Difference between RRL & Q or TDor TD Financial decision making problem : suitable to RRL

Immediate feedback available

Performance criteria : risk-adjusted investment returns Shape ratio Downside risk minimization

Differential form


Experimental DataExperimental Data

U.S. dollar/British Pound foreign exchange market

S&P 500 Stock Index and Treasury Bills

RRL v.s. Q Bellman’s curse of dimensionality


II. Trading Systems and II. Trading Systems and Performance CriteriaPerformance Criteria


Structure of Trading Systems Structure of Trading Systems (1)(1) An agent : assumption

단일 시장에서 고정 포지션씩 거래 Trader at time t , Ft {+1, 0, 1}

Long : 매수 , Neutral : 관망 , Short : 공매도 이익 Rt

(t-1, t] 의 끝에 실현 , Ft-1 포지션에 따른 손익 + Ft-1 에서 Ft 로의 포지션 이동에 따른 수수료

Recurrent 구조로 가야 한다 ! 수수료 , 마켓 임팩트 , 세금등을 고려한 결정을 하기

위해서


Structure of Trading Systems Structure of Trading Systems (2)(2) A single asset trading system

t : system parameter at time t

It : information at time t

zt : price series, yt : other external variable series

Simple example

,...},,,...;,,{

),;(

2121

1

ttttttt

tttt

yyyzzzI

IFFF

)...( 1101 wrvrvrvuFsignF mtmtttt


Profit and Wealth for Trading Profit and Wealth for Trading Systems (1)Systems (1) Performance functions, U(), for risk insensitive trader =

Profit Additive profits

Security 의 고정 주수 (shares or contracts) 에 대한 거래 rt = zt – zt-1 : risky asset 의 리턴 rt

f : risk-free asset 의 리턴 (T-bill 같은 )

: 수수료 비율 Trader 의 자산 : WT = W0 + PT

||)( , 111

ttf

tttf

tt

T

ttT FFrrFrRRP


Profit and Wealth for Trading Profit and Wealth for Trading Systems (2)Systems (2) Multiplicative profits

누적 자산의 일정 비율 > 0 이 투자됨 rt = (zt/zt-1 –1)

In case of no short sales, when = 1

||1)1(1}1{

}1{

111

10

ttttf

ttt

T

ttT

FFrFrFR

RWW


Performance CriteriaPerformance Criteria

UT in general form of U(RT,…,Rt,…,R2,R1;W0) Simple form U(WT) : standard economic utility

Path-dependent performance function : Sharpe ratio etc.

Moody 의 관심사 . Marginal increase of Ut, caused by Rt at each time step

Differential performance criteria

1 ttt UUD


Differential Sharpe Ratio (1)Differential Sharpe Ratio (1)

Sharpe ratio : risk adjusted return

Differential Sharpe ratio 온라인 러닝을 위해 , 시간 t 에서의 Rt 의 영향을 계산이

필요 . 지수 이동 평균 사용

Adaptation rate 에 대한 1 차 Taylor 전개

)(Deviation Standard

)(Average

t

tT R

RS

)(

)(||

2

0

1

2

0

00

OS

S

OS

SS

tt

ttt

= 0 이면 St = St-1



Exponential moving average with adaptation rate

Sharpe Ratio

Taylor 전개로부터 ,

ttttt

ttttt

BBBRB

AAARA

112

11

)1(

)1(

2/12 )( tt

tt ABK

AS

2/3211

11

0)(

21

tt

ttttt

t AB

BAABSD

Rt > At-1 : increased reward

Rt2 > Bt-1 : increased risk



Derivative with .

Dt is max at Rt = Bt-1/At-1

Meaning of differential Sharpe ratio Making on-line learning possible : At-1 과 Bt-1 로부터 쉽게

계산 가능 Recursive updating 이 가능함 최근 return 에 강한 가중치 부여 해석력 : Rt 의 기여도를 알 수 있게됨

2/3211

11

)(

tt

ttt

t

t

AB

RAB

R

D


IIIIII. . Learning to TradeLearning to Trade


Reinforcement FrameworkReinforcement Framework

RL Maximizing the expected reward Trial and error exploration of the environment

Comparison with supervised learning [1, 2] Problematic with transaction costs Structural credit assignment v.s. temporal credit assignment

Types of RL DL : policy search Q-learning : value function Actor-critic method


Recurrent Reinforcement Recurrent Reinforcement Learning (1)Learning (1) Goal

트레이딩 시스템 Ft() 에 대해 , UT 를 최대화 하는 파라미터 를 찾는 것

Example 트레이딩 시스템

Trading return

시간 T 후의 미분 공식

,...},,,...;,,{

),;(

2121

1

ttttttt

tttt

yyyzzzI

IFFF

|| , 11 ttttttT FFrFRRP

1

11

)( t

t

tt

t

tT

t t

TT F

F

RF

F

R

R

UU


Recurrent Reinforcement Recurrent Reinforcement Learning (2)Learning (2) 학습 기법

Back-propagation through time (BPTT)

Temporal dependencies

Stochastic version Rt 에 관계되는 항에만 집중

1

1

t

t

ttt F

F

FFF

1

1

11

)(

t

t

t

t

t

t

t

t

t

t

t

tt F

F

RF

F

R

R

UU

1

1

)( t

t

tt

t

t

t

t

t

ttt

F

F

RF

F

R

R

DD

Differential performance criteria Dt


Recurrent Reinforcement Recurrent Reinforcement Learning (3)Learning (3) Remind

Moody 는 특정 액션에 대한 즉각적인 측정치 , Dt 를 최적화 하는 것에 초점

[1, 2] 포트폴리오 최적화 등


Value Function (1)Value Function (1)

Implicitly learning correct actions through value iteration Value function

Discounted future rewards being received from state x following the policy

a y

xy yVayxDapaxxV )}(),,({)(),()(

상태 x 에서 액션 a 를 취할 확률

x y 상태 전이시 액션 a 를 취할 확률

x y 상태 전이시 액션 a 를 취할 때의 immediate reward

Future reward 와 immediate rewards 간의 discount factor


Value Function (2)Value Function (2)

Optimal value function & Bellman’s optimally equation

Value iteration update : Converge to optimal solution

Optimal Policy

)(max)(* xVxV

)}(),,({)(max)( ** yVayxDapxVy

xya

)}(),,({)(max)(1 yVayxDapxV ty

xya

t

)}(),,({)(maxarg ** yVayxDapay

xya


Q-LearningQ-Learning

Q-function : 현재 상태와 현재 액션에 대한 future reward 계산

Value iteration update : Converge to optimal Q-function

Calculating the best action No need to know pxy(a)

)},(max),,({)(),( ** byQayxDapaxQb

yxy

)},(max),,({)(max),(1 byQayxDapaxQ tb

yxy

at

)),((maxarg ** axQaa

2)),(),(max),,((2

1axQbyQayxD

b

Error function of function approximator (i.e. NN)


IVIV. . Empirical ResultsEmpirical Results

1. Artificial price series

2. U.S. Dollar/British Pound Exchange rate

3. Monthly S&P 500 stock index


A trading system based on DR


Artificial price seriesArtificial price series

Data : autoregressive trend processes

10,000 samples 검증

RRL 이 트레이딩 전략의 학습 도구로 적합한지 ? 거래세의 증가에 따른 거래 횟수의 경향은 ?

)()1()(

)()1()1()(

tvtt

tkttptp

)/)(exp()( Rtptz


Error function of function approximator (i.e. NN)10,000 샘플

{long, short} position only

~2,000 기간 동안 성능 저하


9,000~ 확대

= 0.01


거래횟수

누적이익

Sharpe Ratio

100 번 실험 후 결과100 에포크 학습 + 온라인 적응거래세 0.2%, 0.5%, 1%


U.S. Dollar/British Pound U.S. Dollar/British Pound Foreign Exchange TradingForeign Exchange Trading {long, neutral, short} trading system

30 minute U.S. Dollar/British Pound foreign exchange (FX) rate data 주 5 일 , 24 시간 거래 : 1996 년 1~8 월 분량

전략 2,000 데이터 학습 480 데이터 트레이딩 (2 주 ) 윈도우 이동후 재학습

결과 Annualized 15% return with annualized Sharpe ratio 2.3 평균적으로 5 시간당 1 번 거래

고려되지 않은 사항 피크를 이룬 트레이딩 . 시장의 비유동성



S&P 500/T-Bill Asset S&P 500/T-Bill Asset Allocation (1)Allocation (1) 소개

Long position : S&P 500 에 포지션 , T-Bill 이윤은 없음 Short position : 2 배의 T-Bill 비율을 얻음

배당금 재투자

T-Bill 배당금

S&P500 배당금


S&P 500/T-Bill Asset S&P 500/T-Bill Asset Allocation (2)Allocation (2) 시뮬레이션

데이터 (1950 ~ 1994): 초기 학습 (~1969) + 테스트 (1970~) 학습 윈도우 : 10 년 학습 + 10 년 validation Input Feature : 84 (financial + macroeconomic) series

RRL-trader tanh 유닛 1 개 , weight decay

Q-trader bootstrap 샘플 사용 2-layer FNN (30 tanh 유닛 ) Bias/variance trade off : 10, 20, 30, 40 유닛 모델중 선택


Voting methods

RRL : 30 번 , Q : 10 번거래세 0.5%이익금 재투자Multiplicative profit ratio

Buy and Hold : 1348%Q-Trader : 3359%RRL-Trader : 5860%


대전제 : 1970 ~ 1994 의 25 년 동안 미국 증권 /재무증권 시장은 예측가능했다 .

오일쇼크

통화긴축

시장조정

시장붕괴

걸프 전쟁

Statistically significant


Sensitivity Analysisj

ji

i dx

dF

dx

dFS max/

인플레이션 기대치


VV . . Learn the Policy or Learn Learn the Policy or Learn the Value?the Value?


Immediate v.s. Future Immediate v.s. Future RewardsRewards Reinforcement signal

Immediate (RRL) or delayed (Q , dynamic programming, or TD)

RRL Policy is represented directly. Learning value function is bypassed

Q Policy is represented indirect


Policies v.s. ValuesPolicies v.s. Values

Some limitations of value function approach Original formulation of Q-learning : discrete action & state spaces Curse of dimensionality Policies derived from Q-learning tend to be brittle : small changes in

value function may lead large changes in the policy Large scale noise and non-stationarity may lead severe problems

RRL’s advantages Policy is represented directly : Simpler functional form is sufficient Can produce real valued actions More robust in noisy environment / Quick adaptation to non-

stationarity


An ExampleAn Example

Simple trading system {buy, sell} a single asset. Assumption: rt+1 is known in advance.

No need to future rewards : = 0

Policy funciton is trivial : at = rt+1

1 tanh unit is sufficient

Value function : ability to treat XOR 2 tanh units needed


ConclusionConclusion

How to train trading systems via DR RRL algorithm Differential Sharpe ratio & differential downside

deviation ratio RRL is more efficient than Q-learning in financial area.

Date post:	04-Jan-2016
Category:	Documents
Upload:	linda-benson
View:	224 times
Download:	2 times

BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell,...

Documents