Continuous Control with Stacked Deep Dynamic Recurrent...

Continuous Control with Stacked Deep Dynamic Recurrent Reinforcement Learning for Portfolio Optimization

Journal Pre-proof

Continuous Control with Stacked Deep Dynamic RecurrentReinforcement Learning for Portfolio Optimization

Amine Mohamed Aboussalah, Chi-Guhn Lee

PII: S0957-4174(19)30607-4DOI: https://doi.org/10.1016/j.eswa.2019.112891Reference: ESWA 112891

To appear in: Expert Systems With Applications

Received date: 29 January 2019Revised date: 22 July 2019Accepted date: 19 August 2019

Please cite this article as: Amine Mohamed Aboussalah, Chi-Guhn Lee, Continuous Control withStacked Deep Dynamic Recurrent Reinforcement Learning for Portfolio Optimization, Expert SystemsWith Applications (2019), doi: https://doi.org/10.1016/j.eswa.2019.112891

This is a PDF file of an unedited manuscript that has been accepted for publication. As a serviceto our customers we are providing this early version of the manuscript. The manuscript will undergocopyediting, typesetting, and review of the resulting proof before it is published in its final form. Pleasenote that during the production process errors may be discovered which could affect the content, andall legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier Ltd.

https://doi.org/10.1016/j.eswa.2019.112891

https://doi.org/10.1016/j.eswa.2019.112891

Highlights

• Incorporating portfolio constraints into recurrent reinforce-ment learning framework

• Reinforcement learning algorithm with continuous trad-ing actions over multiple assets

• Simultaneous control of the portfolio constraints and pol-icy network hyperparameters

• Hourglass shape network architectures emerge as a natu-ral choice for asset management

1

Continuous Control with Stacked Deep Dynamic Recurrent Reinforcement Learning forPortfolio Optimization

Amine Mohamed Aboussalaha, Chi-Guhn Leea,∗

aDepartment of Mechanical and Industrial Engineering, University of Toronto, ON M5S 3G8, Canada

Abstract

Recurrent Reinforcement Learning (RRL) techniques have been used to optimize asset trading systems and have achieved out-standing results. However, the majority of the previous work has been dedicated to systems with discrete action spaces. To addressthe challenge of continuous action and multi-dimensional state spaces, we propose the so called Stacked Deep Dynamic RecurrentReinforcement Learning (SDDRRL) architecture to construct a real-time optimal portfolio. The algorithm captures the up-to-datemarket conditions and rebalances the portfolio accordingly. Under this general vision, Sharpe ratio, which is one of the most widelyaccepted measures of risk-adjusted returns, has been used as a performance metric. Additionally, the performance of most machinelearning algorithms highly depends on their hyperparameter settings. Therefore, we equipped SDDRRL with the ability to find thebest possible architecture topology using an automated Gaussian Process (GP) with Expected Improvement (EI) as an acquisitionfunction. This allows us to select the best architectures that maximizes the total return while respecting the cardinality constraints.Finally, our system was trained and tested in an online manner for 20 successive rounds with data for ten selected stocks fromdifferent sectors of the S&P 500 from January 1st, 2013 to July 31st, 2017. The experiments reveal that the proposed SDDRRLachieves superior performance compared to three benchmarks: the rolling horizon Mean-Variance Optimization (MVO) model, therolling horizon risk parity model, and the uniform buy-and-hold (UBAH) index.

Keywords: Reinforcement learning; Policy gradient; Deep learning; Sequential model-based optimization; Financial time series;Portfolio management; Trading systems

1. Introduction

The development of intelligent trading agents has attractedthe attention of investors as it provides an alternative way totrade known as automated data-driven investment, which is dis-tinct from traditional trading strategies developed based on mi-5

croeconomic theories. The intelligent agents are trained by us-ing historical data and a variety of Machine Learning (ML)techniques have been applied to execute the training process.Examples include Reinforcement Learning (RL) approaches thathave been developed to solve Markov decision problems. RL10

algorithms can be classified mainly into two categories: actor-based (sometimes called direct reinforcement or policy gradi-ent / policy search methods) Williams (1992); Moody & Wu(1997); Moody et al. (1998); Ng & Jordan (2000); Baxter &Bartlett (2001) where the actions are learned directly, and critic-15

based (also known as value-function-based methods) where wedirectly estimate the value functions. The choice of a particu-lar method depends upon the nature of the problem being ad-dressed. One of the direct reinforcement techniques is calledRecurrent Reinforcement Learning (RRL) and it is presented20

as a methodology to solve stochastic control problems in fi-nance (Moody & Wu, 1997). RRL has advantages of finding

∗Corresponding author.Email addresses: [email protected]

(Amine Mohamed Aboussalah), [email protected] (Chi-GuhnLee)

the best investment policy which maximizes certain utility func-tions without resorting to predicting price fluctuations and it isoften incorporated with a neural network to determine the rela-25

tionship (mapping) between historical data and investment de-cision making strategies. It produces a simple and elegant rep-resentation of the underlying stochastic control problem whileavoiding Bellman’s curse of dimensionality.

In the past, there have been several attempts to use a value-30

based reinforcement learning approach in the financial indus-try: a TD(λ) approach has been applied in finance Van Roy(1999) and Neuneier (1996) applied Q-Learning to optimizeasset allocation decisions. However, such value-function meth-ods are less-than-ideal for online trading due to their inherently35

delayed feedback (Moody & Saffell, 2001) and also becausethey imply having a discrete action space. Moreover, the Q-learning approach turns out to be more unstable compared tothe RRL approach when presented with noisy data (Moody &Saffell, 2001). In fact, Q-learning algorithm is more sensi-40

tive to the value function selection, while RRL algorithm of-fers more flexibility to choose between different utility func-tions that can be directly optimized such as profit, wealth, orrisk-adjusted performance measures. A comparison study be-tween Direct Reinforcement and Q-Learning methods for asset45

allocation was conducted by (Moody & Saffell, 2001). More-over, Moody & Saffell (2001); Deng et al. (2015) suggest anactor-based direct reinforcement learning that is able to pro-

Preprint submitted to Journal of LATEX Templates August 20, 2019

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

vide immediate feedback of the market conditions to make op-timal decisions. Therefore, it suits better than Q-learning with50

regard to the nature of market and dynamic trading. Anotherrecent paradigm, Deep Q-Network designed initially to playAtari games Mnih et al. (2015) inspired the deep Q-tradingsystem which learns the Q-value function for the control prob-lem (Wang et al., 2017). Other papers using deep RL in port-55

folio management have recently been published. Liang et al.(2018) implemented three state-of-art continuous control RLalgorithms. All of them are widely-used in game playing androbotic. Jiang et al. (2017) presented a financial model-freeRL framework to provide a deep ML solution to the portfolio60

management problem using an online stochastic batch learningscheme. They introduced the concept of Ensemble of Identi-cal Independent Evaluators topology and the Portfolio-VectorMemory. Zarkias et al. (2019) introduced a novel price trailingformulation, where the RL agent is trained to trail the price of65

an asset rather than directly predicting the future price. Zhengyao& Liang (2017) used a convolutional neural network (CNN)trading based approach with historic prices of a set of financialassets from a cryptocurrency exchange to output the portfolioweights. Recent applications of RRL in algorithmic trading70

succeed in single asset trading. Maringer & Ramtohul (2012)presented the regime-switching RRL model and described itsapplication to investment problems. A task-aware scheme wasproposed by Deng et al. (2016) to tackle vanishing/explodinggradient in RRL and Lu (2017) deploys long short-term mem-75

ory (LSTM) to handle the same deficiency. Almahdi & Yang(2017) proposed a RRL method with a coherent risk adjustedperformance objective function to obtain both buy and sell sig-nals and asset allocation weights.

Multi-asset investment, also known as portfolio manage-80

ment, has a cardinality constraint that has to be satisfied as well,which requires portfolio weights to sum to one. Another majorchallenge concerns the return of investment, which is naturallypath-dependent. Previous decisions drastically affect future de-cisions and therefore this brings us to the question of how to85

take advantage of the history of the previous decisions withoutlosing in terms of time complexity.

To address these issues, we introduce the Stacked DeepDynamic Recurrent Reinforcement Learning (SDDRRL) algo-rithm that takes multiple continuous investment actions for each90

asset while enforcing the cardinality constraint. We use a gradi-ent clipping sub-task based Backpropagation Through (BPTT)to address the problem of vanishing gradients that may occurdue to the presence of a memory gate responsible for taking intoaccount the previous investment decisions into the new ones95

(Bengio et al., 1994). Moreover, to find out how many past de-cisions should be incorporated into the model in order to com-pute the current optimal investment decisions without losing interms of time efficiency, we define the concept of Time Recur-rent Decomposition (TRD) that takes into account the temporal100

dependency. The number of time-stacks has been optimized byequipping the agent with the ability to find the best possibleconfiguration of those time-stacks along with other hyperpa-rameters using an automated Gaussian Process (GP). More-over, a noteworthy pattern emerges following the application of105

the automated GP: the architectures presenting the best perfor-mances seem to present an hourglass shape topology (similarto autoencoders). Finally, another advantage of the proposedarchitecture is that it is by construction modular and perfectlydeployable in real-time trading platforms. The remaining Sec-110

tions are organized as follows: Section 2 describes the modelformulation and Section 3 introduces the learning algorithm inmore detail. Section 4 shows the experimental results includ-ing the Bayesian hyperparameter optimization, the distributionof portfolio weights generated by our algorithm and the perfor-115

mance comparison against some commonly used benchmarks.Finally, Section 5 concludes the article.

2. FORMULATION

The key framework of RRL is to find the optimal decisionsδt(θ) in order to maximize a specific utility function UT (.) that120

represents the wealth of investors. The simplest way is to di-rectly maximize UT (.) over a time horizon period T :

maxθ

UT (R1, R2, R3, ..., RT |θ) (1)

where θ denotes the optimal trading system parameters andRt for t ∈ {1, 2, ..., T} the realized returns. The optimizationaims to determine the vector parameter θ that gives the optimal125

decisions leading to a maximal utility.

2.1. Financial Objective Function

The dynamic nature of trading problems requires investorsto make sequential decisions and each of these decisions will re-sult in an instantaneous reward/return Rt. The accumulated re-130

wards generated from the beginning up to the current time stepT define an economic utility function UT (R1, R2, R3, ..., RT ).In this context, a variety of financial objective functions havebeen developed and reported in the literature. By way of illus-tration, the most natural utility function used by risk-insensitive135

investors is the profit, which can be seen as the sum of to-tal rewards. Others use logarithmic cumulative return insteadto maximize their wealth. However, maximizing the cumula-tive return does not mitigate the unseen risks in the investmentwhich is one of the major concerns of risk-averse investors. Al-140

ternatively, most modern fund managers would optimize therisk-adjusted return which is an indicator that refines returnsby measuring how much risk is involved in producing that re-turn. The Sharpe ratio (Sharpe, 1994) is one of the most widelyaccepted measures of risk-adjusted returns. The Sharpe ratio145

is also known as the reward-to-variability ratio. It measureshow much additional return that will be received for the addi-tional volatility of holding the risky assets over a risk-free asset.Under the setting of an investment with multiple periods, theSharpe ratio is the average risk premium per unit of volatility in150

an investment:

UT =1T

∑Tt=1Rt − rf√

1T

∑Tt=1R

2t − ( 1

T

∑Tt=1Rt)

2

(2)

3

where rf denotes the risk-free rate of return and it is definedas being the theoretical rate of return of an investment withzero risk. The risk-free rate can be interpreted as the interestan investor would expect from an absolutely risk-free invest-155

ment over a specified period of time. Since rf is a constant inequation 2, it can be disregarded during the optimization phase(i.e. we consider rf = 0 in the following Sections).

2.2. Portfolio Optimization Model

A signal that represents the current market condition will be160

fed into a neural network and pass through multiple layers andfinally output the decision vector δt at time t. Suppose that wehave a set of assets {1, 2, ...,m} indexed by i throughout the pa-per. By combining the suggestion from Deng et al. (2016) withthe additional setting of a multi-asset portfolio, the input signal165

is defined as ft = {f1,t, f2,t, f3,t, ..., fm,t}, where m is the to-tal number of the assets considered in the portfolio optimizationproblem. ft represents the current market conditions in whicheach element: fi,t = {∆P i,t−a−1,∆P i,t−a, ...,∆P i,t} ∈ Raindicates price changes within the decision making epoch. Ac-170

cording to Merton (1969), price change is a movement inde-pendent of its history. Therefore, using price changes insteadof prices themselves as input signals for the policy network im-proves the learning efficiency because it removes the trend fromthe signal and it makes the data appear stationary. However, in175

Deng et al. (2016), ft+1 is obtained simply by sliding forwardone element in each signal fi,t and consequently there exists asignificant overlap between ft and ft+1. This will drasticallyhinder the learning efficiency as it learns insignificant informa-tion from ft to ft+1. Therefore, we shrink the intervals between180

each feature in the signals. In our definition, the decision willbe made hourly t ∈ {1, 2, ..., T} where T is the total numberof trading hours present in the dataset and each element in fi,trepresents the 15-minutes price change of stock i and as a resultfi,t ∈ R4. The overlap between signals are eliminated under185

this setting and the signal will be fed into a neural network sothat the information it carries will be extracted gradually whenit passes through the neural network layers.

Under the setting of a portfolio containing more than oneasset, the investment decisions are represented by the vector190

δt =[δ1,t, δ2,t, δ3,t, ..., δm,t

]∈ R1×m, with m being the total

number of assets in the portfolio. In plain words, each elementin δt represents the weight of an asset in the portfolio at timestep t and δi,t ∈ [0, 1] for i ∈ {1, 2, ...,m} as short-selling isdisallowed to avoid infinite losses. The immediate return Rt is195

defined as follows:

Rt = δt−1 · Zt − cm∑

i=1

| δi,t − δi,t−1 | (3)

where Zt =[Z1,t, Z2,t, Z3,t, ..., Zm,t

]T ∈ Rm×1, Zi,t =Pi,t

Pi,t−1− 1 for i ∈ {1, 2, ...,m} and Pi,t is the price of asset

i at time t. Therefore, Zi,t indicates the rate of price changewithin a trading period (one hour in our model) and c represents200

the transaction commission rate that is taken into considerationboth during the training and testing periods. At each time step,

rebalancing the portfolio will result in a transaction cost and itis subtracted from the investment returns. Moreover, each deci-sion in the vector δt represents the weight of a stock in the port-205

folio and the cardinality constraint requires that∑mi=1 δi,t = 1.

It leads to a constrained optimization problem. One way to en-force the cardinality constraint is to apply a softmax transfor-mation to the decision layer such that the constrained decisionsbecome δci,t =

exp(δi,t)∑mj=1 exp(δj,t)

(Moody et al., 1998). However,210

applying this transformation is equivalent of applying a multi-class discriminant function on the decision weights, which candrastically enlarge the difference between them. This couldresult in an undiversified portfolio and therefore subject to ahigher risk when a precipitous drop in price is encountered. In215

addition, it requires an extra layer in our architecture which alsomakes the overall model more computationally expensive. In-stead, the penalty method can be used to enforce the cardinal-ity constraint. In addition, it is also known that using both L2

regularization (Phaisangittisagul, 2016) along with dropout in-220

creases the accuracy (Srivastava et al., 2014). Thus, the portfo-lio optimization model can be formulated at time t as follows:

maxθ

Ut(R1, R2, R3, ..., Rt|θ)− p(1−m∑

i=1

δi,t)2

− βn∑

l=1

Nl∑

k=1

w2l,k

s.t. Rt = δt−1 · Zt − cm∑

i=1

| δi,t − δi,t−1 |

δt = sigmoid(W (n)hn−1 + b(n) + v � δt−1)

hn−1 = ReLU(W (n−1)hn−2 + b(n−1))

hn−2 = ReLU(W (n−2)hn−3 + b(n−2))

· · ·h1 = ReLU(W (1)ft + b(1))

(4)

where:

− � represents the element-wise multiplication symbol;

− Nl denotes the number of neurons in a given layer l;225

− W (l) =[w

(l)i,j

]∈ RNl×Nl−1 is the weight matrix and

b(l) ∈ RNl the bias vector for layer l;

− θ = {(W (1), b(1)), ..., (W (n), b(n), v)} represents the trad-ing parameters of the policy network.

In our work, we didn’t use dropout since we don’t have very230

deep neural networks. The improvement that we show in the pa-per is due to L2-regularization only (weigh decay), where wl,kis defined as being the weight connecting the neuron present inthe lth layer, kth position. The recurrent part in (4) is due to thepresence of a memory gate at the decision layer δt responsible235

for taking into account the previous investment decisions intothe new ones. The penalty coefficient p and regularization co-efficient β were treated as hyperparameters. The penalty term

4

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Underline

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

inputsignal~pt

Multiplestockprices~f

Preprocessing~pt − ~pt−1

15 minprice change

~fpc

Tradingsystem

θ = {ω, v}

UtilityfunctionUt(θ)

Feedback signal: Ut(θ)

Reinforcement learning

Normalizationinvestment weights

~δt(θ)

~δk−1(θ)delay

Realizedreturnseries

Transactioncosts

Fig. 1. SDDRLL process.

penalizes the utility function Ut whenever the cardinality con-straint is unsatisfied and the magnitude of the penalty can be240

controlled by the penalty coefficient p. The advantage of try-ing the penalty method is that it is universally applicable to anyequality or inequality constraints (round-lot, asset class, return,cardinality etc.) and fits for more advanced portfolio optimiza-tion approaches. However, the addition of this penalty term to245

our objective function gives no guarantee that the cardinalityconstraint will be respected. Thus to avoid any deficiency riskin our portfolio, we decided to add a normalization layer afterthe decision layer as shown in Fig. 1. To bypass the use of asoftmax function, each weight was simply divided by the sum250

of all the decision weights. The table 2 in Section 4 presenta detailed comparative study of the performance obtained withthe five best online architectures that we found.

3. SDDRRL ARCHITECTURE

As aforementioned, the backpropagation when we have re-255

current structures is slightly different from the regular one sincethe computation of δt requires δt−1 as an extra input at the deci-sion layer. One simple way to backpropagate the flow of infor-mation through the network is BPTT. For instance at time stepT , the gradient of the objective function w.r.t. θ is obtained by260

the chain rule:

∂UT∂θ

=

T∑

t=1

∂UT∂Rt

(∂Rt∂δt

dδtdθ

+∂Rt∂δt−1

dδt−1dθ

) (5)

As δt = sigmoid(W (n)hn−1 + b(n) + v� δt−1), the previ-ous decision serves as an input to the calculation of the currentdecision, therefore,

dδtdθ

= ∇θδt +∂δt∂δt−1

dδt−1dθ

(6)

The above equations (5) and (6) assume differentiability of265

the trading decision function δt. According to the chain rule,the gradient at the current state involves the partial derivativew.r.t all decisions from the beginning to the current step and the

calculation needs to be recursively evolved. By taking recur-rence into account, the gradient is expressed as follows:270

∂UT∂θ

=

T∑

t=1

∂UT∂Rt

(∂Rt∂δt

(∂δt∂θ

+∂δt∂δt−1

(∂δt−1∂θ

+∂δt−1∂δt−2

(∂δt−2∂θ

· · · dδ1dθ

)))

+∂Rt∂δt−1

(∂δt−1∂θ

+∂δt−1∂δt−2

(∂δt−2∂θ

+∂δt−2∂δt−3

(∂δt−3∂θ

· · · dδ1dθ

))))

(7)

Equations (5)–(7) imply that the further we go back in time, theless impact previous decisions would have on the current one.This observation is consistent with the vanishing gradient issuein recurrent neural networks. In addition, unfolding the entirememory is computationally expensive especially at large time275

steps.In the typical recurrent reinforcement learning (RRL) ap-

proach, the training of the neural network requires the optimiza-tion ofUT , in which all trading decisions δt for t ∈ {1, 2, ..., T}need to be adjusted accordingly to the new market conditions.280

However, old decisions are not as influential as new market con-ditions when it comes to making a new decision. Therefore,we introduce the concept of Time Recurrent Decomposition(TRD) that takes into account the necessary temporal depen-dency by stacking RRL structures as shown in Fig. 2, resulting285

in the Stacked Deep Dynamic Recurrent Reinforcement Learn-ing (SDDRRL). In SDDRRL, we re-optimize only the recentdecisions instead of those in the entire history, and the numberof time-stacks (denoted as τ ) specifies how many recent deci-sions should influence the current decision. That is, the number290

of stacks τ is the level of time dependency of UT on previousdecisions. For instance, τ = 2 means there are two time-stacks,i.e. the current decision δT needs to be computed and the mostrecent decision δT−1 needs to be adjusted.

Consider the portfolio optimization problem given in Equa-295

tion (4) at time T . The optimization problem is decomposedinto τ tasks {V1, V2, . . . , Vτ} where Vk, k ∈ {1, 2, . . . , τ} is

5

dell

Highlight

dell

Highlight

dell

Highlight

dell

Highlight

defined in equations (8) to (11), and Vk is assigned to a time-stack k. At time-stack k, Vk is optimized to find the optimal de-cision δT−(τ−k). To optimize δT−(τ−k), we consider only the300

terms of the objective function UT involving δT−(τ−k), and wecompute the gradient to update the parameter vector θ. The taskVk is a combination of two components that includes δT−(τ−k):(1) the transaction cost in RT−(τ−k); (2) the realized return inRT−(τ−k)+1.305

Vk = UT (RT−(τ−k), RT−(τ−k)+1)

− p(1−m∑

i=1

δi,T−(τ−k))2 − β

n∑

l=1

Nl∑

k=1

w2l,k

(8)

where

UT (RT−(τ−k), RT−(τ−k)+1)

= UT (..., RT−(τ−k), RT−(τ−k)+1, ...)(9)

with

RT−(τ−k) = δT−(τ−k)−1 · ZT−(τ−k)

− cm∑

i=1

| δi,T−(τ−k) − δi,T−(τ−k)−1 |(10)

and

RT−(τ−k)+1 = δT−(τ−k) · ZT−(τ−k)+1

− cm∑

i=1

| δi,T−(τ−k)+1 − δi,T−(τ−k) |(11)

UT (RT−(τ−k), RT−(τ−k)+1) represents the component inthe utility functionUT involving only the realized returnsRT−(τ−k)310

and RT−(τ−k)+1. The other realized returns that do not in-volve the computation of δT−(τ−k) are considered fixed. OnceδT−(τ−k) is computed by the updated parameters from back-propagation, it will be fed into the next time-stack to computethe optimal δT−(τ−k)+1 which is assigned with the task Vk+1.315

The pseudocode Algorithm 1 summarizes the training phaseand Fig. 2 illustrates the SDDRRL training process. The util-ity function UT at time T is decomposed into V1, V2, . . . , Vτ ,which are assigned to stacks 1, 2, . . . , τ . The red lines connectV1, V2, . . . , Vτ from UT indicating the time-stack decomposi-320

tion. The yellow lines show how instantaneous returns are de-fined with investment decisions. The green dotted lines showgradient information from decomposed tasks all the way downto multi-layered neural network in time-stacks, which is boxedby red dotted rectangles, to perform backpropagation. It is nec-325

essary to point out that the decision prior to the first RRL block(k = 1) is taken from the last time step (i.e. if we are cur-rently maximizing UT for example, then out-of-stack decisionwill be from the last time step UT−1). However, it is problem-atic that we will need to foresee the information coming from330

δT−(τ−k)+1 to perform backpropagation at time-stack k.

Algorithm 1: Training algorithm for SDDRRL

1 assign:values to α, γ1, γ2, ε, T , N and τ

2 initialize:θ0 ∼ Normal(0, 1), m0 = 0, v0 = 0, i = 0, δlast = 0

/* holdings before investment */3 while t ≤ T do /*iterate in all time steps*/4 while i ≤ N do /*iterate until converge*/5 i← i+ 16 while k ≤ τ do

/*iterate in all time-stacks*/if k = 1 then

7 δprevious = δlast/* from last time step */

/* if out-of-stack */

8 else9 δprevious = δT−(τ−k)−1

/* from last time-stack */

10 if i = 1 then11 δnext_approx − δT−(τ−k) = 0

/* assume unchanged *//* next decision */

12 else13 δnext_approx = δT−(τ−k)+1

14 determine the task Vk(θi−1, δprevious, δnext_approx)15 gi = ∇θVk(θi−1, δprevious, δnext_approx)16

mi = γ1mi−1 + (1− γ1)givi = γ2vi−1 + (1− γ2)g2i

αi = α

√1− γi2

1− γi1θi = θi−1 − αi mi√

vi + ε

17 compute δT−(τ−k)(θi)/* by forward-propagating the learnedweights*/

18 compute δnormT−(τ−k)(θi)

/* by normalizing the decision vector

*/19 return θi, δnormT−(τ−k)(θi)

20 k ← k + 1

21 δlast = δnormT−(τ−1)(θN )/* update out-of-stacks decision for next

time step */

At this time-stack label, δT−(τ−k) is learned by solving theoptimization problem defined in equation (4) for the task Vk asdefined in (8). It is impossible in practice to explicitly knowthe next decision when the computation of the current decisionis still undergoing. Therefore, instead of foreseeing the nextdecision magically, we can approximate the value of it by as-suming the next decision δT−(τ−k)+1 will remain the same asδT−(τ−k), which could be interpreted as temporarily cancelingthe transaction cost at the first iteration (i = 1). This is a conser-vative assumption often used in approximate dynamic program-ming methods Powell (2011) when the next signal ft+1 is tem-porarily absent at time step t. Afterward, δT−(τ−k) is generatedand it will flow to the next time-stack to compute δT−(τ−k)+1.After the first iteration for all time-stacks (i.e. when k = τ ),the system will switch to the second iteration (i = 2) and re-peat the calculation of δT−(τ−k) starting one more time fromthe first time-stack k = 1 up to the last time-stack k = τ ,

6

out-of-stackdecision fromlast time step

δT−(τ−1) δT−(τ−2) δT−2 δT−1 δT

••...•

••...•

••...•

••...•

k = 1

••...•

••...•

••...•

••...•

k = 2

••...•

••...•

••...•

••...•

k = r − 1

••...•

••...•

••...•

••...•

k = r

iteracte untilconvergence:compute δT

RT−(r−1) RT−(r−2) RT−1 RT

V1 Vr−1

. . . . . .

. . . . . .

. . . . . .

UT

@ time step T

Interation #1

Fig. 2. SDDRRL Training phase.

then we move on to the next optimizer iteration and the processrepeats itself until convergence. However, when i = 2, the ap-proximated value of δT−(τ−k)+1 at the time-stack k will be re-placed by the explicit results of δT−(τ−k)+1 from last iteration(i = 1). By doing that, the future market situation is capturedduring the training phase from the last optimizer iteration andhelps to learn the correct decision at the current time-stack bytaking future information into account. Therefore, besides thefirst iteration (i = 1), the value of next decisions will be ap-proximated by the explicit computed values δT−(τ−k)+1 fromprevious iterations i ∈ {2, ..., N}. In addition, the estimationof the next decision at k = τ is unnecessary since δT will bethe last one. Hence, the task assigned to δT is:

Vτ = UT (RT )− p(1−m∑

i=1

δi,T )2 − βn∑

l=1

Nl∑

k=1

w2l,k (12)

where

RT = δT−1 · ZT − cm∑

i=1

| δi,T − δi,T−1 | (13)

4. EXPERIMENTS335

4.1. Dataset

In our experiments, the investment decisions are made hourlyand each element in the input signal ft represents the per 15-minute price changes within the trading hour. SDDRRL istrained and tested through twenty successive rounds. In each340

round, the training section covers 1200 trading hours while thetesting section covers the next 300. Due to this testing mech-anism, the testing periods are made short because the trainedsystem will be mostly effective for short periods. In the firstround, SDDRRL is trained for the first 1200 trading hours and345

tested from the trading hour 1200 to 1500. In the next round, thetraining and testing data are shifted 300 trading hours forward(i.e. the second training round starts from hour 300 to 1500 andthe second testing round from hour 1500 to 1800) and it willmove ahead in this way for the rest of the rounds. In fact, the350

volatility of the market is a major concern in most of ML-basedtrading systems. Models trained with historical data are not ef-fective on testing periods since the new market conditions arenot learned in the trained model. Therefore, SDDRRL is trainedand tested in an online manner so that the model can quickly355

adapt to the new market conditions. It should be noted that ourtest periods include the transaction costs. We used the typicalcost due to bid-ask spread and market impact that is 0.55%. Webelieve these are reasonable transaction costs for the portfoliotrades. For each round, SDDRRL will be trained with 1200360

trading hour data points and when the testing period starts, thelast signal during testing will be added as an input to the sys-tem and the parameters will be updated to adapt the most recentmarket conditions. However, the size of the training and testingwindows should be optimized in order to potentially obtain bet-365

ter results. We have developed SDDRRL as a new architecturecombining deep neural networks with recurrent reinforcementlearning and tailored specifically for multi-asset portfolios. Inthis sense, future investigation of the size of the training andtesting windows should be made.370

The SDDRRL model is evaluated on a portfolio consistingof the following ten selected stocks: American Tower Corp.(AMT), American Express Company (AXP), Boeing Company(BA), Chevron Corporation (CVX), Johnson & Johnson (JNJ),Coca-Cola Co (KO), McDonald’s Corp. (MCD), Microsoft Cor-375

poration (MSFT), AT&T Inc. (T) and Walmart Inc (WMT). Topromote the diversification of the portfolio, these stocks are se-lected from different sectors of S&P 500, so that they are un-correlated as much as possible as shown in Fig. 3.

7

0 1000 2000 3000 4000 5000 6000 7000Trading hour

50

100

150

200

250

Stoc

k pr

ice (U

SD)

AMTAXPBACVXJNJKOMCDMSFTTWMT

0 1000 2000 3000 4000 5000 6000 7000Trading hour

1.0

1.5

2.0

2.5

3.0

Hour

ly ra

te o

f ret

urn


Fig. 3. (Left): The hourly price change for the 10 companies in our portfolio during 7316 trading hours (entire dataset window). (Right): The hourly rate of return.The period between the two vertical dashed lines represent the testing window: 6,000 trading hours (3.5 years).

The data source comes from finam1 database that has intra-380

day data for 42 of the most liquid US stocks on BATS GlobalMarkets2. Our dataset starts from January 1st 2013 up to July31st 2017, resulting in 7928 trading hours. It should be notedthat there was missing data for the 10 stocks at different timeperiods, which were cleaned in our raw data. Therefore, instead385

of having 7928 trading hours, there are still 7316 data points.This represents a duration of 4 years and 7 months (4.58333years), which ultimately comes down to approximately 1596.2trading points per year, i.e. 245.57 trading days per year, thus,6.5 trading hours per day.390

Experiments were run on a 40-core machine with 384GBof memory. All algorithms were implemented in Python usingKeras and Tensorflow libraries. Each method is executed in anasynchronously parallel set up of 2-4 GPUs, that is, it can eval-uate multiple models in parallel, with each model on a single395

GPU. When the evaluation of one model finishes, the methodscan incorporate the result and immediately re-deploy the nextjob without waiting for the others to finish. We use 20 K80(12GB) GPUs with a budget of 10 hours.

4.2. Bayesian optimization for hyperparameter tuning400

Many optimization problems in ML are black box optimiza-tion problems due to the unknown nature of the objective func-tion f(x). If the objective function were inexpensive to evalu-ate, then we could sample at many points e.g. via grid search,random search or numeric gradient estimation where we ex-405

plore the space of hyperparameters without any prior knowl-edge about the configurations seen before. If it were instead ex-pensive, as is typical with tuning hyperparameters of deep neu-ral networks in a time-sensitive application such as finance, thenit would become crucial to minimize the number of samples410

drawn from the black box function f . This is where BayesianOptimization (BO) techniques are most useful. They attemptto find the global optimum in a minimum number of steps.

1https://www.finam.ru/profile/moex-akcii/gazprom/export

2http://markets.cboe.com/

BO incorporates prior belief about f and updates the priorwith samples drawn from f to get a posterior that better ap-415

proximates f . The model used for approximating the objectivefunction is called surrogate model. One of the most popular sur-rogate models for BO is the Gaussian Process (GP). BO alsouses an acquisition function that guides sampling in the searchspace to areas where an improvement over the current best ob-420

servation is likely (right plot in Fig. 4).Automatic hyperparameter tuning methods aim to construct

a mapping between the hyperparameter settings and model per-formance in order to rationally sample the next configurationof hyperparameters. The paradigm of automatic hyperparame-425

ter tuning belongs to a class known as Sequential Model-BasedOptimization (SMBO) (Hutter et al., 2011). SMBO algorithmsmainly differ in the way they take into account historical ob-servations to model either the surrogate function or some kindof transformation applied on top of it. They also differ in the430

way they apply derivative-free methods while optimizing thosesurrogates. For our work, we used the BO approach which isa subclass of SMBOs. It was shown that BO can outperformhuman performance on many benchmark datasets Snoek et al.(2012) and its standard design is described below:435

1. The surrogate is modeled by a Gaussian Process (GP):f(x) ∼ GP(µ(x), k(x,x′)).In other words, f is a sample from a GP with mean func-tion µ and covariance function k and x represents thebest set of hyperparameters we are looking for. Here we440

used a Gaussian kernel as a dissimilarity measure in the

sample space: k(x,x′) ∝ exp(−‖x−x

′‖22σ2

), where σ2 is

a parameter that reflects the degree of uncertainty in ourmodel.

2. To find the next best point to sample from f , we will445

choose the one that maximizes an acquisition function.One of the most popular acquisition functions is of Ex-pected Improvement type (EI), which represents the be-lief that new trials will improve upon the current best con-figuration. The one with the highest EI will be tried next.450

It is defined as: EI(x) = E[max(0, f(x)−f(x)], where

8

0 25 50 75 100 125 150 175 200iteration

1.3

1.4

1.5

1.6

1.7

1.8

best

tota

l ret

urn

0.0 0.2 0.4 0.6 0.8 1.0learning rate

3

2

1

0

1

2

surro

gate

loss

Acquisition (arbitrary units)

Fig. 4. Left: Convergence plot of the total return with respect to the number of GP iterations. Right: Acquisition function (red curve) guiding the sampling of thelearning rate using Gaussian surrogate loss.

x is the current optimal set of hyperparameters. Maxi-mizing EI(x) informs about the region from which weshould sample in order to gain the maximum informationabout the location of the global maximum of f .455

Table 1Range of hyperparameters

Parameters Search space Type

Learning rate 0 - 1 Float

Number of stacks 1 - 10 Integer

Number of units per layer 10 - 2000 Integer

Regularization coefficient : β 0 - 1 Float

Penalty coefficient : p 0 - 10 Integer

Optimizer0 : Adam

1 : Adadelta2 : Adagrad

Categorical

The hyperparameters for the SDDRRL architecture listedin Table 1 were optimized using BO. We use GPyOpt (2016)python routine version 1.2.1 to implement the BO. The opti-mization was initialized with 25 random search iterations fol-lowed by up to 150 iterations of standard GP optimization,460

where the total return is used as the surrogate function and EIas the acquisition function. The results are reported in the leftplot in Fig. 4 showing that after only few iterations, we are ableto get a total return of 15.65%. Random search then boosts veryquickly the total return up to 53.59% after only 18 iterations and465

thus remains until the end of the random search cycle (iteration#25). Using GP , we can show that we constantly improve ourprocess of searching for the best architecture that maximizesthe overall return. In our case, we stopped at iteration #200 butnothing prevents us from exploring even more the configuration470

of the hyperparameter space. The goal of this Section is just to

illustrate the methodology. At iteration #200, we find that thebest architecture gives 94.71% total return and the top-5 archi-tectures give more than 78.77% total return. It should be notedthat with the GP , the agent comes very quickly to probe the475

region of interest that maximizes the total return and targets re-markably the most sensitive hyperparameters that leads to thebest architecture. Next, the right plot in Fig. 4 shows the acqui-sition function in red, the red dots show the history of the pointsthat have already been explored and the surrogate function with480

±96% confidence.In order to keep an eye on what the agent is doing with BO,

it is interesting to compute the loss surface as a function of thehyperparameters as shown in Fig. 5. Essentially, the agent wasable to probe most of the hyperparameter configuration space485

and this gives us precisely an estimate of where the true opti-mum of the loss surface is located.

In particular, an interesting key element deserves to be pointedout. The architectures that give the best performance have to acertain extent several features in common, such as the learn-490

ing rate (almost the same order of magnitude), the number oftime-stacks approximately being 5 or 6, and the type of opti-mizer. But the most important characteristic is that almost allof them typically present an hourglass shape architecture (simi-lar to denoising autoencoder) as a neural network candidate for495

SDDRRL (Table 2). This result opens up a new avenue of in-vestigation. Indeed, it would be interesting to understand whysuch a characteristic emerges essentially when processing non-deterministic and non-stationary signals such as financial data.An attempt for answering this question would be the subject of500

our next work. In addition, our experiments have shown thatwhen p = 0, the trader agent no longer respects the cardinal-ity constraint, while when β = 0, we overfit, which results ina poor performance due to the lack of generalization capacity.The Fig. 6 shows the optimal decision weights for the best SD-505

DRRL architecture over the test period.

9

Table 2Best top five architectures

Learning Number of Number of units Regularization Penalty Optimizer Absolute AnnualizedRate stacks in each layer coefficient coefficient return Return

0.081 6 1160 - 780 - 580 - 1420 0.597 2 Adam 78.77% 18.05%

0.024 5 870 - 160 - 320 - 370 0.809 9 Adam 82.90% 18.83%

0.021 10 790 - 1730 - 140 - 1410 0.767 9 Adam 88.01% 19.77%

0.075 7 400 - 1410 - 1390 - 1920 0.154 9 Adam 90.81% 20.27%

0.063 6 940 - 1810 - 1020 - 1730 0.158 7 Adam 94.71% 20.97%

0.0 0.2 0.4 0.6 0.8 1.0learning rate

0.0

0.2

0.4

0.6

0.8

1.0

beta

coe

fficie

nt

Posterior mean

0.90

0.72

0.54

0.36

0.18

0.00

0.18

0.36

0.54

0.0 0.2 0.4 0.6 0.8 1.0learning rate

0.0

0.2

0.4

0.6

0.8

1.0

beta

coe

fficie

nt

Posterior sd.

0.8808

0.8880

0.8952

0.9024

0.9096

0.9168

0.9240

0.9312

0.9384

0.9456

0.0 0.2 0.4 0.6 0.8 1.0learning rate

0.0

0.2

0.4

0.6

0.8

1.0

beta

coe

fficie

nt

Acquisition function

0.00

0.11

0.22

0.33

0.44

0.55

0.66

0.77

0.88

0.99

0.0 0.2 0.4 0.6 0.8 1.0learning rate

0

2

4

6

8

10

pena

lty c

oeffi

cient

Posterior mean

0.54

0.36

0.18

0.00

0.18

0.36

0.54

0.72

0.90

1.08

0.0 0.2 0.4 0.6 0.8 1.0learning rate

0

2

4

6

8

10

pena

lty c

oeffi

cient

Posterior sd.

0.9114

0.9144

0.9174

0.9204

0.9234

0.9264

0.9294

0.9324

0.9354

0.9384

0.0 0.2 0.4 0.6 0.8 1.0learning rate

0

2

4

6

8

10

pena

lty c

oeffi

cient

Acquisition function

0.00

0.11

0.22

0.33

0.44

0.55

0.66

0.77

0.88

0.99

Fig. 5. Loss surfaces of the Posterior mean, the Posterior standard deviation (sd) and the Acquisition function for:beta coefficient VS learning rate and penalty coefficient VS learning rate.

4.3. Alternative active trading strategies

4.3.1. Rolling horizon mean-variance optimization modelThe mean-variance optimization (MVO) framework has been

proposed by (Markowitz, 1952, 2010). It is a quantitative tool510

traditionally used in dynamic portfolio allocation problem whererisk and return are traded off. In order to make portfolios’performance comparable we use the same risk-adjusted mea-sure of return used in SDDRLL that is Sharpe ratio (reward-to-variability ratio). As defined above in Section 2.1, Sharpe ratio515

measures the excess return per unit of risk (deviation) in an in-vestment asset or a portfolio. The MVO problem using Sharp

ratio can be written in a general way as follows:

maxx∈Rm

µTx− rf√xTQx

s.t.∑

i

xi = 1

l ≤ Ax ≤ ux ≥ 0

(14)

where,

1. µ, the vector of mean returns.520

2. Q, the covariance matrix.

3.∑i xi = 1; (cardinality constraint).

10

1000 2000 3000 4000 5000 6000 7000Trading hour (test period)

0.0

0.2

0.4

0.6

0.8

1.0

Inve

stm

ent d

ecisi

ons


Fig. 6. The distribution of portfolio weights of the best SDDRRL trading agentover the test period.

4. l ≤ Ax ≤ u, (other linear constraints if needed).

5. x ≥ 0, portfolio weight vector3.

6. rf , risk-free rate of return.525

The solution of the optimization problem above is difficultto obtain because of the nature of its objective: 1) non-linear,2) possibly non-convex. However, under a reasonable assump-tion4, it can be reduced to a standard convex quadratic program(Cornuejols & Tutuncu, 2006):530

miny∈Rm,κ∈R

yTQy

s.t.∑

i

(µi − rf )yi = 1

∑

i

yi = κ

l.κ ≤ Ay ≤ u.κκ ≥ 0

(15)

The optimal solution of the problem (14) can be written ac-cording to the solution of the problem (15) as follows: x∗ = y

κ .As in SDDRRL case, the rolling horizon version of the

problem (15) is considered using 20 successive rounds: the firstround uses 1200 trading hours to estimate the portfolio deci-535

sion weights and the next 300 hours to test the trading strategy.In the next round, the training and testing data are shifted 300trading hours forward and it will move ahead likewise for therest of the rounds. The “ILOG CPLEX Optimization Studio"IBM (1988) has been used to solve (15). The Fig. 7 shows the540

optimal decision weights for the rolling horizon MVO model.

4.3.2. Risk parity modelThe risk parity portfolio model is analogous to the equal

weights “1/m” portfolio, but from a risk perspective. It attempts

3x ≥ 0 means that short-selling is disallowed.4There exists a vector x satisfying the constraints (3)-(5) in (14) such that:

µT x− rf > 0. In other terms, we assume that our universe of assets is able tobeat the risk-free rate of return.


0.0

0.2

0.4

0.6

0.8

1.0

Inve

stm

ent d

ecisi

ons


Fig. 7. The distribution of portfolio weights of the rolling horizon MVO modelover the test period.

to diversify risk by ensuring each asset contributes the same545

level of risk. Risk Parity is sometimes referred to as “EqualRisk Contribution” (ERC). The complete risk parity optimiza-tion model is formulated as a least-squares approach that mini-mizes the difference of the following terms:

minx∈Rm

∑

i

∑

j

(xi(Qx)i − xj(Qx)j)2

s.t.∑

i

xi = 1

l ≤ Ax ≤ ux ≥ 0

(16)

where x is the portfolio weight vector, xi(Qx)i represents the550

individual risk contribution of asset i and Q the covariance ma-trix.

The optimization problem (16) was solved successively along20 rounds5 using “Interior Point OPTimizer (IPOPT)” (Wächter& Biegler, 2006), which is a software library for large scale555

nonlinear optimization of continuous systems. The distributionof the optimal weights of the rolling horizon ERC model overthe test period is shown in the Fig. 8.

4.4. Performance analysis

The performance of SDDRRL is evaluated based on the re-560

alized rate of returns of the 5 best architectures over the test-ing period horizon (approximately 3.5 years). SDDRRL per-formance is compared with three benchmarks: the rolling hori-zon MVO model, the rolling horizon risk parity model, and theuniform buy-and-hold (UBAH) strategy with initially equal-565

weighted setting among 10 stocks. In Fig. 9, we notice thateven the 5 architectures do not have the same total return, itseems like they unanimously agreed on the investment policytowards the end of the testing horizon. This fact demonstratesthat the five architectures do not earn equally during the same570

period or under the same market conditions, and therefore, they

5We use 1200 trading hours for training and 300 hours to test the tradingstrategy in the first round, then we apply a sliding window of 300 trading hoursforward until the last round.

11


0.07

0.08

0.09

0.10

0.11

0.12

Inve

stm

ent d

ecisi

ons


Fig. 8. The distribution of portfolio weights of the rolling horizon ERC modelover the test period.

can been seen as five separate experts with five different invest-ment strategies. The best online SDDRRL architecture achievesa total return of 94.71% compared to our benchmarks: 39.8%for the UBAH index, 38.7% the rolling horizon ERC, and 10.4%575

for the rolling horizon MVO. The poor performance of the rollinghorizon MVO model is mainly due to sensitivity to transactioncosts. We used the typical transaction cost of 0.55% due tobid-ask spread similar to what we used for SDDRRL. The onlydifference to report in the case of rolling horizon MVO is that580

the rebalancing operation for our portfolio happens in every 300trading hour data points, i.e. after each sliding window, leadingto only twenty portfolio rebalancing operations in total. Be-sides, MVO model makes the overly restrictive assumption ofindependently and identically distributed (IID) returns across585

different periods.

1000 2000 3000 4000 5000 6000 7000Trading hours

1.0

1.2

1.4

1.6

1.8

2.0

Tota

l ret

urn

SDDRRL1SDDRRL2SDDRRL3SDDRRL4SDDRRL5MVORisk ParityUBAH index

Fig. 9. Benchmarking the Top 5 SDDRRL’s.

Moreover, it is clear from Fig. 9 that most of the expertshad some difficulty to stand out from the market performanceat the first trading hours. This was predictable since the agentshad learned their investment policy during the first training cy-590

cle that took place in 2013, the year in which the S&P500 in-dex posted a performance of 29.60%, which corresponds to the4th best performance in the history of the S&P500. The fol-lowing years have seen a significant drop 11.39% in 2014 and-0.73% in 2015. This has inevitably affected the online trading595

dynamics of our portfolio as can be seen in the range between3000 and 4000 trading hours corresponding to these particular

years. In fact, the majority of stocks in our portfolio show adevaluation during those periods. In other words, the five ex-perts were trying to replicate the decision rules learned in 2013600

during the ensuing testing period when market conditions dras-tically changed, which implies they were applying sub-optimalpolicies. But through the online learning cycle, as time pro-gresses, the experts receive feedback on the market conditionand allow gradual adjustment on their learned investment pol-605

icy. Therefore, the experts clearly begin to gain the upper handand perform well relatively to our benchmarks. In any case, itshould be noted that the SDDRRL trading experts do not per-form worse than the benchmarks. Based on these facts, SD-DRRL investment decisions can be manually centralized by se-610

lecting the best investment strategy at the given trading time t orby adding an additional agent responsible for centralizing trad-ing decisions by giving the right hand action to the best expertat each trading time period. The choice of one or the other isleft to the discretion of the reader. One can also think about us-615

ing advanced boosting techniques as surveyed by Zhou (2012)or the rainbow integrated agent developed recently by Hesselet al. (2017) to convert selected good learners into a better one.Indeed, the existence of different investment strategies that arenot duplicated especially during periods of recession is a cen-620

tral point for a better reliability of the portfolio. In this way, itwill be easier for us to avoid strategies that fail during specificvolatile periods, which will be reflected on the total return atthe end of the horizon.

The hourly rate of return presented in Fig. 3 shows that625

BA and MSFT have good Return on Investment (ROI) towardsthe end of the test period. This fact has been reflected in Fig.6 where the SDDRRL trader agent is more likely to put moreweight on these two stocks during the same period. This factwas also reflected in the total return as shown in Fig. 9.630

5. Conclusions and future work

To the best of our knowledge, this is the first attempt tomulti-asset dynamic and continuous control using deep recur-rent neural networks with customized architectures. A gra-dient clipping sub-task based Backpropagation Through Time635

has been used to avoid the vanishing gradient information prob-lem and a Bayesian optimization technique has been deployedto effectively probe the hyperparameter space in order to esti-mate the set of hyperparameter values that lead to the maximumutility function while respecting the cardinality constraint. As640

a consequence, hourglass shape architectures (similar to auto-encoder) emerge and appear to be a natural choice for this kindof applications. Still, it would be interesting to investigate whysuch a pattern seems to be an appropriate choice and to exam-ine whether there is a particular connection with non-stationary645

signals more generally. The optimized number of time-stackswas found to be approximately equal to 5 or 6, leading to annu-alized returns around 20% throughout our testing period. How-ever, the size of the training and testing windows should be op-timized and was left for future work.650

Moreover, this procedure does not require any time seriespredictions which makes SDDRRL architecture relatively ro-

12

bust to price fluctuations. Also, SDDRLL is modular so it canbe used with different neural network models such as Convolu-tional Neural Networks (ConvNets), Long Short-Term Memory655

(LSTMs) units, Gated Recurrent Units (GRUs) or any combina-tion of these. A more comprehensive study using these modelsdeserves to be done for comparison purposes.

Additionally and as illustrated above, different policy neuralnetwork architectures have different investment strategies over660

the same period of time, which could be interpreted as havingfive different portfolio management experts. By aggregating thedecisions coming from these experts, we can be more robust inthe face of market fluctuations. One way to do this would beto use techniques coming from the Ensemble Methods litera-665

ture, namely, Bagging (Breiman, 1996), Boosting (Zhou, 2012),Bayesian parameter averaging (BPA) (Hoeting et al., 1999) orBayesian model combination (BMC) (Monteith et al., 2011) tocombine high quality architectures. Another potential way toboost the overall performance would be to consider a Multi-670

Armed Bandits (MAB) approach Katehakis & Veinott (1987);Auer et al. (2002) operating in a non-stationarity environment.The aim would be to select the right expert each time when weinteract with the market.

Acknowledgments675

The authors are very grateful to Zixuan Wang and to YassineYaakoubi for their help and constructive comments regardingthis work. This research is supported by the Fonds de recherchedu Québec – Nature et technologies (FRQNT).

References680

Almahdi, S., & Yang, S. Y. (2017). An adaptive portfolio trading system:A risk-return portfolio optimization using recurrent reinforcement learningwith expected maximum drawdown. Expert Systems with Applications, 87,267–279.

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the685

multiarmed bandit problem. Machine Learning, 47, 235–256.Baxter, J., & Bartlett, P. L. (2001). Infinite-horizon gradient-based policy

search. Journal of Artificial Intelligence Research, 15, 319–350.Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies

with gradient descent is difficult. IEEE Transactions on Neural Networks,690

5, 157–166.Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.Cornuejols, G., & Tutuncu, R. (2006). Optimization Methods in Finance. Cam-

bridge.Deng, Y., Bao, F., Kong, Y., Ren, Z., & Dai, Q. (2016). Deep direct rein-695

forcement learning for financial signal representation and trading. IEEETransactions on Neural Networks and Learning systems, 28, 653–664.

Deng, Y., Kong, Y., Bao, F., & Dai, Q. (2015). Sparse coding-inspired optimaltrading system for hft industry. IEEE Transactions on Industrial Informatics,11, 467–475.700

GPyOpt (2016). A bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt.

Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney,W., Horgan, D., Piot, B., Azar, M., & Silver, D. (2017). Rainbow: Com-bining improvements in deep reinforcement learning. AAAI Conference on705

Artificial Intelligence, .Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian

model averaging: A tutorial. Statistical Science, 14, 382–401.Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2011). Sequential model-based

optimization for general algorithm configuration. International Conference710

on Learning and Intelligent Optimization, (pp. 507–523).

IBM (1988). Ilog cplex optimization studio. https://www.ibm.com/ca-fr/products/ilog-cplex-optimization-studio.

Jiang, Z., Xu, D., & Liang, J. (2017). A deep reinforcement learning frameworkfor the financial portfolio management problem. arXiv:1706.10059, .715

Katehakis, M., & Veinott, A. F. (1987). The multi-armed bandit problem: De-composition and computation. Mathematics of Operations Research, 12,262–268.

Liang, Z., Chen, H., Zhu, J., Jiang, K., & Li, Y. (2018). Adversarial deepreinforcement learning in portfolio management. arXiv:1808.09940, .720

Lu, D. W. (2017). Agent inspired trading using recurrent reinforcement learn-ing and lstm neural networks. arxiv, . doi:https://arxiv.org/pdf/1707.07338.pdf.

Maringer, D., & Ramtohul, T. (2012). Regime-switching recurrent reinforce-ment learning for investment decision making. Computational Management725

Science, 9, 89–107.Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7, 77–91.Markowitz, H. (2010). Portfolio theory as i still see it. Annual Review of

Financial Economics, 2, 1–23. URL: https://doi.org/10.1146/annurev-financial-011110-134602.730

Merton, R. C. (1969). Lifetime portfolio selection under uncertainty: Thecontinuous-time case. The Review of Economics and Statistics, 51, 247–257.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M. G.,Graves, A., Riedmiller, M., Fidjeland, A., Ostrovski, G., Petersen, S., Beat-735

tie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D.,Legg, S., & Hassabis, D. (2015). Human-level control through deep rein-forcement learning. Nature, .

Monteith, K., Carroll, J. L., Seppi, K., & Martinez, T. (2011). Turning bayesianmodel averaging into bayesian model combination. Proceedings of the In-740

ternational Joint Conference on Neural Networks IJCNN’11, .Moody, J., & Saffell, M. (2001). Learning to trade via direct reinforcement.

IEEE Transactions on Neural Networks, 12, 875–889.Moody, J., & Wu, L. (1997). Optimization of trading systems and portfolios.

Decision Technologies for Financial Engineering, (pp. 23–35).745

Moody, J., Wu, L., Liao, Y., & Saffell, M. (1998). Performance functionsand reinforcement learning for trading systems and portfolios. Journal ofForecasting, 17, 441–470.

Neuneier, R. (1996). optimal asset allocation using adaptive dynamic program-ming. Advances in Neural Information Processing Systems, .750

Ng, A., & Jordan, M. (2000). Pegasus: A policy search method for large mdpsand pomdps. In Proceedings of the Sixteenth Conference on Uncertainty inArtificial Intelligence, .

Phaisangittisagul, E. (2016). An analysis of the regularization between l2 anddropout in single hidden layer neural network. Intelligent Systems, Mod-755

elling and Simulation (ISMS), .Powell, W. B. (2011). Approximate Dynamic Programming: Solving the Curses

of Dimensionality (Second Edition). Wiley Series in Probability and Statis-tics.

Sharpe, W. F. (1994). The sharpe ratio. The Journal of Portfolio Management,760

21, 49–58.Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical bayesian optimiza-

tion of machine learning algorithms. IPS Proceedings of the 25th Inter-national Conference on Neural Information Processing Systems, 2, 2951–2959.765

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R.(2014). Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15, 1929–1958.

Van Roy, B. (1999). Temporal-difference learning and applications in finance.Conference, Computational finance, .770

Wang, Y., Wang, D., Zhang, S., Feng, Y., Li, S., & Zhou, Q. (2017). Deepq-trading. CSLT Technical Report-20160036, .

Williams, R. (1992). Simple statistical gradient-following algorithms for. con-nectionist reinforcement learning. Machine Learning, 8, 229–256.

Wächter, A., & Biegler, L. (2006). On the implementation of a primal-dual inte-775

rior point filter line search algorithm for large-scale nonlinear programming.Mathematical Programming, 106, 25–57.

Zarkias, K. S., Passalis, N., Tsantekidis, A., & Tefas, A. (2019). Deep rein-forcement learning for financial trading using price trailing. InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, .780

Zhengyao, J., & Liang, J. (2017). Cryptocurrency portfolio management withdeep reinforcement learning. Intelligent Systems Conference (IntelliSys),

13

IEEE, .Zhou, Z.-H. (2012). Ensemble Methods: Foundations and Algorithms. Chap-

man & Hall/CRC.785

14

CRediT author statement

Amine Mohamed Aboussalah: Data curation, Conceptualization, Methodology, Coding,

Writing-Original draft preparation. Chi-Guhn Lee: Resources, Writing-Reviewing and

Editing, Supervision.

Date post:	26-Jun-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Continuous Control with Stacked Deep Dynamic Recurrent...

Documents