Home >
Documents >
Hidden Markov Models Applied To Intraday Momentum …2. Hidden Markov Models An HMM is a Bayesian...

Share this document with a friend

Embed Size (px)

of 23
/23

Transcript

Hidden Markov Models Applied To Intraday Momentum Trading With SideInformation

Hugh Christensena,∗, Richard Turnerb, Simon Godsilla

aSignal Processing and Communications Laboratory, Engineering Department, Cambridge University, CB2 1PZ, UKbMachine Learning Group, Engineering Department, Cambridge University, CB2 1PZ, UK

Abstract

A Hidden Markov Model for intraday momentum trading is presented which specifies a latent momentum state re-sponsible for generating the observed securities’ noisy returns. Existing momentum trading models suffer from time-lagging caused by the delayed frequency response of digital filters. Time-lagging results in a momentum signal ofthe wrong sign, when the market changes trend direction. A key feature of this state space formulation, is no suchlagging occurs, allowing for accurate shifts in signal sign at market change points. The number of latent states in themodel is estimated using three techniques, cross validation, penalized likelihood criteria and simulation based modelselection for the marginal likelihood. All three techniques suggest either 2 or 3 hidden states. Model parametersare then found using Baum-Welch and Markov Chain Monte Carlo, whilst assuming a single (discretized) univariateGaussian distribution for the emission matrix. Often a momentum trader will want to condition their trading signalson additional information. To reflect this, learning is also carried out in the presence of side information. Two sets ofside information are considered, namely a ratio of realized volatilities and intraday seasonality. It is shown that splinescan be used to capture statistically significant relationships from this information, allowing returns to be predicted.An Input Output Hidden Markov Model is used to incorporate these univariate predictive signals into the transitionmatrix, presenting a possible solution for dealing with the signal combination problem. Bayesian inference is thencarried out to predict the securities t + 1 return using the forward algorithm. The model is simulated on one year’sworth of e-mini S&P500 futures data at one minute sampling frequency, and it is shown that pre-cost the models havea Sharpe ratio in excess of 2.0. Simple modifications to the current framework allow for a fully non-parametric modelwith asynchronous prediction.

Keywords: Bayesian inference, trend following, high frequency futures trading, quantitative finance.

1. Introduction

An intraday momentum trading strategy is presented,consisting of a Hidden Markov Model (HMM) frame-work that has the ability to use side information fromexternal predictors. The proposed framework is quitegeneral and allows any predictors to be used in conjunc-tion with the momentum model. An appealing aspectof this model is that all the computationally demand-ing learning is done off-line, allowing for a fast infer-ence phase meaning the model can be applied to high-frequency financial data.

∗Corresponding author.Email addresses: [email protected] (Hugh Christensen),

[email protected] (Richard Turner), [email protected] (SimonGodsill)

Quantitative trading, namely the application of thescientific method, is now well established in the finan-cial markets. A sub-section of this field is termed al-gorithmic trading, where algorithms are responsible forthe full trade cycle, including the decision of when tobuy and sell. When this process is dependent on theprior behavior of the security, it historically was termedtechnical analysis (Lo et al., 2000). Momentum trad-ing (or trend following) falls into this category and isthe most popular hedge fund style trading strategy cur-rently used. For example, the largest quantitative hedgefunds by assets under management famously trade mo-mentum strategies (Anon, 2011). It can be inferred fromthis that momentum is the most significant exploitableeffect in the financial markets, and as a result of thisthere is a large body of literature published on the ef-fect (Hong and Stein, 1999). Momentum (or trend) can

Preprint submitted to arXiv June 22, 2020

arX

iv:2

006.

0830

7v1

[q-

fin.

TR

] 1

5 Ju

n 20

20

be defined as the rate of change of price. As a strat-egy, momentum trading aims to forecast future securityreturns by exploiting the positive autocorrelation struc-ture of the data. Once a trend is detected by careful es-timation of the mean return (in the presence of noise), itcan be predicted. The most well known trend-followingsystem is that introduced by Gerald Appel in the 1970’s,the moving-average convergence-divergence (MACD)(Gerald, 1999), made famous by the success of a groupof traders named the “turtles” (Faith, 2007). The MACDstrategy uses the difference between a pair of cascadedlow pass filters in parallel to remove noise while es-timating the true mean of the rate of change of price(Satchell and Acar, 2002). The reasons for the mo-mentum effect existing are less than clear despite ex-tensive academic research on the subject. Financial dataconsists of deterministic and stochastic components andboth of these components can exhibit trends. Signifi-cant trends commonly occur even in data which is gen-erated by a random process, such as geometric Brow-nian motion (Wilmott, 2006) and can be explained bythe effect of summing random disturbances (Lo andMacKinlay, 2001). Attempting to model such stochas-tic trends can lead to spurious results. Deterministicreasons for trends existing are thought to include herd-ing behaviour (Shiller, 2005), supply-and-demand argu-ments (Johnson, 2002) and delayed over-reactions thatare eventually reversed (Jegadeesh and Titman, 1999).While there is debate in the academic literature betweenthose that believe the momentum effect is still viablepost-transaction costs, for example (Jegadeesh and Tit-man, 1999), and those that believe the effect has beenarbitraged away, for example (Lesmond et al., 2004),the continued profitability of large momentum tradinghedge funds is testament to the enduring nature of themomentum effect.

The motivation for this paper is to apply HMM’s toproduce a trading algorithm that exploits the momentumeffect, and that can be applied to the financial markets inreal-time by industry practitioners. The core aim of thepaper is to give the algorithm the best predictive perfor-mance possible, irrespective of methodology. Applica-tion of such work to the financial markets has obviouseconomic benefits.

The two main innovations presented in this paper areboth new and novel applications of existing statisticaltechniques to an applied problem. No new methodolo-gies are introduced in the paper. Firstly, the price dis-covery process of a security is described by a trend termin the presence of noise. This process is fitted into anHMM framework and various means of parameter esti-mation are inspected. Secondly, momentum traders of-

ten want to incorporate other information into their mo-mentum based forecast, the signal combination prob-lem, and an IOHMM framework is established to al-low this. For both innovations, realistic experiments areconducted (including transaction costs and slippage),results presented and conclusions drawn.

This paper is structured as follows. In Section 2HMM’s in finance and economics are reviewed and theHMM framework is introduced. In Section 3 the threelearning methodologies are presented. In Section 4 twoextrinsic predictors are developed and tested, and thenin Section 5 learning is carried out using this side in-formation. In Section 6 our inference algorithm is pre-sented. In Section 7 we present the historical futuresdata and then simulate the performance of the modelswith data and present results. Finally in Section 8 con-clusions are presented, along with suggestions for fur-ther work.

2. Hidden Markov Models

An HMM is a Bayesian state space model that as-sumes discrete time unobserved (hidden or latent) states(Gales and Young, 2008). The basic assumptions of aMarkov state space model are firstly that states are con-ditionally independent of all other states given the pre-vious state, and secondly that observations are condi-tionally independent of all other observations given thestate that generated it.

2.1. Literature Review of HMM in Finance and Eco-nomics

In the 1970’s Leonard Baum was one of the first re-searchers to work with what is now known as an HMM.He applied the methodology to securities’ trading forthe hedge fund Renaissance Technologies (Baum et al.,1970; Teitelbaum, 2008). Since then HMMs have beenused extensively in finance and economics (Bhar andHamori, 2004; Mamon and Elliott, 2007). The firstwidely attributed public application of HMM’s to fi-nance and economics was by James Hamilton in 1989(Hamilton, 1989). In his seminal paper, Hamilton viewsthe parameters of an autoregression as the outcome ofa discrete Markov process, where the observed variableis GNP and the latent variable is the business cycle. Byobserving GNP, the position in the business cycle canbe estimated and future activity predicted.

Following Hamilton’s paper there has been muchBayesian work discussing estimation of these mod-els and providing financial and economic applications,most of which focus on Markov chain Monte Carlo

2

(MCMC). MCMC is a means of providing a numericalapproximation to the posterior distribution using a set ofsamples, allowing approximate posterior probabilitiesand marginal likelihoods to be found. Two excellent re-views of the field of Bayesian estimation using MCMCare given by Chib (Chib, 2001) and Scott (Scott, 2002).Noteworthy papers applying Bayesian estimation tech-niques include; Fruhwirth-Schnatter applies MCMCto a clustering problem from a panel data set of USbank lending data, where model parameters are time-varying (Frühwirth-Schnatter, 2001). Shephard appliesthe Metropolis algorithm to a non-Gaussian state spacetime series model and illustrates the technique by sea-sonally adjusting a money supply time series (Shep-hard, 1994). McCulloch et al apply a Gibbs sampler forparameter estimation in their Markov switching modeland illustrate their technique using the growth rates ofGNP (McCulloch and Tsay, 1994). Meligkotsidou etal tackle interest rate forecasting with an non-constanttransition matrix using an MCMC reversible jump algo-rithm for predictive inference (Meligkotsidou and Del-laportas, 2011). Less commonly applied in the eco-nomic and financial literature is the technique of vari-ational Bayes (VB) (Attias, 1999). VB provides a para-metric approximation to the posterior, often using in-dependence assumptions, in a computationally efficientmanner. McGrory et al apply VB to estimate the num-ber of unknown states along with the model parametersfrom the daily returns of the S&P500 (McGrory and Tit-terington, 2009). Finally, the debate between learningin HMM’s using frequentist methods such as expecta-tion maximization (EM), versus Bayesian methods suchas MCMC is reviewed by Ryden who highlights poormixing and long computation times as potential compu-tational disadvantages of MCMC (Rydén, 2008).

Many different extensions and modifications to the“vanilla” HMM have been proposed, and applied to eco-nomics and finance. Input output HMMs (IOHMM)include inputs and outputs and can be viewed as a di-rected version of a Hidden Random Field (Bengio andFrasconi, 1995; Kakade et al., 2002). Unlike HMMs,the output and transition distributions are not only con-ditioned on the current state, but are also conditionedon an observed input value. Bengio et al carry outlearning in an IOHMM using a feed-forward neural net-work. Kim et al use an IOHMM to model stock orderflows in the presence of two hidden market states (Kimet al., 2002). HMMs are a generalization of a mixturemodel where latent variables control the mixture com-ponent to be selected for each observation. In a mix-ture model, the latents are assumed to be i.i.d. randomvariables, as opposed to being related through a Markov

process as in an HMM (Bishop, 2006). Liesenfeld etal apply a bivariate mixture model to stock price andtrading volume (Liesenfeld, 2001). In their model, thebehavior of volatility and volume results from the si-multaneous interaction of the number of information ar-rivals and traders’ sensitivity to new information, bothof which are treated as latent variables. In an hierar-chical HMM (HHMM), each state is itself an HHMM,allowing modelling of “the complex multi-scale struc-ture which appears in many natural sequences” (Fineet al., 1998). Wisebourt et al generate a measure of thelimit order book imbalance and uses it to drive latentmarket regimes inside an HHMM (Wisebourt, 2011).Poisson HMMs (PHMM) are a special case of HMMswhere a Poisson process has a rate which varies in as-sociation with changes between the different states ofa Markov model (Scott, 2002). Branger et al apply aPHMM to model jumps in asset price in order to help in-form contagion risk and portfolio choice (Branger et al.,2012). Hidden semi-Markov models (HSMM) have thesame structure as a HMM except that the unobservableprocess is semi-Markov rather than Markov. Here theprobability of a change in the hidden state depends onthe amount of time that has elapsed since entry into thecurrent state (Yu, 2010). Bulla et al apply a HSMM todaily security returns in order to capture the slow de-cay in the autocorrelation function of the squared re-turns, which HMMs fail to capture (Bulla and Bulla,2006). Finally, factorial HMMs (FHMM) distributethe latent state into multiple state variables in a dis-tributed manner, allowing a single observation to beconditioned on the corresponding latent variables of aset of independent Markov chains, rather than a singleMarkov chain (Ghahramani and Jordan, 1997). Charlotapplies an FHMM to design a new multivariate GARCHmodel with time varying conditional correlation (Char-lot, 2012).

Applications of HMMs in finance and economicsrange extensively, with latent variables including thebusiness cycle (Grégoir and Lenglart, 2000), inter-equity market correlation (Bhar and Hamori, 2003),bond-credit risk spreads (Thomas et al., 2002), inflation(Kim, 1993; Chopin and Pelgrin, 2004), credit risk (Gi-ampieri et al., 2005), options pricing (Buffington and El-liott, 2002), portfolio allocation (Elliott and Hinz, 2002;Roman et al., 2010), volatility (Rossi and Gallo, 2006;Dueker, 1997), interest rates (Elliott and Wilson, 2007;Ang and Bekaert, 2002), trend states (Dai et al., 2010;Pidan and El-yaniv, 2011) and future asset returns (Shiand Weigend, 1997; Hassan and Nath, 2005; Duekerand Neely, 2007).

This paper relates to the broader field of research into

3

the prediction of security returns by exploiting the mo-mentum effect. To our knowledge, no other authorshave considered momentum as a latent variable in aHMM setting. However Christensen et al have consid-ered a latent momentum formulation in a Bayesian fil-tering setting (Christensen et al., 2012). In this paperthe authors track a continuous latent momentum stateof a time series using a Rao-Blackwellized particle fil-ter. The paper finds that the predictions are statisticallysignificant when applied to a portfolio of futures in thepresence of transaction costs. In general terms it is ex-pected that an HMM would be able to outperform theRao-Blackwellized particle filtering formulation. Thisis because an HMM with lots of states can model arbi-trary transitions between trend states, e.g. a sudden re-versal of trend at the top of the market, whereas a linearGaussian model is limited to linear changes.

2.2. An HMM for Trading MomentumOur model is based on the concept of a noisy trend,

where the trend is a latent state and the price series isBrownian with a stochastic drift. In order to forecastthe next time step in the HMM we begin with a distri-bution over the current hidden state and use the transi-tion function to propagate this distribution forward intime. At the next time step we are able to infer themost likely hidden state and generate a predictive dis-tribution over observables. This is done by taking aweighted average of the conditional distribution of theobservations where the weights are from the distribu-tion over the hidden state. We are not interested inpredicting price, an arbitrary value, rather the changein price or the return. Let yt be the price, such thatY = y1, . . . , yT and ∆yt = log yt/yt−1 be the return, suchthat ∆Y = ∆y2, . . . ,∆yT . In our model ∆yt (the obser-vation) is influenced by a hidden, unobserved state, mt

(the trend, where d∆y/dt is a noisy estimate of mt), suchthat M = m1, . . . ,mT . In order to find E(∆yt |∆y1:t−1),a two step process of learning followed by inference iscarried out. This model is shown in Figure 1.

The intuition behind the model is that security returnscan be modelled as a noisy trend process and that whilereturn can be observed, the trend state cannot be andmust be inferred. While the MACD algorithm of Sec-tion 1 attempts to find the true value of this hidden stateempirically by use of a digital filter, in this paper wemodel the observations and the hidden trend state ex-plicitly, therefore allowing interpretation of all the pa-rameters in a meaningful way. An additional advantageof an HMM formulation over MACD is that HMMs areable to track trends in a much more flexible way, by en-coding non-linear relationships between the states. This

2

y3

3

y2

1

y1

p

state prior observation function

state transition function

observation

state

m m m mA

Figure 1: A state space model for a discretized continuousobserved state ∆y (the change in price) and a discrete hiddenstate m (the trend). The relationship between the latent vari-ables and the system parameters is shown.

allows for sudden changes to a new trend, whereas dig-ital filters inevitably have some delay in response de-pending upon their frequency response. Finally, wenote that digital smoothing filters can often be writtenequivalently as the stationary solution of particular lin-ear state-space models (Harvey, 1991).

Time series, such as security returns, can be syn-chronous or asynchronous. A synchronous time seriesis one where the time stamps lie on a regular grid. Thegrid spacing is referred to as the sampling frequency. Anasynchronous time series is one where the time stampsdo not lie on a regular grid. Raw security returns aregenerally asynchronous, but are often sampled to makethem synchronous. Security returns are not continuousin value, but lie on a discrete price grid, with a grid spac-ing defined on a security specific basis by the exchange.This grid spacing is called the tick size, α. The statespace of the latent states consists of a total of K possi-ble values of trend. An upper limit to K can be found byknowing the grid size and calculating Ω = max (|∆Y|).The latent variables are indexed on this grid as,

mt ∈ 1, . . . , k, . . . ,K (1)

where k refers to the kth latent state. By fixing K theset of time-dependent trend terms can be specified byM. The observations depend on the latent states accord-ing to; return at time t is equal to the trend term, plusGaussian noise,

∆yt = µmt + εt εt ∼ Norm(0, σ2mt

) (2)

Given the indexed grid of Equation (1),one mode of initialization would be µ1:K =

−Ω,−(Ω − α),−(Ω − 2α), . . . , 0, . . . , (Ω − 2α), (Ω − α),Ω.The Gaussian assumption of Equation (2) could bereplaced with any other parametric distribution (for ex-ample, fat-tailed Cauchy) or a non-parametric approach

4

(for example, kernel density estimation). The resultingconditional distribution from Equation (2) is,

p(∆yt |mt) = Norm(∆yt; µmt , σ2mt

)

As the latents lie on a discrete grid, yet the noise modelis continuous, the implementation is required to dis-cretize the Gaussian noise variable to ensure the resultslie on the grid. The joint distribution of this state spacemodel is therefore given by,

p(∆Y,M) = p(m1)

T∏t=2

p(mt |mt−1)

T∏t=1

p(∆yt |mt)

Given the HMM and the observations p(∆Y,M) one candeduce information about the states occupied by the un-derlying Markov chain m1:T . What we are interested infinding in this model is the probability of a trend givenall our observations of price up to now, p(mt |∆y1:t), alsoknown as the filtering distribution.

2.3. Model ParametersOur model requires that the transition matrix A,

emission matrix φ and latent node initial value π1 areknown a-priori. Together these form the parametersof our model Θ = A, π,φ, as shown in Figure 1.Finding Θ constitutes the learning phase of the HMM.This batch approach to learning suits the structureof the financial markets, as parameter estimation canbe done using the previous H days of market data,when the market is shut. Before discussing learning,the connection between the hidden states M and themodel parameters Θ is explained. A specifies theprobability of transitions between the latent states, πis the probability of the initial latent state and φ isthe probability of the observed return occurring. Theconnection between parameter φ and Equation (2) isthat φ(∆yt) follows a Gaussian distribution. In thispaper four different off-line learning approaches areconsidered,

1. Θ is learnt using piecewise linear regression(PLR).

2. Θ is learnt using the Baum-Welch algorithm.3. Θ is learnt using Markov Chain Monte Carlo

(MCMC).4. Θ is learnt using the Baum-Welch algorithm in the

presence of side information.PLR and Baum-Welch are both frequentist methods,while MCMC is a Bayesian method. The inferencephase of this paper is purely Bayesian. At this point weconsider the correctness of combining frequentist andBayesian methods in the same model. The core aim of

this paper is to produce the best predictive performancepossible, irrespective of methodology used and so froma philosophical viewpoint we are agnostic. From a prac-tical point of view, frequentist methods are more com-monly found in the trading industry. It is reasoned thisis due to the relative simplicity of the methods, the par-simony of the models and the associated low compu-tational loads. In particular, trading practitioners tendto dislike complex models due to the risks associatedwith model failure being low-probability, high-impact.These risks are easier to understand and monitor in sim-ple models.

The major issue when learning is the transient natureof the latent state and how stable its estimated meansare. In order to ensure the most accurate estimationpossible, the means of the K Gaussian distributions (thetrends) are efficiently estimated using short windows ofdata. This is implemented using a rolling window ofdata that consists of 23 trading days (one month). Thiswindow size approximately agrees with the lowest fre-quency information we are trying to exploit in our sys-tem.

The mixing of frequentist learning with Bayesian in-ference is a well established approach in the literature,for example Andrieu et al estimate static parameters innon-linear non-Gaussian state space models using EMtype algorithms (Andrieu and Doucet, 2003). Other ex-amples of merging frequentist and Bayesian method-ologies are given by Gelman (Gelman, 2011) and Jor-dan (Jordan, 2009) who suggest using Bayesian in-ference coupled with frequentist model-checking tech-niques. Such an approach gives the performance ben-efits of using a Bayesian prior, while allowing for theeasily checkable assumptions given by frequentist con-fidence intervals. Completely integrated Bayesian tech-niques to our problem do exist, such as particle MCMCwhich allows for fully Bayesian learning and inference,however such techniques suffer from unpractically highcomputational complexity (Andrieu et al., 2010).

2.3.1. State Transition MatrixA conditional distribution for the latent variables

p(mt |mt−1) is specified. Because the latent variables cantake one of K values, this distribution is the transitionmatrix A, size (K × K). The transition probabilities aregiven by,

A =

a1,1 a1,2 . . . a1,Ka2,1 a2,2 . . . a2,K...

.... . .

...aK′ ,1 aK′ ,2 . . . aK′ ,K

5

where

ak′ ,k = p(mt = Mk |mt−1 = Mk′ ) k′

, k = 1, . . . ,K

i.e. the probability of making a particular transitionfrom state k

′

to state k in one time step is given by ak′ ,k.The diagonal corresponds to the probability of the sys-tem staying in its current state, while the lower diagonalcorresponds to the system moving to a negative pricetrend and the upper diagonal corresponds to moving toa positive price trend. A has K(K − 1) independent pa-rameters and each row of A is a probability distributionfunction such that

∑k ak′ k = 1.

2.3.2. Emission MatrixThe probability of an observation given the hidden

state is given by the emission matrix φ. This matrix is aset of parameters governing the conditional distributionof the observed variables p(∆Y|M,φ) = φk(∆y). In adiscrete HMM model, an emission matrix is output ofsize the number of states in the hidden representation bythe number of possible output states. For our continuousmodel each one of the K states has an associated outputdistribution of a single univariate discretized Gaussian,with a mean µk and a variance σ2

k , as given by Equation(3).

φk(∆y) ∝ Norm(∆y; µk, σ2k)

=Norm(∆y; µk, σ

2k)∑

∆y∈Y Norm(∆y; µk, σ2k)

(3)

where Y denotes the set of all possible ∆y.

2.3.3. Initial Latent NodeThe initial latent node m1 is special in that it does not

have a parent node and so it has a marginal distributionp(m1) represented by a vector of probabilities π withelements πk ≡ p(m1 = k).

2.3.4. Number of Unknown StatesAs the number of latent momentum states K is un-

known, estimating K is a model selection problem.There are various methodologies for determining K,both heuristic and formal, frequentist and Bayesian. Wesummarize some of these techniques here,• Cross validation (Kohavi et al., 1995). Segment

the data set into training and test portions. SelectK which gives the best predictive performance onthe training data set and then apply it to the testdata set.

• Generalized likelihood ratio tests (Vuong, 1989).The ratio of two model’s likelihoods is used tocompute a p-value, which allows the null hypothe-sis to be accepted or rejected.

• Penalized likelihood criteria, such as Bayesian in-formation criterion (BIC) (Schwarz, 1978) andAkaike information criterion (AIC) (Akaike,1974). These criteria penalize the maximized like-lihood function by the number of model parame-ters. The disadvantage, is that they do not provideany measure of confidence in the selected model.

• Approximate Bayesian computation (Toni et al.,2009). Simulation based model selection.

• Bayesian model comparison (Kass and Raftery,1995). Theoretically powerful, but difficult to ap-ply in practice. This approach is often approxi-mated by MCMC (Gilks et al., 1996).

The posterior probability p(Mk |∆Y,Θ) of a modelMk

given data ∆Y is given by Bayes theorem,

p(Mk |∆Y,Θ) =p(∆Y|Mk,Θ)p(Mk)

p(∆Y)

For two different models M1,M2 with parametersΘ1,Θ2, the Bayes factor B can be used to carry outmodel selection,

B =p(∆Y|M1)p(∆Y|M2)

=

∫p(Θ1|M1)p(∆Y|Θ1,M1)dΘ1∫p(Θ2|M2)p(∆Y|Θ2,M2)dΘ2

The chosen model is simply the model with the highestintegrated likelihood p(Mk |∆Y,Θ). However, at timesthe prior p(Θ|Mk) is unknown and so the logarithm ofthe integrated likelihood can be approximated by theBIC. More accurate selection of K requires evaluatingthe marginal likelihood

∫p(Θ|Mk)p(∆Y|Θ,Mk)dΘ,

however this is an extremely difficult integral to calcu-late. Dealing with this integral is covered in Section3.3 once MCMC has been introduced. In the followingSection, each method of learning uses one of the abovetechniques to estimate K.

3. Learning Phase

The three independent methods of learningΘ are nowpresented. Results are shown for applying the methodsto one year’s worth of data at one minute sampling fre-quency from the ES future, a traded security. Full de-tails of the dataset and and its processing are describedin Section 7.

3.1. Piecewise Linear RegressionThe “default case” is presented as a baseline against

which other methods of learning can be compared. A isinitialized as,

ak′ ,k =

β, k′

= k1−β/K−1, k

′

, k

6

where β is the probability of the state staying in its cur-rent state and is set as β = 0.5. The 1−β/K−1 term reflectsthat fact that in the absence of conditioning informa-tion, no state is more likely than any other state, thoughit is most likely to stay in its current state and thus isdescribed as “sticky”.

Change points P in the price Y represent breaks be-tween latent momentum states (i.e. trends). Usingpiecewise linear regression (PLR) on the training dataset, P is found (Oh, 2011). PLR gives two things -firstly a state sequence which can be used in learninglater and secondly, the model mean µk and variance σ2

k .PLR is simply ordinary least squares carried out oversegmented data, with change points tested for by t-stats.For each segment of data that contains a trend, µk is thegradient of the regression and σ2

k the variance, foundfrom the maximum likelihood estimate for a Gaussiannoise model,

σ =

√∑Tt=1 ε

2t

T

where ε are the regression residuals. The presence ofautocorrelation would suggest that PLR was not work-ing correctly and so is checked for using the Durbin-Watson test (Durbin and Watson, 1971). Finally, it isnoted that many other approaches exist for change pointdetection, for example (Adams and MacKay, 2007;Punskaya et al., 2002).

For the default case, the number of hidden states isfound using cross validation and is set to K = 2.

3.2. Baum-Welch

The Baum-Welch algorithm is a special case of theEM algorithm which can be used to determine parame-ter estimates in an HMM when the state sequence of thelatents is unknown (Baum et al., 1970). The algorithmattempts to find the sequence of latent states M whichwill maximize the likelihood of having generated ∆Ygiven Θ,

Θ = argmaxΘ

p(∆Y|Θ)

Finding this global maximum is intractable as it requiresenumerating over all parameter values Θk and then cal-culating p(∆Y|Θk) for each k. Baum-Welch avoids thisglobal maximum and instead settles for a local maxi-mum. As with other members of the EM class, this isachieved by computing the expected log-likelihood ofthe data under the current parameters and then usingthis to iteratively re-estimate the parameters until con-vergence.

In the first step of the algorithm (the E-step), Baum-Welch uses the forward-backward algorithm, whichfinds the smoothing distribution p(k|∆y1:T ). The For-ward algorithm gives αt(k), which is the probability thatthe model is in state k at time t, based on the current pa-rameters. The Backward algorithm gives βt+1(k) whichis the probability of emitting the rest of the sequenceif we are in state k at time t + 1, based on the currentparameters. For large numbers of observations, numeri-cal under-flow can occur, hence in implementation log-probabilities are used (Kingsbury and Rayner, 1971).

The second step of the algorithm (the M-step), seessuccessive iterations of the algorithm update Θ improv-ing the likelihood up to some local maximum. This isdone by calculating the occupation probabilities γt(k)which is the probability of the model occupying state kat time t. These probabilities are then used to find themaximum likelihood estimates of A and φ (Juang et al.,1986).

Baum-Welch is used to find K by maximizing the log-likelihood of k = 1, . . . , 50 models. Penalized likeli-hood criteria are calculated for each model and the max-imum value K = 3 selected. The results are shown inFigure 2.Baum-Welch requires estimates for initial value of the

0 5 10 15 20 25 30 35 40 45 508.5

8.6

8.7

8.8

8.9

9

9.1

9.2

9.3x 10

4 Penalized Likelihood Criteria

Hidden State K

Sta

tis

tic

AIC

BIC

Maxima

Figure 2: Penalized likelihood criteria. Finding the numberof hidden states using Baum-Welch. The optimal model ofK = 3 is shown by a red dot.

emission and transition matrices. A “flat start model”is defined by setting all the values of A to be equal andφ to the global mean/variance of the data. The problemwith this approach is that, depending on how the ini-tial HMM parameters are chosen, the local maximum to

7

which Baum-Welch converges to may not be the globalmaximum. Convergence is deemed to have occurredwhen either a certain number of iterations have passedor a certain log-likelihood tolerance has been met. Inorder to hit the global maximum, good initialization iscrucial. To avoid local minima, a prior is set over Θusing training data Z. Applying Baum-Welch to Z it isnoted that the square root of the model variances σ2

k isof the same order of magnitude as the tick-size α forthe ES contract. This is as expected as the algorithm isunable to predict with an accuracy smaller than the gridsize. Learning the covariance structure (untied) can re-sult in a implementation issue that for one or more statesthe local maxima might settle on a small number of datapoints, giving σ2

k → 0, preventing the log-likelihood in-creasing at each iteration of the M-Step. This is dealtwith in our implementation by never allowing the vari-ance to decrease below a fraction of the tick-size, α2/2.The Baum-Welch algorithm is shown in Algorithm 1.The notation of k to refers to a particular state and notthe indicator variable mt, as that is path-dependent.

3.3. Markov Chain Monte Carlo

MCMC methods are a class of algorithms for sam-pling from probability distributions based on construct-ing a Markov chain that has the desired distribution asits equilibrium distribution. By constructing Markovchains for sampling specific densities, marginal den-sities, posterior expectations and evidence can be cal-culated. The Metropolis-Hastings algorithm (MHA) isa simple and widely applicable MCMC algorithm thatuses proposal distributions to explore the target distri-bution (Metropolis et al., 1953). MHA constructs aMarkov chain by proposing a value for Θ from the pro-posal distribution, and then either accepting or rejectingthis value (with a certain probability). Given the wellestablished literature on MHA in the financial field, thereader is referred to the review at (Chib, 2001).

In order to find the unknown number of states in aBayesian framework, a prior distribution is placed onmodelMk and then posterior distribution ofMk is esti-mated given data ∆Y,

p(Mk |∆Y) ∝ p(∆Y|Mk) × p(Mk)

where p(Mk) is the prior, p(Mk |∆Y) is the posteriorand the quantity we wish to estimate is the marginalizedlikelihood p(∆Y|Mk). However, as marginal likelihoodintegration is intractable, simulation based approachesmust be used. There are many ways to approximatethis marginal likelihood using MCMC draws, typicallydone using MHA for eachMk separately. However, all

Alg. 1 HMM Baum-Welch.Θ = BW(Z,K)

1: Initialize2: Θ A,φ = extract(Z) Extract initial parameters

from the estimate3: while q < maxIterations do4: Go around loop until parameters converge or tol

is met5: Forward Pass6: α1(k) = p(m1)p(z1|m1) Initialization7: for t = 2 to T do8: αt(k) =

∑mt−1

p(zt |mt)p(mt |mt−1)αt−1(k) Gen-erate a forwards factor by eliminating mt−1

9: end for10: Backward Pass11: βt(k) = 1 Initialization12: for t = T − 1 to 1 do13: βt(k) =

∑mt+1

p(zt+1|mt+1)p(mt+1|mt)βt+1(k)Generate a backwards factor by eliminatingmt+1

14: end for15: Occupation Probabilities16: γt(k) =

αt(k)βt(k)p(zt)

17: Parameter Estimation18: µ(k) =

∑Tt=1 γt(k)zt∑Tt=1 γt(k)

19: σ2(k) =∑T

t=1 γt(k)(zt−µk)(zt−µk)T∑Tt=1 γt(k)

Marginalizing over

k gives “tied” σ220: φ ∼ Norm

(Z; µk, σ

2k

)21: A = 1

p(z)

∑Tt=1 αt(k)ak′ ,kφk(zt+1)βt+1(k)zt∑T

t=1 γt(k)22: score = p(Z|A,φ)23: Terminate24: if score < tol then25: Θ = A,φ Maximum likelihood estimates26: return(Θ)27: end if28: end while

8

known estimators have been shown to be biased (Robertand Marin, 2008). Another technique from the literatureis reversible-jump MCMC (RJMCMC), however this ishighly computationally intensive (Green, 1995). Basedon the lower run-time, K is estimated using marginallikelihoods. To avoid a biased estimator, this is doneusing a simulation based approximation of the marginallikelihood called bridge sampling (Frühwirth-Schnatter,2006). Bridge sampling takes an i.i.d. sample from animportance density and combines it with the MCMCdraws from the posterior density in an appropriate way.With bridge sampling, p(∆Y|Mk) is approximated by,

p(∆Y|Mk) =L−1 ∑L

l=1 κ(θ[l;k])p∗(θ[l;k]|∆Y,Mk)

N−1 ∑Nn=1 κ(θ[n;k])q(θ[n;k])

where p∗(θ|∆Y,Mk) = p(∆Y|θ,Mk) × p(θ|Mk), andis the unnormalised posterior density of θ on Θk, κ isan arbitrary function on Θk, q is an arbitrary proba-bility density on Θk, θ[n;k] are samples from the pos-terior p(θ|∆Y,Mk) obtained using MHA and θ[l;k] arei.i.d. samples from q (Rydén, 2008). A drawback tothe bridge-sampling approach is that if the number ofhidden states is suspected to be larger than about six,then empirically the technique becomes inaccurate anda trans-dimensional approach such as RJMCMC has tobe used. This is because it is essential that all modesof the posterior density are covered by the importancedensity q(θ), to avoid any instability in the estimators(Frühwirth-Schnatter, 2006).

A literature review was conducted on the estimationof the number of hidden states in S&P500 daily returndata. Assorted techniques including VB, RJMCMC,EM, penalized likelihood criteria and bridge-samplingall estimated the data to contain between 2 and 3 hid-den states (McGrory and Titterington, 2009; Robertet al., 2000; Rydén et al., 1998; Frühwirth-Schnatter,2008; Rydén, 2008). As a result of this we believe thatK ≤ 10, while noting our data sampling frequency issignificantly different from that used in the literature(one minute versus daily). In order to find K, a se-ries of mixture distributions of a univariate normal arespecified. For each of k = 1, . . . , 10 models the log ofthe bridge sampling estimator of the marginal likelihoodp(∆Y|Mk) is found. The results are shown in Figure 3.

It can be seen that the largest marginal likelihood isa mixture of three normal distributions, meaning K = 3is the number of hidden states suggested by MCMC.Using this number of hidden states, Θ is found. Thechoice of prior is a critical step in the MCMC pro-cess and can lead to significant variations in the pos-terior probabilities. A proper prior is defined based

1 2 3 4 5 6 7 8 9 10−6

−4

−2x 10

4 Log Estimates of Marginal Likelihood

Number of Hidden States K

Lo

g M

arg

inal L

ikelih

oo

d

1 2 3 4 5 6 7 8 9 100

0.5

1

Sta

nd

ard

Err

or

p(Mk|∆Y)

Max Log-Likelihood

Standard Error

Figure 3: Log of the bridge sampling estimator of themarginal likelihood p(∆Y|Mk) under the default prior forK = 1, . . . , 10. The maximum is at K = 3. On the right-handaxis the standard error is shown for each model.

on the methodology suggested by Fruhwirth-Schnatter(Frühwirth-Schnatter, 2008). The prior combines thehierarchal prior for state specific variances σ2

k with ainformative prior on the transition matrix A by assum-ing that each row (ai1, . . . , aiK), i = 1, . . . ,K followsa Dirichlet Dir(ei1, . . . , eiK) prior where ei j = 4 andei j = 1/(d−1) for i , j. By choosing eii > ei j the HMM isbounded away from a finite mixture model (Frühwirth-Schnatter, 2008). The vector π = π1, . . . , πK of theinitial states is drawn from the ergodic distribution ofthe hidden Markov chain.

As a point estimate is required for Θ, we must movefrom the distributional estimate to a point estimate. Thisis done by approximating the posterior mode. The pos-terior mode is the value of Θ which maximizes the non-normalized mixture posterior density log p∗(Θ|∆Y) =

log p(∆Y|Θ) + log p(Θ). The posterior mode estima-tor is the optimal estimator with respect to the 0/1 lossfunction. The estimator is approximated by the MCMCdraw with the largest value of log p∗(Θ|∆Y).

As samples from the beginning of the chain may notaccurately represent the desired distribution a “burn-in”period of 2,000 draws was used. Run length was setto 10,000 draws. Implementation used the Bayesf tool-box with full details of the approach followed found insubsection 11.3.3 of (Frühwirth-Schnatter, 2006). A se-lection of the MCMC outputs are shown in Figure 4.

9

−10 −5 0 50

1

2

3

Histogram of the Data

∆ Y

Den

sit

y

−0.2 −0.1 0 0.1 0.20

0.2

0.4

0.6

0.8

µ

σ2

Point Process Representation

0 5000 10000−0.2

−0.1

0

0.1

0.2

Posterior Draws for µk

MCMC Draw Number

µk

0 5000 100000

0.2

0.4

0.6

0.8

Posterior Draws for σk

2

MCMC Draw Number

σk2

k=1

k=2

k=3

Figure 4: Markov Chain Monte Carlo by the Metropolis-Hastings algorithm. Subplot one, histogram of the data incomparison to the fitted 3 component Gaussian mixture distri-bution. Subplot two, a point process representation for K = 3.Subplot three, MCMC posterior draws for µk. Subplot four,MCMC posterior draws for σ2

k

3.4. Learning Summary

In this subsection the major differences the threemethods of learning are considered and the results com-pared. The three techniques for estimating the numberof hidden states all gave very similar results. Cross val-idation for PLR gave K = 2, penalized likelihood cri-teria for Baum-Welch gave K = 3 and bridge-samplingfor MCMC gave K = 3. For a momentum model bothK = 2 and K = 3 makes sense, as K = 2 could cor-respond to an upward/downward-trending momentumstates, with K = 3 meaning an additional no-trendingmomentum state. Any higher values of K may justbe considered noise. Subplot three of Figure 4 sup-ports this hypothesis by showing that the gradient ofthe trend is either positive, negative or zero, correspond-ing to upward/downward/no-trending states. These ob-servations translate into different conditional means forthe two/three normal distributions and are reported inthe results Section 7. The framework or two or threestates is appealing as experiments with MACD momen-tum models have shown only the sign of the predictivesignal has traction against the sign of future returns. Themagnitude of the signal does not seem able to predict themagnitude of future returns.

The inclusion of the PLR learning allows a “naive”estimate of the system parameters to be compared to theformal EM and MCMC techniques. Both determinis-tic Baum-Welch and stochastic MCMC use statistical

inference to find the number of hidden states and sys-tem parameters in an HMM. Baum-Welch can be usedfor maximum likelihood inference or for maximum a-posteriori (MAP) estimates. A Bayesian approach re-tains distributional information about the unknown pa-rameters which MCMC can be used to approximate.Baum-Welch computes point estimates (modes) of theposterior distribution of parameters, while MCMC gen-erates distributional outputs. Both learning approacheshave their advantages and disadvantages. One passof the EM algorithm is computationally similar toone sweep of MCMC, however typically many moreMCMC sweeps are run than EM iterations, meaning thecomputational cost for MCMC is much higher. Baum-Welch does not always converge on the global maxima,whereas MCMC suffers from the difficulty of choosinga good prior and potentially poor mixing of MCMC. ForMCMC, estimating the number of latent states by a mix-ture likelihood may be a fragile process. It will obvi-ously depend upon the distributions chosen. If a non-Gaussian distribution were selected, the mixture mightbe of lower order. This point also applies to the otherlearning approaches as well. In summary EM is foundto be the simplest and quickest solution (Rydén, 2008).The relative predictive performance of the three sets ofΘ is presented in Section 7.

So far, we have considered the relatively simply spec-ification of two and three state Markov regime switch-ing between Gaussian distributions. This approach iswell known to be able to capture some aspects of thenonlinearity of price formation, however it does sufferfrom overfitting and unobservability in the underlyingstates. Chen et al provide an interesting critique of othersuch approaches applied to forecasting electricity prices(Chen and Bunn, 2014). The authors conclude that afinite mixture approach to regime switching performsbest in out-of-sample testing, a methodology that wemay look to in future work. In the following section,the sophistication of the model is increased by the in-clusion of exogenous information.

4. Side Information

In this Section, the case where the probability of anygiven state in A is affected by side information fromoutside the model is considered. This is important as Agoverns the dynamics of Y. In “classical” trading mod-els, the t + 1 return of a security is forecast by a “signal”which is a univariate time series, typically synchronousand continuous between ±1. When this signal is > 0 thetrader will go “long” and when the signal is at < 0 thetrader will go “short”. In the simple case of a portfolio

10

consisting of only one security, the number of lots ofsecurity to be traded is directly proportional to the prod-uct of the signal magnitude and available capital. His-torically such predictive trading signals are generatedfrom either “technicals” or “fundamentals”. Technicalsare signals based on the prior behaviour of the security(Schwager, 1995b). Fundamentals are signals based onupon extrinsic factors (such as economic data) (Schwa-ger, 1995a). In this section two predictive signals aregenerated and shown to have statistical traction againstsecurity returns. In Section 5, the information held inthese signals is used when learning A. This methodol-ogy is quite general and as such could be applied to anytechnical or fundamental predictor.

Momentum traders often want to combine their mo-mentum signal with one or more extrinsic predictive sig-nals to give a single forecast. This is called the signalcombination problem for which a variety of differentsolutions exist, for example, Bayesian model averag-ing (Hoeting et al., 1999), frequentist model averaging(Wang et al., 2009), expert tracking (Cesa-Bianchi andLugosi, 2006) and filtering (Genasay et al., 2001). It isnoted that our approach of biasing the transition dynam-ics of an HMM momentum trading system using exter-nal predictors seems to be another possible solution tothis problem.

4.1. Forecasting with Splines

Splines are now introduced as the methodology bywhich we condition learning of the transition matrix.Splines are a way of estimating a noisy relationshipbetween dependent and independent variables, whileallowing for subsequent interpolation and evaluation(Reinsch, 1967). Splines have been used extensivelyin the financial trading literature, in areas as diverseas volatility estimation (Audrino and Bühlmann, 2009),yield curve modelling (Bowsher and Meeks, 2008) andreturns forecasting (Dablemont, 2010). We use a B-spline as a way of capturing a stationary, non-linear rela-tionship between predictor and security return. Splinesare implemented in MATLAB using the shape mod-elling language toolbox (D’Errico, 2011) and the curvefitting toolbox (MATLAB, 2009). In our experience fit-ting splines seems to be as much an art as a sciencewith sources of variability including how to treat end-points and the number of knots depending on the degreeof “belief” in the underlying economic argument of therelationship. Each spline is forced to be zero mean bysetting the integral of the spline to be zero, as the meanvalue of a function is the integral of that function di-vided by the length of the support of that function. A

zero mean spline ensures that no persistent bias is al-lowed over the interval of estimation. For a predictorX = x1, x2, . . . , xT the learning and subsequent fore-casting procedure is shown in Algorithm 2.

where t = 1, . . . ,T is intra-day time and n =

Alg. 2 Learning and Forecasting With Splines.∆Y = LAFWS(Y,X)

1: for n = 1 to N do2: for t = 1 to T do3: ∆y = log

(ynt

ynt−1

)Take re-

turns4: end for5: ∆y =

∆y−µ∆y

σ∆yNormalize the

return, ∆y ∼ Norm(0, 12)6: Gn−n′ :n = spline(xn−n′ :n,∆yn−n′ :n) Generate

spline G7: if n > n

′′ then8: for t = 1 to T do9: ∆yt = Gn−n′′ :n (xt)

Evaluate the spline10: end for11: end if12: end for

1, . . . , n′

, n′′

, . . . ,N is inter-day time. Spline evaluationis intra-day while spline learning is inter-day where thespline is “grown” over time, allowing it to capture newinformation and forget old information. The normaliza-tion step for price is carried out using an exponentiallyweighted moving average process for both mean µ∆y andvolatility σ∆y (Pesaran et al., 2009). In our code n

′′

= 66days with N = 258 days and T = 856 observations perday. In this way the spline is estimated using the previ-ous 66 trading days worth of data, on a rolling basis.

In the next two sections we implement two popular“off the shelf” predictors from the literature which ex-ploit intraday effects and use them to generate X.

4.2. Predictor I: Volatility Ratio

An extensive body of empirical research exists show-ing that realized volatility has predictive power againstsecurity returns (Christoffersen and Diebold, 2003; Hib-bert et al., 2008; Giot, 2005; Burghardt and Liu, 2008).These observations can be explained by showing thatthe sign dynamics of security returns are driven byvolatility dynamics (Kinlay, 2006). Modelling the re-turns process ∆yt as Gaussian with mean µ and con-ditional volatility σt allows for probability distributionfunction f and a cumulative distribution function F.The probability of a positive return p(∆yt+1) > 0 is given

11

by F = 1 − p([0, f ]). This shows the probability ofa positive return is a function of conditional volatilityσt+1|t and so as σt+1|t increases, the probability of a pos-itive return falls. In order to be able to benefit from thisrelationship a forecast for σt+1|t is required.

Much literature exists on the subject of volatility fore-casting, a summary of which is beyond the scope of thispaper so instead the reader is directed to three excellentreviews (Poon and Granger, 2003; Pesaran et al., 2009;Zaffaroni, 2008). The main finding of these reviewsis that the sophisticated volatility models can not outperform the simplest models with any statistical signifi-cance and for that reason we use the IGARCH(1,1), oth-erwise known as the J.P. Morgan Risk Metrics EWMAmodel (JPM, 1996; Pafka and Kondor, 2001). A draw-back to this approach are the recent findings in the lit-erature that volatility estimation with data above ∼20-minute frequency can lead to artifacts in the estimate(Andersen et al., 2011).

The EWMA methodology exponentially weights theobservations, representing the finite memory of the mar-ket, as per Equation (4),

σt+1|t =

√√√(1 − λ)

ψ∑τ=0

λτ∆y2t−τ (4)

The model has two parameters ψ (window size) and λ(variance decay factor where 0 < λ < 1) which arefixed a-priori with a trade-off between λ and ψ, with asmall λ yielding similar results to a small ψ. The origi-nal J.P. Morgan documentation suggests using λ = 0.94with daily frequency data, though we increase the reac-tivity of the term to fit our one minute frequency dataand set λ = 0.79 (Pesaran and Pesaran, 2007; Patton,2010). This leaves the only parameter of the model asthe number of historical observations ψ to include in theestimate.

There are many technical indicators that are basedon volatility in the popular trading literature, includ-ing bollinger bands, the ratio of implied to realizedvolatility, and the ratio of current volatility to historicalvolatility. We choose to implement the latter termed thevolatility ratio as designed by Chande in 1992 (Chande,1992) which requires estimating conditional volatilitiesfor “now” and in the “past” (Colby, 2002; investope-dia.com, 2016; quantshare.com, 2016). We parametersweep the ratio and select values ψ f ast = 50 and ψslow =

100 based on stability and predictive performance. Theinput to Algorithm 2 is given by X = σt+1|t(ψfast)/σt+1|t(ψslow).

4.3. Predictor II: Seasonality

Seasonality is an extremely well documented effectin the financial markets and is defined by returns fromsecurities’ varying predictably according to a cyclicalpattern across a time-scale (Bernstein, 1998). The time-scale of the variation in question varies from multi-year(Booth and Booth, 2003) to yearly (Lakonishok andSmidt, 1988), monthly (Ariel, 1987), weekly (Fransesand Paap, 2000), daily (Peiro, 1994) and intraday (Tay-lor, 2010). The fact that the periodicity (i.e. frequency)is fixed and known a-priori, distinguishes the effect fromother cyclical patterns in security returns (Taylor, 2007).

Intra-daily seasonality is where returns vary condi-tionally on the location within the trading day. Hirschobserves in 1987, that in the case of the Dow Jones In-dustrial Average, the market spends most of the tradingday going down and a very small amount of time goingup (i.e. the rises are large and fast and the falls are grad-ual and slow), with the rises happening post-open andpost-lunch (Hirsch, 1987).

A wide range of methodologies for extracting season-ality signals from financial data exist in the literature, in-cluding FFT (Murphy, 1987), seasonal GARCH (Bail-lie and Bollerslev, 1991), flexible Fourier form (An-dersen and Bollerslev, 1997), wavelets (Gencay et al.,2002), Bayesian auto-regression (Canova, 1993), linearregression (Lovell, 1963) and splines (Huptas, 2009).As we know the size of the cycle a-priori, believe theeffect to be non-linear and prefer to work in the time-domain, splines are chosen to estimate the relation-ship between time of day and security return. Theuse of splines seems to be a well accepted way ofcapturing seasonality, for example (Martin-Rodriguezand Caceres-Hernandez, 2005; Robb, 1980; Cáceres-Hernández and Martín-Rodríguez, 2007; Taylor, 2010;Martín Rodríguez and Cáceres Hernández, 2010).

Following the approach of Martin et al a seasonalindex is used to quantify the cycle (Martín Rodríguezand Cáceres Hernández, 2010). Here the author con-structs an index by defining the period of time underconsideration and then partitioning it into a periodic gridbetween one and T and then assigning observations tobuckets on this grid. The authors then capture the sea-sonal variation by fitting a spline to the seasonal indexand bucketed-data. In the case of our one-minute fre-quency data the size of the period is T = 856. The inputto Algorithm 2 is given by X = [1, . . . ,T ].

4.4. Simulation and Results

For our data set consisting of one-years worth of ESdata at one minute frequency, Algorithm 2 is applied to

12

the two predictors and results presented. Firstly, the twosplines generated from the training data set are shown inFigure 5. It can be seen the relationship is non-linear. Itis also clear that the integral of the splines is zero, mean-ing that a series of random evaluations of the spline willlead to a zero mean signal as required. By the degreeof local structure of the splines it is clear that these areempirical relationships, however this does not invalidatethem as predictors, but merely requires a stronger beliefin the underlying economic hypotheses behind them.The economic interpretation of Figure 5 for the volatil-ity ratio predictor is that a small (0.6) ratio of recentto old volatility means that risk is falling, and so thespline suggests buying. A large (0.8) ratio of recent toold volatility means that risk is rising, and so the splinesuggests selling. For the seasonality predictor the splinesuggests buying in the early morning and selling in theafternoon.The choice of the number of knots for the spline is im-

portant. Too many knots means the spline will be verytightly fitted to the data, while too few knots may fail tocapture the relationship of interest. The problem withover-fitting the relationship being that the in-sampleperformance will be great, but the out-of-sample per-formance will be poor. Hence it is a matter of balancewhich is decided upon by intuition about the variabil-ity of the underlying economic relationship. 6 knots arechosen for the volatility predictor and 10 knots for theseasonality predictor. Increasing the number of knots onthe volatility predictor to 40, doubles the predictive per-formance in-sample, but is probably just fitting to noise.

The performance of the two strategies can now besimulated against a benchmark of a long-only strategy.Special care is taken to ensure that the simulation is atruly out-of-sample simulation. Specifically, each oneof the data points used to evaluate a trade had not beenused in any of the previous stages of model identifica-tion, learning and estimation. The annualized Sharperatio is a popular measure of risk-adjusted return and isdefined as

√N(µ−r)σ

where µ is the mean strategy return,σ is the standard deviation of the strategy return, r is therisk free rate and N is the number of trading periods inthe year. The ratio is computed by calculating a vectorof daily returns, generated by finding the total intradaystrategy return each day and setting N = 258. This ag-gregation approach is preferable to scaling by

√N for

intraday N, as the output is more stable. As our finalsignal is zero mean and interest is earned at rate r onshort futures positions, we set r = 0.

The results of the simulation are shown in Figure 6.Subplot one shows the annual returns for the two strate-

gies against a long-only portfolio for the 258 tradingdays of 2011. It can be seen that the returns profile isdifferent for the two strategies so that while the volatil-ity strategy return is higher it also more volatile whichresults in the two strategies having similar risk adjustedreturn profiles. Subplot two shows the mean (pre-cost)annualized Sharpe ratios for the strategies. The Sharperatio of both strategies is around 2.0, as commonly re-quired for an intraday trading strategy to be successful.Subplot three shows the correlation coefficients betweenthe strategies, which are either small and positive ornegative, as required for a diverse portfolio.

In summary both predictors seem to have tractionagainst forecasting the returns of ES and thus containinformation of predictive use. For that reason we tryand incorporate them into our HMM momentum model.The “classical way” of doing this would be to combinethe final three signals, for example, by taking a weightedmean. Rather than combine the signals outright, theinformation held in the splines is used in the learningphase.

5. Learning With Side Information

5.1. Introduction

The HMM of Figure 1 states that the probability oftransitioning between momentum states is only depen-dent on the last momentum state, p(mt |mt−1). From Sec-tion 4 we have two splines that we know contain use-ful information when it comes to predicting security re-turns. In this Section the HMM is re-specified by incor-porating the side information held in the splines, suchthat the transition distribution is given by p(mt |mt−1, xt).The belief behind this new model is that the extrinsicdata is of value to predicting the change in price of thesecurity. Essentially we are saying that not all of thesecurities’ variance can be explained by the momentumeffect, even though we believe it to be the dominant fac-tor.

5.2. Input Output Hidden Markov Models

In Input Output Hidden Markov Models (IOHMMs)the observed distributions are referred to as inputs andthe emission distributions as outputs (Bengio and Fras-coni, 1995). Like regular HMMs, IOHMMs have a fixednumber of hidden states, however the output and transi-tion distributions are not only conditioned on the cur-rent state, but are also conditioned on an observed dis-crete input value X. In the HMM of Section 2.2, Θ waschosen to maximize the likelihood of the observationsgiven the correct classification sequence p(∆Y|M,Θ).

13

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1−0.2

−0.1

0

0.1

0.2

Volatility Ratio Spline on e−Mini S&P500 Future (258 trading days of 2011)

Volatility Ratio

Retu

rns

−6

−4

−2

0

2

4

6x 10

−3

Interday Seasonality Spline on e−Mini S&P500 Future (258 trading days of 2011)

Time of day (Chicago local time)

Retu

rns

0100

0200

0300

0400

0500

0600

0700

0800

0900

1000

1100

1200

1300

1400

1500

1515

Figure 5: Forecasting splines. Subplot one shows the spline generated by the volatility ratio predictor. Subplot two shows thespline generated by the seasonality predictor. This approach could be generalized when using N predictors, by generating anN-dimensional spline.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan−15

−10

−5

0

5

10

15

Time

% R

etu

rn

Cumulative Percentage Returns (2011−2012)

Volatility Ratio

Seasonality

Long−only

Volatility Ratio Seasonality Long−only0

0.5

1

1.5

2

2.5

3 Annualized Strategy Sharpe Ratios

Strategy

Sh

arp

e R

ati

o

Long−only Seasonality Volatility Ratio

Volatility Ratio

Seasonality

Long−only

Inter−Strategy Correlation Coefficents

0.17 −0.15 1

−0.48 1 −0.15

1 −0.48 0.17

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Figure 6: Forecasting splines results. Subplot one shows the annual returns for the two strategies against a long-only portfolio forthe 258 trading days of 2011. Subplot two shows the mean (pre-cost) annualized Sharpe ratios for the strategies. Subplot threeshows the correlation coefficients between the strategies.

IOHMMs are trained to maximize the likelihood of theconditional distribution p(∆Y|X,Θ), where the latentvariable M is stochastically relayed to the output vari-able ∆Y. A schematic of the model is shown in Figure7.

We consider the simplifying case where the input

and output sequences are synchronous (Bengio et al.,1999). Such a system can be represented with dis-crete state space distributions for emission p(∆yt |mt, xt)and transition p(mt |mt−1, xt). When the extrinsic pre-dictor and the HMM momentum predictor have differ-ent time stamps, or are of different sampling frequen-

14

2

3

3

2

1

Δy1

p

state prior observation function

state transition function

observation

state

m m m m

1x x x2 3

externalinformation

Δy Δy

Figure 7: Bayesian network showing the conditional indepen-dence assumption of a synchronous IOHMM. ∆Y is an ob-servable discrete output, X is an observable discrete input andM is an unobservable discrete variable. The model at time t isdescribed by the latent state conditional on the observed stateand some external information p(mt |∆y1:t, xt).

cies, an asynchronous setup is required, adding compu-tational complexity to the forward-backward recursion(Bengio and Bengio, 1996). It is noted such a techniquecould allow signals of a lower frequency to be used ina high-frequency inference problem, for example, low-frequency macro-economic data could be used to biasintraday trading.

The literature suggests three main approaches tolearning in IOHMM: Artificial neural networks (Ben-gio and Frasconi, 1995), partially observable Markovdecision processes (Bäuerle and Rieder, 2011) and EM(Bengio et al., 1999). As Baum-Welch (an EM variant)was used for learning in the HMM case, in order to beconsistent we opt to learn by EM for the IOHMM casetoo. In terms of Algorithm 1 the only changes requiredto deal with the IOHMM case are to lines 8 and 13,

αt(k) =∑mt−1

p(zt |mt, xt)p(mt |mt−1, xt−1)αt−1(k)

βt(k) =∑mt+1

p(zt+1|mt+1, xt+1)p(mt+1|mt, xt)βt+1(k)

To implement this methodology a different A is trainedfor every unique value of X. Such an approach hasthe drawbacks of over parameterization and requiringlarge amounts of data. This is solved by discretizingthe spline according to its roots, with R − 1 roots giv-ing R “buckets” of spline. xt is then aligned with ∆yt,and ∆yt assigned to one of the R buckets, the contentsof each bucket being concatenated to give a data vector.Baum-Welch learning with Algorithm 1 is then carriedout on each of these vectors, as before. As the transi-tion distribution p(mt |mt−1, xt) is time sequential, con-catenating the bucketed data is strictly incorrect as oc-casionally p(mt |mt−τ, xt) occurs, where τ > 1. In the

case of the two splines in Figure 5, the discretizationgives R = 5 and R = 2 for the volatility and seasonal-ity predictors respectively. Given the smoothness of thesplines, concatenation is rare and so the resulting smallloss of Markovian structure can be ignored. The obvi-ous advantage of discretizing by roots is that parametersA1,A2, . . . ,ARmap to signed returns. The learning al-gorithm for IOHMM is shown in Algorithm 3.

Alg. 3 IOHMM Learning.Θ = iohhmLearning(∆Y,X)

1: R = NewtonRaphson (G) Find the roots of splineG

2: Z1:R = map (∆Y,X,R) Map ∆Y to buckets corre-sponding to the roots of G

3: for r = 1 to R do4: Θr = BW(Zr) Baum-Welch on the R buckets,

as per Algorithm 15: end for

Using the methodology described above, two inde-pendent predictions are generated for each of the twoIOHMM models, one for the volatility ratio and onefor seasonality. However, it maybe the case we wish tocombine the two predictors into a single prediction. Inthis case of more than one predictor, X is treated as mul-tivariate and a multi-dimensional spline is generated.Subject to some appropriate discretization of the spline,Algorithm 3 can then be applied to solve p(mt |mt−1, x¯ t)where x

¯ t is a vector.

6. Inference Phase

We first present inference for the default HMM caseand then consider the IOHMM case. The aim of the in-ference phase is to find the marginal predictive distribu-tion p(∆yt |∆y1:t−1,Θ). This is found using the forwardalgorithm (Bishop, 2006).

The likelihood vector, size K × 1, corresponds to theobservation probabilities and together with the transi-tion probabilities fully describes the model. It is definedas,

p(∆yt |mt = k,Θ) ∝ Norm(∆yt; µk, σ

2k

)(5)

=1

σk√

2πexp

− 12σ2

k

[∆yt − µk

]2

If the Gaussian assumption of Equation (2) was droppedthen Equation (5) would be of a different form. Or inthe case of a non-parametric approach, the density ofp(∆Y|M,Θ) would be evaluated at this step.

15

The first step of the prediction is different to the sub-sequent steps, due to not yet being in the recursivechain. The first step starts with a prior over the hiddenstates,

p(m1 = k|∆y1) ∝ p(m1 = k)p(∆y1|m1 = k)∝ πk × Norm

(∆y1; µk, σ

2k

)=

πk × Norm(∆y1; µk, σ

2k

)∑

k′ πk′ × Norm(∆y1; µk′ , σ

2k′)

Once initialization has been dealt with, the rest of theprocess can be decomposed into a recursive formula-tion. The recursions update the posterior filtering dis-tribution in two steps: Firstly a prediction step propa-gates the posterior distribution at the previous time stepthrough the target dynamics to form the one step aheadprediction distribution. Secondly an update step incor-porates the new data through Bayes’ rule to form thenew filtering distribution. The filtering distribution ωt|t,k

is given by,

ωt|t,k , p(mt = k|∆y1:t)∝ p(mt = k|∆y1:t−1)p(∆yt |mt = k)∝ ωt|t−1,k p(∆yt |mt = k)

=ωt|t−1,k × Norm

(∆yt; µk, σ

2k

)∑

k′ ωt|t−1,k′ × Norm(∆yt; µk′ , σ

2k′)

The predictive distribution ωt|t−1,k is found by multiply-ing the filtering distribution by the state transition ma-trix,

ωt|t−1,k =∑

k′akk′ p(mt−1 = k

′

|∆y1:t−1)

=∑

k′akk′ωt−1|t−1,k′

The prediction ∆yt is then found by taking the ex-pectation of the marginal predictive density distributionp(∆yt |∆y1:t−1),

∆yt =∑∆yt

∆yt × p(∆yt |∆y1:t−1)

=∑∆yt

∆yt

∑k

p(mt = k|∆y1:t−1)p(∆yt |mt = k)

=∑

k

ωt|t−1,k × µ∗k

Where µ∗k is the mean of the discretized Gaussianp(∆yt |mt = k). The full approach is summarized in Al-gorithm 4.

Inference in the IOHMM case is very similar to theHMM case, though here Θ is conditional on xt. The

Alg. 4 HMM Prediction.Signal = HMM(∆Y,Θ)

1: Update for first step2: ω1|1,k =

πk×Norm(∆yt ;µk ,σ2k)∑

k′ πk′ ×Norm(∆yt ;µk′ ,σ

2k′

)3: for t = 2 to T do4: Predict5: ωt|t−1,k =

∑k′ akk′ωt−1|t−1,k′

6: ∆yt =∑

k ωt|t−1,k × µ∗k

7:8: Update9: ωt|t,k =

ωt|t−1,k×Norm(∆yt ;µk ,σ2k)∑

k′ ωt|t−1,k′ ×Norm(∆yt ;µk′ ,σ

2k′

)10:11: Output12: Signalt = TF(∆yt) Apply a transfer function13: end for

IOHMM version of the prediction algorithm is summa-rized in Algorithm 5.

Alg. 5 IOHMM Prediction.Signal = IOHMM(∆Y,X, Θ)

1: Update for first step2: ω1|1,k =

πk×Norm(∆yt ;µk ,σ2k)∑

k′ πk′ ×Norm(∆yt ;µk′ ,σ

2k′

)3: for t = 2 to T do4: Θ = F

(Θ, xt

)Parameter lookup table

5: Predict6: ωt|t−1,k =

∑k′ akk′ωt−1|t−1,k′

7: ∆yt =∑

k ωt|t−1,k × µ∗k

8:9: Update

10: ωt|t,k =ωt|t−1,k×Norm(∆yt ;µk ,σ

2k)∑

k′ ωt|t−1,k′ ×Norm(∆yt ;µk′ ,σ

2k′

)11:12: Output13: Signalt = TF(∆yt) Apply a transfer function14: end for

6.0.1. Asynchronous Price DataIn the above form, Algorithm 4 supports data which

lies on a discrete time grid. The popularity of such syn-chronous methodologies in dealing with financial dataarises from the computational challenge of dealing withthe huge amounts of data generated by the markets. Inreality, financial data is asynchronous due to trades clus-tering together (Dufour and Engle, 2000). Aggregationis the process of moving from asynchronous to syn-chronous data and this acts as a zero-one filter. Such a

16

rough down-sampling procedure means potentially use-ful high-frequency information is thrown away. TheBayesian approach to this problem is to keep as muchinformation as possible and then let the model decidehow what parts are/are not needed.

Our model can be altered to deal with asynchronousdata by modifying the observation equation in Equation(2) by scaling up the observation inter-arrival times,

∆yti = µmti∆ti + εti , εti ∼ N(0, σ2

mti∆ti)

where ∆ti = ti − ti−1 is the time between asynchronousobservations. Such a representation suffers the draw-back that µmti

does not change evenly over time, butchanges asynchronously according to observation time.The HMM could be further modified to incorporatesmooth µmti

change, for example by using continuous-time HMMs, but this is beyond the scope of this paper.

7. Data and Simulation

Data from the CME GLOBEX e-mini S&P500 (ES)future is used, one of the most liquid securities’ in theworld. Tick data is used for the period 01/01/2011 to31/12/2011, giving 258 days data. The synchronousform of the algorithm is implemented and the tick datapre-processed by aggregating to periodic spacing on aone minute grid, giving 856 observations per day. Only0100-1515 Chicago time is considered, Monday-Friday,corresponding to the most liquid trading hours. 1515Chicago time is when the GLOBEX server closes downfor its maintenance break and when the exchange offi-cially defines the end of the trading day. Only use thefront month contract is used, with contract rolling car-ried out 12 days before expiry. Only GLOBEX (elec-tronic) trades are considered, with pit (human) tradesbeing excluded. No additional cleaning beyond whatthe data provider has done is carried out.

As synchronous prices are generated on a close-to-close basis, in simulation the forecast signal is laggedby one period so that look-ahead is not incurred. Thestrategy return is then equal to the security return mul-tiplied by the lagged signal. Learning is carried out ondata from the second half of 2010. For all five momen-tum strategies, the mean and variance were specified foreach state (i.e. the system was not tied). Evaluation oftrading strategies is an extensive field, e.g. (Aronson,2006) and so just the key metrics of Sharpe ratio andreturns are presented. The HMM strategies are bench-marked against the long-only case. The pre-cost resultsof the simulation using ES for the year 2011 are shownin Figure 8.

The performance of the default HMM is the worst ofthe group of models. This is as expected and reflectsthe fact that A contains no information about the mar-ket, as all states are equally likely. The poor PLR per-formance can also be explained by the pair of negativetrend terms (µPLR = [−8.99,−0.0207]), in what was arising market over the simulation period. While Baum-Welch was able to beat both the default HMM and thelong-only case, MCMC was not able to beat the long-only case. There is no reason why Baum-Welch shouldbe able to outperform MCMC - we believe this may re-flect the difficulty in using MCMC correctly. Reasonsfor the poor MCMC performance are now discussed,along with suggestions for improvement,• Just as EM can fail to find the true global max-

ima, MCMC can fail to converge to the station-ary distribution of the posterior probabilities (Gilkset al., 1996). Common causes for convergence fail-ure are too few draws and poor proposal densities(Kalos and Whitlock, 2008). Ergodic averages ofMCMC draws which were generated by randompermutation sampling are used to check conver-gence. Convergence can be seen to occur in Fig-ure 4 as the entire MCMC chain is roughly sta-tionary for first and second moment parameters.Cowles et al recommend checking for convergenceby a combination of strategies including applyingdiagnostic procedures to a small number of paral-lel chains, monitoring auto-correlations and cross-correlations (Cowles and Carlin, 1996). However,we do not believe convergence has failed in thiscase.

• The mean emission parameters are Baum-Welchµ1:3 = [−0.0198,−0.00573, 0.0183], MCMCµ1:3 = [−0.122,−0.0117, 0.121]. It can be seenthat both have negative/zero/positive trend terms,but that the numerical values for the first and thirdstate are quite different. It maybe the case thatMCMC has failed to visit all the highly probableregions of the parameter space because of localmaxima in the posterior distribution.

• The step of moving from the distributional estimateto the point estimate presents an opportunity forselecting sub-optimal Θ. Our implementation ofMCMC approximates the posterior mode by keep-ing the sample with the highest posterior probabil-ity. It is possible however, that this approach couldend up selecting a local maxima, as opposed theglobal maxima, leading to a sub-optimal estimateof Θ. In future work this step could be done byestimating the likelihood of each sample and thentaking the maximum.

17

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan−100

−50

0

50

100

Time

% R

etu

rn Cumulative Percentage Returns (2011−2012)

Default HMM

Baum−Welch HMM

Volatility Ratio IOHMM

Seasonality IOHMM

MCMC HMM

Long−only

−6

−4

−2

0

2

4 Annualized Strategy Sharpe Ratios

Sh

arp

e R

ati

o

Default HMM

Baum−Welch HMM

Volatility Ratio IO

HMM

Seasonality IO

HMM

MCMC HMM

Long−only

Inter−Strategy Correlation Coefficents

1 −0.68 −0.54 −0.78 −0.27 0.096

−0.68 1 0.71 0.91 0.41 −0.017

−0.54 0.71 1 0.75 0.25 0.078

−0.78 0.91 0.75 1 0.43 −0.12

−0.27 0.41 0.25 0.43 1 −0.039

0.096 −0.017 0.078 −0.12 −0.039 1

Default HMM

Baum−Welch HMM

Volatility Ratio IO

HMM

Seasonality IO

HMM

MCMC HMM

Long−only

Default HMM

Baum−Welch HMM

Volatility Ratio IOHMM

Seasonality IOHMM

MCMC HMM

Long−only

−0.5

0

0.5

1

Figure 8: Simulation results for the five variations of the HMM intraday momentum trading strategy, with K = 2 or 3, plus thelong-only case.

• Choice of prior maybe more influential than mightbe expected (Frühwirth-Schnatter, 2008). Whilewe have followed the recommendations of the liter-ature, it might be that using a more diffuse prior onA would give better results, as it would allow theparameter space to be more thoroughly searched.Failure to search correctly could happen if the ex-isting prior was too strong and overwhelms thedata, but this would be unusual given the amountof data used for learning. In particular we believethe use of a uniform prior should cause the resultsof MCMC and EM to converge. Another approachwould be to initialize MCMC with Baum-Welch.If the search moves away from the initial searchspace, then it might be the case, that the chains aretaking a very long time to mix.

• The proposal density maybe poorly chosen leadingto acceptance rates which are too high or too low.In future work we suggest modifying the proposaldensity to incorporate work from optimal proposalscalings (Neal and Roberts, 2006) and adaptive al-gorithms (Levine and Casella, 2006) to attempt tofind good proposals automatically.

Interestingly the MCMC and Baum-Welch strategy re-turns have a reasonably high correlation at 0.41, sug-gesting that they maybe picking up the same marketmoves, but with MCMC doing so in a less timely (op-timal) fashion. Over the trading times considered, ESrose resulting in a Sharpe ratio of 0.4, approximately

equal to its long run average. The failure of MCMC tobeat the long-only case, while Baum-Welch does, againpoints to the fact that parameter selection has failed forMCMC. All of the HMMs have a low correlation tothe long-only case, which is as expected given all theHMMs have a zero mean signal. Post-cost results re-duce the Sharpe ratio of the HMM strategies by approx-imately 15%.

The IOHMM models are both able to beat the Baum-Welch HMM model Sharpe ratio by more than 10%, re-flecting the fact that the model is able to use the infor-mation X, which as known from Section 4 has predic-tive value. The Sharpe ratio from the IOHMM models issmaller than that from the individual side-information Xpredictors, because the covariance between the individ-ual predictors and the momentum signal is greater thanzero. Even though the Sharpe ratio of the IOHMM sig-nal is less than the Sharpe ratio of the individual X pre-dictors, this is not necessarily a bad thing as the correla-tion of the IOHMM returns has decreased relative to thebenchmark returns. Institutional investors tend to run“portfolios of strategies” in order to diversify strategyrisk. Here any strategy with a positive expectation anda low correlation to the existing return stream, maybeworthy of inclusion in that portfolio, even if the perfor-mance of the new strategy does not beat the benchmark.The simulation results shown in Figure 8 suggest thatthis strategy could be worthy of inclusion in such a port-folio.

18

8. Conclusions and Further Work

8.1. Conclusions

This paper has presented a viable framework forintraday trading of the momentum effect using bothBayesian sampling and maximum likelihood for param-eters, and Bayesian inference for state. The frameworkis intended to be a practical proposition to augmentmomentum trading systems based on low-pass filters,which have been use since the 1970’s. A key advantageof our state space formulation is that it does not sufferfrom the delayed frequency response that digital filtersdo. It is this time lag which is the biggest cause of pre-dictive failure in digital filter based momentum systems,due their poor ability to detect reversals in trend at mar-ket change points.

As the number of latent momentum states in the mar-ket data is never known, it has to be estimated. Threeestimation techniques are used, cross-validation, penal-ized likelihood criteria and MCMC bridge sampling.All three techniques give very similar results, namelythat the system consists of 2 or 3 hidden states.

Learning of the system parameters is principally car-ried out by two methods, namely frequentist Baum-Welch and Bayesian MCMC. Theoretically MCMCprobably should be able to outperform Baum-Welch,however when carrying out simulations on out of sam-ple data, it is found that Baum-Welch gives the best pre-dictive performance. The reasons for this are unclear,but it maybe because selecting a good prior is hard forour system, or that the single point estimate of Baum-Welch maybe close to the “correct” value, giving supe-rior performance over the Bayesian marginalization ofthe parameters by MCMC.

Often a trend-following system will want to incor-porate external information, in addition to the momen-tum signal, leading to the signal combination problem.An IOHMM is formulated as possible solution to thisproblem. In an IOHMM, the transition distribution isconditioned not only on the current state, but also onan observed external signal. Two such external sig-nals are generated, seasonality and volatility ratio, bothwith positive Sharpe ratios, and are incorporated into theIOHMM. The performance of the IOHMM can be seento be improved over the HMM, suggesting the IOHMMmethodology used is a possible solution to the signalcombination problem.

In addition to presenting novel applications ofHMMs, this paper provides additional support for themomentum effect being profitable, pre- and post-cost,and adds to the substantial body of evidence on the ef-fect. While much of the existing literature shows that

the momentum effect is strongest at the 1-3 month pe-riod, we have shown the effect is viable at higher tradingfrequencies too.

Finally it is noted that this work is an instance ofunsupervised learning under a single basic generativemodel. As such it can be linked to other work in thefield by noting when the state variables presented in thismodel become continuous and Gaussian, the problemcan be solved by a Kalman filter and when continuousand non-Gaussian the problem can be solved by a parti-cle filter, for example (Christensen et al., 2012).

8.2. Further WorkIn future work we would like to explore in detail why

learning Θ by MCMC results in poorer performancethan by Baum-Welch. In particular the selection of theprior and the proposal density seem worthy of furtherinvestigation, as discussed in Section 7.

In this paper just the best sample of Θ was retained.An improved prediction might be possible by retainingall the samples and averaging their predictions. FullyBayesian inference uses the distributional estimate ofΘ output from MCMC. Denoting the training data asZ and the out of sample data as ∆Y, MCMC givesi = 1, . . . , I samples from the posterior distribution, s.t.Θi ∼ p(Θ|Z,Mk). The predictive density can then bedetermined by,

p(∆Y|Z,Mk) =

∫p(∆Y|Z,Θ,Mk)p(Θ|Z,Mk)dΘ

≈

I∑i=1

p(∆Y|Z,Θi,Mk)

A closely related approach that could also be investi-gated is Bayesian Model averaging (BMA) (Hoetinget al., 1999). While the Bayesian inference just de-scribed performs averaging over the distribution of pa-rametersΘ, BMA performs averaging at the level of themodel Mk. BMA might be a sensible approach giventhe similarity of the MCMC marginal likelihoods usedfor model selection.

Predictive performance may also be improved by re-moving the model’s parametric assumption and chang-ing to use asynchronous data. By using a more naturaldescription of emission noise, the fit of the model couldbe improved. In the current downsampling of the data itmaybe that useful high-frequency information is gettingthrown away. Using asynchronous data would be themost Bayesian approach, allowing the model to decidewhat to do with that high-frequency information.

Finally, an interesting area of future research couldbe to compare the IOHMM methodology with other

19

approaches to signal combination, such as a weightedmean of the Baum-Welch HMM and the individual pre-dictor signals.

9. Acknowledgements

We acknowledge use of the following MATLABtoolboxes; Kevin Murphy’s “Probabilistic ModelingToolkit” https://github.com/probml/pmtkand Sylvia Fruhwirth-Schnatter’s “Bayesf”www.wu.ac.at/statmath/en/faculty_staff/faculty/sfruehwirthschnatter.

References

Adams, R. P., MacKay, D. J., 2007. Bayesian online changepoint de-tection. Tech. rep., Cambridge University.

Akaike, H., 1974. A new look at the statistical model identification.Automatic Control, IEEE Transactions on 19 (6), 716–723.

Andersen, T. G., Bollerslev, T., 1997. Intraday periodicity and volatil-ity persistence in financial markets. Journal of Empirical Finance4 (2-3), 115 – 158, high Frequency Data in Finance, Part 1.

Andersen, T. G., Bollerslev, T., Meddahi, N., 2011. Realized volatilityforecasting and market microstructure noise. Journal of Economet-rics 160 (1), 220 – 234.

Andrieu, C., Doucet, A., 2003. Online expectation-maximization typealgorithms for parameter estimation in general state space mod-els. In: Acoustics, Speech, and Signal Processing, 2003. Proceed-ings.(ICASSP’03). 2003 IEEE International Conference on. Vol. 6.IEEE, pp. VI–69.

Andrieu, C., Doucet, A., Holenstein, R., 2010. Particle Markov chainMonte Carlo methods. Journal of the Royal Statistical Society: Se-ries B (Statistical Methodology) 72 (3), 269–342.

Ang, A., Bekaert, G., 2002. Regime switches in interest rates. Journalof Business & Economic Statistics 20 (2), 163–182.

Anon, 6th Jan 2011. Momentum in financial markets: Why Newtonwas wrong. The Economist.

Ariel, R., 1987. A monthly effect in stock returns. Journal of FinancialEconomics 18 (1), 161–174.

Aronson, D., 2006. Evidence-Based Technical Analysis: Applyingthe Scientific Method and Statistical Inference to Trading Signals.Wiley Trading.

Attias, H., 1999. Inferring parameters and structure of latent variablemodels by variational Bayes. In: Proceedings of the Fifteenth con-ference on Uncertainty in artificial intelligence. Morgan KaufmannPublishers Inc., pp. 21–30.

Audrino, F., Bühlmann, P., 2009. Splines for financial volatility. Jour-nal of the Royal Statistical Society: Series B (Statistical Method-ology) 71 (3), 655–670.

Baillie, R., Bollerslev, T., 1991. Intra-day and inter-market volatilityin foreign exchange rates. The Review of Economic Studies 58 (3),565–585.

Bäuerle, N., Rieder, U., 2011. Markov Decision Processes with appli-cations to finance. Springer.

Baum, L. E., Petrie, T., Soules, G., Weiss, N., 1970. A maximiza-tion technique occurring in the statistical analysis of probabilisticfunctions of Markov chains. The annals of mathematical statistics41 (1), 164–171.

Bengio, S., Bengio, Y., 1996. An EM algorithm for asynchronous in-put/output hidden Markov models. In: International ConferenceOn Neural Information Processing. Citeseer, pp. 328–334.

Bengio, Y., Frasconi, P., 1995. An input output hmm architecture. Ad-vances in neural information processing systems, 427–434.

Bengio, Y., et al., 1999. Markovian models for sequential data. Neuralcomputing surveys 2, 129–162.

Bernstein, J., 1998. Seasonality: Systems, Strategies and Signals.John Wiley & Sons.

Bhar, R., Hamori, S., 2003. New evidence of linkages among g7 stockmarkets. Finance Letters 1 (1).

Bhar, R., Hamori, S., 2004. Hidden Markov models: applications tofinancial economics. Kluwer Academic Pub.

Bishop, C., 2006. Pattern Recognition and Machine Learning.Springer.

Booth, J., Booth, L., 2003. Is presidential cycle in security returnsmerely a reflection of business conditions? Review of FinancialEconomics 12 (2), 131–159.

Bowsher, C., Meeks, R., 2008. The dynamics of economic functions:modeling and forecasting the yield curve. Journal of the AmericanStatistical Association 103 (484), 1419–1437.

Branger, N., Kraft, H., Meinerding, C., 2012. Partial informationabout contagion risk and portfolio choice. Tech. rep., Departmentof Finance, Goethe University.

Buffington, J., Elliott, R. J., 2002. American options with regimeswitching. International Journal of Theoretical and Applied Fi-nance 5 (05), 497–514.

Bulla, J., Bulla, I., 2006. Stylized facts of financial time series and hid-den semi-Markov models. Computational Statistics & Data Anal-ysis 51 (4), 2192–2209.

Burghardt, G., Liu, L., 2008. How stock price volatility affects stockreturns and cta returns. Tech. rep., Newedge Brokerage.

Cáceres-Hernández, J., Martín-Rodríguez, G., 2007. Heterogeneousseasonal patterns in agricultural data and evolving splines. The IUPJournal of Agricultural Economics 4 (3), 48–65.

Canova, F., 1993. Forecasting time series with common seasonal pat-terns. Journal of Econometrics 55 (1-2), 173–200.

Cesa-Bianchi, N., Lugosi, G., 2006. Prediction, Learning, and Games.Cambridge University Press.

Chande, T. S., March 1992. Adapting moving averages to marketvolatility. Technical Analysis of Stocks & Commodities magazine10(3), 108–114.

Charlot, P., 2012. Modelling volatility and correlations with a hiddenMarkov decision tree. Tech. rep., Aix-Marseille University.

Chen, D., Bunn, D., 2014. The forecasting performance of a finitemixture regime-switching model for daily electricity prices. Jour-nal of Forecasting.

Chib, S., 2001. Markov chain Monte Carlo methods: computation andinference. Handbook of econometrics 5, 3569–3649.

Chopin, N., Pelgrin, F., 2004. Bayesian inference and state numberdetermination for hidden Markov models: An application to theinformation content of the yield curve about inflation. Journal ofEconometrics 123 (2), 327–344.

Christensen, H. L., Murphy, J., Godsill, S. J., 2012. Forecasting high-frequency futures returns using online langevin dynamics. SelectedTopics in Signal Processing, IEEE Journal of 6 (4), 366–380.

Christoffersen, P., Diebold, F., 2003. Financial asset returns, direction-of-change forecasting, and volatility dynamics. Tech. rep., NBER.

Colby, R. W., 2002. The Encyclopedia Of Technical Market Indica-tors. McGraw-Hill.

Cowles, M. K., Carlin, B. P., 1996. Markov chain Monte Carlo conver-gence diagnostics: a comparative review. Journal of the AmericanStatistical Association 91 (434), 883–904.

Dablemont, S., 2010. Forecasting of High Frequency Financial TimeSeries: Concepts, Methods, Algorithms. Lambert Academic Pub-lishing.

Dai, M., Zhang, Q., Zhu, Q. J., 2010. Trend following trading under aregime switching model. SIAM Journal on Financial Mathematics

20

1 (1), 780–810.D’Errico, J., 2011. Shape language modellingwww.

mathworks.com/matlabcentral/fileexchange/24443-slm-shape-language-modeling.

Dueker, M., Neely, C. J., 2007. Can Markov switching models predictexcess foreign exchange returns? Journal of Banking & Finance31 (2), 279–296.

Dueker, M. J., 1997. Markov switching in GARCH processes andmean-reverting stock-market volatility. Journal of Business & Eco-nomic Statistics 15 (1), 26–34.

Dufour, A., Engle, R., 2000. Time and the price impact of a trade. TheJournal of Finance 55 (6), 2467–2498.

Durbin, J., Watson, G., 1971. Testing for serial correlation in leastsquares regression. iii. Biometrika 58 (1), 1–19.

Elliott, R., Hinz, J., 2002. Portfolio optimization, hidden Markovmodels, and technical analysis of P&F charts. International Journalof Theoretical and Applied Finance 5 (04), 385–399.

Elliott, R. J., Wilson, C. A., 2007. The term structure of interest ratesin a hidden Markov setting. In: Hidden Markov Models in Finance.Springer, pp. 15–30.

Faith, C., 2007. Way of the Turtle. McGraw-Hill Professional.Fine, S., Singer, Y., Tishby, N., 1998. The hierarchical hidden Markov

model: Analysis and applications. Machine learning 32 (1), 41–62.Franses, P., Paap, R., 2000. Modelling day-of-the-week seasonality in

the s&p 500 index. Applied Financial Economics 10 (5), 483–488.Frühwirth-Schnatter, S., 2001. Markov chain Monte Carlo estimation

of classical and dynamic switching and mixture models. Journal ofthe American Statistical Association 96 (453), 194–209.

Frühwirth-Schnatter, S., 2006. Finite mixture and Markov switchingmodels. Springer Science+ Business Media.

Frühwirth-Schnatter, S., 2008. Comment on article by rydén.Bayesian Analysis 3 (4), 689–698.

Gales, M., Young, S., 2008. The application of hidden Markov modelsin speech recognition. Foundations and Trends in Signal Process-ing (Now publishers).

Gelman, A., 2011. Induction and deduction in Bayesian data analysis.Rationality, Markets and Morals (RMM) 2, 67–78.

Genasay, R., Dacorogna, M., Muller, U. A., Pictet, O., 2001. An In-troduction to High-Frequency Finance. Academic Press.

Gencay, R., Selcuk, F., Whitcher, B., 2002. An Introduction toWavelets and Other Filtering Methods in Finance and Economics.Elsevier.

Gerald, A., 1999. Technical analysis power tools for active investors.Financial Times Prentice Hall.

Ghahramani, Z., Jordan, M. I., 1997. Factorial hidden Markov models.Machine learning 29 (2-3), 245–273.

Giampieri, G., Davis, M., Crowder, M., 2005. A hidden Markovmodel of default interaction. Quantitative Finance 5 (1), 27–34.

Gilks, W. R., Richardson, S., Spiegelhalter, D. J., 1996. Markov chainMonte Carlo in practice. Vol. 2. Chapman & Hall/CRC.

Giot, P., 2005. Relationships between implied volatility indexes andstock index returns. The Journal of Portfolio Management 31 (3),92–100.

Green, P. J., 1995. Reversible jump Markov chain Monte Carlo com-putation and Bayesian model determination. Biometrika 82 (4),711–732.

Grégoir, S., Lenglart, F., 2000. Measuring the probability of a busi-ness cycle turning point by using a multivariate qualitative hiddenMarkov model. Journal of forecasting 19 (2), 81–102.

Hamilton, J. D., 1989. A new approach to the economic analysis ofnonstationary time series and the business cycle. Econometrica:Journal of the Econometric Society, 357–384.

Harvey, A., 1991. Forecasting, structural time series models and theKalman filter. Cambridge university press.

Hassan, M. R., Nath, B., 2005. Stock market forecasting using hidden

Markov model: a new approach. In: Intelligent Systems Designand Applications, 2005. ISDA’05. Proceedings. 5th InternationalConference on. IEEE, pp. 192–196.

Hibbert, A. M., Daigler, R. T., Dupoyet, B., 2008. A behavioral expla-nation for the negative asymmetric return-volatility relation. Jour-nal of Banking & Finance 32 (10), 2254 – 2266.

Hirsch, Y., 1987. Don’t Sell Stocks on Monday: An Almanac forTraders, Brokers and Stock Market Investors. Penguin.

Hoeting, J., Madigan, D., Raftery, A., Volinsky, C., 1999. Bayesianmodel averaging: A tutorial. Statistical science 14(4), 382–401.

Hong, H., Stein, J. C., 1999. A unified theory of underreaction, mo-mentum trading, and overreaction in asset markets. The Journal ofFinance 54 (6), 2143–2184.

Huptas, R., 2009. Intraday seasonality in analysis of uhf financialdata: Models and their empirical verification. Dynamic Economet-ric Models 9, 1–10.

investopedia.com, 2016. The volatility ratio.www.investopedia.com/terms/v/volatility-ratio.asp.

Jegadeesh, N., Titman, S., 1999. Profitability of momentum strategies:An evaluation of alternative explanations. Tech. rep., National Bu-reau of Economic Research.

Johnson, T., 2002. Rational momentum effects. Journal of Finance,585–608.

Jordan, M. I., 2009. Are you a Bayesian or a frequentist? SummerSchool Lecture, Cambridge.

JPM, 1996. JPM RiskMetrics - technical document. Tech. rep., JPM.URL www.riskmetrics.com/system/files/private/td4e.pdf

Juang, B., Levinson, S., Sondhi, M., 1986. Maximum likelihood es-timation for multivariate mixture observations of Markov chains(corresp.). Information Theory, IEEE Transactions on 32 (2), 307–309.

Kakade, S., Teh, Y. W., Roweis, S. T., 2002. An alternate objectivefunction for Markovian fields. In: Machine Learning InternationalWorkshop. pp. 275–282.

Kalos, M. H., Whitlock, P. A., 2008. Monte Carlo methods. Wiley-VCH.

Kass, R. E., Raftery, A. E., 1995. Bayes factors. Journal of the ameri-can statistical association 90 (430), 773–795.

Kim, A., Shelton, C., Poggio, T., 2002. Modeling stock order flowsand learning market-making from data. Tech. rep., MassachusettsInstitute of Technology.

Kim, C.-J., 1993. Unobserved-component time series models withMarkov-switching heteroscedasticity: Changes in regime and thelink between inflation rates and inflation uncertainty. Journal ofBusiness & Economic Statistics 11 (3), 341–349.

Kingsbury, N., Rayner, P., 1971. Digital filtering using logarithmicarithmetic. Electronics Letters 7 (2), 56–58.

Kinlay, J., 2006. Predicting market direction. Tech. rep., InvestmentAnalytics LLP.

Kohavi, R., et al., 1995. A study of cross-validation and bootstrap foraccuracy estimation and model selection. In: International jointConference on artificial intelligence. Vol. 14. Lawrence ErlbaumAssociates Ltd, pp. 1137–1145.

Lakonishok, J., Smidt, S., 1988. Are seasonal anomalies real? aninety-year perspective. Review of Financial Studies 1 (4), 403–425.

Lesmond, D., Schill, M., Zhou, C., 2004. The illusory nature of mo-mentum profits. Journal of Financial Economics 71 (2), 349–380.

Levine, R. A., Casella, G., 2006. Optimizing random scan gibbs sam-plers. Journal of multivariate analysis 97 (10), 2071–2100.

Liesenfeld, R., 2001. A generalized bivariate mixture model forstock price volatility and trading volume. Journal of Econometrics104 (1), 141–178.

Lo, A., MacKinlay, A., 2001. A non-random walk down Wall Street.

21

Princeton University Press.Lo, A., Mamaysky, H., Wang, J., 2000. Foundations of technical anal-

ysis: Computational algorithms, statistical inference, and empiri-cal implementation. Journal of Finance, 1705–1765.

Lovell, M. C., 1963. Seasonal adjustment of economic time series andmultiple regression analysis. Journal of the American StatisticalAssociation 58 (304), 993–1010.

Mamon, R., Elliott, R., 2007. Hidden Markov models in finance. Vol.104. Springer Verlag.

Martin-Rodriguez, G., Caceres-Hernandez, J., 2005. Modelling thehourly spanish electricity demand. Economic Modelling 22 (3),551–569.

Martín Rodríguez, G., Cáceres Hernández, J., 2010. Splines and theproportion of the seasonal period as a season index. EconomicModelling 27 (1), 83–88.

MATLAB, 2009. Spline Toolbox User’s Guide 3.McCulloch, R. E., Tsay, R. S., 1994. Statistical analysis of economic

time series via Markov switching models. Journal of time seriesanalysis 15 (5), 523–539.

McGrory, C. A., Titterington, D., 2009. Variational Bayesian analysisfor hidden Markov models. Australian & New Zealand Journal ofStatistics 51 (2), 227–244.

Meligkotsidou, L., Dellaportas, P., 2011. Forecasting with non-homogeneous hidden Markov models. Statistics and Computing21 (3), 439–449.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H.,Teller, E., 1953. Equation of state calculations by fast computingmachines. The journal of chemical physics 21, 1087.

Murphy, J., 1987. The seasonality of risk and return on agriculturalfutures positions. American Journal of Agricultural Economics69 (3), 639–646.

Neal, P., Roberts, G., 2006. Optimal scaling for partially updatingmcmc algorithms. The Annals of Applied Probability 16 (2), 475–515.

Oh, E. S., May 2011. Bayesian particle filtering for prediction of fi-nancial time series. Master’s thesis, Cambridge University.

Pafka, S., Kondor, I., 2001. Evaluating the riskmetrics methodology inmeasuring volatility and value-at-risk in financial markets. PhysicaA: Statistical Mechanics and its Applications 299, 305–310.

Patton, A. J., 2010. Volatility forecast comparison using imperfectvolatility proxies. Journal of Econometrics 160 (1), 246–256.

Peiro, A., 1994. Daily seasonality in stock returns: Further interna-tional evidence. Economics Letters 45 (2), 227–232.

Pesaran, B., Pesaran, M. H., 2007. Volatilities and conditional corre-lations in futures markets with a multivariate t distribution. CESifoWorking Paper Series 2056, CESifo Group Munich.

Pesaran, M., Schleicher, C., Zaffaroni, P., 2009. Model averaging inrisk management with an application to futures markets. Journal ofEmpirical Finance 16 (2), 280–305.

Pidan, D., El-yaniv, R., 2011. Selective prediction of financial trendswith hidden Markov models. In: Advances in Neural InformationProcessing Systems. pp. 855–863.

Poon, S. H., Granger, C. W., 2003. Forecasting volatility in financialmarkets: A review. Journal of Economic Literature XLI, 478–539.

Punskaya, E., Andrieu, C., Doucet, A., Fitzgerald, W., 2002. Bayesiancurve fitting using MCMC with applications to signal segmenta-tion. Signal Processing, IEEE Transactions on 50 (3), 747–758.

quantshare.com, 2016. The standard de-viation ratio.www.quantshare.com/item-1039-standard-deviation-ratio.

Reinsch, C., 1967. Smoothing by spline functions. Numerical Mathe-matics 10, 177–183.

Robb, A. L., 1980. Accounting for seasonality with spline functions.The Review of Economics and Statistics 62 (2), 321–323.

Robert, C. P., Marin, J.-M., 2008. On some difficulties with a poste-

rior probability approximation technique. Bayesian Analysis 3 (2),427–441.

Robert, C. P., Ryden, T., Titterington, D. M., 2000. Bayesian inferencein hidden Markov models through the reversible jump Markovchain Monte Carlo method. Journal of the Royal Statistical So-ciety: Series B (Statistical Methodology) 62 (1), 57–75.

Roman, D., Mitra, G., Spagnolo, N., 2010. Hidden Markov modelsfor financial optimization problems. IMA Journal of ManagementMathematics 21 (2), 111–129.

Rossi, A., Gallo, G. M., 2006. Volatility estimation via hidden Markovmodels. Journal of Empirical Finance 13 (2), 203–230.

Rydén, T., 2008. EM versus Markov chain Monte Carlo for estimationof hidden Markov models: A computational perspective. BayesianAnalysis 3 (4), 659–688.

Rydén, T., Teräsvirta, T., Åsbrink, S., 1998. Stylized facts of dailyreturn series and the hidden Markov model. Journal of appliedeconometrics 13 (3), 217–244.

Satchell, S., Acar, E., 2002. Advanced Trading Rules. Butterworth-Heinemann.

Schwager, J. D., 1995a. Fundamental Analysis (Schwager on Fu-tures). John Wiley & Sons.

Schwager, J. D., 1995b. Technical Analysis (Schwager on Futures).John Wiley & Sons.

Schwarz, G., 1978. Estimating the dimension of a model. The annalsof statistics 6 (2), 461–464.

Scott, S. L., 2002. Bayesian methods for hidden Markov models: Re-cursive computing in the 21st century. Journal of the AmericanStatistical Association 97 (457), 337–351.

Shephard, N., 1994. Partial non-gaussian state space. Biometrika81 (1), 115–131.

Shi, S., Weigend, A. S., 1997. Taking time seriously: Hidden Markovexperts applied to financial engineering. In: Computational Intel-ligence for Financial Engineering (CIFEr), 1997., Proceedings ofthe IEEE/IAFE 1997. IEEE, pp. 244–252.

Shiller, R., 2005. Irrational exuberance. Princeton University Press.Taylor, J., 2010. Exponentially weighted methods for forecasting in-

traday time series with multiple seasonal cycles. International Jour-nal of Forecasting 26 (4), 627–646.

Taylor, S. J., 2007. Asset Price Dynamics, Volatility, and Prediction.Princeton University Press.

Teitelbaum, R., January 2008. The code breaker. Bloomberg Maga-zine, 32–48An interview with the CEO of Renaissance Technolo-gies.

Thomas, L. C., Allen, D. E., Morkel-Kingsbury, N., 2002. A hid-den Markov chain model for the term structure of bond credit riskspreads. International Review of Financial Analysis 11 (3), 311–329.

Toni, T., Welch, D., Strelkowa, N., Ipsen, A., Stumpf, M. P., 2009. Ap-proximate Bayesian computation scheme for parameter inferenceand model selection in dynamical systems. Journal of the RoyalSociety Interface 6 (31), 187–202.

Vuong, Q. H., 1989. Likelihood ratio tests for model selection andnon-nested hypotheses. Econometrica: Journal of the EconometricSociety, 307–333.

Wang, H., Zhang, X., Zou, G., 2009. Frequentist model averagingestimation: a review. Journal of Systems Science and Complexity22, 732–748.

Wilmott, P., 2006. Paul Wilmott on Quantitative Finance. Wiley.Wisebourt, S., 2011. Hierarchical hidden Markov model of high-

frequency market regimes using trade price and limit order bookinformation. Master’s thesis, University of Waterloo.

Yu, S.-Z., 2010. Hidden semi-Markov models. Artificial Intelligence174 (2), 215–243.

Zaffaroni, P., 2008. Large-scale volatility models: theoretical prop-erties of professionals’ practice. Journal of Time Series Analysis

22

29 (3), 581–599.

23

Recommended