+ All Categories
Home > Documents > Bayesian non-parametric hidden Markov models with applications

Bayesian non-parametric hidden Markov models with applications

Date post: 04-Feb-2022
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
21
© 2011 Royal Statistical Society 1369–7412/11/73037 J. R. Statist. Soc. B (2011) 73, Part 1, pp. 37–57 Bayesian non-parametric hidden Markov models with applications in genomics C. Yau, University of Oxford, UK O. Papaspiliopoulos, Universitat Pompeu Fabra, Barcelona, Spain G. O. Roberts Warwick University, Coventry, UK and C. Holmes University of Oxford, UK [Received October 2008. Final revision May 2010] Summary. We propose a flexible non-parametric specification of the emission distribution in hidden Markov models and we introduce a novel methodology for carrying out the computa- tions. Whereas current approaches use a finite mixture model, we argue in favour of an infinite mixture model given by a mixture of Dirichlet processes. The computational framework is based on auxiliary variable representations of the Dirichlet process and consists of a forward–back- ward Gibbs sampling algorithm of similar complexity to that used in the analysis of parametric hidden Markov models. The algorithm involves analytic marginalizations of latent variables to improve the mixing, facilitated by exchangeability properties of the Dirichlet process that we uncover in the paper. A by-product of this work is an efficient Gibbs sampler for learning Dirichlet process hierarchical models. We test the Monte Carlo algorithm proposed against a wide variety of alternatives and find significant advantages.We also investigate by simulations the sensitivity of the proposed model to prior specification and data-generating mechanisms. We apply our methodology to the analysis of genomic copy number variation. Analysing various real data sets we find significantly more accurate inference compared with state of the art hidden Markov models which use finite mixture emission distributions. Keywords: Block Gibbs sampler; Copy number variation; Local and global clustering; Partial exchangeability; Partition models; Retrospective sampling 1. Introduction Hidden Markov models (HMMs) are arguably the most popular statistical tool for extracting information from sequential data throughout applied science; indicative application areas include signal processing and speech recognition (Rabiner, 1989; Fox et al., 2009), natural language processing (Manning and Schuetze, 1999), information retrieval (Teh et al., 2006), economics (Hamilton, 1989; Kim, 1994), molecular dynamics (Horenko and Schütte, 2008) and biochemistry (McKinney et al., 2006; Gopich and Szabo, 2009). Directly related to the content Address for correspondence: O. Papaspiliopoulos, Department of Economics, Universitat Pompeu Fabra, Ramon Trias Fargas 25–27, Barcelona 8005, Spain. E-mail: [email protected]
Transcript
Page 1: Bayesian non-parametric hidden Markov models with applications

© 2011 Royal Statistical Society 1369–7412/11/73037

J. R. Statist. Soc. B (2011)73, Part 1, pp. 37–57

Bayesian non-parametric hidden Markov modelswith applications in genomics

C.Yau,

University of Oxford, UK

O. Papaspiliopoulos,

Universitat Pompeu Fabra, Barcelona, Spain

G. O. Roberts

Warwick University, Coventry, UK

and C. Holmes

University of Oxford, UK

[Received October 2008. Final revision May 2010]

Summary. We propose a flexible non-parametric specification of the emission distribution inhidden Markov models and we introduce a novel methodology for carrying out the computa-tions. Whereas current approaches use a finite mixture model, we argue in favour of an infinitemixture model given by a mixture of Dirichlet processes.The computational framework is basedon auxiliary variable representations of the Dirichlet process and consists of a forward–back-ward Gibbs sampling algorithm of similar complexity to that used in the analysis of parametrichidden Markov models. The algorithm involves analytic marginalizations of latent variables toimprove the mixing, facilitated by exchangeability properties of the Dirichlet process that weuncover in the paper. A by-product of this work is an efficient Gibbs sampler for learning Dirichletprocess hierarchical models.We test the Monte Carlo algorithm proposed against a wide varietyof alternatives and find significant advantages.We also investigate by simulations the sensitivityof the proposed model to prior specification and data-generating mechanisms. We apply ourmethodology to the analysis of genomic copy number variation. Analysing various real datasets we find significantly more accurate inference compared with state of the art hidden Markovmodels which use finite mixture emission distributions.

Keywords: Block Gibbs sampler; Copy number variation; Local and global clustering; Partialexchangeability; Partition models; Retrospective sampling

1. Introduction

Hidden Markov models (HMMs) are arguably the most popular statistical tool for extractinginformation from sequential data throughout applied science; indicative application areasinclude signal processing and speech recognition (Rabiner, 1989; Fox et al., 2009), naturallanguage processing (Manning and Schuetze, 1999), information retrieval (Teh et al., 2006),economics (Hamilton, 1989; Kim, 1994), molecular dynamics (Horenko and Schütte, 2008) andbiochemistry (McKinney et al., 2006; Gopich and Szabo, 2009). Directly related to the content

Address for correspondence: O. Papaspiliopoulos, Department of Economics, Universitat Pompeu Fabra,Ramon Trias Fargas 25–27, Barcelona 8005, Spain.E-mail: [email protected]

Page 2: Bayesian non-parametric hidden Markov models with applications

38 C.Yau, O. Papaspiliopoulos, G. O. Roberts and C. Holmes

of this paper is the recent application of HMMs in genomics and the analysis of copy numbervariation in mammalian genomes; see Sections 1.1 and 5 for details and references.

The basic HMM for a sequence of data y1, . . . , yT introduces a hidden Markov chain s1, . . . , sT

with discrete state space S = {1, . . . , n} and assumes that the observed data are conditionallyindependent given the hidden states. The conditional distribution of yt given st , which is oftencalled the emission distribution, is given by some parametric distribution whose parametersdepend on st . A determining factor for the popularity of HMMs is the accompanying com-putational machinery for carrying out the statistical inference efficiently. This computationalmethodology originates from Baum (1966); it is intrinsically related to dynamic programmingand it is broadly known as the forward–backward algorithm (see Rabiner (1989) and Cappéet al. (2005) for details). Bayesian inference can be performed by using the Gibbs sampler. Theforward–backward algorithm is used to simulate exactly and in a single block the hidden Markovchain according to its conditional distribution at O.Tn2/ computational cost; see for exampleScott (2002).

The basic HMM has well-known limitations, whose relative importance depends on the appli-cation at hand. The Markov switching dynamic linear models in discrete (e.g. Kim (1994)) andcontinuous time (e.g. Horenko and Schütte (2008)) introduce serial correlation in the data, andthe hierarchical Dirichlet process HMM (Teh et al., 2006) and the sticky hierarchical Dirich-let process HMM (Fox et al., 2009) remove the necessity to decide a priori (an upper boundfor) the number of states. Another point which has been widely discussed in the literature (seefor example Section 1.1 below, sections IV.D and 6 of Rabiner (1989) and Fox et al. (2009)and references therein) is the sensitivity of the inference to the specification of the emissiondistribution. The emission distribution often involves heavy tails, skewness or multiple modes.Its misspecification leads to errors both in the segmentation of the data to states and in theout-of-sample prediction. We give evidence of this in the analysis of copy number variationwhere the misspecification of the emission distribution leads to a large number of false positivecopy number variants being flagged. Existing approaches for resolving this issue specify a finitemixture model for the emission distribution.

This paper proposes a novel methodology for flexible modelling of the emission distributionwithin the HMM framework. We specify the emission distribution as an infinite mixture modelwhere the mixing is induced by the Dirichlet process; such mixtures are known as mixturesof Dirichlet processes (MDPs) (Antoniak, 1974; Lo, 1984; Escobar, 1988). (The term ‘non-parametric’ is understood as that the model involves an infinite number of parameters.) Wecall the model proposed the MDPHMM. MDP models have experienced tremendous successover the last 20 years in a variety of statistical applications; see for example the collection ofarticles in Hjort et al. (2010) for a recent overview, references and current trends in the field.MDP models are a popular alternative to finite mixture models because an upper bound for thenumber of components does not need to be specified a priori. Additionally, inference for thisnumber can be accomplished via Gibbs sampling methods, whereas this task typically requirescarefully tuned reversible jump algorithms in finite mixture models; see for example Green andRichardson (2001) for a discussion.

We introduce a complementary computational methodology for learning the MDPHMM,which has similar complexity and efficiency to that of the Gibbs sampler for the basic HMM.As typically done in mixture modelling, we introduce a further latent stochastic process, kt , t =1, . . . , T , which determines which component of the emission mixture distribution yt is allocatedto. We design an algorithm which

(a) updates jointly in a single block the hidden Markov chain and

Page 3: Bayesian non-parametric hidden Markov models with applications

Hidden Markov Models 39

(b) does so without conditioning on a particular assignment of the data into the mixturecomponents of the emission distribution, i.e. k1, . . . , kT are integrated out when updatings1, . . . , sT .

We show that the computational cost of the step which updates the hidden Markov chain isO{T log.T/n2}, a minor overhead compared with the simple HMM. A by-product of our meth-odology is a very efficient Markov chain Monte Carlo algorithm for learning general MDPmodels. This was originally described in Papaspiliopoulos (2008), and it is detailed in Section3. The algorithm is already being successfully used in different applications; see for exampleDunson (2009) and Pati and Dunson (2009).

We test our model and our algorithms by using a thorough simulation study. We show thatthe algorithm proposed outperforms all natural alternative strategies. We find that inferencewith the model proposed is robust to the prior specification of the parameters controlling theMarkov chain dynamics, especially when T is large (which is the scenario for which our methodsare particularly targeted at). We also demonstrate robustness to the form of the true emissiondistribution.

Our work is driven by the analysis of genomic copy number variation in mammaliangenomes. This problem is detailed in Section 1.1 below and it is revisited in Section 5 where ourmethods are shown to outperform the state of the art finite mixture HMMs that are currentlyused in the literature. The rest of the paper is organized as follows. The MDPHMM is defined inSection 2. Section 3 develops the computational methodology and discusses alternative relatedwork. Prior sensitivity and algorithmic performance are scrutinized in Section 4. The genomiccopy number variation analysis is presented in Section 5. Section 6 concludes, discussing variousextensions and connection of our work to previous and concurrent methods that combineDirichlet processes and HMMs to infer the number of states.

1.1. Motivating applicationCopy number variants are regions of the genome where stretches of deoxyribonucleic acid(DNA) are found in duplication or deletion. In diploid organisms, such as humans, somaticcells normally contain two copies of each gene, one inherited from each parent. However, abnor-malities during the process of DNA replication and synthesis can lead to the loss or gain ofDNA fragments, leading to variable gene copy numbers that may initiate or promote diseaseconditions. For example, the loss or gain of a number of tumour suppressor genes and oncogenesare known to promote the initiation and growth of cancers.

This has been enabled by microarray technology that has enabled copy number variationacross the genome to be routinely profiled by using array comparative genomic hybridization(CGH) methods. These technologies allow the DNA copy number to be measured at millionsof genomic locations simultaneously, allowing copy number variants to be mapped with highresolution. Copy number variation discovery, as a statistical problem, essentially amounts todetecting segmental changes in the mean levels of the DNA hybridization intensity along thegenome (Fig. 1). However, these measurements are extremely sensitive to variations in DNAquality, DNA quantity and instrumental noise and this has led to the development of variousstatistical methods for data analysis.

One popular approach for tackling this problem utilizes HMMs where the hidden statescorrespond to the unobserved copy number states at each probe location, and the observeddata are the hybridization intensity measurements from the microarrays (see Shah et al. (2006),Marioni et al. (2006), Colella et al. (2007), Stjernqvist et al. (2007) and Andersson et al. (2008)).Typically the distributions of the observations are assumed to be Gaussian or, to add robustness,

Page 4: Bayesian non-parametric hidden Markov models with applications

40 C.Yau, O. Papaspiliopoulos, G. O. Roberts and C. Holmes

100 200 300 400 500 600 700 800 900 1000−2

−1

0

1

2

Probe Number

log

ratio

Duplication Deletion

Fig. 1. Example array CGH data set: this data sets shows a copy number gain (duplication) and a copy num-ber loss (deletion) which are characterized by relative upward and downward shifts in the log-intensity-ratiorespectively; the probe number here indicates the chromosomal location

a mixture of two Gaussian distributions or a Gaussian and uniform distribution, where thesecond mixture component acts to capture outliers such as in Shah et al. (2006) and Colellaet al. (2007). However, many data sets contain non-Gaussian noise distributions on the mea-surements, as pointed out in Hu et al. (2007), particularly if the experimental conditions arenot ideal. As a consequence, existing methods can be extremely sensitive to outliers, skewnessor heavy tails in the actual noise process that might lead to large numbers of false copy num-ber variants being detected. As genomic technologies evolve from being pure research toolsto diagnostic devices, more robust techniques are required. Bayesian non-parametrics offer anattractive solution to these problems and lead us to investigate the models that we describe here.

2. Mixture of Dirichlet processes hidden Markov model formulation

The observed data will be a realization of a stochastic process {yt}Tt=1. The marginal distribution

and the dependence structure in the process are specified hierarchically and semiparametrically.Let f.y|m, z/ be a density with parameters m and z; let {st}T

t=1 be a Markov chain with discretestate space S ={1, . . . , n}, transition matrix Π= [πi,j]i,j∈S and initial distribution π0, and let Hθ

be a distribution that is indexed by some parameters θ, and α> 0. Then, the model is specifiedhierarchically as follows:

yt|st , kt , m, z ∼f.yt|mst , zkt /, t =1, . . . , T ,

P.st = j|st−1 = i/=πi,j, i, j ∈S,

p.kt , ut|w/∝ ∑j:wj>ut

δj.kt/=∞∑

j=11[ut < wj]δj.kt/,

zj|θ ∼Hθ, j �1,

w1 =v1, wj =vj

j−1∏i=1

.1−vi/, j �2,

vj ∼Be.1, α/, j �1,

⎫⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭

.1/

where m = {mj, j ∈S}, s = .s1, . . . , sT /, y = .y1, . . . , yT /, u = .u1, . . . , uT /, k = .k1, . . . , kT /, w =.w1, w2, . . ./, v = .v1, v2, . . ./, z = .z1, z2, . . ./ and δx.·/ denotes the Dirac delta measure centredat x. The model involves structural changes in time that are induced by the HMM, {mst }T

t=1. Italso uses a flexible emission distribution specified as a mixture model in which f.y|m, z/ is mixedwith respect to P.dz/. The last four lines in the hierarchy identify P with the Dirichlet process

Page 5: Bayesian non-parametric hidden Markov models with applications

Hidden Markov Models 41

prior with base measure Hθ and concentration parameter α. Such mixture models are known asMDPs. We complete the specification of the model with priors for the unknown hyperparame-ters: the HMM labels mi, i= 1, . . . , n, the transition matrix Π and the Dirichlet concentrationparameter α. The parameters of the base measure are typically given data-driven fixed values.Details are provided in Sections 4.2 and 5.

The representation of the Dirichlet process prior in terms of only k, v and z (with u marginal-ized out) is well known and has been used in hierarchical modelling inter alia by Ishwaran andJames (2001) and Papaspiliopoulos and Roberts (2008). According to this specification,

p.kt|w/=∞∑

j=1wj δj.kt/: .2/

Following Walker (2007) we augment the parameter space with further auxiliary variables uand specify a joint distribution of .kt , ut/ in model (1). Note that conditionally on w the pairs.kt , ut/ are independent over t and model (2) is a marginal of model (1). Model (1) correspondsto a standard representation of an arbitrary random variable k with density p as a marginal ofa pair .k, u/ uniformly distributed under the curve p. When p is unimodal the representationcoincides with Khinchin’s theorem (see section 6.2 of Devroye (1986)). We choose to representthe mixture distribution by .k, v, z, u/ for computational reasons that are described in Section 3.

The model involves two levels of clustering for y: a temporally persisting (local) clusteringthat is induced by the HMM and represented by the labels of s, and a global clustering thatis induced by the Dirichlet process and represented by the labels of k. A specific instance ofthe model is obtained when yt ∈ �, f is the Gaussian density with mean m + μ and varianceσ2, z= .μ, σ2/∈�×�+ and Hθ is an N.0, γ/× IG.a, b/ product measure with hyperparametersθ= .γ, a, b/. Then, according to this model, E.yt|s, m/=mt is a slowly varying random functiondriven by the HMM and the distribution of the residuals yt −mt is a Gaussian MDP.

3. Simulation methodology: block Gibbs sampling for mixture of Dirichlet processhidden Markov model

Our target is the exploration of the posterior distribution of .s, m, u, v, z, k, Π, α/ by Markovchain Monte Carlo sampling. w is simply a function of v; hence it can be recovered from thealgorithmic output. We want the computational methodology for the MDPHMM to meetthree principal requirements. First, the algorithmic time should scale well with T (i.e. betterthan O.T 2/). Second, the algorithm should not become trapped around minor modes whichcorrespond to confounding of local with global clusters. Informally, we would like to makemoves in the high probability region of HMM configurations and then use the residuals to fitthe MDP component. And, third we would like the algorithm to require as little human interven-tion as possible (hence avoid having to tune algorithmic parameters). Such a simulation methodwould enable the analysis of massive data sets from array CGH and single-nucleotide polymor-phism genotyping platforms where it is now routine to perform microarray experiments thatcan generate millions of observations per sample with populations involving many thousandsof individuals. The following algorithm meets the three requirements.

We propose the following block Gibbs sampler which iteratively simulates from the condi-tional distributions, where variables are removed from the conditioning set either by explicitintegration or by conditional independence. The steps which involve integration are steps 1 and7 and are discussed in what follows:

step 1, [s|y, m, u, v, z, Π];

Page 6: Bayesian non-parametric hidden Markov models with applications

42 C.Yau, O. Papaspiliopoulos, G. O. Roberts and C. Holmes

step 2, [k|y, s, m, u, v, z];step 3, [m|y, s, k, z];step 4, [Π|s];step 5, [v, u|k, α];step 6, [z|y, k, s, m];step 7, [α|k].

Steps 1, 3 and 4 can be seen as an update of the HMM component of the model, whereassteps 2 and 5–7 constitute an update of the MDP component. There are two key ideas in thisalgorithm. First, steps 1 and 2 correspond to a joint update of s and k, by first drawing sfrom [s|y, m, u, v, z, Π] and subsequently k from [k|y, s, m, u, v, z]. Hence, the global allocationvariables k are integrated out in the update of the local allocation variables s. Second, step 5updates jointly v and u. These two innovations result in a highly efficient algorithm both interms of decoupling the dependence between local and global allocation variables and in termsof simulating MDP models. We detail these steps below.

3.1. Hidden Markov model updateWe simulate exactly from [s|y, m, u, v, z, Π] by using a standard forward filtering–backward sam-pling algorithm (see for example Cappé et al. (2005)). This is facilitated by the following keyresult which is proved in Appendix A. The result is a general property of MDP models (and infact of more general mixtures of stick breaking processes).

Proposition 1. The yts are conditionally independent given s, m, u, v, z, with conditional density

pt.yt|m, st , ut , w/= ∑j:wj>ut

f.yt|mst , zj/: .3/

Therefore the conditional distribution [s|y, m, u, v, z, Π] is the posterior distribution of a hiddenMarkov chain st , 1 � t � T , with state space S, transition matrix Π, initial distribution π0 andemission distributions (3).

The number of terms that are involved in the sum above is finite almost surely, since there willbe a finite number of mixture components with weights wj > uÅ.T/ := inf1�t�T .ut/. In particu-lar, Walker (2007) observed that j > jÅ.T/ is a sufficient condition which ensures that wj < ut ,where jÅ.T/ := max1�t�T {jÅ

t }, and jÅt is the smallest l such that Σl

j=1wj > 1 − ut . To see this,note that Σk�j wk < u implies that wk < u for all k � j. Hence, the number of terms that areused in the likelihood evaluations is bounded above by jÅ.T/. Additionally, note that we onlyneed partial information about the random measure .z, v/ to carry out this step: the values of.vj, zj/, j � jÅ.T/, are sufficient to carry out the forward–backward algorithm.

However, jÅ.T/ will typically grow with T. Under the prior distribution, uÅ.T/ ↓ 0 almostsurely as T →∞. Standard properties of the DPP imply that jÅ.T/ =O{log.T/} (see for exampleMuliere and Tardella (1998)). This relates to the fact that the number of new componentsthat are generated by the Dirichlet process grows logarithmically with the size of the data(Antoniak, 1974). In contrast, it is well known that the computational cost of the forwardfiltering–backward sampling algorithm, when the computational cost of evaluating the likeli-hood is fixed, isO.Tn2/. Hence, (a priori) we expect an overall computational costO{T log.T/n2}for the exact simulation of the hidden Markov chain in this non-parametric set-up. In fact, typ-ical values of the number of components that are involved in density (3) are reported for theanalysis of genomic data in Section 5. Steps 3 and 4 are carried out as in the standard Gibbssampler for basic HMMs.

Page 7: Bayesian non-parametric hidden Markov models with applications

Hidden Markov Models 43

3.2. Mixture of Dirichlet processes updateConditionally on a realization of s and m, we have an MDP model. Therefore, the algorithmthat is described in steps 2 and 5–7 can be seen more generally as a block Gibbs sampler forposterior simulation in an MDP model. The nomenclature in the literature classifies it as aconditional Gibbs sampler (Ishwaran and James, 2001; Papaspiliopoulos and Roberts, 2008)since the random measure .z, v/ is imputed and explicitly updated. The algorithm that we pro-pose is a synthesis of the retrospective Markov chain Monte Carlo algorithm of Papaspiliopoulosand Roberts (2008) and the slice Gibbs sampler of Walker (2007), and it was initially describedin Papaspiliopoulos (2008). The synthesis yields an algorithm which has advantages over bothand it is particularly appropriate in the context of the MDPHMM.

The retrospective algorithm works with the parameterization of the MDP model in termsof .k, v, z/ (see the discussion in Section 2). Then, it proceeds by Gibbs sampling of k, v and zaccording to their full conditional distributions. Simulation from the conditional distributionsof v and z is particularly easy. Specifically, v consists of conditionally independent elements with

vj|k, α∼Be(

nj +1, T −j∑

l=1nl +α

)for all j =1, 2, . . . , .4/

where nj =#{t : kt = j}. Similarly, z consists of conditionally independent elements with

zj|y, s, m, k ∼{ ∏

t:kt=j

f.yt|mst , zj/π.zj|θ/ for all j : nj > 0,

Hθ, otherwise..5/

In this expression π.z|θ/ denotes the Lebesgue density of Hθ. In contrast, simulation from theconditional distribution of k is more involved. It follows directly from model (2) that condi-tionally on the rest k consists of conditionally independent elements with

p.kt|y, m, s, v, z/∝∞∑

j=1wj f.yt|mst , zj/δj.kt/, .6/

which has an intractable normalizing constant Σ∞j=1wj f.yt|mst , zj/. Therefore, direct simulation

from this distribution is difficult. Papaspiliopoulos and Roberts (2008) devised a Metropolis–Hastings scheme which resembles an independence sampler and it accepts with probability 1most of the moves proposed.

The slice Gibbs sampler of Walker (2007) parameterizes in terms of .k, u, v, z/. Hence, theposterior distribution that is sampled by the retrospective algorithm is a marginal of the dis-tribution that is sampled by the slice Gibbs sampler, and the retrospective Gibbs sampler is acollapsed version of the slice Gibbs sampler (bar the Metropolis–Hastings step in the update ofk). The slice Gibbs sampler proceeds by sampling k, u, v and z according to their full conditionaldistributions. The augmentation of u greatly simplifies the structure of expression (6), whichnow becomes

p.kt|y, m, s, v, z/∝ ∑j:wj>ut

f.yt|mst , zj/δj.·/: .7/

The distribution has now finite support and the normalizing constant can be computed. Hencethis distribution can be simulated by the inverse cumulative density function method bycomputing at most jÅ.t/ terms for each t. u consists of conditional independent elements withut ∼ Uni.0, wkt /. However, the conditioning on u creates global dependence on the vjs, whosedistribution is given by expression (4) under the constraint wj > ut , ∀t = 1, . . . , T . The easiestway to simulate from this constrained distribution is by single-site Gibbs sampling of the vjs.This single-site Gibbs sampling tends to be slowly mixing and deteriorating with T.

Page 8: Bayesian non-parametric hidden Markov models with applications

44 C.Yau, O. Papaspiliopoulos, G. O. Roberts and C. Holmes

Our method updates u and v in a single block, by first updating v from its marginal (withrespect to u) according to expression (4) and consequently u conditionally on v as describedabove. This scheme is feasible owing to the nested structure of the parameterizations of theretrospective and the slice Gibbs algorithms. The update of k is done as in the slice Gibbssampler, and the update of z as described earlier. When a gamma prior is used for α, itsconditional distribution given k and marginal with respect to the rest is a mixture of gammadistributions and can be simulated as described in Escobar and West (1995). The algorithm caneasily incorporate the label switching moves that were discussed in section 3.4 of Papaspiliop-oulos and Roberts (2008) (where the problem of multimodality for conditional Gibbs samplingmethods is discussed in detail). Fortran 77 and MATLAB code are available on request fromthe authors.

4. Simulation experiments

We design a thorough simulation study to assess the performance of the proposed algorithmsin comparison with a wide variety of alternative Gibbs sampling schemes, and to investigate therobustness of the MDPHMM under various true emission distributions and prior specificationsfor the unknown parameters. We simulated data according to the following scheme:

yt ∼N.m0,st +μkt , 1=λkt /,

p.st = i|st−1 = j/=πi,j,

p.s1 = s/=π0.s/,

kt ∼K∑

j=1wj δj.·/:

We take n=2, π0 = . 12 , 1

2 / and the transition matrix Π is of the form

Π=(

1−ρ ρρ 1−ρ

)

where the transition probability ρ=0:05 and T =1000. We consider various emission distribu-tions specified as finite mixtures given in Table 1. They are based on previous simulation studiesthat were considered in Green and Richardson (2001), Papaspiliopoulos and Roberts (2008)and Walker (2007).

We also consider a bimod 1000 data set which involves no HMM component, i.e. m0 = .0, 0/,to test algorithms that are focused on posterior simulation for MDP models. In all cases a

Table 1. Simulation parameters

Simulation Values for the following data sets:parameter

lepto bimod trimod1000 1000 1000

K 2 2 3m0 .0, 0:3/ .0, 1/ .0, 1/w .0:67, 0:33/ .0:5, 0:5/ .1=3, 1=3, 1=3/μ .0, 0:3/ .−1, 1/ .−4, 0, 8/

λ .1, 1=0:252/ .1=0:52, 1=0:52/ .1, 1, 1/

Page 9: Bayesian non-parametric hidden Markov models with applications

Hidden Markov Models 45

common Dirichlet process prior is assumed which generates pairs .μj, λj/ according to a basemeasure N.0, 1/×Ga.1, 1/ and concentration parameter α=1.

4.1. Mixture of Dirichlet processes posterior sampling schemesWe begin with a comparison of various Gibbs sampling algorithms for MDP models, i.e. weeliminate the HMM component from the model and focus on learning the underlying clusteringmechanism. Since extensive comparisons between marginal and conditional algorithms havebeen carried out in Papaspiliopoulos and Roberts (2008), here we focus on the three conditionalGibbs sampling schemes that were considered in Section 3: the retrospective Markov chainMonte Carlo method of Papaspiliopoulos and Roberts (2008) with label switching moves(method R), the slice sampler of Walker (2007) (method SL) and the block Gibbs algorithm(method BGS) that was introduced in this paper. Fig. 2 investigates the mixing time of the variousalgorithms via the auto-correlations of two functionals of the parameters learned by the algo-rithms. We monitor the number of clusters, i.e. the number of components in the infinite mixturewith at least one data point allocated to them, and a measure of model fit

D=−2T∑

t=1log

{ ∑j:nj =0

nj

Tf.yt|zj/

},

which is a meaningful function of several parameters of interest. More details on the choice ofthese functionals can be found in section 4 of Papaspiliopoulos and Roberts (2008). The bimod1000 data set is used.

0.2

0.4

0.6

0.8

1.0

0 20 40 60 800 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

(a) (b)

Fig. 2. Simulation study of retrospective Markov chain Monte Carlo (- - - - - - -), slice sampler (. . . . . . .)and the new block Gibbs algorithm ( ): auto-correlation plots correspond to (a) the number of clus-ters and (b) D D�2ΣT

tD1 log{Σj :nj D= 0 .nj =T / f .ytjzj /}

Page 10: Bayesian non-parametric hidden Markov models with applications

46 C.Yau, O. Papaspiliopoulos, G. O. Roberts and C. Holmes

The experiment that is reported is representative of several others that we have carried outand do not include here. The computational times per iteration (in ‘stationarity’) of algorithmsR and SL are similar, and about 50–60% higher than those of algorithm BGS. Additionally,the computational times for all the algorithms grow linearly with T , the size of the data. Theretrospective Markov chain Monte Carlo algorithm mixes faster than the other algorithms,and the block Gibbs sampler is more efficient than the slice Gibbs sampler. However, the mainadvantage of imputing the auxiliary variables u and working with the block Gibbs sampler isfully appreciated in the MPDHMM context, as shown in what follows.

4.2. Mixture of Dirichlet processes hidden Markov model posterior sampling schemesWe consider inference for the full MDPHMM by using the simulated data sets that were out-lined above. The prior distribution that was used for m is a bivariate Gaussian distribution withmean m0 and covariance matrix ω−1I, where I is the identity matrix and ω=100. The transitionprobability ρ is fixed to the true value 0.05. Note that the prior for m is quite informative, butSection 4.3 considers a range of values for ω.

We applied six different Gibbs sampling approaches. The first is a marginal method that isbased on algorithm 5 from Neal (2000) that updates .st , kt/ from its conditional distributionπ.st , kt|·/ according to the following scheme.

Step 1: draw a candidate kÅt from the conditional prior for ki where the conditional prior is

given by

p.kÅt = j|k−t/∝

⎧⎪⎨⎪⎩

n−t,k

n−1+α, if kt = j for some t,

α

n−1+α, if kt = j for all t,

where n−t,k is the number of data points that are allocated to the kth component but notincluding the tth data point.Step 2: draw a candidate state sÅt from the conditional prior distribution p.st|st−1, st+1/.Step 3: accept .sÅt , kÅ

t / with probability α{.sÅt , kÅt /, .st , kt/} where

α{.sÅt , kÅt /, .st , kt/}=min

{1,

πst−1,sÅtπsÅ

t ,st+1 f.yt|msÅt, zkÅ

t/

πxt−1,xt πst ,st+1 f.yt|mst , zkt /

};

otherwise leave .st , kt/ unchanged.

We also analysed the data sets by using variations of both the slice and the block Gibbs sam-pling approaches (blocking here refers to the MDP update). In the first approach, we samplefrom the conditional distributions π.st , kt|·/.

Step 1: sample st from p.st|st−1, st+1, u, z, y/, t =1, . . . , T .Step 2: sample kt from p.kt|s, u, z, y/, t =1, . . . , T .

We denote these slice samplers and block Gibbs samplers with local updates. The second ap-proach uses forward–backward sampling to simulate π.s|·/.

Step 1: sample s from p.s|u, z, y/ by using the forward filtering–backward sampling method.Step 2: sample kt from p.kt|s, u, z, y/, t =1, . . . , T .

We denote these slice samplers and block Gibbs samplers with forward–backward updates.

Page 11: Bayesian non-parametric hidden Markov models with applications

Hidden Markov Models 47

Therefore, the slice and block Gibbs samplers differ in step 5 of the algorithm that was out-lined in Section 3: the slice algorithm simulates from the joint distribution by Gibbs samplingwhereas the block Gibbs sampler simulates directly. The algorithm of Section 3 is precisely theblock Gibbs sampler with forward–backward updates. Finally, a third approach consists of ablock conditional sampler with forward–backward updates, which does not integrate out k whenupdating s; therefore it uses the forward–backward algorithm to simulate from p.s|k, u, z, y/.

For all the sampling methods, we generated 20000 sweeps (one sweep being equivalent toan update of all T allocation and state variables) and discarded the first 10000 as burn-in. Weemployed the following Gibbs updates for the mixture component parameters, for j =1, . . . , kÅ:

μj ∼N

(ξjλj

njλj +1,

1njλj +1

),

λj ∼Ga.1+nj=2, 1+dj=2/,

0 10 20 300

0.2

0.4

0.6

0.8

1

AC

F

s170

s30

s255 s490 s840 s960

s175 s615 s775

s300 s490 s630

0 10 20 300

0.2

0.4

0.6

0.8

1

0 10 20 300

0.2

0.4

0.6

0.8

1

0 10 20 300

0.2

0.4

0.6

0.8

1

0 10 20 300

0.2

0.4

0.6

0.8

1

AC

F

0 10 20 300

0.2

0.4

0.6

0.8

1

0 10 20 300

0.2

0.4

0.6

0.8

1

0 10 20 300

0.2

0.4

0.6

0.8

1

0 10 20 300

0.2

0.4

0.6

0.8

1

AC

F

Lag0 10 20 30

0

0.2

0.4

0.6

0.8

1

Lag0 10 20 30

0

0.2

0.4

0.6

0.8

1

Lag0 10 20 30

0

0.2

0.4

0.6

0.8

1

Lag

(a)

(b)

(c)

Fig. 3. Auto-correlation of mst at various time instances for (a) the lepto 1000, (b) the bimod 1000 and (c) thetrimod 1000 data sets (the auto-correlation times are significantly larger when updating si one at a time byusing local Gibbs updates compared with updating the entire sequence s by using forward–backward sam-pling): �, marginal Gibbs sampler; �, slice sampler using local updates; �, block Gibbs sampler using localupdates; +, slice sampler using forward–backward updates; �, block Gibbs sampler using forward–backwardupdates; , block conditional algorithm with forward–backward updates

Page 12: Bayesian non-parametric hidden Markov models with applications

48 C.Yau, O. Papaspiliopoulos, G. O. Roberts and C. Holmes

where kÅ =maxt{kt}, ξj =Σt:kt=j .yt −msi/, nj =Σt:kt=j1 and dj =Σt:kt=j .yi −msi/2. The mean

levels for each hidden state are updated by using

mi ∼N

(Sλ,y +ωmi,0

Sλ +ω,

1Sλ +ω

)where Sλ =Σt:st=iλkt and Sλ,y =Σt:st=iλkt .yt −μkt /.

Fig. 3 gives auto-correlation times for the three Gibbs samplers on the simulated data sets.In terms of updating the hidden states s, the use of forward–backward sampling gives a distinctadvantage over the local updates. This replicates previous findings by Scott (2002) who showedthat forward–backward Gibbs sampling for HMMs mixes faster than using local updates as itis difficult to move from one configuration of s to another configuration of entirely differentstructure by using local updates only. This result motivates the use of the conditional augmen-tation structure that is adopted here as it would otherwise be impossible to perform efficientforward–backward sampling of the hidden states s.

In Fig. 4 we plot the simulation output of .v1, v2/ for the slice sampler and the block Gibbssampler (using forward–backward updates). The mixing of the block Gibbs sampler is consid-erably better than the slice sampler. The block Gibbs sampler appears able to explore differentmodes in the posterior distribution of v for each of the three data sets whereas the slice samplertends to be attached to one mode. The same problem is encountered in the block Gibbs samplerwhich does not integrate out the global allocation variables k when updating the HMM.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

v1

v 2v 2

v 2

v1 v1 v1 v1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

(k) (l) (m) (n) (o)

Fig. 4. Gibbs sampler output for (v1,v2) (the combination of the block Gibbs sampler with forward–back-ward updating of the hidden states can explore the posterior distribution of v most efficiently): (a)–(e) lepto1000 data set; (f)–(j) bimod 1000 data set; (k)–(o) trimod 1000 data set; (a), (f), (k) slice sampler using localupdates; (b), (g), (l) slice sampler using forward–backward updates; (c), (h), (m) block Gibbs sampler usinglocal updates; (d), (i), (n) block conditional algorithm with forward–backward updates; (e), (j), (o) block Gibbssampler using forward–backward updates

Page 13: Bayesian non-parametric hidden Markov models with applications

Hidden Markov Models 49

Table 2. Sensitivity analysis: posterior estimates of the mean level m2 and transitionprobability ρ (˙1 standard deviation) for the simulated data sets

Data set ω Parameter Values for the following values of T:

1000 5000 10000

lepto 1 m2 0:3026±0:0179 0:3080±0:0048 0:2985±0:0030ρ 0:0242±0:0101 0:0401±0:0055 0:0452±0:0041

bimod 1 m2 0:9848±0:0186 0:9890±0:0073 0:9983±0:0052ρ 0:0534±0:0078 0:0451±0:0032 0:0464±0:0023

trimod 1 m2 1:0288±0:1085 1:0120±0:0510 1:0581±0:0330ρ 0:0649±0:0363 0:0438±0:0113 0:0500±0:0070

lepto 0.1 m2 0:3049±0:0178 0:3084±0:0050 0:2983±0:0030ρ 0:0265±0:0120 0:0402±0:0054 0:0456±0:0042

bimod 0.1 m2 0:9851±0:0186 0:9891±0:0072 0:9981±0:0050ρ 0:0538±0:0078 0:0455±0:0032 0:0462±0:0024

trimod 0.1 m2 1:0445±0:1140 1:0173±0:0486 1:0588±0:0337ρ 0:0738±0:0564 0:0445±0:0117 0:0505±0:0075

lepto 0.01 m2 0:3040±0:0181 0:3084±0:0049 0:2985±0:0031ρ 0:0248±0:0091 0:0402±0:0053 0:0455±0:0041

bimod 0.01 m2 0:9850±0:0185 0:9891±0:0070 0:9981±0:0052ρ 0:0541±0:0079 0:0453±0:0032 0:0401±0:0023

trimod 0.01 m2 1:0489±0:1155 1:0122±0:0472 1:0593±0:0317ρ 0:0819±0:074 0:0435±0:0106 0:0507±0:0074

4.3. Sensitivity analysisTo assess the sensitivity of the block Gibbs sampler to uncertainty in the HMM parameters, wereanalysed the data without informative priors or fixed parameters. We fixed m1 =0 for the firsthidden state but used a Gaussian prior with unit variance for the mean level of the alternativestate m2 ∼N.0, 1/ and a beta prior on the transition probabilities ρ∼ Be.1, 1/. We used Gibbsupdates for m2 and ρ, using their respective conditional distributions. We also updated theconcentration parameter α by using the mixtures of gamma distributions method in Escobarand West (1995) using a gamma prior distribution α∼Ga.1, 1/. In addition, we also simulateddata sets of length T = 1000, 5000, 10000 to assess the effect on larger data sets that are morerepresentative of the data sequences that will typically be encountered in real applications.

Table 2 shows posterior estimates for the HMM parameters obtained from 2000 Markov chainMonte Carlo samples (10000 Markov chain Monte Carlo samples were obtained and thinnedby taking every fifth sample) after a burn-in of 10000 iterations. Although the prior informationthat is specified is quite vague, the estimates are nonetheless concordant with the true valuesand increases in data size improve accuracy and reduce uncertainty as desired. As expected thetrimod specification of the emission distribution, which involves three well-separated modes, isthe most difficult to identify, and this is reflected in the larger confidence intervals.

5. Array comparative genomic hybridization data analysis

We analysed the mouse representational oligonucleotide microarray analysis (ROMA) data setfrom Lakshmi et al. (2006) that consists of approximately 84000 probe measurements from aDNA sample derived from a tumour generated in a mouse model of liver cancer compared withnormal (non-tumour) DNA derived from the parent mouse. We examined chromosomes 3, 5,9 and 19 for the ROMA data set as these contained validated copy number alterations. For

Page 14: Bayesian non-parametric hidden Markov models with applications

50 C.Yau, O. Papaspiliopoulos, G. O. Roberts and C. Holmes

correspondence between the experimental set-up and model, we think of yt representing thelog-hybridization intensity ratio obtained from measurements from the microarray experiment;t denotes the genome order (an index after which the probes are sorted by genomic position);st denotes the unobserved copy number state in the case subject (e.g. 0, 1, 2, 3, etc.); mj is thecorresponding mean level for the jth copy number state.

We also studied a Nimblegen data set from Cahan et al. (2008) that consists of approximately385000 probe measurements from the comparison of two inbred mouse strains. We examinedchromosome 1 of this data set after inserting 11 deletions and 11 duplications (by using signalshifts of −0.5 and 0.5) of varying sizes to test the ability of the three methods to detect thesealterations.

5.1. ModelsWe analysed the data sets by using the MDPHMM and two additional HMM-based models.The first is a standard HMM model with Gaussian-distributed observations (that we shall

1.2 1.3 1.4 1.5 1.6

−2

0

2

y

1.2 1.3 1.4 1.5 1.6

0

0.5

1

p

1.2 1.3 1.4 1.5 1.6

0

0.5

1

p

1.2 1.3 1.4 1.5 1.6

0

0.5

1

p

Genome Order/104 Genome Order/104 Genome Order/104 Genome Order/104

2.2 2.3 2.4 2.5 2.6

−2

0

2

y

2.2 2.3 2.4 2.5 2.6

0

0.5

1

2.2 2.3 2.4 2.5 2.6

0

0.5

1

2.2 2.3 2.4 2.5 2.6

0

0.5

1

4 4.1 4.2 4.3 4.4

−2

0

2y

4 4.1 4.2 4.3 4.4

0

0.5

1

4 4.1 4.2 4.3 4.4

0

0.5

1

4 4.1 4.2 4.3 4.4

0

0.5

1

7.6 7.7

−2

0

2

y

7.6 7.7

0

0.5

1

7.6 7.7

0

0.5

1

7.6 7.7

0

0.5

1

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)

(m) (n) (o) (p)

Fig. 5. ROMA data analysis (�, , position of validated copy number aberrations (Lakshmi et al., 2006);the G-HMM identifies the known copy number variants but also many positive results; the R-HMM and theMDPHMM reduce the number of false positive results and identify the known copy number variants only):(a)–(d) data; (e)–(h) marginal probability plots for the presence of a copy number aberration, G-HMM; (i)–(l)marginal probability plots for the R-HMM; (m)–(p) marginal probability plots for the MDPHMM; (a), (e), (i), (m)chromosome 3; (b), (f), (j), (n) chromosome 5; (c), (g), (k), (o) chromosome 9; (d), (h), (l), (p) chromosome 19

Page 15: Bayesian non-parametric hidden Markov models with applications

Hidden Markov Models 51

denote as the G-HMM) and the second model uses Student t-distributions for the observation(which we shall denote as the robust HMM or R-HMM). These two models are representativeof currently available HMM-based methods for analysing array CGH data sets.

5.2. Prior specificationFor this analysis, the models differ only in the observation density that was used; otherwise anidentical HMM and prior structure is employed for all three models. We used a five-state HMMcorresponding to copy numbers 0–4. The level of the copy neutral state (2) is fixed to 0 but weplace Gaussian priors on the mean levels that are associated with each non-copy-number neutralstate, ms ∼N.m̄s, 1/ where m̄0 =−1, m̄1 =−0:58, m̄3 = 0:5 and m̄4 = 1. In terms of the specifi-cation of the mjs, in this analysis, we have specified the prior means following an approximatevisual inspection of the data and set the prior variances fairly large to estimate the levels from the

20 40 60 80 100 120 140 160 180−2

−1

0

1

2

y

20 40 60 80 100 120 140 160 180

0

0.5

1

p

20 40 60 80 100 120 140 160 180

0

0.5

1

p

20 40 60 80 100 120 140 160 180

0

0.5

1

p

Position/Mb

(a)

(b)

(c)

(d)

Fig. 6. Nimblegen analysis ( , positions of copy number aberrations that were inserted into the data set); theG-HMM identifies some of the copy number variants but not all and many false positive results; the R-HMMreduces the number of false positive results and finds the majority of the variants inserted but assigns lowerprobability than the MDPHMM): (a) data; (b) marginal probability plot for the presence of a copy numberaberration, G-HMM; (c) marginal probability plot, R-HMM; (d) marginal probability plot, MDPHMM

Page 16: Bayesian non-parametric hidden Markov models with applications

52 C.Yau, O. Papaspiliopoulos, G. O. Roberts and C. Holmes

data. This follows standard practice in array CGH analysis where prior information about themean signal levels that is associated with different copy number states is often unavailable andmust be inferred from the sample data themselves. This is due to various factors including vari-ations in the exact physical–chemical processes underlying different microarray technologies,non-linearity in expression measurements, i.e. one copy does not produce half the expression oftwo copies, and the application of various preprocessing methods that can introduce unknowneffects on data. Similarly to Shah et al. (2006) and Guha et al. (2008), we also impose an orderconstraint such that m0 < m1 < m2 < m3 < m4 to maintain model identifiability and to preventlabel switching.

We use two transition probabilities ρnormal and ρcnv to denote the different transition ratesout of the copy neutral and copy number aberration states. There transition probabilities aredivided equally such that the probability of moving from the normal state to any aberrant state isρnormal=4. Similarly, the transition probability of moving from a given aberrant state to anotherstate is ρcnv=4. We used beta priors on both parameters ρnormal ∼ Be.1, 1/ and ρcnv ∼ Be.1, 1/.We used normal priors for the mixture centres μk ∼ N.0, 1/ and gamma-distributed priors forthe precisions λk ∼ Ga.1, 1/. For the R-HMM, we also adopted a flat prior on the degrees offreedom ν given by p.ν/∝ν−2 and for the MDPHMM we used a gamma prior on the Dirichletprocess prior concentration parameter p.α/=Ga.1, 1/.

5.3. Posterior inferenceFor the MDPHMM, we used the block Gibbs sampler with forward–backward sampling asdescribed previously, whereas for the G-HMM and R-HMM we employed standard forward–backward Gibbs sampling methods for HMMs. For the R-HMM, we utilized the scaled mix-tures representation of the Student t-distribution to facilitate posterior inference and we useda Metropolis–Hastings update for ν. We used the mixtures of gamma distributions methodsin Escobar and West (1995) to update the concentration parameter α for the MDPHMM. We

−2 −1 0 1 2−2

−1

0

1

2

Data

Mod

el

−2 −1 0 1 2−2

−1

0

1

2

Data

Mod

el

(a) (b)

Fig. 7. QQ-plots of predictive distributions for (a) the ROMA and (b) the Nimblegen data (the empirical dis-tribution of the data appears to be heavy tailed and contains a slight asymmetry; although this is problematicfor the Student t -distribution that is used in the R-HMM, the MDPHMM has no such difficulties): �, R-HMM;�, MDPHMM

Page 17: Bayesian non-parametric hidden Markov models with applications

Hidden Markov Models 53

simulated 15000 samples and discarded the first 5000 samples as burn-in and, in the calculationof summary statistics, thinned by taking every fifth sample. On a MATLAB implementation,computational times for the ROMA data set were 256, 410 and 1851 s for the G-HMM,R-HMM and MDPHMM respectively and 487, 775 and 4079 s for the Nimblegen data set.The number of components that were involved in the emission distribution (3) ranged from 5 to14 (with mode at 7) for the Nimblegen data set, and from 4 to 16 (with mode at 6) for the ROMAdata set.

5.4. ResultsFig. 5 shows the analysis of the ROMA data set. The G-HMM, R-HMM and MDPHMM canidentify a deletion that was found previously in Lakshmi et al. (2006); however, the G-HMMalso identifies many other putative copy number variants. Although mouse tumours are likelyto contain many copy number alteration events, the numbers that were predicted by theG-HMM are far too high. The R-HMM and MDPHMM provide more conservative and real-istic estimates of the number of putative copy number variants in the tumour and identify onlythe validated copy number variants.

0 5000 10000 150000

0.2

0.4

0.6

0.8

1

Iteration

ρ

0 5000 10000 15000−3

−2

−1

0

1

2

3

Iteration

ms

0 5000 10000 150000

0.2

0.4

0.6

0.8

1

Iteration

0 5000 10000 15000−3

−2

−1

0

1

2

3

Iteration

0 5000 10000 150000

0.2

0.4

0.6

0.8

1

Iteration

0 5000 10000 15000−3

−2

−1

0

1

2

3

Iteration

(a) (b) (c)

(d) (e) (f)

Fig. 8. Parameter estimation—posterior samples for various model parameters (note that m2 is fixed to 0by assumption): (a) G-HMM (�, ρcnv; �, ρnormal); (b) R-HMM (�, ρcnv; �, ρnormal); (c) MDPHMM (�, ρcnv; �,ρnormal); (d) G-HMM (+, m0; �, m1; �, m2; �, m3; �, m4); (e) R-HMM (+, m0; �, m1; �, m2; �, m3; �, m4);(f) MDPHMM (+, m0; �, m1; �, m2; �, m3; �, m4)

Page 18: Bayesian non-parametric hidden Markov models with applications

54 C.Yau, O. Papaspiliopoulos, G. O. Roberts and C. Holmes

Table 3. Posterior summary statistics for model parameters estimated from the ROMAdata set

Parameter Results for G-HMM Results for R-HMM Results for MDPHMM

Mean Standard Mean Standard Mean Standarddeviation deviation deviation

ρnormal 0.0204 0.0017 0.0002 0.0001 0.0002 0.1706ρcnv 0.8682 0.0187 0.2157 0.2407 0.1112 0.1112m0 −1.8063 0.0377 −1.3392 0.7626 −0.9671 0.5313m1 −1.0017 0.0362 −0.5579 0.3598 −0.5141 0.1516m3 1.1449 0.0247 0.8872 0.4136 1.0639 0.1656m4 1.9562 0.0252 1.6705 0.4687 1.9964 0.2883

Similarly, for the Nimblegen data set that is shown in Fig. 6, the G-HMM gives many falsepositive results whereas the R-HMM and MDPHMM locate only the copy number variantsthat were inserted in the data. Plots of the marginal state probabilities p.st|·/ indicate that theMDPHMM assigns higher posterior probabilities at the site of 12 small copy number variants(spanning just 10 probes) than the R-HMM. If a threshold of 0.5 or higher were to be imposedthen the R-HMM would only recover one out of 12 small variants whereas the MDPHMMrecovers four of the 12. The average classification error for the Nimblegen data was 2:39±0:05%,1:35±0:32% and 1:26±0:27% for the G-HMM, R-HMM and MDPHMM respectively, whichindicates that the MDPHMM can reduce classification errors for copy number analysis of arrayCGH data.

The slightly improved classification power of the MDPHMM arises because of the heavy-tailed and asymmetric distribution of the data. A comparison of histograms from 10000 drawsfrom the predictive distributions of the R-HMM and MDPHMM and the empirical distributionof the data is shown in Fig. 7. The extra flexibility of the MDPHMM is better able to capturethe empirical distribution of the data compared with the R-HMM.

The increased flexibility of the MDPHMM also leads to more stable estimates of HMMparameters such as the transition rates ρ and the mean signal levels ms that are associated witheach copy number state. Fig. 8 shows that the inability of the G-HMM to capture the complextail behaviour of the data leads to gross overestimates of the transition rates. In particular, theG-HMM estimates give an estimate for ρcnv ≈ 0:87 which arises because data lying in the tailstrigger artificial state transitions. Table 3 shows that the posterior standard deviation which isassociated with the model parameters is significantly greater with the R-HMM than with theMDPHMM. This suggests that, although the R-HMM provides an improved density modelfor array CGH data over the G-HMM, small departures from the Student t-distribution stillremain that lead to increased uncertainty due to model misspecification.

6. Discussion

This paper has introduced a new methodology for Bayesian semiparametric time series analysis.The approach, equipped with powerful computational machinery, is particularly suited touncovering signal in long time series, and it provides a natural alternative to current approacheswhich model the emission distribution as a finite mixture model. The results in our genomicexample are very promising, and we are already investigating further genetic applications of thiswork.

Page 19: Bayesian non-parametric hidden Markov models with applications

Hidden Markov Models 55

We have considered an a priori known number of states n. This assumption is very reason-able in the application that was considered here, as it is in various other contexts where HMMmodels are employed. There are of course contexts where this is not appropriate: informationretrieval (where n might be the number of topics) and speaker diarization (where n might bethe number of speakers) are typical such examples. Following the seminal paper by Teh et al.(2006), active research has pursued the interweaving of MDPs and the HMM to learn about n

and allow it to grow with T. Therefore, in this line of work the MDP model is used to definemore complex Markov dynamics. This turns out to be quite subtle from both the modellingand the computational side. We refer to Fox et al. (2009) for a detailed discussion, references onthis line of work and various contributions. Fox et al. (2009) also considered flexible emissiondistributions in terms of Dirichlet processes, but operationally they approximated them by afinite mixture model. They also reported problems with existing algorithms for learning theseclasses of model, including with the so-called beam sampler of Van Gael et al. (2008) whichis based on forward–backward recursions to block-update the latent process. Fox et al. (2009)advocated an approach which weakly approximates the Dirichlet processes by finite mixtures. Itis interesting to investigate whether our approach of integrating out the global allocation vari-ables when updating the hidden Markov chain, which is based on proposition 1, can be extendedin this framework. This would remove the approximation error in the approach of Fox et al.(2009) while potentially alleviating mixing problems that are found in alternative approaches.

Acknowledgements

The second author acknowledges financial support by the Spanish Government through a‘Ramon y Cajal’ fellowship. Christopher Yau is funded by a UK Medical Research CouncilSpecialist Training Fellowship in Biomedical Informatics (reference G0701810).

Appendix A: Proof of proposition 1

Proposition 1 follows directly from the following result which shows that the data y conditionally on.s, z, v, u/ are independent even when the allocation variables k are integrated out:

p.y|s, z, v, u/=∑

kp.y|s, k, z/p.k|w, u/

=∑

k

T∏

t=1f.yt |mst , zkt /p.kt |ut , w/

=T∏

t=1

∞∑

j=11[ut < wj ] f.yt |mst , zj/

=T∏

t=1

j:ut<wj

f.yt |mst , zj/:

The first equality follows by standard marginalization, where we have used the conditional independenceto simplify each of the densities. The second equality follows from the conditional independence of theyts and the kts given the conditioning variables. We exploit the product structure to exchange the orderof the summation and the product to obtain the third equality. The last equality is a re-expression of theprevious equality.

References

Andersson, R., Bruder, C. E. G., Piotrowski, A., Menzel, U., Nord, H., Sandgren, J., Hvidsten, T. R., de Ståhl, T.D., Dumanski, J. P. and Komorowski, J. (2008) A segmental maximum a posteriori approach to genome-widecopy number profiling. Bioinformatics, 24, 751–758.

Page 20: Bayesian non-parametric hidden Markov models with applications

56 C.Yau, O. Papaspiliopoulos, G. O. Roberts and C. Holmes

Antoniak, C. E. (1974) Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.Ann. Statist., 2, 1152–1174.

Baum, L. E. (1966) Statistical inference for probabilistic functions of finite state space Markov chains. Ann. Math.Statist., 37, 1554–1563.

Cahan, P., Godfrey, L. E., Eis, P. S., Richmond, T. A., Selzer, R. R., Brent, M., McLeod, H. L., Ley, T. J.and Graubert, T. A. (2008) wuHMM: a robust algorithm to detect DNA copy number variation using longoligonucleotide microarray data. Nucleic Acids Res., 36, article e41.

Cappé, O., Moulines, E. and Rydén, T. (2005) Inference in Hidden Markov Models. New York: Springer.Colella, S., Yau, C., Taylor, J. M., Mirza, G., Butler, H., Clouston, P., Bassett, A. S., Seller, A., Holmes, C. C.

and Ragoussis, J. (2007) QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data. Nucleic Acids Res., 35, 2013–2025.

Devroye, L. (1986) Non-uniform Random Variate Generation. New York: Springer.Dunson, D. (2009) Multivariate kernel partition process mixtures. Statist. Sin., to be published.Escobar, M. (1988) Estimating the means of several normal populations by nonparametric estimation of the

distribution of the means. PhD Dissertation. Department of Statistics, Yale University, New Haven.Escobar, M. D. and West, M. (1995) Bayesian density estimation and inference using mixtures. J. Am. Statist.

Ass., 90, 577–588.Fox, E., Sudderth, E., Jordan, M. and Willsky, A. (2009) The sticky HDP-HMM: Bayesian nonparametric hidden

markov models with persistent states. (Available from http://arxiv.org/abs/0905.2592.)Gopich, I. V. and Szabo, A. (2009) Decoding the pattern of photon colors in single-molecule FRET. J. Phys.

Chem. B, 113, 10965–10973.Green, P. and Richardson, S. (2001) Modelling heterogeneity with and without the Dirichlet process. Scand. J.

Statist., 28, 355–375.Guha, S., Li, Y. and Neuberg, D. (2008) Bayesian hidden Markov modeling of array CGH data. J. Am. Statist.

Ass., 103, 485–497.Hamilton, J. D. (1989) A new approach to the economic analysis of nonstationary time series and the business

cycle. Econometrica, 57, 357–384.Hjort, N., Holmes, C., Müller, P. and Walker, S. (eds) (2010) Bayesian Nonparametrics: Principles and Practice.

Cambridge: Cambridge University Press.Horenko, I. and Schütte, C. (2008) Likelihood-based estimation of multidimensional Langevin models and its

application to biomolecular dynamics. Multiscale Modlng Simuln, 7, 731–773.Hu, J., Gao, J.-B., Cao, Y., Bottinger, E. and Zhang, W. (2007) Exploiting noise in array CGH data to improve

detection of DNA copy number change. Nucleic Acids Res., 35, article e35.Ishwaran, H. and James, L. (2001) Gibbs sampling methods for stick-breaking priors. J. Am. Statist. Ass., 96,

161–173.Kim, C.-J. (1994) Dynamic linear models with Markov-switching. J. Econmetr., 60, 1–22.Lakshmi, B., Hall, I. M., Egan, C., Alexander, J., Leotta, A., Healy, J., Zender, L., Spector, M. S., Xue, W.,

Lowe, S. W., Wigler, M. and Lucito, R. (2006) Mouse genomic representational oligonucleotide microarrayanalysis: detection of copy number variations in normal and tumor specimens. Proc. Natn. Acad. Sci. USA,103, 11234–11239.

Lo, A. Y. (1984) On a class of Bayesian nonparametric estimates: I, Density estimates. Ann. Statist., 12, 351–357.Manning, C. D. and Schuetze, H. (1999) Foundations of Statistical Natural Language Processing. Cambridge:

MIT Press.Marioni, J. C., Thorne, N. P. and Tavaré, S. (2006) BioHMM: a heterogeneous hidden Markov model for seg-

menting array CGH data. Bioinformatics, 22, 1144–1146.McKinney, S. A., Joo, C. and Ha, T. (2006) Analysis of single-molecule FRET trajectories using hidden Markov

modeling. Biophys. J., 91, 1941–1951.Muliere, P. and Tardella, L. (1998) Approximating distributions of random functionals of Ferguson-Dirichlet

priors. Can. J. Statist., 26, 283–297.Neal, R. (2000) Markov chain sampling: methods for Dirichlet process mixture models. J. Computnl Graph.

Statist., 9, 283–297.Papaspiliopoulos, O. (2008) A note on posterior sampling from Dirichlet mixture models. Technical Re-

port. Centre for Research in Statistical Methodology, University of Warwick, Coventry. (Available fromhttp://www2.warwick.ac.uk/fac/sci/statistics/crism/research/2008/paper08-20.)

Papaspiliopoulos, O. and Roberts, G. O. (2008) Retrospective Markov chain Monte Carlo for Dirichlet processhierarchical models. Biometrika, 95, 169–186.

Pati, D. and Dunson, D. (2009) Bayesian nonparametric regression with varying residual density. Discussion Paper2009-25. Department of Statistical Science, Duke University, Durham.

Rabiner, L. (1989) A tutorial on HMM and selected applications in speech recognition. Proc. IEEE, 77, 257–286.Scott, S. (2002) Bayesian methods for hidden Markov models: recursive computing in the 21st century. J. Am.

Statist. Ass., 97, 337–351.Shah, S. P., Xuan, X., DeLeeuw, R. J., Khojasteh, M., Lam, W. L., Ng, R. and Murphy, K. P. (2006) Integrating

copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics, 22, e431–e439.

Page 21: Bayesian non-parametric hidden Markov models with applications

Hidden Markov Models 57

Stjernqvist, S., Rydén, T., Sköld, M. and Staaf, J. (2007) Continuous-index hidden Markov modelling of arrayCGH copy number data. Bioinformatics, 23, 1006–1014.

Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006) Hierarchical Dirichlet processes. J. Am. Statist. Ass.,101, 1566–1581.

Van Gael, J., Saatci, Y., Teh, Y. W. and Ghahramani, Z. (2008) Beam sampling for the infinite hidden Markovmodel. In ICML ’08: Proc. 25th Int. Conf. Machine Learning, pp. 1088–1095. New York: Association forComputing Machinery.

Walker, S. (2007) Sampling the Dirichlet mixture model with slices. Communs Statist. Simuln Computn, 36, 45–54.


Recommended