A Tutorial on Adaptive MCMC

7/29/2019 A Tutorial on Adaptive MCMC

1/31

Stat Comput (2008) 18: 343373DOI 10.1007/s11222-008-9110-y

A tutorial on adaptive MCMC

Christophe Andrieu

Johannes Thoms

Received: 23 January 2008 / Accepted: 19 November 2008 / Published online: 3 December 2008 Springer Science+Business Media, LLC 2008

Abstract We review adaptive Markov chain Monte Carlo

algorithms (MCMC) as a mean to optimise their perfor-mance. Using simple toy examples we review their theo-retical underpinnings, and in particular show why adaptiveMCMC algorithms might fail when some fundamental prop-erties are not satisfied. This leads to guidelines concern-ing the design of correct algorithms. We then review cri-teria and the useful framework of stochastic approximation,which allows one to systematically optimise generally usedcriteria, but also analyse the properties of adaptive MCMCalgorithms. We then propose a series of novel adaptive al-gorithms which prove to be robust and reliable in practice.These algorithms are applied to artificial and high dimen-

sional scenarios, but also to the classic mine disaster datasetinference problem.

Keywords MCMC Adaptive MCMC ControlledMarkov chain Stochastic approximation

1 Introduction

Markov chain Monte Carlo (MCMC) is a general strategyfor generating samples {Xi , i = 0, 1, . . .} from complex

high-dimensional distributions, say defined on a space

C. Andrieu ()School of Mathematics, University of Bristol,Bristol BS8 1TW, UKe-mail: [email protected]: http://www.stats.bris.ac.uk/~maxca

J. ThomsChairs of Statistics, cole Polytechnique Fdrale de Lausanne,1015 Lausanne, Switzerland

X

Rnx (assumed for simplicity to have a density with re-

spect to the Lebesgue measure, also denoted ), from whichintegrals of the type

I(f) :=

X

f (x) (x) dx,

for some -integrable functions X Rnf can be approxi-mated using the estimator

IN (f) :=1

N

Ni=1

f (Xi ) , (1)

provided that the Markov chain generated with, say, transi-tion P is ergodic i.e. it is guaranteed to eventually producesamples {Xi} distributed according to . Throughout thisreview we will refer, in broad terms, to the consistency ofsuch estimates and the convergence of the distribution ofXito as -ergodicity. The main building block of this classof algorithms is the Metropolis-Hastings (MH) algorithm. Itrequires the definition of a family of proposal distributions{q(x, ), x X} whose role is to generate possible transitionsfor the Markov chain, say from X to Y, which are then ac-cepted or rejected according to the probability

(X, Y) = min1, (Y) q (Y, X)

(X) q (X, Y)

.

The simplicity and universality of this algorithm are bothits strength and weakness. Indeed, the choice of the pro-posal distribution is crucial: the statistical properties of theMarkov chain heavily depend upon this choice, an inade-quate choice resulting in possibly poor performance of theMonte Carlo estimators. For example, in the toy case wherenx = 1 and the normal symmetric random walk Metropo-lis algorithm (N-SRWM) is used to produce transitions, the
mailto:[email protected]://www.stats.bris.ac.uk/~maxcahttp://www.stats.bris.ac.uk/~maxcamailto:[email protected]


2/31

344 Stat Comput (2008) 18: 343373

density of the proposal distribution is of the form

q(x,y) =1

2 2exp

122

(y x)2

,

where 2 is the variance of the proposed increments, hencedefining a Markov transition probability P. The varianceof the corresponding estimator IN(f ), which we wish to be

as small as possible for the purpose of efficiency, is wellknown to be typically unsatisfactory for values of 2 thatare either too small or too large in comparison to optimalor suboptimal value(s). In more realistic scenarios, MCMCalgorithms are in general combinations of several MH up-dates {Pk,, k = 1, . . . , n , } for some set , with eachhaving its own parametrised proposal distribution qk, fork = 1, . . . , n and sharing as common invariant distribu-tion. These transition probabilities are usually designed inorder to capture various features of the target distribution and in general chosen to complement one another. Such acombination can for example take the form of a mixture of

different strategies, i.e.

P (x,dy) =n

k=1wk ()Pk, (x,dy) , (2)

where for any , nk=1 wk( ) = 1, wk( ) 0, but canalso, for example, take the form of combinations (i.e. prod-ucts of transition matrices in the discrete case) such as

P (x,dy) = P1,P2, Pn,(x,dy).

Both examples are particular cases of the class of Markov

transition probabilities P on which we shall focus inthis paper: they are characterised by the fact that they(a) belong to a family of parametrised transition proba-bilities {P, } (for some problem dependent set , = (0, +) in the toy example above) (b) for all is an invariant distribution for P, which is assumed to beergodic (c) the performance ofP, for example the varianceofIN(f ) above, is sensitive to the choice of.

Our aim in this paper is to review the theoretical under-pinnings and recent methodological advances in the areaof computer algorithms that aim to optimise such para-metrised MCMC transition probabilities in order to lead

to computationally efficient and reliable procedures. As weshall see we also suggest new algorithms. One should note atthis point that in some situations of interest, such as temper-ing type algorithms (Geyer and Thompson 1995), property(b) above might be violated and instead the invariant distrib-ution ofP might depend on (although only a non -dependent feature of this distribution might be of interestto us for practical purposes). We will not consider this casein depth here, but simply note that most of the argumentsand ideas presented hereafter generally carry on to this

slightly more complex scenario e.g. (Benveniste et al. 1990;Atchad and Rosenthal 2005).

The choice of a criterion to optimise is clearly the firstdecision that needs to be made in practice. We discuss thisissue in Sect. 4.1 where we point out that most sensible opti-mality or suboptimality criteria can be expressed in terms ofexpectations with respect to the steady state-distributions ofMarkov chains generated by P for

fixed, and make

new suggestions in Sect. 5 which are subsequently illus-trated on examples in Sect. 6. We will denote by a genericoptimal value for our criteria, which is always assumed toexist hereafter.

In order to optimise such criteria, or even simply findsuboptimal values for , one could suggest to sequentiallyrun a standard MCMC algorithm with transition P for aset of values of (either predefined or defined sequentially)and compute the criterion of interest (or its derivative etc.)once we have evidence that equilibrium has been reached.This can naturally be wasteful and we will rather focus hereon a technique which belongs to the well known class ofprocesses called controlled Markov chains (Borkar 1990)in the engineering literature, which we will refer to as con-trolled MCMC (Andrieu and Robert 2001), due to their nat-ural filiation. More precisely we will assume that the algo-rithm proceeds as follows. Given a family of transition prob-abilities {P, } defined on X such that for any , P = (meaning that if Xi , then Xi+1 ,Xi+2 , . . .) and given a family of (possibly random) mappings{i : Xi+1 , i = 1, . . .}, which encodes what ismeant by optimality by the user, the most general form ofa controlled MCMC proceeds as follows:

Algorithm 1 Controlled Markov chain Monte Carlo Sample initial values 0, X0 X. Iteration i + 1 (i 0), given i = i (0, X0, . . . , Xi ) from

iteration i

1. Sample Xi+1|(0, X0, . . . , Xi ) Pi (Xi , ).2. Compute i+1 = i+1(0, X0, . . . , Xi+1).

In Sect. 4.2 we will focus our results to particular map-pings well suited to our purpose of computationally efficient

sequential updating of {i} for MCMC algorithms, whichrely on the Robbins-Monro update and more generally onthe stochastic approximation framework (Benveniste et al.1990). However, before embarking on the description ofpractical procedures to optimise MCMC transition probabil-ities we will first investigate, using mostly elementary un-dergraduate level tools, some of the theoretical ergodicityproperties of controlled MCMC algorithms.

Indeed, as we shall see, despite the assumption that forany , P = , adaptation in the context of MCMC


3/31

Stat Comput (2008) 18: 343373 345

using the controlled approach leads to complications. Infact, this type of adaptation can easily perturb the ergodicityproperties of MCMC algorithms. In particular algorithms ofthis type will in most cases lead to the loss of as an invari-ant distribution of the process {Xi}, which intuitively shouldbe the minimum requirement to produce samples from and lead to consistent estimators. Note also that when notcarefully designed such controlled MCMC can lead to tran-sient processes or processes such that IN(f ) is not consis-tent. Studying the convergence properties of such processesnaturally raises the question of the relevance of such devel-opments in the present context. Indeed it is often argued thatone might simply stop adaptation once we have enough evi-dence that {i} has reached a satisfactory optimal or subop-timal value of and then simply use samples produced by astandard MCMC algorithm using such a fixed good value .No new theory should then be required. While apparentlyvalid, this remark ignores the fact that most criteria of in-terest depend explicitly on features of , which can only

be evaluated with. . . MCMC algorithms. For example, asmentioned above most known and useful criteria can be for-mulated as expectations with respect to distributions whichusually explicitly involve .

Optimising such criteria, or finding suboptimal valuesof, thus requires one to be able to sampleperhaps ap-proximately or asymptoticallyfrom , which in the con-text of controlled MCMC requires one to ensure that theprocess described above can, in principle, achieve this aim.This, in our opinion, motivates and justifies the need forsuch theoretical developments as they establish whether ornot controlled MCMC can, again in principle, optimise such

-dependent criteria. Note that convergence of {i} shoulditself not be overlooked since, in light of our earlier discus-sion of the univariate N-SRWM, optimisation of{P} is ourprimary goal and should be part of our theoretical develop-ments. Note that users wary of the perturbation to ergodic-ity brought by adaptation might naturally choose to freeze{i} to a value beyond an iteration and consider onlysamples produced by the induced Markov chain for their in-ference problem. A stopping rule is described in Sect. 4.2.2.In fact, as we shall see it is possible to run the two proce-dures simultaneously.

Finally, whereas optimising an MCMC algorithm seems

a legitimate thing to do, one might wonder if it is compu-tationally worth adapting. This is a very difficult questionfor which there is probably no straight answer. The view weadopt here is that such optimisation schemes are very usefultools to design or help the design of efficient MCMC algo-rithms which, while leading to some additional computation,have the potential to spare the MCMC user significant im-plementation time.

The paper is organised as follows. In Sect. 2 we providetoy examples that illustrate the difficulties introduced by the

adaptation of MCMC algorithms. In Sect. 3 we discuss whyone might expect vanishing adaptation to lead to processessuch that {Xi} can be used in order to estimate expectationwith respect to . This section might be skipped on a firstreading. In Sect. 4 we first discuss various natural criteriawhich are motivated by theory, but to some extent simplifiedin order to lead to useful and implementable algorithms. Wethen go on to describe how the standard framework of sto-chastic approximation, of which the Robbins-Monro recur-sion is the cornerstone, provides us with a systematic frame-work to design families of mappings {i} in a recursive man-ner and understand their properties. In Sect. 5 we present aseries of novel adaptive algorithms which circumvent someof the caveats of existing procedures. These algorithms areapplied to various examples in Sect. 6.

2 The trouble with adaptation

In this section we first illustrate the loss of-ergodicity ofcontrolled MCMC with the help of two simple toy examples.The level of technicality required for these two examplesis that of a basic undergraduate course on Markov chains.Despite their simplicity, these examples suggest that vanish-ing adaptation (a term made more precise later) might pre-serve asymptotic -ergodicity. We then finish this sectionby formulating more precisely the fundamental differencebetween standard MCMC algorithms and their controlledcounterparts which affects the invariant distribution of thealgorithm. This requires the introduction of some additionalnotation used in Sect. 4 and a basic understanding of expec-

tations to justify vanishing adaptation, but does not signifi-cantly raise the level of technicality.

Consider the following toy example, suggested in An-drieu and Moulines (2006), where X = {1, 2} and =(1/2, 1/2) (it is understood here that for such a case we willabuse notation and use for the vector of values of andP for the transition matrix) and where the family of transi-tion probabilities under consideration is of the form, for any := (0, 1)

P =

P(Xi = 1, Xi+1 = 1) P(Xi = 1, Xi+1 = 2)P(Xi = 2, Xi+1 = 1) P(Xi = 2, Xi+1 = 2)

=

1 1

. (3)

It is clear that for any , is a left eigenvector of Pwith eigenvalue 1,

P = ,i.e. is an invariant distribution of P. For any theMarkov chain is obviously irreducible and aperiodic, andby standard theory is therefore ergodic, i.e. for any starting


4/31

346 Stat Comput (2008) 18: 343373

probability distribution ,

limi

Pi =

(with Pi the i-th power ofP), and for any finite real valuedfunction f

limN

1

N

N

i=1

f (Xi )

=E (f (X)),

almost surely, where for any probability distribution , Erepresents the expectation operator with respect to . Nowassume that is adapted to the current state in order to sam-ple the next state of the chain, and assume for now thatthis adaptation is a time invariant function of the previousstate of the MC. More precisely assume that for any i 1the transition from Xi to Xi+1 is parametrised by (Xi ),where : X . The remarkable property, specific to thispurely pedagogical example, is that {Xi} is still in this casea time homogeneous Markov chain with transition proba-

bilityP (Xi = a, Xi+1 = b) := P(a)(Xi = a, Xi+1 = b)for a, b X, resulting in the time homogeneous transitionmatrix

P :=

(1) 1 (1)1 (2) (2)

. (4)

Naturally the symmetry of P above is lost and one cancheck that the invariant distribution ofP is

=

1 (2)2 (1) (2) ,

1 (1)2 (1) (2)

= ,

in general. For (1),(2) the time homogeneousMarkov chain will be ergodic, but will fail to converge to as soon as (1) = (2), that is as soon as there is depen-dence on the current state. As we shall see, the principle ofvanishing adaptation consists of the present toy example ofmaking both (1) and (2) time dependent (deterministi-cally for simplicity here), denoted i (1) and i (2) at itera-tion i, and ensure that as i , |i (1) i (2)| vanishes.Indeed, while {i (1)} and {i (2)} are allowed to evolve for-ever (and maybe not converge) the corresponding transitionprobabilities {Pi := Pi (Xi )} have invariant distributions {i}convergent to . We might hence expect one to recover -ergodicity. In fact in the present case standard theory fornon-homogeneous Markov chains can be used in order tofind conditions on {i} that ensure ergodicity, but we do notpursue this in depth here.

It could be argued, and this is sometimes suggested, thatthe problem with the example above is that in order to pre-serve as a marginal distribution, should not depend onXi for the transition to Xi+1, but on X0, . . . , Xi1 only. Forsimplicity assume that the dependence is on Xi1 only. Then

it is sometimes argued that since

(Xi = 1)(Xi = 2)

T

P(X i1)(Xi= 1, Xi+1= 1) P(Xi1)(Xi= 1, Xi+1= 2)P(X i1)(Xi= 2, Xi+1= 1) P(Xi1)(Xi= 2, Xi+1= 2)

=

(Xi = 1)(Xi = 2)

T

(Xi1) 1 (Xi1)1 (Xi1) (Xi1)

= (Xi+1 = 1), (Xi+1 = 2) ,then Xi+1, Xi+2, . . . are all marginally distributed accord-ing to . Although this calculation is correct, the underly-ing reasoning is naturally incorrect in general. This can bechecked in two ways. First through a counterexample whichonly requires elementary arguments. Indeed in the situation

just outlined, the law of Xi+1 given 0, X0, . . . , Xi1, Xiis P(X

i1)(Xi , Xi

+1

), from which we deduce that Zi=(Zi (1), Zi (2)) = (Xi , Xi1) is a time homogeneous Markov

chain with transition

P(Z i (2))(Zi (1), Zi+1(1)) I{Zi+1(2) = Zi (1)},

where for a set A, IA denotes its indicator function. De-noting the states 1 := (1, 1), 2 := (1, 2), 3 := (2, 1) and4 := (2, 2), the transition matrix of the time homogeneousMarkov chain is

P = (1) 0 1 (1) 0 (2) 0 1

(2) 0

0 1 (1) 0 (1)0 1 (2) 0 (2)

and it can be directly checked that the marginal invariantdistribution ofZi (1) is

=

2 + (2)1 (1) +

(1)

1 (2)

1

1+(2)(1)1(1)

1+(1)(2)1(2)

= (1/2, 1/2),

in general. The second and more informative approach con-

sists of considering the actual distribution of the processgenerated by a controlled MCMC. Let us denote E theexpectation for the process started at some arbitrary , x X. This operator is particularly useful to describe theexpectation of(Xi , Xi+1, . . . ) for any i 1 and any func-tion : Xk R, E((Xi , Xi+1, . . . , Xi+k1)). Moreprecisely it allows one to clearly express the dependence ofi (0, X0, . . . , Xi ) on the past 0, X0, . . . , Xi of the process.Indeed for any f : X R, using the tower property ofexpectations and the definition of controlled MCMC given


5/31

Stat Comput (2008) 18: 343373 347

in the introduction, we find that

E (f (Xi+1)) = EE (f (Xi+1)|0, X0, . . . , Xi )

= E

X

Pi (X0,...,Xi )(Xi ,dx)f(x)

, (5)

which is another way of saying that the distribution of

Xi+1 is that of a random variable sampled, conditionalupon 0, X0, . . . , Xi , according to the random transitionPi (X0,...,Xi )(Xi , Xi+1 ), where the pair i (0, X0,. . . , Xi ), Xi is randomly drawn from a distribution com-pletely determined by the possible histories 0, X0, . . . , Xi .In the case where X is a finite discrete set, writing thisrelation concisely as the familiar product of a row vectorand a transition matrix as above would require one to de-termine the (possibly very large) set of values for the pairi (0, X0, . . . , Xi ), Xi (say Wi ), the vector representing theprobability distribution of all these pairs as well as the tran-sition matrix from Wi to X. The introduction of the expec-

tation allows one to bypass these conceptual and notationaldifficulties. We will hereafter denote

(0, X0, . . . , Xi ) :=

X

Pi (0,X0,...,Xi )(Xi ,dx)f(x),

and whenever possible will drop unnecessary arguments i.e.arguments of which do not affect its values.

The possibly complex dependence on i (0, X0, . . . , Xi ),Xi of the transition of the process to Xi+1 needs to be con-trasted with the case of standard MCMC algorithms. Indeed,in this situation the randomness of the transition probability

only stems from Xi . This turns out to be a major advantagewhen it comes to invariant distributions. Let us assume thatfor some i 1 E(g(Xi )) = E (g(X)) for all -integrablefunctions g. Then according to the identity in (5), for anygiven and i = for all i 0 a standard MCMC al-gorithm has the well known and fundamental property

E(f(Xi+1)) = E ((,Xi ))= E ((,X))

=

XX(dx)P(x,dy)f (y) = E (f(X)) ,

where the second equality stems from the assumptionE(g(Xi )) = E (g(X)) and the last equality is obtained bythe assumed invariance of for P for any . Nowwe turn to the controlled MCMC process and focus forsimplicity on the case i (0, X0, . . . , Xi ) = (Xi1), cor-responding to our counterexample. Assume that for somei 1 Xi is marginally distributed according to , i.e. forany g : X R, E(g(Xi )) = E (g(X)), then we would liketo check if E(g(Xj)) = E (g(X)) for all j i. However

using the tower property of expectations in order to exploitthe property E(g(Xi )) = E (g(X)),

E(f(Xi+1)) = E ((Xi1, Xi ))= E

E ((Xi1, Xi )|Xi )

= E E ((Xi1, X)|X) .

Now it would be tempting to use the stationarity assumptionin the last expression,

E ((Xi1, X)) =

XX(dx)P(X i1)(x,dy)f(y)

= E (f (X)).

This is however not possible due to the presence of theconditional expectation E(|X) (which crucially dependson X) and conclude that in general

E(f(Xi+1))

= EE

X

P(0,X0,Xi1)(X,dxi+1)f(xi+1)

.

The misconception that this inequality might be an equalityis at the root of the incorrect reasoning outlined earlier. Thisproblem naturally extends to more general situations.

Vanishing adaptation seems, intuitively, to offer the pos-sibility to circumvent the problem of the loss of as in-variant distribution. However, as illustrated by the follow-ing toy example, vanishing adaptation might come withits own shortcomings. Consider a (deterministic) sequence

{i} (1, 1)N

and for simplicity first consider the non-homogeneous, and non-adaptive, Markov chain {Xi} withtransition Pi at iteration i 1, where P is given by (3),and initial distribution (, 1 ) for [0, 1]. One caneasily check that for any n 1 the product of matricesP1 Pn has the simple expression

P1 Pn

= 12

1 + ni=1(2i 1) 1 ni=1(2i 1)1 ni=1(2i 1) 1 + ni=1(2i 1)

.

As a result one deduces that the distribution of Xn is1

2

1 + (2 1) ni=1(2i 1) 1 (2 1) ni=1(2i 1) .

Now if i 0 (resp. i 1) and

i=1 i < + (resp.i=1(1 i ) < + ), that is convergence to either 0 or 1

of{i} is too fast , then limnn

i=1(2i 1) = 0 andas a consequence, whenever = 1/2, the distribution ofXndoes not converge to = (1/2, 1/2). Similar developmentsare possible for the toy adaptive MCMC algorithm given by


6/31

348 Stat Comput (2008) 18: 343373

the transition matrix P in (4), at the expense of extra tech-nical complications, and lead to the same conclusions. Thistoy example points to potential difficulties encountered bycontrolled MCMC algorithms that exploit vanishing adapta-tion: whereas -ergodicity ofP is ensured for any ,this property might be lost if the sequence {i} wanders to-wards bad values of for which convergence to equilib-rium of the corresponding fixed parameter Markov chainsP might take an arbitrarily long time.

This point is detailed in the next section, but we first turnto a discussion concerning the possibility of using vanishingadaptation in order to circumvent the loss of invariance ofby controlled MCMC.

3 Vanishing adaptation and convergence

As suggested in the previous section, vanishing adaptation,that is ensuring that i depends less and less on recently vis-ited states of the chain

{Xi

}might be a way of designing

controlled MCMC algorithms which produce samples as-ymptotically distributed according to . In this section weprovide the basic arguments and principles that underpin thevalidity of controlled MCMC with vanishing adaptation. Wehowever do not provide directly applicable technical con-ditions here that ensure the validity of such algorithms -more details can be found in Holden (1998), Atchad andRosenthal (2005), Andrieu and Moulines (2006), Robertsand Rosenthal (2006), Bai et al. (2008) and Atchad andFort (2008). The interest of dedicating some attention to thispoint here is twofold. First it provides useful guidelines asto what the desirable properties of a valid controlled MCMCalgorithm should be, and hence help design efficient algo-rithms. Secondly it points to some difficulties with the ex-isting theory which is not able to fully explain the observedstability properties of numerous controlled algorithms, a factsometimes overlooked.

3.1 Principle of the analysis

Existing approaches to prove that ergodicity might be pre-served under vanishing adaptation all rely on the same prin-ciple, which we detail in this section. The differences be-tween the various existing contributions lies primarily in theassumptions, which are discussed in the text. With the nota-tion introduced in the previous section, we are interested inthe behaviour of the difference

|E(f(Xi )) E (f (X))|

as i for any f : X R. Although general functionscan be considered (Atchad and Rosenthal 2005; Andrieuand Moulines 2006) and (Atchad and Fort 2008), we willhere assume for simplicity of exposition that |f| 1. The

study of this term is carried out by comparing the processof interest to a process which coincides with {Xk} up tosome time ki < i but becomes a time homogeneous Markovchain with frozen transition probability Pki from this timeinstant onwards. (We hereafter use the following standardnotation Pk[f](x) = Pk f(x) for any f : X Rnf andx X defined recursively as P0f(x) = f(x), Pf(x) :=

XP(x,dy)f(y) and Pk+1f(x) = P[Pk f](x) for k 1. Inthe finite discrete case this corresponds to considering pow-ers Pk of the transition matrix P and right multiplying witha vector f.) Denoting Pikiki f (Xki ) the expectation of fafter i ki iterations of the frozen time homogeneousMarkov transition probability Pki initialised with Xki attime ki and conditional upon 0, X0, X1, . . . , Xki , this trans-lates into the fundamental decomposition

E(f(Xi )) E (f (X))= E

P

ikiki

f (Xki ) (f)

+ E f (Xi ) Pikiki f (Xki ) , (6)where the second term corresponds to the aforementionedcomparison and the first term is a simple remainder term.

Perhaps not surprisingly the convergence to zero of thefirst term, provided that i ki as i , dependson the ergodicity of the non-adaptive MCMC chain withfixed parameter , i.e. requires at least that for any, x X, limk |Pk f(x)(f)| = 0. However sinceboth ki and Xki are random and possibly time dependent,this type of simple convergence is not sufficient to ensureconvergence of this term. One could suggest the following

uniform convergence condition

limk

sup,xX

|Pk f(x) E (f (X))| = 0, (7)

which although mathematically convenient is unrealistic inmost scenarios of interest. The first toy example of Sect. 2provides us with such a simple counterexample. Indeed, atleast intuitively, convergence of this Markov chain to equi-librium can be made arbitrarily slow for values of (0, 1)arbitrarily close to either 0 or 1. This negative property un-fortunately carries on to more realistic scenarios. For ex-ample the normal symmetric random walk Metropolis al-

gorithm described in Sect. 1 can in most situations of in-terest be made arbitrarily slow as the variance 2 is madearbitrarily small or large. This turns out to be a fundamentaldifficulty of the chicken and egg type in the study of thestability of such processes, which is sometimes overlooked.Indeed in order to ensure ergodicity, {i} should stay awayfrom poor values of the parameter , but proving thestability of{i} might often require establishing the ergod-icity of the chain {Xi}; see Andrieu and Moulines (2006)and Andrieu and Tadic (2007) where alternative conditions


7/31

Stat Comput (2008) 18: 343373 349

are also suggested. We will come back to this point af-ter examining the second term of the decomposition above.Note that locally uniform such conditions (i.e. where in (7) is replaced by some subsets K and the rateof convergence might be slower) are however satisfied bymany algorithmsthis property is exploited in Andrieu andMoulines (2006), Andrieu and Tadic (2007), Atchad andFort (2008) and Bai et al. (2008) although this is not explicitin the latter.

The second term in the decomposition can be analysed byinterpolating the true process and its Markovian approxi-mation using the following telescoping sum

E (f (Xi )) E

Piki

kif (Xki )

=i1

j=kiE

P

ij1ki

f (Xj+1)

E

Pij

kif (Xj)

,

which can be easily understood as follows. Each term

of the sum is the difference of the expectations of (a) aprocess that adapts up to time j + 1 > ki and then freezesand becomes Markovian with transition probability Pkigiven the history 0, X0, X1, . . . , Xj+1 (and hence ki =ki (0, X0, X1, . . . , Xki )) between time j + 1 and time i(b) and likewise for the second term, albeit between time jand i. Hence the two terms involved only differ in that attime j the first term updates the chain with j while thesecond term uses ki , which can be concisely expressed asfollows (thinking about the difference between two productsof matrices in the discrete case might be helpful),

E

Pij1ki f (Xj+1) E Pijki f (Xj)

= E

PjPij1

kif (Xj)

E

P

ijki

f (Xj)

= E

Pj Pki

Pij1

kif (Xj)

.

The role of vanishing adaptation should now be appar-ent. Provided that the transition probability P is suffi-ciently smooth in and that the variations of {i} vanishas i + (in some unspecified sense at this point) thenwe might expect

i1j=ki

E

Pj Pki

Pij1

kif (Xj)

to vanish if the number of terms in this sum does notgrow too rapidly. However as noticed when analysing thefirst term of the fundamental decomposition above, simplecontinuity cannot be expected to be sufficient in generalsince ki , ki+1, . . . , i1 are random and time dependent.By analogy with the analysis above one could assume some

form of uniform continuity in order to eliminate the vari-ability ofj and ki in the expression above. More precisely,denoting for any > 0

() := sup|g|1

supxX,{,:||}

|Pg(x) P g(x)|,

one could assume,

lim0

() = 0.

Provided that the sequence {i} is such that its incrementsare bounded i.e. such that |i i1| i for a determin-istic sequence {i} [0, )N (which is possible since theupdates of{i} are chosen by the user) then the second termof (6) can be bounded by

i1j=ki

jk=ki+1

k ,

which can usually be easily dealt with; e.g. when some uni-form Lipschitz continuity is assumed and () = C forsome constant C > 0 (Andrieu and Moulines 2006). Unfor-tunately, although mathematically convenient, this conditionis not satisfied in numerous situations of interest, due in par-ticular to the required uniformity in , . Other appar-ently weaker conditions have been suggested, but share thesame practical underlying difficulties such as

limi

sup|g|1

E|Pi g(Xi ) Pi1 g(Xi )| = 0

suggested in Benveniste et al. (1990, p. 236) and the slightlystronger condition of the type

limi

supxX,|g|1

E|Pi g(x) Pi1 g(x)| = 0,

in Roberts and Rosenthal (2006).

3.2 Discussion

The simple discussion above is interesting in two respects.On the one hand, using basic arguments, it points to the pri-mary conditions under which one might expect controlled

MCMC algorithms to be -ergodic: expected ergodicityand continuity of the transition probabilities. On the otherhand it also points to the difficulty of proposing verifiableconditions to ensure that the aforementioned primary condi-tions are satisfied. The problem stems from the fact that it isrequired to prove that the algorithm is not unlucky enoughto tune to poor values, leading to possibly unstable algo-rithms. The uniform conditions suggested earlier circumventthis problem, since they suggest that there are no such arbi-trarily bad values. This is unfortunately not the case for


8/31

350 Stat Comput (2008) 18: 343373

numerous algorithms of practical interest. However it is of-ten the case that such uniformity holds for subsets K .For example in the case of the first toy example of Sect. 2 thesets defined as K := [, 1 ] for any (0, 1) are suchthat for any (0, 1) the Markov chains with parameters K are geometrically convergent with a rate of at least|1 2|, independent ofK . A simple practical solutionthus consists of identifying such subsets K and constrain{i} to these sets by design. This naturally requires some un-derstanding of the problem at hand, which might be difficultin practice, and does not reflect the fact that stability is ob-served in practice without the need to resort to such fixedtruncation strategies. A general approach for adaptive trun-cation is developed and analysed in Andrieu et al. (2005)and Andrieu and Moulines (2006). It takes advantage of thefact that uniform ergodicity and continuity can be shown forfamilies of subsets of {Ki }, such that {Ki } is acovering of such that Ki Ki+1 for i 0. The strategythen consists of adapting the truncation of the algorithm onthe fly in order to ensure that

{i

}does not wander too fast

towards inappropriate values in or at its boundary, henceensuring that ergodicity can kick in and stabilise the trajec-tories of {i}, i.e. ensure that there is a random, but finite,k such that {i} Kk with probability 1. The procedurehas the advantage of not requiring much knowledge aboutwhat constitutes good or bad values for , while allowingfor the incorporation of prior information and ultimately en-suring stability. It can in fact be shown, under conditionssatisfied by some classes of algorithms, that the number k ofreprojections needed for stabilisation is a random variablewith probability distribution whose tails decay superexpo-nentially. While this approach is general and comes with a

general and applicable theory, it might be computationallywasteful in some situations and more crucially does not re-flect the fact that numerous algorithm naturally present sta-bility properties. Another possibility consists of consideringmixtures of adaptive and non-adaptive MCMC proposal dis-tributions, the non adaptive components ensuring stabilitye.g. (Roberts and Rosenthal 2007): again while this type ofstrategy generally ensures that the theory works, it poses theproblem of the practical choice of the non-adaptive compo-nent, and might not always result in efficient strategies. Inaddition, as for the strategies discussed earlier, this type ofapproach fails to explain the observed behaviour of some

adaptive algorithms.In a number of situations of interest it is possible to show

that the parameter stays away from its forbidden valueswith probability one (Andrieu and Tadic 2007). The ap-proach establishes a form of recurrence via the existence ofcomposite drift functions for the joint chain {i , Xi}, whichin turn ensures an infinite number of visits of {i} to somesets K for which the uniform properties above hold. Thiscan be shown to result in stability under fairly general con-ditions. In addition the approach provides one with some

insight into what makes an algorithm stable or not, and sug-gests numerous ways of designing updates for {i} whichwill ensure stability and sometimes even accelerate conver-gence. Some examples are discussed in Sect. 4.2.2. In Saks-man and Vihola (2008) the authors address the same prob-lem by proving that provided that {i} is constrained not todrift too fast to bad values, then -ergodicity of{Xi} is pre-served. The underlying ideas are related to a general strategyof stabilisation developed for the stochastic approximationprocedure, see Andradttir (1995).

Finally we end this section with a practical implicationof the developments above, related to the rate of conver-gence of controlled MCMC algorithms. Assume for exam-ple the existence of K , C (0, ), (0, 1) and{i} [0, )N such that for all i 1, , x K X and anyf : X [1, 1]

|Pif(x) E (f )| Ci , (8)

and for any ,

K, x

X and any f

:X

[1, 1

],

|Pf(x) P f(x)| C| |, (9)

and such that for all i 1, |i i1| i , where {i}satisfies a realistic assumption of slow decay (Andrieu andMoulines 2006) (satisfied for example for i = 1/ i, > 0).These conditions are far from restrictive and can be shownto hold for the symmetric random walk Metropolis (SRWM)for some distributions , the independent Metropolis-Hastings (IMH) algorithm, mixtures of such transitions etc.(Andrieu and Moulines 2006). Less restrictive conditionsare possible, but lead to slower rates of convergence. Then,

using a more precise form of the decomposition in (6) (An-drieu and Moulines 2006, Proposition 4), one can show thatthere exists a constant C (0, ) such that for all i 1and |f| 1,E (f (Xi ) E (f )) I{ i} Ci , (10)where is the first time at which {i} leaves K (which canbe infinity). The result simply tells us that while the adaptedparameter does not leave K, convergence towards occursat a rate of at least {i}, and as pointed out in Andrieu andMoulines (2006) does not require convergence of{i}. Thismight appear to be a negative result. However it can beproved, (Andrieu 2004) and (Andrieu and Moulines 2006),that there exist constants A(,K) and B(,K) such that forany N 1,E

1NN

i=1f (Xi ) E (f )

2

I{ n}

A(,K)N

+ B(,K)N

k=1 kN

. (11)


9/31

Stat Comput (2008) 18: 343373 351

The first term corresponds to the Monte Carlo fluctuationswhile the second term is the price to pay for adaptation. As-suming that i = i for (0, 1), thenN

k=1 kN

11 N

,

which suggests no loss in terms of rate of convergence for

1/2. More general and precise results can be found inAndrieu and Moulines (2006, Proposition 6), including acentral limit theorem (Theorem 9) which shows the asymp-totic optimality of adaptive MCMC algorithms when con-vergence of {i} is ensured . Weaker rates of convergencethan (8) lead to a significant loss of rate of convergence,which is also observed in practice.

4 Vanishing adaptation: a framework for consistent

adaptive MCMC algorithms

In the previous section we have given arguments that sug-gest that vanishing adaptation for MCMC algorithms mightlead to algorithms from which expectations with respect toa distribution of interest can be consistently estimated.However neither criteria nor ways of updating the parameter were described. The main aim of this section is to pointout the central role played by stochastic approximation andthe Robbins-Monro recursion (Robbins and Monro 1951) inthe context of vanishing or non-vanishing adaptation. Whilea complete treatment of the theoretical aspects of such con-trolled MCMC algorithms is far beyond the scope of thisreview, our main goal is to describe the principles underpin-ning this approach that have a practical impact and to showthe intricate link between criteria and algorithms. Indeed, aswe shall see, while the stochastic approximation frameworkcan be used in order to optimise a given criterion, it can alsohelp understand the expected behaviour of an updating al-gorithm proposed without resorting to grand theory, but bysimply resorting to common sense.

4.1 Criteria to optimise MCMC algorithms and a generalform

Since our main aim is that of optimising MCMC transi-tion probabilities, the first step towards the implementationof such a procedure naturally consists of defining what ismeant by optimality, or suboptimality. This can be achievedthrough the definition of a cost function, which could for ex-ample express some measure of the statistical performanceof the Markov chain in its stationary regime e.g. favour neg-ative correlation between Xi and Xi+l for some lag l andi = 0, 1, . . . . In what follows we will use the convention thatan optimum value corresponds to a root of the equation

h() = 0 for some function h() closely related to the afore-mentioned cost function.

Since the main use of MCMC algorithms is to computeaverages of the form IN(f ) given in (1) in order to estimateE (f (X)), in situations where a central limit theorem holds,i.e. in scenarios such that for any

N (

IN(f )

E (f (X)))

D N(0,

2(f )),

it might seem natural to attempt to optimise the constant2(f ). This however poses several problems. The first prob-lem is computational. Indeed, for a given and f :X : [1, 1] (for simplicity) and when it exists, 2(f ) canbe shown to have the following expression

2(f ) = E (f2(X0)) + 2+k=1

E(f (X0)f (Xk))

= E (f2(X0)) + 2E+k=1

f (X0)f (Xk )

, (12)

with f(x) := f(x) E (f (X)) and E the expectationassociated to the Markov chain with transition probabilityP and such that X0 . This quantity is difficult to esti-mate and optimise (since for all it is the expectationof a non-trivial function with respect to an infinite set ofrandom variables) although some solutions exist (VladislavTadic, personal communication, see also Richard EverittsPh.D. thesis) and truncation of the infinite sum is also possi-ble (Andrieu and Robert 2001; Pasarica and Gelman 2003),allowing for example for the recursive estimation of thegradient of 2 (f ) with respect to . In Pasarica and Gel-

man (2003), maximising the expected mean square jumpdistance is suggested, i.e. here in the scalar case and withXi = Xi E (X) for i = 0, 1,

E

(X0 X1)2

= E

X0 X12

= 2E

X2

E X0X1 (13)

which amounts to minimising the term corresponding tok = 1 in (12) for the function f(x) = x for all x X. An-other difficulty is that the criterion depends on a specificfunction f, and optimality for a function f might not re-

sult in optimality for another function g. Finally it can beargued that although optimising this quantity is an asymp-totically desirable criterion, at least for a given function, thiscriterion can in some scenarios lead to MCMC samplers thatare slow to reach equilibrium (Besag and Green 1993).

Despite the difficulties pointed out earlier, the criterionabove should not be totally discarded, but instead of try-ing to optimise it directly and perfectly, suboptimal op-timisation through proxies that are amenable to simplecomputation and efficient estimation might be preferable.


10/31

352 Stat Comput (2008) 18: 343373

Such a simple criterion, which is at least completely sup-ported by theory in some scenarios (Roberts et al. 1997;Sherlock and Roberts 2006; Roberts and Rosenthal 1998;Bdard 2006) and proves to be more universal in practice,is the expected acceptance probability of the MH algorithmfor random walk Metropolis algorithms or Langevin basedMH updates. The expected acceptance probability is moreformally defined as the jump rate of a MH update in the sta-tionary regime

:=

X2min

1,

(y) q (y, x)

(x) q (x, y)

(x) q (x, y) dxdy

= Eq

min

1,

(Y) q (Y, X)

(X) q (X, Y)

. (14)

This criterion has several advantages. The first one is com-putational, since it is much simpler an expectation of a muchsimpler function than 2 (f ). In such cases it has the doubleadvantage of being independent of any function f and to

provide a good compromise for 2 (f ) for all functions f.A less obvious advantage of this criterion, which we illus-

trate later on in Sect. 5, is that where some form of smooth-ness of the target density is present it can be beneficial inthe initial stages of the algorithm in order to ensure that theadaptive algorithm actually starts exploring the target distri-bution in order to learn some of its features.

The aforementioned theoretical results tell us that opti-mality of2 (f ) (in terms of) or proxy quantities relatedto this quantity (truncation, asymptotics in the dimension)is reached for a specific value of the expected acceptanceprobability

, denoted hereafter: 0.234 for the random

walk Metropolis algorithm for some specific target distribu-tions and likewise 0.574 for Langevin diffusion based MHupdates (Roberts and Rosenthal 1998).

In some situations, Gelman et al. (1995) have shown thatthe optimal covariance matrix for a multivariate randomwalk Metropolis algorithm with proposal N(0, ) is :=(2.382/nx ) , where is the covariance matrix of thetarget distribution

=E

XXT E (X)ET (X) .

The covariance is unknown in general situations and re-

quires the numerical computation of the pairE (X) ,E

XXT

=E (X, XXT) . (15)As pointed out in Andrieu and Moulines (2006, Sect. 7), thiscan also be interpreted as minimising the Kullback-Leiblerdivergence

X

(x) log(x)

N(x; ,) dx = E

log(X)

N(X; ,)

,

which suggests generalisations consisting of minimising

E

log

(X)

q(X)

, (16)

in general, for some parametric family of probability dis-tributions {q, }. Section 7 of Andrieu and Moulines(2006) is dedicated to the development or an on-line EM

algorithm and a theoretical analysis of an adaptive indepen-dent MH algorithm where q is a general mixture of distri-butions belonging to the exponential family. We will comeback to this strategy in Sect. 5.2.2 where we show that thisprocedure can also be used in order to cluster the state-spaceX and hence define locally adaptive algorithms.

Before turning to ways of optimising criteria of the typedescribed above, we first detail a fundamental fact shared byall the criteria described above and others, which will allowus to describe a general procedure for the control of MCMCalgorithms. The shared characteristic is naturally that all thecriteria developed here take the form of an expectation with

respect to some probability distribution dependent on .In fact as we shall see optimality can often be formulated asthe problem of finding the root(s) of an equation of the type

h() :=E (H ( , X0, Y1, X1, . . . )) = 0 (17)

(remember that {Yi} is the sequence of proposed samples)for some function XN :Rnh for some nh N, with inmany situations nh = n (but not always). The case of thecoerced acceptance probability corresponds to

H ( , X0, Y1, X1, . . . ) = min1, (Y1) q (Y1, X0)

(X0) q (X0, Y1) ,

which according to (14) results in the problem of finding thezero(s) ofh() = . The moment matching situationcorresponds to

H(,X) = (X,XXT) (,)

for which it is sought to find the zeros ofh() = ( , )(,) i.e. simply ( , ) (naturally assuming that thetwo quantities exist). It might not be clear at this point howoptimising the remaining criteria above might amount to

finding the zeros of a function of the form (17). However,under smoothness assumptions, it is possible to consider thegradients of those criteria (note however that one might con-sider other methods than gradient based approaches in orderto perform optimisation). In the case of the Kullback-Leiblerdivergence, and assuming that differentiation and integrationcan be swapped, the criterion can be expressed as

E

log

(X)

q(X)

= 0 (18)


11/31

Stat Comput (2008) 18: 343373 353

that is

H(,X) = log(X)

q(X)

and in the more subtle case of the first order autocovarianceminimisation one can invoke a standard score function argu-ment and find the zeros of (in the scalar case for simplicity)

E

X0X1 = EP(X0, X1)

P(X0, X1)X0X1

= 0.

Similarly, under smoothness assumptions, one can differen-tiate 2(f ) and obtain a theoretical expression for 2(f )of the form (17). Note that when q is a mixture of distrib-ution belonging to the exponential family, then it is possibleto find the zeros (assumed here to exist) of (18) using anon-line EM algorithm (Andrieu and Moulines 2006).

Note that all the criteria described above are steadystate criteria and explicitly involve , but that other cri-teria such as the minimisation of return times to a given set

C X (Andrieu and Doucet 2003), namely

= E

i=1I{Xi / C}

with a probability measure concentrated on C, do not en-ter this category. Such criteria seem however difficult to op-timise in practice and we do not pursue this.

4.2 The stochastic approximation framework

We dedicate here a section to the Robbins-Monro update,

which although not the only possibility to optimise criteriaof the type (17) appears naturally in most known adaptivealgorithms and provides us with a nice framework naturallyconnected to the literature on controlled Markov chains inthe engineering literature. The reason for its ubiquity stemsfrom the trivial identity: i+1 = i + i+1 i . This turnsout to be a particularly fruitful point of view in the presentcontext. More precisely, it is well suited to sequential up-dating of{i} and makes explicit the central role played bythe updating rule defining the increments {i+1 i}. Inlight of our earlier discussion {i+1 i} should be van-ishing, and when convergence is of interest their cumula-

tive sums should also vanish (in some probabilistic sense) inthe vicinity of optimal values . Naturally, although con-venient, this general framework should not prevent us fromthinking outside of the box.

4.2.1 Motivating example

Consider the case where X = R and a symmetric ran-dom walk Metropolis (SRWM) algorithm with normal in-crement distribution N(z; 0, exp()), resulting in a tran-

sition probability PNSRW . We know that in some situa-tions (Roberts et al. 1997) the expected acceptance prob-ability should be in a range close to = 0.44. We willassume for simplicity that in (14) is a non-increasingfunction of (which is often observed to be true, butdifficult to check rigourously in practice and can further-more be shown not to hold in some situations, Hastie2005). In such situations one can suggest the followingintuitive algorithm. For an estimate i obtained af-ter i L iterations of the controlled MCMC algorithm,one can simulate L iterations of the transition probabilityPNSRWi

and estimate the expected acceptance probabilityfor such a value of the parameter for the i-th block of sam-ples {XiL+1, YiL+1, . . . , XiL+L, YiL+L, k = 1, . . . , L} (ini-tialised with Xi )

i =1

L

Lk=1 min

1,

(YiL+k )(XiL+k1)

and update i according to the following rule, motivated by

our monotonicity assumption on : if i > then i isprobably (i is only an estimator) too small and should beincreased while if i <

then i should be decreased.There is some flexibility concerning the amount by whichi should be altered and depends either on the criterion onewishes to optimise or more heuristic considerations. How-ever, as detailed later, this choice will have a direct influ-ence on the criterion effectively optimised and in light ofthe discussion of Sect. 3 concerning diminishing adaptation,this amount of change should diminish as i in orderto either ensure that -ergodicity of{Xi} is ensured or thatapproximate convergence of{i} is ensured. The intuitivedescription given above can suggest the following updatingrules (see also Gilks et al. 1998, Andrieu and Robert 2001,Atchad and Rosenthal 2005 for similar rules)

i+1 = i + i+1I

i > 0 Ii 0

(19)

or

i+1 = i + i+1

i

, (20)

where {i} (0, +)N is a sequence of possibly stochas-tic stepsizes which ensures that the variations of {i} van-ish. The standard approach consists of choosing the se-quence {i} deterministic and non-increasing, but it is alsopossible to choose {i} random e.g. such that it takes val-ues in {, 0} for some > 0 and such that P(i = ) = piwhere {pi} [0, 1]N is a deterministic and non-increasingsequence (Roberts and Rosenthal 2007), although it is notalways clear what the advantage of introducing such an ad-ditional level of randomness is. A more interesting choice inpractice consists of choosing {i} adaptively, see Sect. 4.2.2,


12/31

354 Stat Comput (2008) 18: 343373

but for simplicity of exposition we focus here on the deter-ministic case.

We will come back to the first updating rule later on, andnow discuss the second rule which as we shall see aims toset (14) equal to . Notice first that ifL and the un-derlying Markov chain is ergodic, then and the re-cursion becomes deterministic

i+1 = i + i+1

i

(21)

and is akin to a standard gradient algorithm, which will con-verge under standard conditions. Motivated by this asymp-totic result, one can rewrite the finite L recursion (20) asfollows

i+1 = i + i+1

i + i+1 i i . (22)

Assuming for simplicity that there exists

, the in-terior of , such that = and that i is unbiased,

at least as i . Then, since |i+1 i | i+1 0 asi , and provided that is smooth in terms of the sequence of noise terms {i i } is expectedto average out to zero (i.e. statistically, positive incrementsare compensated by negative increments) and we expect thetrajectory of (22) to oscillate about the trajectory of (21),with the oscillations vanishing as i . This is the mainidea at the core of the systematic analysis of such recursionswhich, as illustrated below, has an interest even for practi-tioners. Indeed, by identifying the underlying deterministicrecursion which is approximated in practice, it allows one tounderstand and predict the behaviour of algorithms, even in

situations where the recursion is heuristically designed andthe underlying criterion not explicit. Equation (20) suggeststhat stationary points of the recursion should be such that = . The stationary points of the alternative recursion(19) are given in the next subsection.

In general most of the recursions of interest can be recastas follows,

i+1 = i + i+1Hi+1 (i , X0, . . . , Y i , Xi , Yi+1, Xi+1) (23)

where Hi+1(,X0, . . . , Y i , Xi , Yi+1, Xi+1) takes its valuesin . Typically in practice {Hi+1} is a time invariant se-

quence of mappings which in effect only depends on a fixedand finite number of arguments through time invariant sub-sets of {Yi , Xi} (e.g. the last L of them at iteration i, asabove). For simplicity we will denote this mapping H andinclude all the variables i , X0, . . . , Y i , Xi , Yi+1, Xi+1 as anargument, although the dependence will effectively be ona subgroup. Considering sequences {Hi (,X0, . . . , Y i , Xi ,Yi+1, Xi+1)} with a varying numbers of arguments is possi-ble (and needed when trying to optimise (12) directly), butat the expense of additional notation and assumptions.

4.2.2 Why bother with stochastic approximation?

In this subsection we point to numerous reasons why thestandard framework of stochastic approximation can be use-ful in order to think about controlled MCMC algorithms:as we shall see motivations range from theoretical to practi-cal or implementational, and might help shed some lights onpossibly heuristically developed strategies. Again, althoughthis framework is very useful and allows for a systematic ap-proach to the development and understanding of controlledMCMC algorithms, and despite the fact that this frameworkencompasses most known procedures, it should however notprevent us from thinking differently.

A standardized framework for programming and analysis

Apart from the fact that the standard form (23) allows forsystematic ways of coding the recursions, in particular thecreation of objects, the approach allows for an understand-ing of the expected behaviour of the recursion using sim-ple mathematical arguments as well as the development of awealth of very useful variations, made possible by the under-standing of the fundamental underlying nature of the recur-sions. As suggested above with a simple example (21)(22)the recursion (23) can always be rewritten as

i+1 = i + i+1h (i ) + i+1i+1, (24)

where h() is the expectation in steady state for a fixed ofH(,X0, . . . , Y i , Xi , Yi+1, Xi+1), i.e.

h() :=E (H(, X0, . . . , Y i , Xi , Yi+1, Xi+1))

and i+1 := H (i , X0, . . . , Y i , Xi , Yi+1, Xi+1) h(i ) isusually referred to as the noise. The recursion (24) cantherefore be thought of as being a noisy gradient algorithm.Intuitively, if we rearrange the terms in (24)

i+1 ii+1

= h (i ) + i+1,

we understand that provided that the noise increments icancel out on average, then a properly rescaled continu-ous interpolation of the recursion 0, 1, . . . should behavemore or less like the solutions (t) of the ordinary differen-tial equation

(t) = h( (t)), (25)

whose stationary points are precisely such that h() = 0.The general theory of stochastic approximation consists ofestablishing that the stationary points of (24) are relatedto the stationary points of (25) and that convergence oc-curs provided that some conditions concerning {i},h()and {i} are satisfied. While this general theory is rather in-volved, it nevertheless provides us with a useful recipe to


13/31

Stat Comput (2008) 18: 343373 355

try to predict and understand some heuristically developedalgorithms. For example it is not clear what criterion is actu-ally optimised when using the updating rule (19). Howeverthe mean field approach described above can be used tocompute

h() = E (H(, X0, . . . , Y i , Xi , Yi+1, Xi+1))

=E

I > 0 I

0= P > 0 P 0 .

Its zeros (the possible stationary points of the recursion) aresuch that P( > 0) = P( 0) = 1/2, i.e.the stationary points are such that is the median of thedistribution of in steady-state, which seems reasonablewhen this median is not too different from given our ini-tial objective. In addition this straightforward analysis alsotells us that the algorithm will have the desired gradient likebehaviour when P( > 0) is a non-increasing func-tion of. Other examples of the usefulness of the frameworkto design and understand such recursions are given later inSect. 5, in particular Sect. 5.2.2.

In addition to allowing for an easy characterisation ofpossible stationary points of the recursion (and hence of theideal optimal values ) the decomposition (24) points tothe role played by the deterministic quantity h() to ensurethat the sequence {i} actually drifts towards optimal values, which is the least one can ask from such a recursion,and the fact that the noise sequence {i} should also aver-age out to zero for convergence purposes. This latter pointis in general very much related to the ergodicity propertiesof{Xi}, which justifies the study of ergodicity even in situ-ations where it is only planned to use the optimised MCMC

algorithm with a fixed and suboptimal parameter obtainedafter optimisation. This in turn points to the intrinsic dif-ficulty of ensuring and proving such ergodicity propertiesbefore {i} wanders towards bad values, as explained inSect. 2. Recent progress in Andrieu and Tadic (2007), rely-ing on precise estimates of the dependence in of standarddrift functions for the analysis of Markov chains allows oneto establish that {i} stays away from such bad values, en-suring in turn ergodicity and a drift of{i} towards the set ofvalues of interest . Similar results are obtained in Saksmanand Vihola (2008), albeit using totally different techniques.

Finally note that the developments above stay valid in the

situation where {i} is set to a constant, say . In such sit-uations it is possible to study the distribution of i arounda deterministic trajectory underlying the ordinary differen-tial equation, but it should be pointed out that in such sit-uations {Xi} is not -stationary, and one can at most hopefor -stationarity for a probability distribution such that in a certain sense as 0.

The connection between stochastic approximation andthe work of Haario et al. (2001) and the underlying gener-ality was realised in Andrieu and Robert (2001), although

it is mentioned in particular cases in Geyer and Thomp-son (1995) and Ramponi (1998), the latter reference beingprobably the first rigourous analysis of the stability and con-vergence properties of a particular implementation of con-trolled MCMC for tempering type algorithms.

A principled stopping rule As pointed out earlier, and al-though ergodicity is intrinsically related to the sequence

{

i}approaching the zeroes ofh() and hence taking good val-ues, one might be more confident in using samples pro-duced by a standard MCMC algorithm that would use anoptimal or suboptimal value of . This naturally raises thequestion of the stopping rule to be used. In the ubiquitouscase of the Robbins-Monro updating rule, and given theclear interpretation in terms of the root finding ofh(), onecan suggest monitoring the average of the field

1

n

ni=1

H (i , Xi+1)

and stop, for example, when its magnitude is less than a pre-set threshold for a number m of consecutive iterations.More principled statistical rules relying on the CLT can alsobe suggested, but we do not expand on this here.

Boundedness and convergence The dependence of the er-godicity properties of P can lead to some difficulties inpractice. Indeed these ergodicity properties are rarely uni-form in and tend to degrade substantially for somevalues, typically on the boundary of . For examplefor the toy example of Sect. 2, both values = {0, 1}

are problematic. For = 0 aperiodicity is lost whereasfor = 1 irreducibility is lost. This can result in impor-tant problems in practice since -ergodicity can be lost aspointed out in Sect. 2 through the aforementioned toy ex-ample when the sequence {i} converges to too quickly.In fact, as pointed out to us by Y.F. Atchad, an examplein Winkler (2003) shows that even in the situation wherei (1) = i (2) = 1 1/ i, the sequence {n1

ni=1Xi 3/2}

does not vanish (in the mean square sense) as i . Thisproblem of possible loss of ergodicity ofP and its implica-tions for controlled Markov chains has long been identified,but is often ignored in the current MCMC related literature.

For example a normal symmetric random walk Metropolis(N-SRWM) algorithm loses ergodicity as its variance (or co-variance matrix) becomes either too large or too small andan algorithm with poor ergodicity properties does not learnfeatures of the target distribution . In the case of a randomscan MH within Gibbs algorithm as given in (2), it is pos-sible to progressively lose irreducibility whenever a weightdrifts towards 0. Several cures are possible. The first and ob-vious one consists of truncating in order to ensure the ex-istence of some uniform ergodicity properties of the family


14/31

356 Stat Comput (2008) 18: 343373

of transitions {P}. While this presumes that one knows byhow much one can truncate without affecting the ergod-icity properties of{P} significantly, this is not a completelysatisfactory solution since stability is actually observed innumerous situations.

In Andrieu and Tadic (2007), using explicit dependenceof the parameters of well known drift conditions for MCMCalgorithms on the tuning parameter , general conditionson the transition probability P and the updating functionH(,x) that ensure boundedness of {i} are derived. As aresult -ergodicity of{Xi}. and convergence to optimal orsuboptimal values of are automatically satisfied withoutthe need ro resort to fixed or adaptive truncations for ex-ample One aspect of interest of the results is that they sug-gest some ways of designing fully adaptive and stable algo-rithms.

For example by noting that the zeroes ofh() are also thezeroes ofh()/(1 + ||) for example, one can modify thestandard recursion in order to stabilise the update, resulting

in the alternative updating rulei+1 = i + i+1H (i , Xi+1)/(1 + |i |).

One can also add regularisation terms to the recursion.For example, assuming for example that we learn optimalweights for a mixture of transition probabilities as in (2), therecursion

wki+1 = wki + i+1Hk (wi , Xi+1)

(with wi = (w1i , w2i , . . . , wni )) can be for example modifiedto

wki+1 = wki + i+1Hk (wi , Xi+1)

+ 1+i+1

+ (wki )n

j=1 + (wji ) wki

for some ,, > 0. Note that since the sum over k of thefields is 0, the weights still sum to 1 after the update andalso that due to the boundedness of the additional term itvanishes as i . Finally in Andrieu et al. (2005) and An-drieu and Moulines (2006), following Chen et al. (1988), analgorithm with adaptive truncation boundaries is suggested

and a general theory developed that ensures that both bound-edness and convergence of{i} is ensured. Although requir-ing an intricate theory, the conditions under which bounded-ness and convergence hold cover a vast number of situations,beyond the situations treated in Andrieu and Tadic (2007).In Saksman and Vihola (2008) a different approach to provestability is used, and consists of proving that provided that{i} does not drift too fast to bad values, then the algorithmpreserves ergodicity. In fact the analysis performed by theauthors can be directly used to study the general stabilisation

strategy of Andradttir (1995) (see also reference therein)for stochastic approximation.

Finally, under more restrictive conditions, detailed inBenveniste et al. (1990) and Andrieu and Atchad (2007,Theorem 3.1), which include the uniqueness of such thath() = 0 and conditions (8)(9) for K , it is pos-sible to show that for a deterministic sequence {i}, thereexists a finite constant C such that for all i

1,

E|i |2I{ i}

Ci ,

where is the first exit time from K, meaning that whilei remains in K (where locally uniform conditions of thetype (8)(9) hold), then the rate of convergence towards

is given by {i}.

Automatic choice of the stepsizes The stochastic approx-imation procedure requires the choice of a stepsize se-quence {i}. A standard choice consists of choosing a deter-ministic sequence satisfying

i=1

i = and

i=11

+

i< for some > 0. The former condition somehow ensurethat any point of can eventually be reached, while the sec-ond condition ensures that the noise is contained and doesnot prevent convergence. Such conditions are satisfied bysequences of the type i = C/ i for ((1 + )1, 1]. Wetend in practice to favour values closer to the lower boundin order to increase convergence of the algorithm towards aneighbourhood of. This is at the expense of an increasedvariance of{i} around however.

A very attractive approach which can be useful in prac-tice, and for which some theory is available, consists of

adapting {i} in light of the current realisation of thealgorithmthis proves very useful in some situations seeAndrieu and Jasra (2008). The technique was first describedin Kesten (1958) and relies on the remark that, for exam-ple, an alternating sign for {i } in (22) is an indica-tion that {i} is oscillating around (a) solution(s), whereasa constant sign suggests that {i} is, roughly speaking,still far from the solution(s). In the former case the step-size should be decreased, whereas in the later it should, atleast, be kept constant. More precisely consider a function : [0, +) [0, +). The standard scenario correspond-ing to a predetermined deterministic schedule consists of

taking {i = (i)}. The strategy suggested by Kesten (1958)and further generalised to the multivariate case in Delyonand Juditsky (1993) suggests to consider for i 2 the fol-lowing sequence of stepsizes

i =

i1k=1

I {H (k1, Xk ),H(k , Xk+1) 0}

where u, v is the inner product between vector u and v.Numerous generalisations are possible in order to take into


15/31

Stat Comput (2008) 18: 343373 357

account the magnitudes of {H (i , Xi+1)} in the choice of{i} (Plakhov and Cruz 2004) (and references therein),

i =

i1k=1

(H (k1, Xk ),H(k , Xk+1))

for some function :R [0, +). Numerous generalisa-tions of these ideas are naturally possible and we have foundthat in numerous situations a componentwise choice of step-size can lead to major acceleration (Andrieu and Jasra 2008),i.e. consider for example for j = 1, . . . , n

j

i =

i1k=1

I

Hj(k1, Xk), Hj(k, Xk+1) 0

where Hj(,X) is the j-th component ofH(,X), but caremust be taken to ensure that important properties of (suchas positivity if it is a covariance matrix) are preserved. Fi-nally note that this idea needs to be handled with care in theunlikely situations where (here in the scalar case for sim-

plicity) h() 0 as well as H(,x) for all , x X andthe solution to our problem is on the boundary of.

4.2.3 Some variations

The class of algorithms considered earlier essentially relyon an underlying time homogeneous Markov chain MonteCarlo algorithm with target distribution . It is howeverpossible to consider non-homogeneous versions of the al-gorithms developed above. More precisely one can suggestdefining a sequence {i , i 1} of probability distributionson X such that i

in some sense, e.g. total variation

distance, and select associated MCMC transition probabil-ities {Pi,} such that for any i 1 and i Pi, = i .Then the controlled MCMC algorithm defined earlier canuse Pi+1,i at iteration i +1 instead ofPi . This opens up thepossibility for example to use tempering ideas, i.e. choosei (x) i (x) for i (0, 1), allowing for the accumula-tion of useful information concerning the distribution of in-terest , while exploring simpler distributions. This typeof strategy can be useful in order to explore multimodal dis-tributions.

Another possibility, particularly suitable to two stagestrategies where adaptation is stopped, consists of remov-

ing the vanishing character of adaptation. In the context ofstochastic approximation this means for example that the se-quence {i} can be set to a constant small value . As a re-sult, in light of the examples of the first section, one expectsthat under some stability assumptions the chain {Xi} willproduce samples asymptotically distributed according to anapproximation of (such that in some sense)and optimise an approximate criterion corresponding to thestandard criterion where is replaced by . This strategycan offer some robustness properties.

5 Some adaptive MCMC procedures

In this section we present combinations of strategies, someof them original,1 which build on the principles developedin previous sections. Note that in order to keep notation sim-ple and ensure readability we present here the simplest ver-sions of the algorithms but that additional features describedin Sect. 4.2.2, such as the modification of the mean field tofavour stability, the automatic choice of the stepsize (com-ponentwise or not) or Rao-Blackwellisation etc., can easilybe incorporated.

5.1 Compound criteria, transient and starting to learn

As pointed out earlier desirable asymptotic criteria and asso-ciated optimisation procedures can easily be defined. How-ever it can be observed in practice that the algorithm canbe slow to adapt, in particular in situations where the ini-tial guess of the parameter is particularly bad, resultingfor example in a large rejection probability. More generallythe MH algorithm has this particular rather negative char-acteristic that if not well tuned it will not explore the targetdistribution and hence will be unable to gather informationabout it, resulting in a poor learning of the target distribu-tion, and hence algorithms that adapt and behave badly. Wedescribe in this section some strategies that circumvent thisproblem in practice.

We focus here on the symmetric increments random-walk MH algorithm (hereafter SRWM), in which q(x,y) =q(x y) for some symmetric probability density q on Rnx ,referred to as the increment distribution. The transition prob-

ability of the Metropolis algorithm is then given for x, A X B(X) by

PSRWMq (x,A)

=

Ax(x,x + z)q(z)dz

+ I(x A)

Xx(1 (x,x + z)) q(z)dz,

x X, A B(X), (26)

where (x,y) := 1 (y)/(x). A classical choice for theproposal distribution is q(z)

=N(z

;0, ), where

N(z; ,) is the density of a multivariate Gaussian withmean and covariance matrix . We will later on refer tothis algorithm as the N-SRWM. It is well known that ei-ther too small or too large a covariance matrix will resultin highly positively correlated Markov chains, and thereforeestimators In (f ) with a large variance. In Gelman et al.

1First presented at the workshop Adapski08, 68 January 2008,Bormio, Italy.


16/31

358 Stat Comput (2008) 18: 343373

(1995) it is shown that the optimal covariance matrix (un-der restrictive technical conditions not given here) for the N-SRWM is (2.382/nx ) , where is the true covariancematrix of the target distribution. In Haario et al. (2001) (seealso Haario et al. 1999) the authors have proposed to learn on the fly, whenever this quantity exists. It should bepointed out here that in situations where this quantity is notwell defined, one should resort to robust type estimates inorder to capture the dependence structure of the target distri-bution; we do not consider this here. Denoting PSRWMi ,i thetransition probability of the N-SRWM with proposal distri-bution N(0,) for some > 0. With = 2.382/nx , thealgorithm in Haario et al. (2001) can be summarised as fol-lows,

Algorithm 2 AM algorithm Initialise X0, 0 and 0. At iteration i + 1, given Xi , i and i

1. Sample Xi+1 PSRWM

i ,i (Xi , ).2. Update

i+1 = i + i+1(Xi+1 i ),i+1 = i + i+1((Xi+1 i )(Xi+1 i )T i ).

(27)

This algorithm has been extensively studied in Andrieuand Moulines (2006), Atchad and Fort (2008), Bai et al.

(2008) and Andrieu and Tadic (2007). We now detail somesimple improvements on this algorithm.

5.1.1 Rao-Blackwellisation and square root algorithms

Following (Ceperley et al. 1977) and (Frenkel 2006), wenote that, conditional upon the previous state Xi of the chainand the proposed transition Yi+1, the vector f (Xi+1) (forany function f : X Rnf ) can be expressed as

f (Xi+1) := I{Ui+1 (Xi , Yi+1)}f (Yi+1)

+I{Ui+1 >(Xi , Yi+1)}f (Xi ), (28)

where Ui+1 U(0, 1). The expectation off (Xi+1) with re-spect to Ui+1 conditional upon Xi and Yi+1 leads to

f (Xi+1) := (Xi , Yi+1)f(Yi+1)+ (1 (Xi , Yi+1))f(Xi ). (29)

For example Xi+1 := (Xi , Yi+1)Yi+1 + (1 (Xi , Yi+1))Xiis the average location of state Xi+1 which follows Xi

given Yi+1. This can be incorporated in the following Rao-Blackwellised AM recursions

i+1 = i + i+1

(Xi , Yi+1)(Yi+1 i )+ (1 (Xi , Yi+1))(Xi i )

,

i+1 = i + i+1

(Xi , Yi+1) (Yi+1 i ) (Yi+1 i )T

+ (1 (Xi , Yi+1)) (Xi i ) (Xi i )T

i .Using, for simplicity, the short notation (29) a Rao-Black-wellised AM algorithm can be described as follows:

Algorithm 3 Rao-Blackwellised AM algorithm Initialise X0, 0 and 0. At iteration i + 1, given Xi , i and i

1. Sample Yi+1 N(Xi , i ) and set Xi+1 = Yi+1 withprobability (Xi , Yi+1), otherwise Xi+1 = Xi .

2. Update

i+1 = i + i+1(Xi+1 i ),i+1 = i + i+1[(Xi+1 i )(Xi+1 i )T i].

(30)

Note that it is not clear that this scheme is always advan-tageous in terms of asymptotic variance of the estimators,as shown in Delmas and Jourdain (2007), but this modifica-tion of the algorithm might be beneficial during its transientwhenever the acceptance probability is not too low naturally.

It is worth pointing out that for computational efficiencyand stability one can directly update the Choleski decompo-sition ofi , using the classical rank 1 update formula

1/2i+1 = (1 i+1)1/2

1/2i

+

1 + i+11i+1

1/2i (Xi+1 i )2 1

1/2i (Xi+1 i )2

(1 i+1)1/2 (Xi+1 i ) (Xi+1 i )T T/2iwhere AT/2 is a shorthand notation for (A1/2)T whenever

this quantity is well defined. This expression can be simpli-fied through an expansion (requiring i+1 1) and modi-fied to enforce a lower triangular form as follows

1/2i+1 =

1/2i + i+1

1/2i

L

1/2i (Xi+1 i ) (Xi+1 i )T

T/2i I

,

where L(A) is the lower triangular part of matrix A. Noteagain the familiar stochastic approximation form of the re-


17/31

Stat Comput (2008) 18: 343373 359

cursion, whose mean field is

L

1/2

+ ( ) ( )T

T/2 I

,

and whose zeros (together with those of the recursion onthe mean) are precisely any square root of . The operatorensures that the recursion is constrained to lower triangu-lar matrices. Note that this is only required if one wishes tosave memory. Rank r updates can also be used when the co-variance matrix is updated every r iterations only. In whatfollows, whenever covariance matrices are updated, recur-sions of this type can be used although we will not makethis explicit for notational simplicity.

5.1.2 Compound criterion: global approach

As pointed out earlier, in the case of the N-SRWM algo-rithm the scaling of the proposal distribution is well under-stood in specific scenarios and intuitively meaningful for a

larger class of target distributions. A good rule of thumb isto choose = (2.382/nx ) , where is the covariancematrix of . We have shown above that following (Haarioet al. 2001) one can in principle estimate from the pastof the chain. However the difficulties that lead to the de-sire to develop adaptive algorithms in the first place, includ-ing the very poor exploration of the target distribution of , also hinder learning about the target distribution in theinitial stages of an adaptive MCMC algorithm when our ini-tial value for the estimator of is a poor guess. Again ifi is either too large in some directions or too small inall directions the algorithm has either a very small or a very

large acceptance probability, which results in a very slowlearning of since the exploration of the targets supportis too localised. This is a fundamental problem in practice,which has motivated the use of delayed rejection for exam-ple (Haario et al. 2003), and for which we present here analternative solution which relies on the notion of compositecriterion.

While theory suggests a scaling of = 2.382/nx we pro-pose here to adapt this parameter in order to coerce theacceptance probability to a preset and sensible value (e.g.0.234), at least in the initial stages of the algorithm. Indeed,while this adaptation is likely not to be useful in the long-

run, this proves very useful in the early stages of the algo-rithm (we provide a detailed illustration in Sect. 6.3) wherethe pathological behaviour described above can be detectedthrough monitoring of the acceptance probability, and cor-rected.

As a consequence in what follows the proposal distri-bution of the adaptive N-SRWM algorithm we consideris q(z) = N(z; 0,) where here := (,,). As-suming that for any fixed covariance matrix the corre-sponding expected acceptance probability (see (14)) is

a non-increasing function of , one can naturally suggestthe recursion log i+1 = log i + i+1[(Xi , Yi+1) ],which following the discussion of Sect. 4 is nothing but astandard Robbins-Monro recursion. Now when the covari-ance matrix needs to be estimated, one can suggestthe following compound criterion or multicriteria algo-rithm:

Algorithm 4 AM algorithm with global adaptive scaling Initialise X0, 0 and 0. At iteration i + 1, given Xi , i , i and i

1. Sample Yi+1 N(Xi , i i ) and set Xi+1 = Yi+1with probability (Xi , Yi+1), otherwise Xi+1 = Xi .

2. Update

log(i+1) = log(i ) + i+1[(Xi , Yi+1) ],i+1 = i + i+1(Xi+1 i ), (31)i

+1

=i

+i

+1

[(Xi

+1

i )(Xi

+1

i )

T

i

].

Again the interest of the algorithm is as follows: when-ever our initial guess 0 is either two large or two small,this will be reflected in either a large or small acceptanceprobability, meaning that learning of is likely to be slowfor a fixed scaling parameter. However this measure of per-formance of the algorithm can be exploited as illustratedabove: if(Xi , Yi+1) < 0 for most transition attemptsthen i should be decreased, while if on the other hand(Xi , Yi+1) 0 for most transition attempts, then ishould be increased. As a result one might expect a morerapid exploration of the target distribution following a poorinitialisation. Although this strategy can improve the perfor-mance of the standard AM algorithm in practice, we showin the next section that it is perfectible.

5.1.3 Compound criterion: local approach

As we shall now see, the global approach described in theprevious subsection might be improved further. There aretwo reasons for this. First it should be clear that adjustingthe global scaling factor ignores the fact that the scaling of

i i might be correct in some directions, but incorrect inothers. In addition, in order to be efficient, such bold up-dates require in general some good understanding of the de-pendence structure of the target distribution, in the form ofa reasonable estimate of , which is not available in theinitial stages of the algorithm. These problems tend to beamplified in scenarios involving a large dimension nx of thespace X since innocuous approximations in low dimensionstend to accumulate in larger cases. Inspired by Haario etal. (2005), we suggest the following componentwise update


18/31

360 Stat Comput (2008) 18: 343373

strategy which consists of a mixture of timid moves whoserole is to attempt simpler transitions better able to initiatethe exploration of . Note, however, that in contrast with(Haario et al. 2005) our algorithm uses the notion of com-pound criterion, which in our experience significantly im-proves performance. With ek the vector with zeroes every-where but for a 1 on its k-th row and a sensible (0, 1)e.g. 0.44:

Algorithm 5 Componentwise AM with componentwiseadaptive scaling

Initialise X0, 0, 0 and 10, . . . , nx0 . At iteration i + 1, given i , i and 1i , . . . , nxi

1. Choose a component k U{1, . . . , nx}.2. Sample Yi+1 Xi + ekN(0, ki [i]k,k ) and set

Xi+1 = Yi+1 with probability (Xi , Yi+1), otherwiseXi+1 = Xi .

3. Update

log(ki+1) = log(ki ) + i+1[(Xi , Yi+1) ],i+1 = i + i+1(Xi+1 i ), (32)i+1 = i + i+1[(Xi+1 i )(Xi+1 i )T i]

and j

i+1 = ji for j = k.

One might question the apparently redundant use of botha scaling ki and the marginal variance [i]k,k in the pro-posal distributions above, and one might choose to com-bine both quantities into a single scaling factor. However the

present formulation allows for a natural combination (i.e. amixture or composition) of the recursion above and varia-tions of the standard AM algorithm (Algorithm 2) such asAlgorithm 4. Such combinations allow one to circumventthe shortcomings of bold moves, which require extensiveunderstanding of the structure of , in the early iterations ofthe algorithm. The timid moves allow the procedure to startgathering information about which might then be used bymore sophisticated and more global updates.

We now turn to yet another version of the AM algorithm(Algorithm 2) which can be understood as being a versionof Algorithm 4 which exploits the local scalings computed

by Algorithm 5 instead of a single global scaling factor. Itconsists of replacing the proposal distribution N(Xi , i i )in Algorithm 4 with N(Xi ,

1/2i i

1/2i ), where

i := diag

1i , . . . , nxi

.

As we now show, such an update can be combined with Al-gorithm 5 into a single update. For a vector V we will de-note V(k) its k-th component and ek the vector with zeroeseverywhere but for a 1 on its k-th row. We have,

Algorithm 6 Global AM with componentwise adaptive scal-ing

Initialise X0, i , i and 1i , . . . , nxi . Iteration i + 1

1. Given i , i and 1i , . . . , nxi , sample Zi+1

N(0, 1/2i i 1/2i ) and set Xi+1 = Xi + Zi+1 with

probability (Xi , Xi+

Zi+

1), otherwise Xi+

1

=Xi .

2. Update for k = 1, . . . , nxlog(ki+1) = log(ki ) + i+1[(Xi , Xi + Zi+1(k)ek)

],(33)

i+1 = i + i+1(Xi+1 i ),i+1 = i + i+1[(Xi+1 i )(Xi+1 i )T i].

It is naturally possible to include an update for a globalscaling parameter, but we do not pursue this here. This al-

gorithm exploits the fact that a proposed sample Xi + Zi+1provides us with information about scalings in various di-rections through the virtual componentwise updates withincrements {Zi+1(k)ek} and their corresponding directionalacceptance probabilities. This strategy naturally requiresnx + 1 evaluations of , which is equivalent to one updateaccording to Algorithm 4 and nx updates according to Al-gorithm 5.

5.2 Fitting mixtures, clustering and localisation

As pointed out in Andrieu and Moulines (2006, Sect. 7) the

moment matching criterion corresponding to the recursion(27) can be understood as minimising the Kullback-Leiblerdivergence

KL(,q) :=E

log(X)

q(X)

(34)

where q(x) =N(x; ,) (but using q(z) =N(z; 0,)as a proposal distribution for the increments of a N-SRWMupdate). This remark leads to the following considerations,of varying importance.

The first remark is that q could be used as the proposaldistribution of an independent MH (IMH) update, as in An-

drieu and Moulines (2006) or Giordani and Kohn (2006).Although this might be a sensible choice when q(x) is agood approximation of , this might fail when is not closeto (in the transient for example) or simply because thechosen parametric form is not sufficiently rich. In additionsuch a bad behaviour is generally exacerbated by large di-mensions as illustrated by the following toy example.

Example 1 The target distribution is (x) = N(x;0, I )with x Rnx and proposal distribution q(x) = N(x;


19/31

Stat Comput (2008) 18: 343373 361

e,I) for some > 0 with e = (1, 1, 1, . . . )T. The importancesampling weight entering the acceptance ratio of an IMHalgorithm is

(x)

q(x)= exp

1

22nx eTx

=exp1

22nx

n

1/2x n

1/2x

nx

i=1

(x(i)

),

which is not bounded, hence preventing geometric ergod-icity. The distribution ofn1/2x

nxi=1(x(i) ) is precisely

N(0, 1), which results in a variance for the weights of

exp

2nx

1.

This is known to result in poorly performing importancesampling algorithms, but will also have an impact on theconvergence of IMH algorithms which will get stuck instates x with arbitrarily large weights x as nx increases, with

non negligible probability.

IMH updates hence fall in the category of very boldupdates which require significant knowledge of the structureof and do not usually form the base for reliable adaptiveMCMC algorithms.

The second remark, which turns out to be of more inter-est, is that one can consider other parametric forms for q,and use such approximations of to design proposal distri-butions for random walk type algorithms, which are likelyto perform better given their robustness. It is suggested inAndrieu and Moulines (2006, Sect. 7) to consider mixtures,

finite or infinite, of distributions belonging to the exponen-tial family (see also Capp et al. 2007 for a similar ideain thecontext of importance s

Date post:	03-Apr-2018
Category:	Documents
Upload:	ugur-sevik
View:	227 times
Download:	0 times

A Tutorial on Adaptive MCMC

Documents