+ All Categories
Home > Documents > Markov chain Monte Carlo algorithms with sequential...

Markov chain Monte Carlo algorithms with sequential...

Date post: 30-Sep-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
47
Markov chain Monte Carlo algorithms with sequential proposals Joonha Park * , Yves Atchad´ e Boston University Abstract We explore a general framework in Markov chain Monte Carlo (MCMC) sampling where se- quential proposals are tried as a candidate for the next state of the Markov chain. This sequential- proposal framework can be applied to various existing MCMC methods, including Metropolis- Hastings algorithms using random proposals and methods that use deterministic proposals such as Hamiltonian Monte Carlo (HMC) or the bouncy particle sampler. Sequential-proposal MCMC methods construct the same Markov chains as those constructed by the delayed rejection method under certain circumstances. In the context of HMC, the sequential-proposal approach has been proposed as extra chance generalized hybrid Monte Carlo (XCGHMC). We develop two novel methods in which the trajectories leading to proposals in HMC are automatically tuned to avoid doubling back, as in the No-U-Turn sampler (NUTS). The numerical efficiency of these new meth- ods compare favorably to the NUTS. We additionally show that the sequential-proposal bouncy particle sampler enables the constructed Markov chain to pass through regions of low target density and thus facilitates better mixing of the chain when the target density is multimodal. 1 Introduction Markov chain Monte Carlo (MCMC) methods are widely used to sample from distributions with analytically tractable unnormalized densities. In this paper, we explore a MCMC framework in which proposals for the next state of the Markov chain are drawn sequentially. We consider the objective of obtaining samples from a target distribution on a measurable space (X, X ) with density ¯ π(x) := π(x) Z with respect to a reference measure denoted by dx, where π(x) denotes an unnormalized density, and Z denotes the corresponding normalizing constant. MCMC methods construct Markov chains such that, given the current state of the Markov chain X (i) , the next state X (i+1) is drawn from a kernel which has the target distribution ¯ π as its invariant distribution. The widely used Metropolis- Hastings (MH) strategy constructs a kernel with a specified invariant distribution in the following two steps (Metropolis et al., 1953; Hastings, 1970). First, a proposal Y is drawn from a proposal kernel, and second, the proposal is accepted as X (i+1) with a certain probability. When the proposal is not accepted, the next state of the chain is set equal to the current state X (i) . The acceptance probability depends on the target density and the proposal kernel density at X (i) and Y in a way that ensures that ¯ π is a stationary density of the constructed Markov chain. * Email: [email protected] 1 arXiv:1907.06544v3 [stat.CO] 20 Aug 2019
Transcript
  • Markov chain Monte Carlo algorithms with sequentialproposals

    Joonha Park∗, Yves Atchadé

    Boston University

    Abstract

    We explore a general framework in Markov chain Monte Carlo (MCMC) sampling where se-quential proposals are tried as a candidate for the next state of the Markov chain. This sequential-proposal framework can be applied to various existing MCMC methods, including Metropolis-Hastings algorithms using random proposals and methods that use deterministic proposals suchas Hamiltonian Monte Carlo (HMC) or the bouncy particle sampler. Sequential-proposal MCMCmethods construct the same Markov chains as those constructed by the delayed rejection methodunder certain circumstances. In the context of HMC, the sequential-proposal approach has beenproposed as extra chance generalized hybrid Monte Carlo (XCGHMC). We develop two novelmethods in which the trajectories leading to proposals in HMC are automatically tuned to avoiddoubling back, as in the No-U-Turn sampler (NUTS). The numerical efficiency of these new meth-ods compare favorably to the NUTS. We additionally show that the sequential-proposal bouncyparticle sampler enables the constructed Markov chain to pass through regions of low target densityand thus facilitates better mixing of the chain when the target density is multimodal.

    1 Introduction

    Markov chain Monte Carlo (MCMC) methods are widely used to sample from distributions withanalytically tractable unnormalized densities. In this paper, we explore a MCMC framework inwhich proposals for the next state of the Markov chain are drawn sequentially. We consider theobjective of obtaining samples from a target distribution on a measurable space (X,X ) with density

    π̄(x) :=π(x)

    Z

    with respect to a reference measure denoted by dx, where π(x) denotes an unnormalized density,and Z denotes the corresponding normalizing constant. MCMC methods construct Markov chainssuch that, given the current state of the Markov chain X(i), the next state X(i+1) is drawn from akernel which has the target distribution π̄ as its invariant distribution. The widely used Metropolis-Hastings (MH) strategy constructs a kernel with a specified invariant distribution in the followingtwo steps (Metropolis et al., 1953; Hastings, 1970). First, a proposal Y is drawn from a proposalkernel, and second, the proposal is accepted as X(i+1) with a certain probability. When the proposalis not accepted, the next state of the chain is set equal to the current state X(i). The acceptanceprobability depends on the target density and the proposal kernel density at X(i) and Y in a waythat ensures that π̄ is a stationary density of the constructed Markov chain.

    ∗Email: [email protected]

    1

    arX

    iv:1

    907.

    0654

    4v3

    [st

    at.C

    O]

    20

    Aug

    201

    9

  • The typical size of proposal increments and the mean acceptance probability affect the rate ofmixing of the constructed Markov chain and thus the numerical efficiency of the algorithm. Thereis often a balance to be made between the size of proposal increments and the mean acceptanceprobability. Theoretical studies on this trade-off have been carried out for several widely usedalgorithms, such as random walk Metropolis (Roberts et al., 1997), Metropolis adjusted Langevinalgorithm (MALA) (Roberts and Rosenthal, 1998), or Hamiltonian Monte Carlo (HMC) (Beskoset al., 2013), in an asymptotic scenario where the target density is given by the product of d identicalcopies of a one dimensional density and where d tends to infinity. These results suggest that theoptimal balance can be made by aiming at a certain value of the mean acceptance probabilitywhich depends on the algorithm but not on the target density, provided that the marginal densitysatisfies some regularity conditions.

    Alternative methods to the basic Metropolis-Hastings strategy have been proposed to improvethe numerical efficiency beyond the optimal balance between the proposal increment size and theacceptance probability. The multiple-try Metropolis method by Liu et al. (2000) makes multipleproposals given the current state of the Markov chain and select one of them as a candidate forthe next state of the Markov chain. Calderhead (2014) proposed a different algorithm that makesmultiple proposals and allows more than one of them to be taken as the samples in the Markovchain. Since multiple proposals can be made independently in these methods, parallelization canincrease computational efficiency. These methods make a preset number of proposals conditionalon the current state in the Markov chain at each iteration.

    Developments in various other directions have been made to improve the numerical efficiencyof MCMC sampling. Adaptive MCMC methods use transition kernels that adapt over time usingthe information about the target distribution provided by the past history of the constructed chain(Haario et al., 2001; Andrieu and Thoms, 2008). The update scheme for the transition kernelis designed to induce a sequence of transition kernels that converges to one that is efficient forthe target distribution. The convergence of the law of the constructed chain and the rate ofconvergence have been studied under certain sets of conditions (Haario et al., 2001; Atchadé andRosenthal, 2005; Andrieu and Moulines, 2006; Andrieu and Atchade, 2007; Roberts and Rosenthal,2007; Atchadé and Fort, 2010; Atchade and Fort, 2012). Note however that the performance of anadaptive MCMC algorithm is limited by the efficiencies of the candidate transition kernels. In adifferent approach, Goodman and Weare (2010) proposed using ensemble samplers that constructMarkov chains that are equally efficient for all target distributions that are affine transformationsof each other. These methods draw information about the shape of the target distribution fromparallel chains which jointly target the product distribution given by identical copies of the targetdensity.

    There also exist a class of methods that address difficulties in sampling from multimodal dis-tributions using local proposals. Methods in this class include parallel tempering (Geyer, 1991;Hukushima and Nemoto, 1996), simulated tempering (Marinari and Parisi, 1992), and the equi-energy sampler (Kou et al., 2006). In these methods, the mixing of the constructed Markov chainis aided by a set of other Markov chains that target alternative distributions for which the movesbetween separated modes happen more frequently. The equi-energy sampler bears a similarity withthe approach of slice sampling, where a new sample is obtained within a randomly chosen level setof the target density (Roberts and Rosenthal, 1999; Mira et al., 2001a; Neal et al., 2003).

    In this paper, we explore a novel approach where proposals are drawn sequentially conditionalon the previous proposal in each iteration. The proposal draws continue until a desired numberof “acceptable” proposals are made, so the total number of proposals is variable. A key elementin this approach is that the decision of acceptance or rejection of proposals are coupled via asingle uniform(0, 1) random variable drawn at the start of each iteration. This feature makesa straightforward generalization of the Metropolis-Hastings acceptance or rejection strategy. Theapproach is applicable to a wide range of commonly used MCMC algorithms, including ones that useproposal kernels with well defined densities and others that use deterministic proposal maps, suchas Hamiltonian Monte Carlo (Duane et al., 1987) or the recently proposed bouncy particle sampler

    2

  • (Peters et al., 2012; Bouchard-Côté et al., 2018). We will demonstrate that the sequential-proposalapproach is flexible; it is possible to make various modifications in order to develop methods thatpossess specific strengths.

    The advantage of the sequential-proposal approach can be explained using Peskun-Tierneyordering (Peskun, 1973; Tierney, 1998; Andrieu and Livingstone, 2019). Suppose two transitionkernels P1 and P2 defined on (X,X ) are reversible with respect to π̄:∫

    1A×B(x, y)Pj(x, dy)π(x)dx =∫

    1B×A(x, y)Pj(x, dy)π(x)dx, ∀A,B ∈X , j= 1, 2.

    The transition kernel P1 is said to dominate P2 off the diagonal if

    P1(x,A \ {x}) ≥ P2(x,A \ {x}), ∀x∈X, ∀A∈X .

    For a X -measurable function f such that∫f2(x)π̄(x)dx

  • Algorithm 1: A sequential-proposal Metropolis algorithm

    Input : Maximum number of proposals, NSymmetric proposal kernel, q(y |x)Number of iterations, M

    Output: A draw of Markov chain,(X(i)

    )i∈1:M

    1 Initialize: Set X(0) arbitrarily2 for i← 0 :M−1 do3 Draw Λ ∼ unif(0, 1)4 Set X(i+1) ← X(i)5 Set Y0 ← X(i)6 for n← 1 :N do7 Draw Yn ∼ q(· |Yn−1)8 if Λ < π(Yn)

    π(Y0)then

    9 Set X(i+1) ← Yn10 break

    11 end

    12 end

    13 end

    2 Sequential-proposal Metropolis-Hastings algorithms

    2.1 Sequential-proposal Metropolis algorithm

    We will first explain the sequential-proposal approach when the proposal kernel has well defineddensity with respect to the reference measure of the target density π̄. For a simpler presentation,we will first describe a sequential-proposal Metropolis algorithm, which uses a proposal kernelwith symmetric density. Various generalizations will be introduced in Section 2.2. In standardMetropolis algorithms, given the current state X(i) =x at the i-th iteration of the algorithm, theproposal Y is drawn from a probability kernel with conditional density q(y |x) that is symmetric inthe sense that q(y |x) = q(x | y) for all x, y ∈X. The proposal Y = y is accepted with the probability

    min

    (1,π(y)

    π(x)

    ).

    This is often implemented by drawing a uniform random variable Λ ∼ unif(0, 1) and accepting theproposal by setting X(i+1) ← Y if and only if Λ < π(y)π(x) . If Y is not accepted, the algorithm setsX(i+1) ← X(i).

    We will call Y1 the first proposal drawn from q(· |X(i)). The proposal Y1 is rejected if andonly if a uniform random number Λ ∼ unif(0, 1) is greater than or equal to π(Y1)/π(X(i)). Ifrejected, a second proposal Y2 is drawn from q(· |Y1). The second proposal is accepted if andonly if Λ < π(Y2)/π(X

    (i)) using the same value of Λ used previously. If accepted, the algorithmsets X(i+1) ← Y2. In the case where Y2 is rejected, a third proposal is drawn from q(· |Y2) andchecked for acceptability using the same type of criterion, Λ < π(Y3)/π(X

    (i)). This procedure isrepeated until an acceptable proposal is found or until a preset number N of proposals are allrejected, whichever is reached sooner. In the case where all N proposals are rejected, the algorithmsets X(i+1) ← X(i). A pseudocode for a sequential-proposal Metropolis algorithm is given inAlgorithm 1. The algorithm reduces to a standard Metropolis algorithm if we set N = 1.

    We will now show that the sequential-proposal Metropolis algorithm just described constructsa reversible Markov chain with respect to the target distribution with density π̄. Throughout thispaper, for two integers n and m, we will denote by n :m the sequence (n, n+ 1, . . . ,m) if n ≤ m

    4

  • and the sequence (n, n− 1, . . . ,m) if n > m. Also, given a sequence (an)n≥0 = (a0, a1, a2, . . . ), wewill denote by an:m the subsequence (aj)n≤j≤m.

    Proposition 1. Algorithm 1 constructs a reversible Markov chain(X(i)

    )with respect to the target

    density π̄.

    Proof. We will show that the detailed balance equation

    P[X(i) ∈A, X(i+1) ∈B] = P[X(i) ∈B, X(i+1) ∈A]

    holds for every pair of measurable subsets A and B of X, provided that X(i) is distributed accordingto π̄. We will write Y0 := X

    (i), and the subsequent proposals as Y1, Y2, . . . , YN . The case wherethe n-th proposal Yn is taken for X

    (i+1) will be considered; then the claim of detailed balance willfollow by combining the cases for n in 1 :N and the case where all proposals are rejected. Underthe assumption that X(i) is distributed according to π̄, the probability that X(i) is in A and then-th proposal is in B and taken as X(i+1) is given by

    P[X(i) ∈A, X(i+1) ∈B, the n-th proposal is taken as X(i+1)]

    =

    ∫1A(y0)1B(yn)π̄(y0)q(y1 | y0) · · · q(yn | yn−1)

    · 1[Λ ≥ π(y1)

    π(y0)

    ]· · · 1

    [Λ ≥ π(yn−1)

    π(y0)

    ]1

    [Λ <

    π(yn)

    π(y0)

    ]1[0 < Λ < 1]dΛ dy0 dy1 . . . dyn,

    (1)

    where 1A denotes the indicator function for the set A and 1[·] denotes the indicator function of theevent specified between the brackets. The quantity

    1

    [Λ ≥ π(y1)

    π(y0)

    ]· · · 1

    [Λ ≥ π(yn−1)

    π(y0)

    ]1

    [Λ <

    π(yn)

    π(y0)

    ]1[0 < Λ < 1] (2)

    is equal to unity if and only if

    Λ ≥ maxk∈1:n−1

    π(yk)

    π(y0)and Λ < min

    (1,π(yn)

    π(y0)

    ). (3)

    It can be readily observed that for real numbers x, a, and b, the conditions x ≥ a and x < b aresatisfied if and only if x ∈ [min{a, b}, b), where the interval length is given by b− min(a, b). Thusthe interval length corresponding to the conditions (3) is given by

    min

    (1,π(yn)

    π(y0)

    )−min

    (1,π(yn)

    π(y0), maxk∈1:n−1

    π(yk)

    π(y0)

    ),

    which gives the integral of (2) over Λ. It follows that (1) is equal to∫1A(y0)1B(yn)π̄(y0)

    n∏k=1

    q(yk | yk−1) ·[min

    (1,π(yn)

    π(y0)

    )−min

    (1,π(yn)

    π(y0), maxk∈1:n−1

    π(yk)

    π(y0)

    )]dy0:n

    =1

    Z

    ∫1A(y0)1B(yn)

    n∏k=1

    q(yk | yk−1) ·[min{π(y0), π(yn)} −min{π(y0), π(yn), max

    k∈1:n−1π(yk)}

    ]dy0:n.

    (4)

    If we change the notation of the dummy variables by writing y0 ← yn, y1 ← yn−1, . . . , yn ← y0,then (4) is given by

    1

    Z

    ∫1A(yn)1B(y0)

    n∏k=1

    q(yk | yk−1)[min{π(yn), π(y0)} −min{π(yn), π(y0), max

    k∈1:n−1π(yk)}

    ]dy0:n,

    (5)

    5

  • Algorithm 2: A sequential-proposal Metropolis-Hasting algorithm

    Input : Distribution for the maximum number of proposals and the number of acceptedproposals, ν(N,L)Possibly asymmetric proposal kernel, q(yn | yn−1)Number of iterations, M

    Output: A draw of Markov chain,(X(i)

    )i∈1:M

    1 Initialize: Set X(0) arbitrarily2 for i← 0 :M−1 do3 Draw (N,L) ∼ ν(·, ·)4 Draw Λ ∼ unif(0, 1)5 Set X(i+1) ← X(i)6 Set Y0 ← X(i) and na ← 07 for n← 1 :N do8 Draw Yn ∼ q(· |Yn−1)9 if Λ <

    π(Yn)∏nj=1 q(Yj−1 |Yj)

    π(Y0)∏nj=1 q(Yj |Yj−1)

    then na ← na + 110 if na = L then11 Set X(i+1) ← Yn12 break

    13 end

    14 end

    15 end

    where we have used the fact that the kernel density q is symmetric; that is, q(yk−1 | yk) =q(yk | yk−1), for k ∈ 1 :n. It is now obvious that (5) is equal to the quantity obtained by swappingthe positions of A and B in (1). Thus we see that

    P[X(i) ∈A, X(i+1) ∈B, the n-th proposal is taken as X(i+1)]= P[X(i) ∈B, X(i+1) ∈A, the n-th proposal is taken as X(i+1)]. (6)

    In the case where all N proposals are rejected, the algorithms sets X(i+1) ← X(i). Thus,

    P[X(i) ∈A, X(i+1) ∈B, all N proposals are rejected]= P[X(i) ∈A, X(i) ∈B, all N proposals are rejected], (7)

    which is obviously unchanged under the swap of A and B. Thus summing (6) over all n∈ 1 :N andadding (7) gives

    P[X(i) ∈A, X(i+1) ∈B] = P[X(i) ∈B, X(i+1) ∈A].

    2.2 Algorithm generalizations

    The sequential-proposal Metropolis algorithm described in the previous subsection can be gener-alized in various ways. Firstly, the algorithm may use proposal kernels with asymmetric density.The n-th proposal Yn is drawn from a probability kernel with density q which may not satisfyq(y |x) = q(x | y), ∀x, y ∈X. A proposed value Yn is deemed acceptable if and only if

    Λ <π(Yn)

    ∏nj=1 q(Yj−1 |Yj)

    π(Y0)∏nj=1 qj(Yj |Yj−1)

    . (8)

    6

  • Here Y0 denotes the current state of the Markov chain. Clearly, (8) reduces to, if the proposal den-sity q is symmetric, the acceptance probability π(Yn)/π(Y0) in Algorithm 1. We call a sequential-proposal MCMC algorithm that uses a proposal kernel that has possibly asymmetric density asequential-proposal Metropolis-Hastings algorithm.

    A sequential-proposal Metropolis-Hastings algorithm can be further generalized by taking theL-th acceptable proposal instead as the next state of the Markov chain for general L ≥ 1. Thealgorithms previously described correspond to the case where L= 1. A pseudocode for this gener-alized Metropolis-Hastings algorithm is given in Algorithm 2. The algorithmic parameters N andL may be randomly selected at each iteration, provided that they are independent of the proposals{Yn ;n≥ 1} and Λ. If there are less than L acceptable proposals in the first N proposals, theMarkov chain stays at its current position. The proof that Algorithm 2 constructs a reversibleMarkov chain with respect to the target density π̄ is given in Appendix A.

    A sequential-proposal Metropolis-Hastings algorithm can also employ proposal kernels that de-pend on the sequence of previous proposals. Suppose that proposals are sequentially drawn in sucha way that the k-th candidate Yk is drawn from a proposal kernel with density qk(· |Yk−1, . . . , Y0),where Yk−1, . . . , Y1 denote the previous proposals and Y0 denotes the current state X

    (i) in theMarkov chain at the i-th iteration. The candidate Yk is deemed acceptable if

    Λ <π(Yk)

    ∏kj=1 qj(Yk−j |Yk−j+1:k)

    π(Y0)∏kj=1 qj(Yj |Yj−1:0)

    . (9)

    Proposals are sequentially drawn until L acceptable proposals are found. If there are less thanL acceptable proposals among the first N proposals, the next state in the Markov chain is set tothe current state, X(i+1)←X(i). Suppose now that the L-th acceptable state is obtained by then-th proposal Yn for some n≤N . In the case where the proposal kernel depends on the sequenceof previous proposals, in order to take Yn as the next state of the Markov chain, an additionalcondition needs to be checked, namely that there are exactly L− 1 numbers k∈ 1 :n−1 that satisfy

    Λ <π(Yk)

    ∏n−kj=1 qj(Yk+j |Yk+j−1:k)

    ∏nj=n−k+1 qj(Yn−j |Yn−j+1:n)

    π(Y0)∏nj=1 qj(Yj |Yj−1:0)

    . (10)

    If this additional condition is satisfied, Yn is taken as the next state of the Markov chain, that isX(i+1)←Yn. Otherwise, the next state is set to the current state in the Markov chain, X(i+1)←X(i).A pseudocode for sequential-proposal Metropolis-Hastings algorithms that employ kernels depen-dent on the sequence of previous proposals are given in Appendix B. The role of the additional con-dition (10) is to establish detailed balance between X(i) and X(i+1) by creating a symmetry betweenthe sequence of proposals Y0→Y1→ · · · →Yn and the reversed sequence Yn→Yn−1→ · · · →Y0.To see this, we note that the candidate Yn can be taken as the next state of the Markov chain onlywhen there are exactly L− 1 acceptable proposals among Y1, . . . , Yn−1. The additional symmetrycondition accounts for a mirror case where there are L− 1 acceptable proposals among Yn−1, . . . ,Y1, assuming that these proposals are sequentially drawn in the reverse order starting from Yn. Aproof of detailed balance for this algorithm is also given in Appendix B. This algorithm reduces toAlgorithm 2 in the case where the proposal kernel is dependent only on the most recent proposal.

    We note that sequential-proposal Metropolis-Hastings algorithms in the case where L= 1 con-struct the same Markov chains as those constructed by delayed rejection methods (Tierney andMira, 1999; Mira et al., 2001b; Green and Mira, 2001) when the proposal kernel depends only onthe most recent proposal. A brief description of the delayed rejection method, following Mira et al.(2001b), is given as follows. Given the current state of the Markov chain y0, the first candidatevalue y1 is drawn from q(· | y0) and accepted with probability

    α1(y0, y1) = 1 ∧π(y1)q(y0 | y1)π(y0)q(y1 | y0)

    ,

    7

  • where a ∧ b := min(a, b). If y1 is rejected, a next candidate value y2 is drawn from q(· | y1). Theacceptance probability for y2 is given by

    α2(y0, y1, y2) = 1 ∧π(y2)q(y1 | y2)q(y0 | y1){1− α1(y2, y1)}π(y0)q(y1 | y0)q(y2 | y1){1− α1(y0, y1)}

    .

    If y1, . . . , yn−1 are rejected, yn is drawn from q(· | yn−1) and accepted with probability

    αn(y0:n) = 1 ∧π(yn)

    ∏nj=1 q(yj−1 | yj)

    ∏n−1j=1 {1− αj(yn:n−j)}

    π(y0)∏nj=1 q(yj | yj−1)

    ∏n−1j=1 {1− αj(y0:j)}

    .

    If all proposals are rejected up to a certain number N , the next state of the Markov chain is setto the current state y0. We show in Appendix C the equivalence between the delayed rejectionmethod and the sequential-proposal Metropolis-Hastings algorithm with L= 1 under the case whereeach proposal is made depending only on the most recent proposal. The delayed rejection methodcan also use proposal kernels dependent on the sequence of previous proposals to construct areversible Markov chain with respect to the target distribution. In this case, however, the lawof the constructed Markov chain will be different from that by a sequential-proposal Metropolis-Hastings algorithm.

    In our view, there are several advantages sequential-proposal Metropolis-Hastings algorithmshave over the delayed rejection method:

    1. Sequential-proposal Metropolis-Hastings algorithms are more straightforward to implementthan the delayed rejection method. The evaluation of αn(y0:n) in delayed rejection involves theevaluation of a sequence of reversed acceptance probabilities {αj(yn:n−j) ; j ∈ 1 :n−1}. Thisinvolves computation of a total of O(n2) acceptance probabilities. In comparison, sequential-proposal Metropolis-Hastings algorithms only compare the ratio in (8) to a uniform randomnumber Λ for the same task of checking the acceptability of yn. The algorithmic simplicity ofsequential-proposal Metropolis-Hastings facilitates the use of a large number of proposals ineach iteration. Moreover, one may choose to take the L-th acceptable proposal for the nextstate of the Markov chain for a large L> 1.

    2. The sequential-proposal MCMC framework can be readily applied to MCMC algorithms us-ing deterministic maps for proposals, as explained in Section 3. In particular, the sequential-proposal MCMC framework applies to Hamiltonian Monte Carlo and the bouncy particlesampler methods, leading to improved the numerical efficiency. Applications to these algo-rithms are discussed in Section 4 and Appendix F. We note that the delayed rejection methodhas been generalized to algorithms using deterministic maps in Green and Mira (2001), al-though only the case for the second proposal was discussed.

    3. The conceptual simplicity of the sequential-proposal MCMC framework allows for various gen-eralizations and modifications. For example, in Section 4.2, we develop sequential-proposalNo-U-Turn sampler algorithms (Algorithms 6 and 7) that automatically adjust the lengthsof trajectories leading to proposals in HMC, similarly to the No-U-Turn sampler algorithmproposed by Hoffman and Gelman (2014). The proofs of detailed balance for these algo-rithms can be obtained by making minor modifications to the proof for the sequential-proposalMetropolis-Hastings algorithms.

    3 Sequential-proposal MCMC algorithms using deter-

    ministic kernels

    The sequential-proposal MCMC framework can be applied to algorithms that use deterministicproposal kernels. MCMC algorithms that employ deterministic proposal kernels often target adistribution on an extended space X×V whose the marginal distribution on X is equal to the

    8

  • Algorithm 3: A sequential-proposal MCMC using a deterministic kernel

    Input : Distribution of the maximum number of proposals and the number of acceptedproposals, ν(N,L)Time step length distribution, µ(dτ)Velocity distribution density, ψ(v ;x)Time evolution operators, {Sτ}Velocity reflection operators, {Rx}Velocity refreshment probability, pref(x)Number of iterations, M

    Output: A draw of Markov chain,(X(i)

    )i∈1:M

    1 Initialize: Set X(0) arbitrarily and draw V (0) ∼ ψ( · ;X(0)).2 for i← 0 :M−1 do3 Draw N,L ∼ ν(·, ·)4 Draw τ ∼ µ(·)5 Draw Λ ∼ unif(0, 1)6 Set X(i+1) ← X(i) and V (i+1) ← RX(i)V (i)7 Set na ← 08 Set (Y0,W0)← (X(i), V (i))9 for n← 1 :N do

    10 Set (Yn,Wn)← Sτ (Yn−1,Wn−1)

    11 if Λ <π(Yn)ψ(Wn ;Yn)

    π(Y0)ψ(W0 ;Y0)|detDSnτ (Y0,W0)| then na ← na + 1

    12 if na = L then13 Set (X(i+1), V (i+1))← (Yn,Wn)14 break

    15 end

    16 end

    17 With probability pref(X(i+1)), refresh V (i+1) ∼ ψ( · ;X(i+1))18 end

    original target distribution π̄. An additional variable V drawn from a distribution on V serves asa parameter for the deterministic proposal kernel. In this section, we will explain a general classof MCMC algorithms using deterministic proposal kernels and show how the sequential-proposalscheme can be applied to these algorithms. Applications to specific algorithms, such as HMC or thebouncy particle sampler (BPS), are discussed in subsequent sections (Section 4 and Appendix F).

    We suppose that the extended target distribution on X× V has density Π(x, v) with respect toa reference measure denoted by dx dv. We further assume that the original target density π̄ equalsthe marginal density of Π, such that Π(x, v) = π̄(x)ψ(v ;x) for some ψ(v ;x), the conditional densityof v given x. We define a collection of deterministic maps Sτ : X× V→ X× V for possibly variousvalues of τ . In HMC and the BPS, Sτ has an analogy with the evolution of a particle in a physicalsystem for a time duration τ . In this analogy, the variable x ∈ X is considered as the position of aparticle in the system and the variable v ∈ V as the velocity of the particle. The point Sτ (x, v) thenrepresents the final position-velocity pair of a particle that moves with initial position x and initialvelocity v for time τ . We suppose that the map Sτ for each τ satisfies the following condition:

    9

  • Reversibility condition. There exists a velocity reflection operator Rx : V→ V defined for everypoint x ∈ X such that

    Rx ◦ Rx = I, (11)

    holds for every x∈X andψ(Rxv ;x)ψ(v ;x)

    ∣∣∣∣∂Rxv∂v∣∣∣∣ = 1 (12)

    holds for almost every (x, v)∈X × V with respect to the reference measure dx dv. Furthermore, ifwe define a map T : X× V→ X× V as T (x, v) := (x,Rxv), we have

    T ◦ Sτ ◦ T ◦ Sτ = I. (13)

    Similar sets of conditions appear routinely in the literature on MCMC (Fang et al., 2014; Vanettiet al., 2017) and on Hamiltonian dynamics (Leimkuhler and Reich, 2004, Section 4.3). In (11) and(13), I denotes the identity map in the corresponding space V or X× V, and the symbol ◦ denotesfunction composition. In (12),

    ∣∣∂Rxv∂v

    ∣∣ denotes the absolute value of the Jacobian determinant ofthe map Rx at v. The condition (12) is equivalent to the condition that∫

    AΠ(x, v)dv =

    ∫Rx(A)

    Π(x, v)dv (14)

    for every measurable subset A of V and for almost every x∈X, due to the change of variableformula. The condition (13) can be understood as an abstraction of a property in Hamiltoniandynamics that if we reverse the velocity of a particle and advance in time, the particle traces backits past trajectory.

    Given X(i) =x and V (i) = v at the start of the i-th iteration, a MCMC algorithm can make adeterministic proposal Sτ (x, v), which is accepted with probability

    min

    (1,

    Π(Sτ (x, v))Π(x, v)

    |detDSτ (x, v)|),

    where D denotes the differential operator (i.e., DSτ (x, v) = ∂Sτ (x,v)∂(x,v) ). In algorithms such as HMC orthe BPS, the extended target density Π(x, v) is often taken as a product of independent densities,π̄(x)ψ(v), where a common choice for ψ(v) is a multivariate normal density. The map Sτ isoften taken to preserve the reference measure, such that it has unit Jacobian determinant (i.e.,|detDSτ (x, v)| = 1, for all (x, v)).

    The sequential-proposal framework can be used to generalize MCMC algorithms using deter-ministic kernels in a similar way that it is applied to Metropolis-Hastings algorithms. A pseu-docode of a sequential-proposal MCMC algorithm using a deterministic kernel is shown in Al-gorithm 3. Proposals are obtained sequentially as (Yn,Wn) ← Sτ (Yn−1,Wn−1), where we write(Y0,W0) := (X

    (i), V (i)). The pair (Yn,Wn) is deemed acceptable if

    Λ <Π(Yn,Wn)

    Π(Y0,W0)|detDSnτ (Y0,W0)| ,

    where Snτ = Sτ ◦ · · · ◦ Sτ denotes a map obtained by composing Sτ n times. If there are less thanL acceptable proposals in the sequence of L proposals, the next state of the Markov chain is setto (X(i+1), V (i+1)) ← (X(i),RX(i)V (i)). The velocity V (i+1) may be refreshed at the end of theiteration by drawing from ψ( · ;X(i+1)) with a certain probability pref(X(i+1)) that may depend onX(i+1). The parameter τ for the evolution map Sτ can be drawn randomly. The pseudocode inAlgorithm 3 shows the case where τ is drawn once per iteration and the same value is used for alln∈ 1 :N , but τ can also be drawn separately for each n, provided that the draws are independentof each other and of all other random draws in the algorithm.

    We state the following result for Algorithm 3. The proof is given in Appendix D.

    10

  • Proposition 2. The extended target distribution with density Π(x, v) is a stationary distributionfor the Markov chain

    (X(i), V (i)

    )i∈1:M constructed by Algorithm 3. Furthermore, the Markov chain(

    X(i))i∈1:M constructed by Algorithm 3, marginally for the x-component, is reversible with respect

    to the target distribution π̄(x).

    4 Connection to Hamiltonian Monte Carlo methods

    4.1 Sequential-proposal Hamiltonian Monte Carlo

    In this section, we consider applications of the sequential-proposal approach described in Section 3to Hamiltonian Monte Carlo algorithms and discuss the numerical efficiency. We first brieflysummarize basic features of HMC algorithms. A function on X × V, called the Hamiltonian, isdefined as the negative log density of the extended target density:

    H(x, v) := − log Π(x, v) = − log π̄(x)− logψ(v ;x). (15)

    We assume both X and V are equal to the d dimensional Euclidean space Rd. The velocity distri-bution ψ(v ;x) is often taken as a multivariate normal density independent of x,

    ψ(v ;x) ≡ ψC(v) :=1√

    (2π)d |detC|exp

    {−v

    TC−1v

    2

    }.

    An analogy with a physical Hamiltonian system is drawn by interpreting the first term − log π̄(x)as the static potential energy of a particle and the second term − logψ(v) as the kinetic energy. Inthis analogy, the covariance matrix C can be interpreted as the inverse of the mass of the particle.Hamiltonian dynamics is defined as a solution to the Hamiltonian equation of motion (HEM):

    dx

    dt= C

    ∂H

    ∂vdv

    dt= −C∂H

    ∂x.

    (16)

    If we denote the solution to the HEM as (x(t), v(t)), the exact Hamiltonian flow S∗τ defined byS∗τ (x(0), v(0)) := (x(τ), v(τ)) satisfies the reversibility condition (12) and (13) when the velocityreflection operator is given by Rx(v) =−v for all x∈X and v ∈V. The map S∗τ preserves theHamiltonian, that is, H(x, v) = H(S∗τ (x, v)) for all x∈X, v ∈V, and τ ≥ 0. The map S∗τ alsopreserves the reference measure dx dv, that is, |detDS∗τ (x, v)| = 1 for all x ∈ X, v ∈ V and τ ≥ 0,which is known as Liouville’s theorem (Liouville, 1838). A commonly used numerical approximationmethod for solving the HEM is called the leapfrog method (Duane et al., 1987; Leimkuhler andReich, 2004). One iteration of the leapfrog method approximates time evolution of a Hamiltoniansystem for duration � by alternately updating the velocity and position (x, v) as follows:

    v ← v + �2· C · ∇ log π(x)

    x← x+ �v

    v ← v + �2· C · ∇ log π(x).

    (17)

    We call the time increment � the leapfrog step size.A standard Hamiltonian Monte Carlo algorithm is a specific instance of MCMC algorithms

    using deterministic kernels described in Section 3, where the extended target density Π(x, v) isgiven by π̄(x)ψC(v) and the proposal map Sτ is given by l leapfrog jumps with step size �, suchthat the time duration parameter τ can be understood as the pair (�, l). The reversibility condition(11)–(13) is satisfied by this Sτ with Rx =−I for all x∈X. Each step in the leapfrog method

    11

  • Algorithm 4: Sequential-proposal HMC and leapfrog jump function

    Input : Leapfrog step size, �Number of leapfrog jumps, lCovariance of the velocity distribution, C

    Output: A draw of Markov chain,(X(i)

    )i∈1:M

    1 Run Algorithm 3 with Π(x, v) = π̄(x)ψC(v), pref(x) = 1, τ := (�, l),

    Sτ (x, v) = Leapfrog(x, v, �, l, C), and Rx = −I.2 Function Leapfrog(x, v, �, l, C)3 v ← v + �

    2· C · ∇ log π(x)

    4 x← x+ �v5 Set j ← 16 while j < l do7 v ← v + � · C · ∇ log π(x)8 x← x+ �v9 Set j ← j + 1

    10 end11 v ← v + �

    2· C · ∇ log π(x)

    12 end

    (17) preserves the reference measure dx dv, so we have |detDSτ | ≡ 1. It is common to refresh theprobability at every iteration (i.e., pref(x) ≡ 1).

    Sequential-proposal HMC (Algorithm 4) is obtained as a specific case of sequential-proposalMCMC algorithms using deterministic kernels (Algorithm 3) under the same setting, Π(x, v) =π̄(x)ψC(v) and Sτ = S(�,l). In other words, a proposal (Y1,W1) is made by making l leapfrog jumpsof size � starting from (Y0,W0), and if the proposal is rejected, a new proposal (Y2,W2) is madeby making l leapfrog jumps from (Y1,W1). The procedure is repeated until L acceptable proposalsare found, or until N proposals have been tried, whichever comes sooner. The leapfrog jump size �and the unit number of jumps l may be re-drawn at every iteration or for every new proposal. Asmentioned earlier, Campos and Sanz-Serna (2015) has proposed extra chance generalized hybridMonte Carlo (XCGHMC), which is identical to the sequential-proposal approach, except possiblyin the way the velocity is refreshed at the end of each iteration. In generalized HMC (Horowitz,1991), the velocity is partially refreshed by setting

    V (i+1) = sin θ V + cos θ U,

    where V is the velocity before refreshement, U is an independent draw from N (0, C), and θ isan arbitrary real number. It was shown in Campos and Sanz-Serna (2015) that Markov chainsconstructed by XCGHMC have the same law as those constructed by Look Ahead HamiltonianMonte Carlo (LAHMC) developed by Sohl-Dickstein et al. (2014).

    A major advantage of HMC algorithms over random walk based algorithms such as randomwalk Metropolis or Metropolis adjusted Langevin algorithms is that HMC can make a global jumpin one iteration (Neal, 2011). The leapfrog method is able to build long trajectories that arenumerically stable, provided that the target distribution satisfies some regulatory conditions andthe leapfrog step size is less than a certain upper bound (Leimkuhler and Reich, 2004). Since thesolution to the HEM preserves the Hamiltonian, proposals obtained by a numerical approximationto the solution can be accepted with reasonably high probabilities. Given a fixed length of leapfrogtrajectory, the number of leapfrog jumps is inversely proportional to the leapfrog jump size. Thusan increase in the leapfrog step size leads to a reduced number of evaluations of the gradient ofthe target density. On the other hand, decreasing the leapfrog step size tends to increase the meanacceptance probability. As � → 0, the average increment in the Hamiltonian at the end of theleapfrog trajectory scales as �4 (Leimkuhler and Reich, 2004). In an asymptotic scenario where the

    12

  • target distribution is given by a product of d independent, identical low dimensional distributionsand d tends to infinity, the increment in the Hamiltonian converges in distribution to a normaldistribution with mean µ�4d and variance 2µ�4d for some constant µ> 0 dependent on the targetdensity π̄ (Gupta et al., 1990; Neal, 2011). Beskos et al. (2013) showed under some mild regulatoryconditions on the target density that as � = �0d

    −1/4 and d→∞, the mean acceptance probabilitytends to a(�0) := 2Φ

    (−�20

    õ/2

    )where Φ(·) denotes the cdf of the standard normal distribution.

    The computational cost for obtaining an accepted proposal that is fixed distance away from thecurrent state in HMC is approximately given by

    1

    �0a(�0),

    which is minimized when a(�0) = 0.651 to three decimal places (Beskos et al., 2013; Neal, 2011).Empirical results also support targeting the mean acceptance probability of around 0.65 (Sextonand Weingarten, 1992; Neal, 1994). HMC using sequential proposals can improve on the numericalefficiency by increasing the probability that the constructed Markov chain makes a nonzero moveat each iteration. A numerical study in Section 4.4 shows that HMC with sequential proposalsleads to higher effective sample sizes per computation time compared to the standard HMC on atoy model.

    4.2 Sequential-proposal No-U-Turn sampler algorithms

    As previously mentioned, a key advantage of HMC over random walk based methods comes fromits ability to make long moves. If the number of leapfrog jumps is too small, the Markov chainfrom HMC may essentially behave like a random walk because the velocity is randomly refreshedbefore a long leapfrog trajectory is built. Conversely, if the number of leapfrog jumps is too large,the trajectory may double back on itself, since the solution to the Hamiltonian equation of motionis confined to a level set of the Hamiltonian. However, simply stopping the leapfrog jumps whenthe trajectory starts doubling back on itself generally destroys the detailed balance of the Markovchain with respect to the target distribution. In order to solve this issue, Hoffman and Gelman(2014) proposed the No-U-Turn sampler (NUTS). In this section, we will briefly explain the NUTSalgorithm and discuss the connection with sequential-proposal framework. In addition, we willpropose two new algorithms that address the same issue of trajectory doubling.

    In the No-U-Turn sampler, leapfrog trajectories are repeatedly extended twice in size in eitherforward or backward direction in the form of binary trees, until a “U-turn” is observed (see Fig-ure 1). The binary tree starts from the initial node (X(i), V ), where V is the velocity drawn fromthe standard multivariate normal distribution at the beginning of the i-th iteration. The directionof binary tree expansion is determined by a sequence of unif({−1, 1}) variables denoted by (σj)j≥0.The expansion of the binary tree stops if a U-turn is observed between the two leaf nodes on op-posite sides of any of the sub-binary trees of the current tree. A position-velocity pair (x, v) andanother pair (x′, v′) that is ahead of (x, v) on a leapfrog trajectory are said to satisfy the U-turncondition if either

    (x′−x) · v′ ≤ 0 or (x′−x) · v ≤ 0, (18)

    where · denotes the inner product in Euclidean spaces. If there is a U-turn within the lastly addedhalf of the current binary tree, the other half without a U-turn is taken as the final binary tree. Onthe other hand, if a U-turn is only observed between the two opposite leaf nodes of the current binarytree but not within any of the sub-binary trees, the current binary tree is taken as the final binarytree. The next state of the Markov chain X(i+1) is set to one of the acceptable leaf nodes in the finalbinary tree. A leaf node (x, v) is deemed acceptable if Π(x,v)

    Π(X(i),V )> Λ. Here Π(x, v) := π̄(x)ψId(v),

    where ψId denotes the density of the d dimensional standard normal distribution. Hoffman andGelman (2014) gives two versions of the NUTS algorithm. The naive version selects the next stateof the Markov chain uniformly at random among the acceptable leaf nodes in the final binary tree.

    13

  • Algorithm 5: The No-U-Turn samplers by Hoffman and Gelman (2014)

    Input : Leapfrog step size, �

    Output: A draw of Markov chain,(X(i)

    )i∈1:M

    1 Initialize: Set X(0) arbitrarily2 for i← 0 :M−1 do3 Draw Λ ∼ unif(0, 1) and V ∼ N (0, Id)4 Start with an initial tree T 0 := {(X(i), V )} having a single leaf5 for j ≥ 1 do6 Draw σj ∼ unif({−1, 1})7 Make 2j−1 leapfrog jumps either forward or backward depending on σj, forming a new

    binary tree T ′ of the same size as T j−1.8 if every sub-binary trees of T ′ is such that the two leaves on the opposite sides do not

    satisfy the U-turn condition (18) then9 Set T j ← T j−1 ∪ T ′

    10 else11 break

    12 end13 if the two opposite leaves of T j satisfies the U-turn condition then14 break

    15 end

    16 end17 Let T j0 be the final binary tree constructed

    18 Naive NUTS (Algorithm 2 in Hoffman and Gelman (2014)): Take for X(i+1) one of the leaf

    nodes of T j0 that are acceptable, i.e., Π(x,v)Π(X(i),V )

    > Λ, uniformly at random

    19 Efficient NUTS (Algorithm 3 in Hoffman and Gelman (2014)): Denote by na(T ) the numberof acceptable leaf nodes in a binary tree T , and

    20 for j ← j0 : 0 do21 With probability 1 ∧ na(T

    j \T j−1)na(T j−1)

    , take for X(i+1) one of the acceptable leaf nodes of

    T j \T j−1 uniformly at random, and break out from for loop22 end

    23 end

    The efficient version preferentially selects a random leaf node in sub-binary trees that are addedlater. A pseudocode for these NUTS algorithms is given in Algorithm 5. By construction, for everyleaf node in the final binary tree, it is possible to build the same final binary tree starting from thatleaf node using a unique sequence of directions. Since each direction is drawn from unif({−1, 1}),the probability of constructing the final binary tree is the same when started from any of its leafnodes. This symmetric relationship ensures that the constructed Markov chain is reversible withrespect to the target distribution.

    The NUTS algorithm shares with the sequential-proposal MCMC framework the key fea-ture that the decisions of acceptance or rejection of proposals are mutually coupled via a singleuniform(0, 1) random variable drawn at the start of each iteration. Furthermore, the naive versionof the algorithm (Algorithm 2 in Hoffman and Gelman (2014)) can be viewed as a specific caseof the sequential-proposal MCMC algorithm as follows. At each iteration a binary tree startingfrom (X(i), V ) is expanded until a U-turn is observed, as described above. Proposals are madesequentially by selecting one of the leaf nodes of the final binary tree uniformly at random. Thefirst proposal that is acceptable is taken as the next state of the Markov chain. Since the nextstate of the Markov chain is then selected uniformly at random among the acceptable leaf nodes inthe final binary tree, this sequential-proposal approach is equivalent to the naive NUTS.

    There are two features of the NUTS algorithm that may, unfortunately, compromise the numer-

    14

  • angle>90o

    4

    3

    1

    25 6

    7

    8X(i)

    V

    T1 (σ1=+1)T2 (σ2=−1)

    T3 (σ3=+1)

    Figure 1: An example diagram of a final binary tree constructed in an iteration of the NUTS algorithm byHoffman and Gelman (2014). The numbered circles indicate the points along a leapfrog trajectory in the orderthey are added. The binary tree stops expanding at T 3 because there is a U-turn between leaf nodes 4 and 8.The next state of the Markov chain is selected randomly among the acceptable states, colored in yellow.

    ical efficiency. First, the point chosen for the next state of the Markov chain is generally not thefarthest point on the leapfrog trajectory from the initial point. The NUTS typically constructs aleapfrog trajectory that is longer than the distance between the initial point and the point selectedfor the next state of the Markov chain due to the requirement of detailed balance. Second, theNUTS evaluates the log target density at every point on the constructed leapfrog trajectory todetermine the acceptability. This can result in a substantial overhead if the computational costof evaluating the log target density is at least comparable to that of evaluating the gradient ofthe log target density. We propose two alternative No-U-Turn sampling algorithms, which we callspNUTS1 and spNUTS2, addressing these two issues.

    In spNUTS1, leapfrog trajectories are extended in one direction according to a given lengthschedule until a U-turn is observed, and only the endpoint of the trajectory is checked for ac-ceptability. If the endpoint is not acceptable, a new trajectory is started from that point witha refreshed velocity. A pseudocode of spNUTS1 is given in Algorithm 6. At the start of eachiteration, a velocity vector is drawn from a multivariate normal distribution N (0, C) where C is ad× d positive definite matrix. A leapfrog trajectory started from the current state of the Markovchain and the drawn velocity vector, denoted by (x0, v0), is repeatedly extended in units of l jumps.The position-velocity pair after lk leapfrog jumps is denoted by (xk, vk). We note that the leapfrogupdates (17) should also use the same matrix C as the covariance of the velocity distribution. Atpreset checkpoints determined by a finite increasing sequence (bj)j∈1:jmax , the algorithm calculatesthe angles between the displacement xbj−x0 and the velocities v0 and vbj . In order to take intoaccount the given covariance structure C, we define a C-norm of a vector x ∈ Rd as

    ‖x‖C :=√xTC−1x,

    and the cosine of the angle between two vectors x and x′ as

    cosAngle(x, x′ ;C) :=xTC−1x′

    ‖x‖C · ‖x′‖C. (19)

    The leapfrog trajectory stops at (xbj , vbj ) if either of the following inequalities hold for a given c:

    cosAngle(xbj−x0, v0 ;C) ≤ c or cosAngle(xbj−x0, vbj ;C) ≤ c. (20)

    The value of c can be fixed at a constant value or randomly drawn for each trajectory. Algorithm 6describes a case where c is randomly drawn from a distribution denoted by ζ. If the above stoppingcondition (20) is not satisfied until j= jmax, the trajectory stops at (xbjmax , vbjmax ). The final state

    15

  • Algorithm 6: Sequential-proposal No-U-Turn sampler—Type 1 (spNUTS1)

    Input : Leapfrog step size, �Unit number of leapfrog jumps, lCovariance of velocity distribution, CScheduled checkpoints for a U-turn, (bj)j∈1 : jmaxDistribution for the stopping value of cosine angle, ζMaximum number of proposals tried, N

    Output: A draw of Markov chain,(X(i)

    )i∈1:M

    1 Initialize: Set X(0) arbitrarily2 for i← 0 :M−1 do3 Set Y0 ← X(i) and draw W0 ∼ N (0, C)4 Set X(i+1) ← X(i)5 Draw Λ ∼ unif(0, 1) and set Hmax ← − log π(Y0) + 12‖W0‖

    2C − log Λ

    6 for n← 1 :N do7 Draw c ∼ ζ(·)8 (Yn,W

    ′n)← spNUTS1Kernel(Yn−1,Wn−1, c)

    9 if − log π(Yn) + 12‖W′n‖2C < Hmax then

    10 Set X(i+1) ← Yn11 break

    12 end

    13 Draw U ∼ N (0, C) and set Wn ← U · ‖W′n‖C

    ‖U‖C14 end

    15 end

    16 Function spNUTS1Kernel(x0, v0, c)17 for k ← 1 : b1 do18 (xk, vk)← Leapfrog(xk−1, vk−1, �, l, C)19 end20 Set j ← 121 while cosAngle(xbj−x0, v0 ;C) > c and cosAngle(xbj−x0, vbj ;C) > c and j < jmax do22 Set j ← j + 123 for k ← bj−1+1 : bj do24 (xk, vk)← Leapfrog(xk−1, vk−1, �, l, C)25 end

    26 end27 if cosAngle(xbj−xbj−bj′ , vbj ;C) > c and cosAngle(xbj−xbj−bj′ , vbj−bj′ ;C) > c for all

    j′ ∈ 1 : j−1 then28 return (xbj , vbj)29 else30 return (x0, v0)31 end

    32 end

    at the stopped trajectory (xbj , vbj ) makes the first proposal (Y1,W′1). It is taken as the next state

    in the Markov chain if the following two conditions are met. First, the state (xbj , vbj ) has to beacceptable by satisfying

    log Λ < log π(xbj ) + logψC(vbj )− log π(x0)− logψC(v0). (21)

    Since the Hamiltonian of the state (xbj , vbj ) is given by −π̄(xbj )− logψC(vbj ), the acceptabilitycriterion (21) can be interpreted as that the increase in the Hamiltonian compared to the initial

    16

  • X(i)=x0

    V=v0x1

    x2

    x4

    x8x12

    x14x15Y1=x16

    W′1=v16 W1

    Y2W′2

    Figure 2: An example diagram for an iteration in spNUTS1 where bj = 2j−1. The first proposal

    (Y1,W′1) = (x16, v16) was rejected, and the second trajectory was started with a refreshed velocity W1. The

    pairs of points for which the U-turn condition is checked are connected by dashed line segments.

    state (x0, v0) is at most − log Λ. The second required condition is that

    cosAngle(xbj −xbj−bj′ , vbj ;C) > c and cosAngle(xbj −xbj−bj′ , vbj−bj′ ;C) > c for all 1≤ j′≤ j−1.

    (22)Since the trajectory has been extended to (xbj , vbj ), the stopping condition (20) was not satisfiedbetween the initial state (x0, v0) and any of the previously visited states {(xbj′ , vbj′ ) ; 1≤ j

  • Algorithm 7: Sequential-proposal No-U-Turn sampler—Type 2 (spNUTS2)

    Input : Leapfrog step size, �Unit number of leapfrog jumps, lCovariance of velocity distribution, CMaximum number of proposals, NScheduled checkpoints for a U-turn, (bj)j∈1 : jmaxDistribution for the stopping value of cosine angle, ζ

    Output: A draw of Markov chain,(X(i)

    )i∈1:M

    1 Initialize: Set X(0) arbitrarily2 for i← 0 :M−1 do3 Draw V ∼ N (0, C)4 Draw c ∼ ζ(·)5 Draw Λ ∼ unif(0, 1) and set ∆← − log Λ6 Set

    (X(i+1), V (i+1)

    )← spNUTS2Kernel(X(i), V,∆, �, C, c)

    7 end

    8 Function spNUTS2Kernel(x0, v0,∆, �, C, c)9 Set Hmax ← − log π(x0) + 12‖v0‖

    2C + ∆

    10 for k ← 1 : b1 do11 (xk, vk, f)← FindNextAcceptable(xk−1, vk−1, �,Hmax, C)12 if f = 0 then return (x0, v0) // the case where no acceptable states were found

    13 end14 Set j ← 115 while cosAngle(xbj−x0, v0 ;C) > c and cosAngle(xbj−x0, vbj ;C) > c and j < jmax do16 Set j ← j+ 117 for k ← bj−1+1 : bj do18 (xk, vk, f)← FindNextAcceptable(xk−1, vk−1, �,Hmax, C)19 if f = 0 then return (x0, v0)

    20 end

    21 end22 if cosAngle(xbj−xbj−bj′ , vbj ;C) > c and cosAngle(xbj−xbj−bj′ , vbj−bj′ ;C) > c for all

    j′ ∈ 1 : j−1 then23 return (xbj , vbj)24 else25 return (x0, v0)26 end

    27 end

    28 Function FindNextAcceptable(x, v, �,Hmax, C)29 Set (xtry, vtry)← (x, v)30 for n← 1 :N do31 (xtry, vtry)← Leapfrog(xtry, vtry, �, l, C)32 if − log π(xtry) + 12‖vtry‖

    2C < Hmax then return (xtry, vtry, 1)

    33 end34 return (x, v, 0)

    35 end

    the log target density ξ times and checks the U-turn condition ξ times on average. Thus theaverage computational cost for one iteration of the NUTS algorithm is given by (lcO + cπ + cU ) · ξ.In comparison, spNUTS1 evaluates the log target density once and checks the U-turn condition2 log2 ξ+ 1 times if bj = 2

    j−1 for j ∈ 1 : jmax. The average computational cost of obtaining a proposalin spNUTS1 is given by lcOξ+cπ+cU (2 log2 ξ+ 1), and the average cost of finding a new state for theMarkov chain different from the current state is roughly given by 1a·ã

    (lcOξ+ cπ + cU (2 log2 ξ+ 1)

    ),

    18

  • X(i)=x0

    V=v0

    x1

    x2x3

    x4 x5

    x6x7

    x8

    Figure 3: An example diagram for an iteration in spNUTS2 where bj = 2j−1. Acceptable states are marked

    by filled circles and unacceptable ones by empty circles. The pairs of states for which the U-turn condition ischecked are indicated by dashed line segments. The eighth acceptable state x8 is taken as the next state of theMarkov chain.

    where a denotes the mean acceptance probability of a proposal and ã denotes the average probabilitythat the symmetry condition is satisfied. Both a and ã can be made close to unity in practice, sothere is a computational gain in using spNUTS1 over the original NUTS if ξ is large and cπ is atleast comparable to cO. The number l can be chosen to one unless there is an issue of numericalinstability of leapfrog trajectories. We note that the increase in the cost by a factor of 1a canbe partially negated, in terms of the overall numerical efficiency, due to the fact if a proposal isdeemed unacceptable, the next proposal can be further away from the initial state Y0. The averagedistance between two consecutive states in the constructed Markov chain is a measure widely usedto evaluate the numerical efficiency of a MCMC algorithm (Sherlock et al., 2010).

    The proof of the following proposition is given in Appendix E.

    Proposition 3. The Markov chain(X(i)

    )i∈1:M constructed by the sequential-proposal No-U-Turn

    sampler of type 1 (spNUTS1, Algorithm 6) is reversible with respect to the target density π̄.

    Another algorithm that automatically tunes the lengths of leapfrog trajectories, called sp-NUTS2, is given in Algorithm 7. Unlike spNUTS1, spNUTS2 applies the sequential-proposalscheme within one trajectory. The spNUTS2 algorithm takes the endpoint of the constructedleapfrog trajectory as a candidate for the next state of the Markov chain, as in spNUTS1. How-ever, it evaluates the log target density at every point on the trajectory like the original NUTS.Starting from the current state of the Markov chain X(i) =x0 and a velocity vector v0 randomlydrawn from ψC , the algorithm extends a leapfrog trajectory in units of l leapfrog jumps. We willdenote by (x1, v1) the first acceptable state along the trajectory that is a multiple of l leapfrogjumps away from the initial state. Here, (x1, v1) is acceptable if

    Λ <π(x1)ψC(v1)

    π(x0)ψC(v0).

    In order to avoid indefinitely extending the trajectory when the leapfrog approximation is numer-ically unstable, the algorithm ends the attempt to find the next acceptable state if N consecutivestates at intervals of l leapfrog jumps are all unacceptable. In this case, the next state of the Markovchain is set to (x0, v0). For k≥ 2, the state (xk, vk) is likewise found as the first acceptable statealong the leapfrog trajectory that is a multiple of l jumps from (xk−1, vk−1). If for any k≥ 1 thenext acceptable state is not found in N consecutive states visited after (xk−1, vk−1), the next statein the Markov chain is also set to (x0, v0). In practice, however, this situation can be avoided bytaking the leapfrog step size � reasonably small to ensure numerical stability and N large enough.The algorithm takes a preset increasing sequence of integers (bj)j∈1 : jmax and checks if the anglesbetween the displacement vector xbj −x0 and the initial and the last velocity vectors v0 and vbjare below a certain level c. The trajectory is stopped at (xbj , vbj ) if either

    cosAngle(xbj −x0, v0 ;C) ≤ c or cosAngle(xbj −x0, vbj ;C) ≤ c. (23)

    19

  • Upon reaching (xbjmax , vbjmax ), however, the trajectory stops regardless of whether (23) is satisfiedfor j= jmax. As in spNUTS1, a symmetry condition is checked to ensure detailed balance. Thatis, the state (xbj , vbj ) is taken as the next state in the Markov chain if and only if

    cosAngle(xbj −xbj−bj′ , vbj ;C) > c and cosAngle(xbj −xbj−bj′ , vbj−bj′ ;C) > c for all 1≤ j′≤ j−1.

    (24)If the symmetry condition is not satisfied, the next state of the Markov chain is set to (x0, v0). Asin spNUTS1, the choice of bj = 2

    j−1, j ∈ 1 : jmax, allows the symmetry condition in spNUTS2to be satisfied with high probability and makes the checkpoints for the symmetry condition,{bj − bj′ ; j′≤ j−1}, readily predictable.

    When cπ, cO, and cU denote the same average computational costs as before and bj = 2j−1, the

    average computational cost of finding a distinct sample point for the Markov chain using spNUTS2is roughly given by 1ã{(lcO + cπ) · ξ + cU (2 log2(aξ) + 1)}, where ξ denotes the average length ofstopped trajectories in units of l leapfrog jumps, ã the average probability that the symmetrycondition is satisfied, and a the mean acceptance probability. Since ã can be close to unity and cUis often smaller than cπ or cO in practice, the computational cost of spNUTS2 per distinct sampleis comparable to that of the NUTS. However, the overall numerical efficiency of spNUTS2 can behigher because the average distance between the current and the next state of the Markov chaincan be larger.

    The proof of the following proposition is also given in Appendix E.

    Proposition 4. The Markov chain(X(i)

    )i∈1:M constructed by the sequential-proposal No-U-Turn

    sampler of type 2 (spNUTS2, Algorithm 7) is reversible with respect to the target density π̄.

    4.3 Adaptive tuning of parameters in HMC

    Adaptively tuning parameters in MCMC algorithms using the history of the Markov chain canoften lead to enhanced numerical efficiency (Haario et al., 2001; Andrieu and Thoms, 2008). Herewe discuss adaptive tuning of some parameters in HMC algorithms. As discussed in Section 4.1,tuning the leapfrog step size � is one of the critical decisions to make in running HMC. Numericalefficiency of HMC algorithms can be increased by targeting an average acceptance probability thatis away from both zero and one (Beskos et al., 2013). Since the mean acceptance probability tendsto increase with decreasing step size, we use the following recursive formula to update the step size,

    log �i+1 ← log �i +λ

    iα(ai − a∗), (25)

    where �i and ai denote the leapfrog step size and the acceptance probability of a proposal at thei-th iteration, and a∗ the target mean acceptance probability. We follow a standard approach forthe sequence of adaptation sizes by taking α∈ (0, 1] and λ> 0 (Andrieu and Thoms, 2008).

    Tuning the covariance C of the velocity distribution can also increase the numerical efficiencyof HMC algorithms. If the marginal distributions for the target density π̄ along different directionshave orders of magnitude differences in standard deviation, the size of leapfrog jumps shouldtypically be on the order of the smallest standard deviation in order to avoid numerical instability(Neal, 2011). In this case a large number of leapfrog jumps are needed to make a global movein the direction having the largest standard deviation. Figure 4a shows an example of leapfrogtrajectory when the target distribution is N (0,Σ), where the standard deviation of one principalcomponent of Σ is twenty times larger than the other. In this diagram, the leapfrog trajectory takesabout 120 jumps to move across the less constrained direction from one end of a level set to theother. Choosing a covariance C for the velocity distribution close to the covariance of the targetdistribution can substantially reduce the number of leapfrog jumps needed to explore the samplespace in every direction (Neal, 2011, Section 5.4.1). Figure 4b shows that a leapfrog trajectory canloop around the level set with only fifteen jumps when the covariance of the velocity distributionC is equal to Σ. We note that the covariance C affects not only the velocity distribution and

    20

  • (a) 120 leapfrog jumps with C = I. (b) 15 leapfrog jumps with C = Σ.

    Figure 4: Two leapfrog trajectories for an ill conditioned Gaussian distribution with covariance Σ for twodifferent choices of C. In both cases, the leapfrog jump size � was 0.5. A level set of the target density is shownas a dashed ellipsoid.

    the leapfrog updates, but also the U-turn condition (23) for NUTS-type algorithms (the originalNUTS, spNUTS1, and spNUTS2) via the cosAngle function.

    For adaptive tuning, the covariance Ci used at the i-th iteration can be set equal to the samplecovariance of the Markov chain sampled up to the previous iteration. During initial iterations, afixed covariance C0 can be used to avoid numerical instability (Haario et al., 2001):

    Ci ←{C0 i≤ i0sample covariance of {X(j) ; j≤ i−1} i> i0.

    It is possible to take Ci as a diagonal matrix whose diagonal entries are given by the samplemarginal variances of each component of the Markov chain (Haario et al., 2005). This approach iseffective when the components have different scales. The computational cost can be substantiallyreduced by using a diagonal covariance matrix when the target distribution is high dimensional,because operations such as the Cholesky decomposition of Ci can be avoided. Marginal samplevariances can be updated with little overhead at each iteration using a recursive formula.

    4.4 Numerical examples

    4.4.1 Multivariate normal distribution

    We used two examples to study the numerical efficiency of various algorithms discussed in thispaper. We first considered a one hundred dimensional normal distribution N (0,Σ) where thecovariance matrix Σ is diagonal and the marginal standard deviations form a uniformly increasingsequence from 0.01 to 1.00.

    We compared the numerical efficiency of the following five algorithms, first without adap-tively tuning the covariance of the velocity distribution C: the standard HMC, HMC with se-quential proposals (abbreviated as spHMC, which is equivalent to XCGHMC in Algorithm 4),the NUTS algorithm by Hoffman and Gelman (2014) (the efficient version), spNUTS1 (Algo-rithm 6), and spNUTS2 (Algorithm 7). All experiments were carried out using the implemen-tation of the algorithms in R (R Core Team, 2018). The source codes are available at https://github.com/joonhap/spMCMC. The covariance matrix C of the velocity distribution was setequal to the one hundred dimensional identity matrix. The leapfrog step size � was adaptivelytuned using (25) with α= 0.7 and target acceptance probabilities a∗ varying from 0.45 to 0.95.The adaptation started from the one hundredth iteration. The acceptance probability at the i-thiteration ai was computed using the state that was one leapfrog jump away from the current stateof the Markov chain to ensure that the leapfrog jump size � converges to the same value for the

    21

    https://github.com/joonhap/spMCMChttps://github.com/joonhap/spMCMC

  • 1050

    200

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    min

    imum

    ES

    S /s

    ec

    HMC spHMC NUTS spNUTS1 (N=1) spNUTS1 (N=5) spNUTS2

    520

    200

    2000

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    aver

    age

    ES

    S /s

    ec10

    5050

    0

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    runt

    ime

    Target acceptance probability

    Figure 5: The minimum and average effective sample sizes of constructed Markov chains acrossd= 100 variables per second of runtime for the target distribution N (0,Σ) when the covariance Cof the velocity distribution was fixed. The target acceptance probabilities are shown on the x-axis.The runtime in seconds are shown in the bottom row of plots. All y-axes are in logarithmic scales.

    same target acceptance probability across the various algorithms. When running HMC, spHMC,spNUTS1, and spNUTS2, the leapfrog step size was randomly perturbed at each iteration by mul-tiplying to �i a uniform random number between 0.8 and 1.2. Randomly perturbing the leapfrogstep size can improve the mixing of the Markov chain constructed by HMC algorithms (Neal, 2011).For the NUTS, we found perturbing the leapfrog step size did not improve numerical efficiency andthus used �i. In HMC and spHMC, each proposal was obtained by making fifty leapfrog jumps. InspHMC, a maximum of N = 10 proposals were tried in each iteration and the first acceptable pro-posal was taken as the next state of the Markov chain (i.e., L= 1). In spNUTS1 and spNUTS2, thestopping condition was checked according to the schedule bj = 2

    j−1 for j ∈ 1 : jmax with jmax = 15,and the unit number of leapfrog trajectories was set to one (i.e., l= 1 in Algorithms 6 and 7).The value c in the stopping condition (23) was randomly drawn from a uniform(0, 1) distributionfor each trajectory, as we found randomizing c yielded better numerical results than fixing it atzero. For the NUTS algorithm, randomizing c did not improve the numerical efficiency, so eachtrajectory was stopped when the cosine angle fell below zero (i.e., c= 0), as was in Hoffman andGelman (2014). In spNUTS1, the maximum number of proposals N in each iteration was set toeither one or five. In spNUTS2, a maximum of N = 20 consecutive states on a leapfrog trajectorywere tried in each attempt to find an acceptable state. Every algorithm ran for M = 20,200 itera-tions. As a measure of numerical efficiency, the effective sample size (ESS) of each component ofthe Markov chain was computed using an estimate of the spectral density at frequency zero via theeffectiveSize function in R package coda (Plummer et al., 2006). The first two hundred states ofthe Markov chains were discarded when computing the effective sample sizes. Each experiment wasindependently repeated ten times. All computations were carried out using the Boston UniversityShared Computing Cluster.

    22

  • Figure 5 shows both the minimum and the average effective sample size for the one hundredvariables divided by the runtime in seconds when the covariance C of the velocity distribution wasfixed at the identity matrix. We observed that there were large variations in the effective samplesizes among the d= 100 variables for the Markov chains constructed by HMC and spHMC, resultingin minimum ESSs much smaller than average ESSs. This happened due to the fact that for somevariables the leapfrog trajectories with fifty jumps consistently tended to return to states closeto the initial positions. The Markov chains mixed slowly in these variables. On the other hand,the leapfrog trajectories tended to reach the opposite side of the level set of Hamiltonian for somevariables, for which the autocorrelation at lag one was close to −1. For these variables, the effectivesample size was greater than the length of the Markov chain M . There were much variations inthe effective sample size among the variables for the Markov chains constructed by spNUTS1 andspNUTS2 when the stopping cosine angle c was fixed at zero, but variations diminished when cwas varied uniformly in the interval (0, 1).

    The highest value of the minimum ESS per second achieved by spHMC among various valuesof the target acceptance probability was about fifty percent higher than that by the standardHMC. For this multivariate normal distribution, the number of leapfrog jumps l= 50 for HMCand spHMC was within the range of the average number of jumps in the leapfrog trajectoriesconstructed by the NUTS, spNUTS1, and spNUTS2 algorithms. Thus the effective sample sizesby HMC and spHMC were comparable to those by the other three algorithms, but the runtimestended to be shorter. The highest minimum ESS per second by spNUTS1 with N = 5 was 7.6 timeshigher than that by the NUTS and 6.9 times higher than that by spNUTS2. The runtimes of theNUTS were more than ten times longer than those of spNUTS1 and twice longer than those ofspNUTS2. This happened because the evaluation of the gradient of the log target density tookmuch less computation time than the evaluation of the log target density for this example. Thehighest minimum ESS per second by spNUTS1 when up to five sequential proposals were made(i.e., N = 5) was twenty percent higher than when only one proposal was made (N = 1).

    Next we ran the NUTS, spNUTS1, and spNUTS2 algorithms for the same target distributionN (0,Σ) but with adaptively tuning the covariance C of the velocity distribution. The covariance Ciat the i-th iteration for i≥ 100 was set to a diagonal matrix whose diagonal entries were given by themarginal sample variances of the Markov chain constructed up to that point. We did not test HMCor spHMC when C was adaptively tuned, because the leapfrog step size, and thus the total lengthof the leapfrog trajectory with a fixed number of jumps, varied depending on the tuned values forC. Figure 6 shows the minimum and average ESS among d= 100 variables divided by the runtimein seconds. The highest minimum ESS per second improved more than fifty times, compared towhen the covariance C was fixed, for the NUTS. There was more than a five-fold improvement forspNUTS1, and more than a 25-fold improvement for spNUTS2. The highest minimum ESS persecond by the NUTS was 19% higher than that by spNUTS1 (N = 5) and 86% higher than thatby spNUTS2. The NUTS was relatively more efficient when C was adaptively tuned because thetrajectories were built using fewer leapfrog jumps. The computational advantage of spNUTS1 thatthe log target density is not evaluated at every leapfrog jump is relatively small when there areonly few jumps per trajectory. When C is close to Σ, the sampling task is essentially equivalentto sampling from the standard normal distribution, in which case larger leapfrog step sizes maybe used. The trajectories were made of five to eight leapfrog jumps at the most efficient targetacceptance probability when C was adaptively tuned. In comparison, the number of leapfrog jumpsin a trajectory was between 80 and 250 when C was not adaptively tuned.

    4.4.2 Bayesian logistic regression model

    We also experimented the numerical efficiency of the NUTS, spNUTS1, and spNUTS2 using theposterior distribution for a Bayesian logistic regression model. The Bayesian logistic regressionmodel and the data we used are identical to those considered by Hoffman and Gelman (2014).The German credit dataset from the UCI repository (Dua and Graff, 2017) consists of twenty four

    23

  • 100

    500

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    min

    imum

    ES

    S /s

    ec

    NUTS spNUTS1 (N=1) spNUTS1 (N=5) spNUTS2

    100

    500

    2000

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    aver

    age

    ES

    S /s

    ec20

    5020

    0

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    runt

    ime

    Target acceptance probability

    Figure 6: The minimum and average effective sample sizes per second of runtime for the targetdistribution N (0,Σ) when the covariance C of the velocity distribution was adaptively tuned. Thetarget acceptance probabilities are shown on the x-axis.

    attributes of individuals and one binary variable classifying those individuals’ credit. The posteriordensity is proportional to

    π(α, β |x, y) ∝ exp

    {−

    1000∑i=1

    log(1 + exp{−yi(α+ xi · β})−α2

    200− ‖β‖

    22

    200

    },

    where xi denotes the twenty four dimensional covariate vector for the i-th individual and yi denotesthe classification result taking a value from ±1. We did not normalize the covariates to zero meanand unit variance as in Hoffman and Gelman (2014), because we let C be adaptively tuned. Thecovariance C was set to a diagonal matrix having as its diagonal entries the marginal samplevariances of the constructed Markov chain up to the previous iteration. All algorithms were rununder the same settings as those used for the multivariate normal distribution example.

    Figure 7 shows the minimum and average ESS across d= 25 variables per second of runtime.The minimum ESS per second by spNUTS1 at the most efficient target acceptance probabilitywas 2.6 times higher than that by the NUTS and 1.7 times higher than that by spNUTS2. Thedifferences in the numerical efficiency was led mostly by the differences in the runtime. The numbersof leapfrog jumps in stopped trajectories tended to be larger than those for the normal distributionexample due to the correlations between the variables in this Bayesian logistic regression model;the numbers of leapfrog jumps were about fifty for the NUTS, twenty seven for spNUTS1, andtwenty two for spNUTS2.

    24

  • 520

    50

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    min

    imum

    ES

    S /s

    ec

    NUTS spNUTS1 (N=1) spNUTS1 (N=5) spNUTS2

    510

    50

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    aver

    age

    ES

    S /s

    ec10

    050

    050

    00

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    0.95

    0.85

    0.75

    0.65

    0.55

    0.45

    runt

    ime

    Target acceptance probability

    Figure 7: The minimum and average effective sample sizes per second of runtime across d= 25variables for the posterior distribution for the Bayesian logistic regression model in Section 4.4.2.

    5 Conclusion

    The sequential-proposal MCMC framework is readily applicable to a wide range of MCMC al-gorithms. The flexibility and simplicity of the framework allow for various adjustments to thealgorithms and offer possibilities of developing new ones. In this paper, we showed that the numer-ical efficiency of MCMC algorithms can be improved by using sequential proposals. In particular,we developed two novel NUTS-type algorithms, which showed higher numerical efficiency than theoriginal NUTS by Hoffman and Gelman (2014) on two examples we examined. In Appendix F, weapply the sequential-proposal framework to the bouncy particle sampler (BPS) and demonstrate anadvantageous property that the sequential-proposal BPS can readily make jumps between multiplemodes. The possibilities of other applications of the sequential-proposal MCMC framework can beexplored in future research.

    Acknowledgement This work was supported by National Science Foundation grants DMS-1513040and DMS-1308918. The authors thank Edward Ionides, Aaron King, and Stilian Stoev for comments on

    an earlier draft of this manuscript. The authors also thank Jesús Maŕıa Sanz-Serna for informing us about

    related references.

    A Proof of detailed balance for Algorithm 2 (sequential-

    proposal Metropolis-Hastings algorithm)

    Here we give a proof that Algorithm 2 constructs a reversible Markov chain with respect to thetarget density π̄. In what follows, we denote the l-th rank of a given finite sequence an:m by rl(an:m);

    25

  • that is, if we reorder the sequence an:m as a(1) ≥ a(2) ≥ · · · ≥ a(m−n+1), then rl(an:m) = a(l). If l isgreater than the length of the sequence an:m, we define rl(an:m) := 0. We also define r0(an:m) :=∞.Proposition 5. The Markov chain

    (X(i)

    )i∈1:M constructed by Algorithm 2 is reversible with respect

    to the target density π̄.

    Proof. It suffices to show the claim for fixed N and L. The general case immediately follows byconsidering a mixture over N and L according to ν(N,L).

    We will show that for a given n∈ 1 :N , the probability density of taking yn as the next state ofthe Markov chain starting from the current state y0 after rejecting a sequence of proposals y1:n−1is the same as the probability density of taking y0 starting from yn after going through a reversedsequence of proposals yn−1:1. The case for n= 1 coincides with a standard Metropolis-Hastingsalgorithm. We now fix n≥ 2. Denoting the uniform(0, 1) random variable drawn at the beginningof the iteration by Λ, the k-th proposal yk is considered acceptable if and only if

    Λ <π(yk)

    ∏kj=1 q(yj−1 | yj)

    π(y0)∏kj=1 q(yj | yj−1)

    .

    Multiplying both the numerator and the denominator by∏nj=k+1 q(yj | yj−1), we see that the above

    condition is equivalent to

    Λ <π(yk)

    ∏kj=1 q(yj−1 | yj)

    ∏nj=k+1 q(yj | yj−1)

    π(y0)∏nj=1 q(yj | yj−1)

    . (26)

    For k∈ 0 :n, we define the following quantities

    pk(y0, y1, . . . , yn) := π(yk)

    k∏j=1

    q(yj−1 | yj)n∏

    j=k+1

    q(yj | yj−1),

    such that the condition (26) can be concisely written as

    Λ <pk(y0, y1, . . . , yn)

    p0(y0, y1, . . . , yn).

    In what follows, pk(y0, y1, . . . , yn) will be denoted by pk for brevity. The proposal yn is taken as thenext state of the Markov chain if and only if it is the L-th acceptable proposal among the sequenceof proposals y1:n. This happens if and only if Λ < pn/p0 and there are exactly L−1 proposalsamong y1:n−1 such that Λ < pk/p0. The latter condition is satisfied if and only if Λ is less thanthe L−1-th largest number among p1:n−1/p0 but greater than or equal to the L-th largest numberamong the same sequence, that is,

    rL

    (p1:n−1p0

    )≤ Λ < rL−1

    (p1:n−1p0

    ).

    Under the assumption that X(i) is distributed according to the target density π̄, the probabilitythat the current state of the Markov chain is in a set A∈X and the n-th proposal, which is in aset B ∈X , is taken as the next state of the Markov chain is given by∫

    1A(y0)1B(yn)π̄(y0)n∏j=1

    q(yj | yj−1)1[Λ ≥ rL

    (p1:n−1p0

    )]1

    [Λ < rL−1

    (p1:n−1p0

    )]1

    [Λ <

    pnp0

    ]· 1 [0 < Λ < 1] dΛ dy0:n

    =

    ∫1A(y0)1B(yn) ·

    p0Z· 1[Λ ≥ rL

    (p1:n−1p0

    )]· 1[Λ < min

    {rL−1

    (p1:n−1p0

    ),pnp0, 1

    }]dΛ dy0:n

    =

    ∫1A(y0)1B(yn) ·

    p0Z·(

    min

    {rL−1(p1:n−1)

    p0,pnp0, 1

    }−min

    {rL(p1:n−1)

    p0,pnp0, 1

    })dΛ dy0:n

    =1

    Z

    ∫1A(y0)1B(yn) ·

    (min{rL−1(p1:n−1), pn, p0} −min{rL(p1:n−1), pn, p0}

    )dΛ dy0:n.

    (27)

    26

  • We will change the notation of dummy variables by writing y0 ← yn, y1 ← yn−1, . . . , yn ← y0. Butnote that pk(yn, yn−1, . . . , y0) can be expressed as

    π(yn−k)k∏j=1

    q(yn−j+1 | yn−j)n∏

    j=k+1

    q(yn−j | yn−j+1) = π(yn−k)n∏

    j=n−k+1q(yj | yj−1)

    n−k∏j=1

    q(yj−1 | yj),

    which is the same as an expression for pn−k(y0, y1, . . . , yn). Thus under the change of notation,(27) can be re-written as

    1

    Z

    ∫1A(yn)1B(y0) ·

    [min{rL−1(pn−1:1), p0, pn} −min{rL(pn−1:1), p0, pn}

    ]dΛ dyn:0.

    where pk denotes pk(y0, y1, . . . , yn) for k∈ 0 :n. The above integral is equal to (27) with the setsA and B interchanged. Thus we have proved that the probability that the current state of theMarkov chain is in A and the n-th proposal, which is in B, is taken as the next state of the Markovchain is equal to the probability that the current state is in B and the n-th proposal, which is inA, is taken as the next state. Summing the established equality over all n∈ 1 :N and finally notingthat the next state in the Markov chain is set equal to the current state in the case where therewere less than L proposals found in the first N proposals, we reach the conclusion that under theassumption that X(i) is distributed according to π̄, the probability that the current state of Markovchain is in A and the next state is in B is the same as the probability that the current state is inB and the next state is in A. This finishes the proof of detailed balance for Algorithm 2.

    B Sequential-proposal Metropolis-Hastings algorithms

    with proposal kernels dependent on previous proposals

    In Section 2.2 we presented a generalization of Algorithm 2 in which the proposal kernel can dependon previous proposals made in the same iteration. A pseudocode for this generalized version is givenin Algorithm 8. A proof that Algorithm 8 constructs a reversible Markov chain with respect to thetarget density π̄ is given below.

    Proposition 6. Algorithm 8 constructs a reversible Markov chain with respect to the target densityπ̄.

    Proof. Again we consider fixed N and L, because the general case can easily follow by consideringa mixture over N and L. Let Λ denote the uniform(0, 1) number drawn at the start of the iteration.We will denote the value of current state of the Markov chain by y0, and the values of a sequenceof proposals up to the n-th proposal as y1, . . . , yn. The n-th proposal yn is taken as the next stateof the Markov chain if and only if

    Λ <π(yn)

    ∏nj=1 qj(yn−j | yn−j+1:n)

    π(y0)∏nj=1 qj(yj | yj−1:0)

    ,

    and there are exactly L− 1 numbers k among 1 :n−1 that satisfy

    Λ <π(yk)

    ∏kj=1 qj(yk−j | yk−j+1:k)

    π(y0)∏kj=1 qj(yj | yj−1:0)

    , (28)

    and also there exist exactly L− 1 numbers k′ among 1 :n−1 that satisfy

    Λ <π(yk′)

    ∏n−k′j=1 qj(yk′+j | yk′+j−1:k′)

    ∏nj=n−k′+1 qj(yn−j | yn−j+1:n)

    π(y0)∏nj=1 qj(yj | yj−1:0)

    . (29)

    27

  • Algorithm 8: A sequential-proposal Metropolis Hasting algorithm using a path-dependent pro-posal kernel

    Input : Distribution for the maximum number of proposals and the number of acceptedproposals, ν(N,L)Path-dependent proposal kernels, {qj(· |xj−1, . . . , x0) ; j≥ 1}Number of iterations, M

    Output: A draw of Markov chain,(X(i)

    )i∈1:M

    1 Initialize: Set X(0) arbitrarily2 for i← 0 :M−1 do3 Draw (N,L) ∼ ν(·, ·)4 Draw Λ ∼ unif(0, 1)5 Set X(i+1) ← X(i)6 Set Y0 ← X(i) and na ← 07 for n← 1 :N do8 Draw Yn ∼ qn(· |Yn−1:0)

    9 if Λ <π(Yn)

    ∏nj=1 qj(Yn−j |Yn−j+1:n)

    π(Y0)∏n

    j=1 qj(Yj |Yj−1:0)then na ← na + 1

    10 if na = L then11 if there exist exactly L− 1 cases among k ∈ 1 :n−1 such that

    Λ <π(yk)

    ∏n−kj=1 qj(yk+j | yk+j−1:k)

    ∏nj=n−k+1 qj(yn−j | yn−j+1:n)

    π(y0)∏n

    j=1 qj(yj | yj−1:0)then

    12 Set X(i+1) ← Yn13 end14 break

    15 end

    16 end

    17 end

    The inequality (28) can be expressed as

    Λ <π(yk)

    ∏kj=1 qj(yk−j | yk−j+1:k)

    ∏nk+1 qj(yj | yj−1:0)

    π(y0)∏nj=1 qj(yj | yj−1:0)

    .

    We note that the numerator in the expression above is the probability density of drawing a sequenceof proposals in the order yk→ yk−1→ · · · → y0→ yk+1→ yk+2→ . . . → yn, where the value yj forj≥ k+ 1 is drawn from a proposal density qj(· | yj−1, yj−2, . . . , y0). We denote this probabilitydensity by

    pk(y0, y1, . . . , yn) := π(yk)

    k∏j=1

    qj(yk−j | yk−j+1:k)n∏k+1

    qj(yj | yj−1:0).

    We also denote the numerator in (29) by

    pk(y0, y1, . . . , yn) := π(yk)

    n−k∏j=1

    qj(yk+j | yk+j−1:k)n∏

    j=n−k+1qj(yn−j | yn−j+1:n),

    which gives the probability density of drawing proposals in the order yk → yk+1 → · · · yn →yk−1 → · · · → y0, where yj for j≤ k− 1 is drawn from qn−j(· | yj+1, . . . , yn). One can easily checkthe following relations:

    pn(y0:n) = p0(yn:0), p0(y0:n) = pn(yn:0), and

    pk(y0:n) = pn−k(yn:0), pk(yn:0) = pn−k(y0:n) for k ∈ 0 :n,(30)

    28

  • where we remind the reader of our notation y0:n := (y0, y1, . . . , yn) and yn:0 := (yn, yn−1, . . . , y0).Now (28) and (29) can be concisely expressed as

    Λ <pk(y0:n)

    p0(y0:n), and Λ <

    pk(y0:n)

    p0(y0:n)

    respectively. The conditions required for taking yn as the next state of the Markov chain can besummarized by the following inequalities:

    Λ ≥ rL(p1:n−1p0

    (y0:n)

    ), Λ < rL−1

    (p1:n−1p0

    (y0:n)

    ),

    Λ ≥ rL(p1:n−1p0

    (y0:n)

    ), Λ < rL−1

    (p1:n−1p0

    (y0:n)

    ),

    and Λ <pnp0,

    where rL denotes the function returning the L-rank as defined in Section A, andp1:n−1p0

    (y0:n) denotes

    the sequence of values(p1(y0:n)p0(y0:n)

    , . . . , pn−1(y0:n)p0(y0:n)). In what follows, pk(y0:n) and pk(y0:n) will be written

    as pk and pk for brevity. Under the assumption that at the current iteration the state of the Markovchain is distributed according to π̄, the probability that the current state is in A and the n-thproposal, which is in B, is taken as the next state of the Markov chain is given by∫

    1A(y0)1B(yn)π̄(y0)n∏j=1

    qj(yj | yj−1:0)1[Λ < min

    {1,pnp0, rL−1

    (p1:n−1p0

    ), rL−1

    (p1:n−1p0

    )}]

    · 1[Λ ≥ max

    {rL

    (p1:n−1p0

    ), rL

    (p1:n−1p0

    )}]dΛ dy0:n

    =

    ∫1A(y0)1B(yn)

    p0Z

    [min

    {1,pnp0, rL−1

    (p1:n−1p0

    ), rL−1

    (p1:n−1p0

    )}−min

    {1,pnp0, rL−1

    (p1:n−1p0

    ), rL−1

    (p1:n−1p0

    ),max

    {rL

    (p1:n−1p0

    ), rL

    (p1:n−1p0

    )}}]dy0:n

    =1

    Z

    ∫1A(y0)1B(yn)

    [min{p0, pn, rL−1(p1:n−1), rL−1(p1:n−1)}

    −min{p0, pn, rL−1(p1:n−1), rL−1(p1:n−1),max{rL(p1:n−1), rL(p1:n−1)}}]dy0:n(31)

    We now change the notation of dummy variables by writing y0 ← yn, y1 ← yn−1, . . . , yn ← y0,and noting the relations (30), we may rewrite (31) as

    1

    Z

    ∫1A(yn)1B(y0)

    [min{pn, p0, rL−1(pn−1:1), rL−1(pn−1:1)}

    −min{pn, p0, rL−1(pn−1:1), rL−1(pn−1:1),max{rL(pn−1:1), rL(pn−1:1)}}]dyn:0

    But the above display is equal to what is obtained when the sets A and B are interchanged in (31).Thus we have proved that, denoting the current state of the Markov chain as X(i) and the nextstate as X(i+1) and assuming that X(i) is distributed according to π̄,

    P[X(i) ∈ A,X(i+1) ∈ B, the n-th proposal is taken as X(i+1)]= P[X(i) ∈ B,X(i+1) ∈ A, the n-th proposal is taken as X(i+1)].

    Summing the above equation for n∈ 1 :N and considering that X(i+1) is set equal to X(i) for allscenarios except when a proposal among y0, . . . , yN is taken as the next state


Recommended