Block-WisePseudo-MarginalMetropolis-HastingsarXiv:1603.02485v3 [stat.ME] 10 Nov 2016...

arX

iv:1

603.

0248

5v3

[st

at.M

E]

10

Nov

201

6

Block-Wise Pseudo-Marginal Metropolis-Hastings

M.-N. Tran∗ R. Kohn † M. Quiroz ‡§ M. Villani‡

February 22, 2019

Abstract

The pseudo-marginal Metropolis-Hastings (PMMH) approach is increasingly usedfor Bayesian inference in statistical models where the likelihood is intractable but canbe estimated unbiasedly, such as random effects models and state-space models, or fordata subsampling in big-data settings. Dahlin et al. (2015) and Deligiannidis et al.(2015) show how the PMMH approach can be made much more efficient by correlatingthe underlying pseudo random numbers used to form the estimate of the likelihoodat the current and proposed values of the unknown parameters. Their approachgreatly speeds up the standard PMMH algorithm, as it requires a much smaller num-ber of particles to form the optimal likelihood estimate. We present an alternativePMMH approach that divides the underlying random numbers into blocks so thatthe likelihood estimates for the proposed and current values of the likelihood onlydiffer by the random numbers in one block. Our approach has several advantages.First, it provides a direct way to control the correlation between the logarithms ofthe estimates of the likelihood at the current and proposed values of the parameters.Second, under some conditions, the mathematical properties of the method are sim-plified and made more transparent compared to the treatment in Deligiannidis et al.(2015). Third, blocking is shown to be a natural way to carry out PMMH in manyproblems, especially when the likelihood is estimated using randomised quasi num-bers. We show that, if quasi numbers are used for estimating the likelihood, theoptimal number of particles required in the block-wise PMMH is much smaller thanthat in Deligiannidis et al.. We obtain theory and guidelines for selecting the optimalnumber of particles, and document large speed-ups in a wide range of applications.

Keywords. Intractable likelihood; Unbiasedness; Panel data; Data subsampling;Quasi Monte Carlo

∗Discipline of Business Analytics, University of Sydney†School of Economics, UNSW School of Business‡Department of Computer and Information Science, Linkoping University§Research Division, Sveriges Riksbank

1

http://arxiv.org/abs/1603.02485v3

1 Introduction

In many statistical applications the likelihood is analytically or computationally intractable,

making it difficult to carry out Bayesian inference. An example of models where the

likelihood is often intractable are generalised linear mixed models (GLMM) for longitudinal

data, where random effects are used to account for the dependence between the observations

measured on the same individual (Fitzmaurice et al., 2011; Bartolucci et al., 2012). The

likelihood is intractable because it is an integral over the random effects, but it can be

easily estimated unbiasedly using importance sampling. The second example that uses

a variant of the unbiasedness idea, is that of unbiasedly estimating the log likelihood by

subsampling, as in Quiroz et al. (2016). Subsampling is useful when the log likelihood

is a sum of terms, with each term in the log likelihood expensive to evaluate, or when

there is a very large number of such terms. Quiroz et al. (2016) estimate the log-likelihood

unbiasedly in this way and then bias correct the resulting likelihood estimator to use within

a PMMH algorithm. They show analytically that the resulting posterior distribution of the

parameters is a perturbation of the true target distribution, with the perturbation error

being very small. State space models are a third class of models where the likelihood is

often intractable but can be unbiasedly estimated using an importance sampling estimator

(Shephard and Pitt, 1997; Durbin and Koopman, 1997) or by a particle filter estimator

(Del Moral, 2004; Andrieu et al., 2010).

It is now well known in the literature that a direct way to overcome the problem of

working with an intractable likelihood is to estimate the likelihood unbiasedly and use this

estimate within a Markov chain Monte Carlo (MCMC) simulation on an expanded space

that includes the random numbers used to construct the likelihood estimator. This was

first considered by Lin et al. (2000) in the Physics literature and Beaumont (2003) in the

Statistics literature; it was formally studied in Andrieu and Roberts (2009), who called the

method pseudo-marginal Metropolis-Hastings and gave conditions for the chain to converge.

Andrieu et al. (2010) use MCMC for inference in state space models where the likelihood

is estimated unbiasedly by the particle filter. Flury and Shephard (2011) give an excel-

lent discussion with illustrative examples of PMMH. Pitt et al. (2012) and Doucet et al.

(2015) analyse the effect of estimating the likelihood and show that the variance of the

2

log-likelihood estimator should be around 1 to obtain an optimal tradeoff between the effi-

ciency of the Markov chain and the computational cost. See also Sherlock et al. (2015), who

consider random walk proposals for the parameters, and show that the optimal variance of

the log of the likelihood estimator can be somewhat higher in this case.

A key issue in estimating models by standard PMMH is that the variance of the log of

the estimated likelihood grows linearly with the number of observations T . Hence, to keep

the variance of the log of the estimated likelihood small and around 1 it is necessary for the

number of particles N , used in constructing the likelihood estimator, to increase in propor-

tion to T , which means that PMMH requires O(T 2) operations at every MCMC iteration.

Deligiannidis et al. (2015) and Dahlin et al. (2015) propose correlating the pseudo random

numbers used in constructing the estimators of the likelihood at the current and proposed

values of the parameters; see also Stramer and Bognar (2011), Murray and Graham (2016)

and Lee and Holmes (2010). Deligiannidis et al. (2015) show that by inducing a high cor-

relation between these ensembles of pseudo random numbers it is only necessary to increase

the number of particles N in proportion to T12 , reducing the PMMH algorithm to O(T 3/2)

per iteration. This is a breakthrough in the ability of PMMH to be competitive with more

traditional MCMC methods.

Our article proposes an alternative approach to Deligiannidis et al. (2015) and Dahlin et al.

(2015) called the block-wise PMMH that requires a much smaller number of particles for

optimal performance. The block-wise PMMH divides the set of underlying random numbers

into blocks and updates one block at a time, thus reducing the variation in the Metropolis-

Hastings acceptance probability. This helps the chain to mix well even if highly variable

estimates of the likelihood are used. As a result, only a small number of particles is needed

at every iteration. We obtain theory and guidelines for selecting an optimal number of

particles in the block-wise PMMH. The block-wise PMMH has a number of advantages

over the correlated PMMH method for models where the likelihood can be divided into in-

dependent blocks. First, the correlation between the proposed and current values of the log

likelihood estimators is controlled directly rather than indirectly and nonlinearly through

the correlated ensembles of random numbers. Second, we show that blocking is a natural

way to carry out a dependent PMMH in many problems such as subsampling, panel data,

3

diffusion processes and Approximate Bayesian Computation. Third, under some condi-

tions, the theoretical justification and optimality properties of the block-wise PMMH are

more transparent and easier to prove than the derivations in Deligiannidis et al. (2015). In

particular, similarly to Deligiannidis et al. (2015), we show that the optimal number of par-

ticles required at each iteration of the block-wise PMMH is O(T 3/2). Forth, the blockwise

PMMH takes a less CPU time than the standard and correlated PMMH as we do not need

to generate the full set of random numbers in each iteration. Fifth, the block-wise PMMH

offers a natural way to use randomised quasi numbers for estimating the likelihood. Nu-

merical integration using randomised quasi Monte Carlo numbers achieves, in many cases,

a better convergence rate than using Monte Carlo pseudo random numbers. A thorough

treatment of quasi Monte Carlo can be found in the monographs by Niederreiter (1992)

and Dick and Pillichshammer (2010). Using quasi numbers has proven successful in the

intractable likelihood literature recently; see, e.g., Gerber and Chopin (2015); Tran et al.

(2015). We show that, if quasi numbers are used for estimating the likelihood, the optimal

number of particles required at each iteration of the PMMH is approximately O(T 7/6),

compared to O(T 3/2) in the correlated PMMH (Deligiannidis et al., 2015). Correlating

quasi numbers for their use in the correlated PMMH is not trivial as this might break

down the desired uniformity property of quasi numbers, and this is recently studied in

Gunawan et al. (2016).

We obtain a large-sample property showing that, under some conditions, slowly updat-

ing the underlying random numbers in likelihood estimation does not have any negative

effect on the mixing of the chain on the parameters of interest. We apply the proposed

method to a wide range of applications and document large speed-ups compared to the

other approaches.

4

2 Block-wise pseudo-marginal Metropolis-Hastings al-

gorithm

2.1 The PMMH algorithm

Let y be a set of observations with density p(y|θ), where θ ∈ Θ is the vector of unknown

parameters. We are interested in sampling from the posterior π(θ) ∝ p(θ)p(y|θ) in models

where the likelihood p(y|θ) is analytically or computationally intractable. We assume that

p(y|θ) can be unbiasedly estimated by pN(y|θ) = pN(y|θ, u), with u the set of pseudo

random numbers or randomised quasi numbers used to compute pN(y|θ); N is the number

of importance samples or number of particles used and the dimension of u is proportional

N . We will however call N the number of particles throughout. Denote the density function

of u by pN (u) and define a joint density of θ and u as

πN (θ, u) := p(θ)pN (y|θ, u)pN(u)/p(y). (1)

Then, πN(θ,u) admits π(θ) as its marginal density because∫pN(y|θ,u)p(u)du= p(y|θ) by

unbiasedness. Therefore, we obtain samples from the posterior π(θ) by sampling from

πN(θ,u).

Let q(θ|θ′) be a proposal density for θ, conditional on θ′, with θ′ the current state. Let

u′ be the corresponding current set of random numbers used to compute pN(y|θ′,u′). The

standard PMMH algorithm generates samples from π(θ) by generating a Markov chain with

invariant density based on πN (θ,u) using the Metropolis-Hastings algorithm with proposal

density q(θ,u|θ′,u′)=q(θ|θ′)pN(u). The proposal (θ,u) is accepted with probability

α(θ′,u′;θ,u) :=min

(1,πN (θ,u)

πN (θ′,u′)

q(θ′,u′|θ,u)q(θ,u|θ′,u′)

)=min

(1,p(θ)pN(y|θ,u)p(θ′)pN(y|θ′,u′)

q(θ′|θ)q(θ|θ′)

), (2)

which is computable. In the standard PMMH scheme, a new independent set of pseudo-

random numbers u is generated each time the likelihood estimate is computed, so it is

usually unnecessary to store u and u′.

For the standard PMMH algorithm, Pitt et al. (2012) and Doucet et al. (2015) show

5

that the variance of log pN (y|θ,u) should be around 1 in order to obtain an optimal tradeoff

between the computational cost and efficiency of the Markov chain in θ and u. However,

in some problems it may be prohibitively expensive to take an N large enough to ensure

that V(log pN(y|θ,u))≈1.

2.2 The block-wise PMMH algorithm

In our block-wise PMMH algorithm, instead of generating a completely new set u when

estimating the likelihood as previously done in the literature, we update u in blocks. Let

us divide the set of variables u into G sets u(1),...,u(G). The extended target (1) can be

re-written as

πN (θ, u(1), ..., u(G)) = p(θ)pN(y|θ, u(1), ..., u(G))pN(u(1), ..., u(G))/p(y). (3)

Instead of updating the full set of (θ,u) as in the standard PMMH, we propose to update

θ and a block u(K) at a time. The block index K is randomly selected from 1,...,G with

P (K=k)>0 for every k=1,...,G. Typically, P (K=k)=1/G. This is similar to component-

wise MCMC whose convergence is well established in the literature; see, e.g., Johnson et al.

(2013). The acceptance probability (2) becomes

min

(1,p(θ)pN(y|θ, u′(1), ..., u(k), ..., u′(G))

p(θ′)pN(y|θ′, u′(1), ..., u′(k), ..., u′(G))


). (4)

Intuitively, by fixing all u(j)’s except u′(k), the variation in the ratio of the likelihood esti-

mates is reduced, which helps the chain to mix well.

Instead of updating u in blocks, Deligiannidis et al. (2015) and Dahlin et al. (2015)

move u slowly by making the proposed u and the current u′ correlated. Suppose that

the underlying pseudo random numbers u are standard univariate normal variables and

ρ>0 is a number close to 1, they propose u=ρu′+√1−ρ2ǫ with ǫ a set of standard normal

variables of the same size as u′. We note that it’s not trivial to extend this framework to the

case where u are randomised quasi numbers, as u is a set of uniform rather than standard

random variables, and correlating u in this way will break down the desired uniformity

6

property of quasi numbers.

3 Analysis of the block-wise PMMH

We now suppose that the likelihood can be written as a product of G independent terms,

p(y|θ)=G∏

k=1

p(y(k)|θ). (5)

Example: panel data models. Consider a panel-data model with T panels, which we

divide into G groups y(1),...,y(G) with approximately T/G panels in each.

Example: big data. Consider a big data set with T independent observations, which we

divide into G groups y(1),...,y(G) with T/G observations in each.

Example: time series data. Consider a time series data y={y1,...,yT}, which we divide

into G groups y(1)=y1:t1, y(2)=yt1+1:t2 ... Then p(y|θ) can be written in the form of (5) with

p(y(k)|θ)=p(ytk−1+1|ytk−1,θ)...p(ytk |ytk−1,θ), k=1,...,G.

Here, t0=0, tG=T and p(y1|y0,θ)=p(y1|θ).We assume that the kth likelihood term p(y(k)|θ) is estimated unbiasedly by pN(k)

(y(k)|θ,u(k)),where the u(k) are independent with u(k)∼pN(k)

(·), and N(k) is the number of particles or

importance samples or the size of u(k). Let N :=N(1)+···N(G). An unbiased estimator of

the likelihood is

pN(y|θ) :=G∏

k=1

pN(k)(y(k)|θ,u(k)).

Our analysis of the block-wise PMMH is built on the framework of Pitt et al. (2012). We

follow Pitt et al. (2012) and define z(k)=z(k)(θ,u(k)) := log pN(k)(y(k)|θ,u(k))−logp(y(k)|θ) as

the error in the log likelihood estimate of the kth block. Then, z=z(θ,u) :=log pN(y|θ,u)−logp(y|θ)=∑G

k=1z(k) is the error in the log-likelihood estimate.

Suppose that u(1),...,u(G) are independent and generated from pN(k)(u(k)) and suppose

7

that gN(k)(z(k);θ) is the distribution of the corresponding z(k). Then, the distribution of

θ,z(1),...,z(G) is

πN(θ,z(1),...,z(G))=π(θ)G∏

k=1

exp(z(k))gN(k)(z(k)|θ), (6)

which means that, at stationary, πN(k)(z(k)|θ)=exp(z(k))gN(k)

(z(k)|θ) and the z(k) are inde-

pendent conditional on θ, i.e., in the posterior πN(·|θ).We note that it is straightforward to show that the acceptance probability (2) can be

rewritten as

min

{1,exp(z−z′) π(θ)

π(θ′)


}(7)

where z′ = z(θ′,u′). This is also the acceptance probability if we consider a Metropolis-

Hastings scheme that samples from

πN(θ,z)=π(θ)ezgN(z|θ), (8)

with gN(z|θ) the density of z. Thus, as noted in Pitt et al. (2012), the properties of the

Metropolis-Hastings scheme based on πN (θ,u) are the same as those based on πN(θ,z). As

we show below, it will be easier to work with πN (θ,z) than with πN(θ,u) because z is a

scalar.

Instead of updating θ and u as in the standard PMMH, the block-wise PMMH updates

θ and a single block, u(k). The terms z and z′ in the acceptance probability (7) are

z =∑G

j=1,j 6=kz(j)(θ,u(j) = u′(j))+z(k)(θ,u(k)), z′ =∑G

j=1z(j)(θ′,u′(j)). We use the following

notation: w∼N (a,b2) means that w has a normal distribution with mean a and variance

b2, and denote the density of w as N (w;a,b2).

We make the following assumptions.

Assumption 1. Suppose that u(1),...,u(G) are independent and generated from pN(k)(u(k))

and,

8

(i) For each group k, there is a γ2(k)(θ)>0 and an Nk such that

z(k)(θ,u(k))∼N(−γ2(k)(θ)

2N2k

,γ2(k)(θ)

N2k

),

for some >0.

(ii) For a given σ2> 0, let Nk be a function of θ, σ2 and G such that V(z(k)(θ,u(k))) =

σ2/G, i.e. Nk = Nk(θ,σ2,G) = [Gγ2(k)(θ)/σ

2]1/(2). This means that z(k)(θ,u(k)) ∼N (−σ2/(2G),σ2/G) for each k.

As will be clear in Lemma 5, =1/2 if the likelihood is estimated using pseudo random

numbers, and =3/2−ǫ for any arbitrarily small ǫ>0 if the likelihood is estimated using

quasi numbers. We note that N(k) is the total number of particles used for the kth group,

and will usually be different from Nk. In panel data models, N(k)=(T/G)Nk. Assumption 1

is a stylized set of assumptions that will hold approximately when the Nk are sufficiently

large. Part (i) is similar to the assumption made in Pitt et al. (2012) and ensures that

if z(k) is normally distributed then the expected value E(exp(z(k))) = 1 for each k; this is

consistent with the estimator pN(k)(y(k)|θ) being unbiased. For panel data models, part (i) is

justified by Lemma 5 in this article. Assumption 1(ii) can be enforced for most panel data

models and subsampling approaches because it is straightforward to estimate the variance

of z(k) accurately for each k and θ.

We can now obtain the following lemma whose proof is straightforward and omitted.

Lemma 1. Suppose that Assumption 1 holds and define z′ and z as above. Let ρ=1−1/G.

Then,

(i)

z′∼N(σ2

2,σ2

)and z∼N

(σ2(2ρ−1)

2,σ2

).

(ii) Corr(z,z′)=ρ.

9

(iii)

z|z′∼N(−σ2

2(1−ρ)+ρz′,σ2(1−ρ2)

).

We follow Pitt et al. (2012) and assume that the proposal for θ is perfect. That is,

Assumption 2. q(θ|θ′)=π(θ).

This assumption helps identify the effect of estimating the likelihood, as we assume

a perfect proposal, and helps to simplify the derivation of the guidelines for the optimal

number of particles. Under Assumption 2, the acceptance probability (7) of the Metropolis-

Hastings scheme becomes

α(z′, z; ρ, σ) = min(1, ez−z′

), (9)

The next lemma gives the conditional and unconditional acceptance probabilities of the

Metropolis-Hastings scheme for z and θ and is proved in Appendix A

Lemma 2. Suppose Assumptions 1 and 2 hold and ρ=1−1/G.

(i) The acceptance probability of the Metropolis-Hastings scheme conditional on z′ =

z(θ′,u′) is

P(accept|z′,ρ,σ)=exp(−x+τ 2/2)Φ(xτ−τ)+Φ

(−xτ

)

with x :=

(z′+ σ2

2

)(1−ρ) and τ=σ

√1−ρ2.

(ii) The unconditional acceptance probability of the Metropolis-Hastings scheme is

P(accept|ρ, σ) = 2

(1− Φ

(σ√1− ρ√2

)). (10)

Suppose that we are interested in estimating π(ϕ)=∫ϕ(θ)π(θ)dθ for some scalar func-

tion ϕ(θ) of θ. Let {θ[j],z[j],j=1,...,M} be the draws obtained from the PMMH sampler

after it has converged, and let the estimator of π(ϕ) be π(ϕ):= 1M

∑ϕ(θ[j]). Then, we define

the inefficiency of the estimator π(ϕ) relative to an estimator based on an i.i.d. sample

10

from π(θ) (as in Assumption 2) as

IF(ϕ,σ,ρ) := limM→∞

MVPMMH(π(ϕ))/V(ϕ|y), (11)

where VPMMH(π(ϕ)) is the posterior variance of the estimator π(ϕ) and V(ϕ|y):=Eπ(ϕ(θ)2)−

[Eπ(ϕ(θ))]2 is the posterior variance of ϕ so that V(ϕ|y)/M is the variance of the ideal es-

timator when θ[j]iid∼ π(θ). We obtain the following result which shows that under our

assumptions the inefficiency IF(ϕ,σ,ρ) is independent of ϕ and is a function only of σ and

ρ=1−1/G. The proof is Appendix A. We call IF(σ,ρ) the inefficiency of the PMMH algo-

rithm, for given ρ and σ. From (8) and Lemma 1, we note that the posterior density of z

conditional on θ is πN (z|θ)=ezgN(z|θ)=N (z;σ2/2,σ2), which does not depend on θ and is

denoted π(z|σ).

Lemma 3. The inefficiency is given by

IF(σ,ρ)=1+2Ez′∼π(z′|σ)

(1−k(z′|σ,ρ)k(z′|σ,ρ)

), (12)

where k(z′|ρ,σ)=Pr(accept|z′,ρ,σ) is the acceptance probability of the MCMC scheme con-

ditional on the previous iterate z′ and is given by part (i) of Lemma 2.

Similarly to Pitt et al. (2012), we obtain in Appendix B the computing time of the

sampler as

CT(σ,ρ) :=IF(σ,ρ)

σ1/(13)

which takes into account the total number of particles needed to obtain a given precision

and the mixing rate of the PMMH chain.

To simplify the notation in this section we often do not show dependence on ρ as it is

assumed constant. In Section 5 we show that if we take G=O(T12 ), then ρ=1−O(T− 1

2 )

and Nk=O(T1/(4)) are optimal.

The next lemma shows the optimal σ under our assumptions as well as the corresponding

acceptance rate. Its proof is in Appendix A.

11

Lemma 4. Given that ρ=1−1/G is fixed and close to 1, the optimal σopt that minimizes

CT(σ,ρ) is σopt ≈ 2.16/√1−ρ2 if = 1/2, and σopt ≈ 0.82/

√1−ρ2 if = 3/2−ǫ. The

unconditional acceptance rate (10) under this optimal choice of the tuning parameters is

approximately 0.28 and 0.68, respectively.

Lemma 4 provides a theory for selecting the optimal σopt given that G is fixed. Because

the optimal variance σopt is much bigger than 1, the optimal variance in the standard

PMMH, Lemma 4 shows that the bockwise PMMH is much more robust than the standard

PMMH in the sense that the former can tolerate highly noisy estimates of the likelihood.

Let M be the length of the Markov chain to be generated. The average number of times

that a block u(k) is updated is M/G. In general, G should be selected such that M/G is

not too small so that the space if z is well explored. In the examples in this paper, if not

otherwise stated, we set G=100, as we found that the efficiency is relatively insensitive to

larger values of G.

A toy example. Suppose that we wish to sample from π(θ,z)=π(θ)ezg(z|σ) in which θ is

the parameter of interest, with π(θ)=N (0,1) and g(z|σ)=N (−σ2/2,σ2). Suppose further

that z is divided into blocks z=∑G

k=1z(k) with z(k)iid∼N (−σ2

G/2,σ2G), σ

2G=σ

2/G and G=100.

First, we consider two sampling schemes, the standard PMMH and the block-wise

PMMH, to sample from π(θ,z) with σ2G = 2.34, i.e. σ2 = 234. Suppose that (θ′,z′) is

the current state. The proposal (θ,z) in the standard PMMH is generated by θ ∼ π(θ)

and z ∼ g(z|σ). The proposal (θ,z) in the block-wise PMMH scheme is generated as

follows. Let z′ =∑G

k=1z′(k) be the current z-state and let k be an index uniformly gen-

erated from the set {1,...,G}. Sample z(k) ∼N (−σ2G/2,σ

2G) and let z =

∑j 6=kz

′(k)+z(k) be

the proposal. Both schemes accept (θ,z) with probability min(1,ez−z′). In this setting,

the blockwise PMMH scheme is exactly the correlated PMMH algorithm of Dahlin et al.

(2015) and Deligiannidis et al. (2015) where the proposal density q(z|z′) is normal with

mean −(σ2/2)(1−ρ)+ρz′ and variance σ2(1−ρ2), ρ=1−1/G.

Figure 1 plots the θ-samples generated by the standard PMMH scheme (left panel) and

by the block-wise PMMH scheme (right panel). As expected, the standard PMMH chain

is sticky because of the big variance of z, σ2=234.

Now, in order to study the effect of σ2 on the acceptance rate and computing time CT(σ)

12

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 105

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 105

−5

−4

−3

−2

−1

0

1

2

3

4

5

Figure 1: The samples of θ generated by the standard PMMH scheme (left) and the block-wise PMMH scheme (right). Both chains are initialised at 3 and run for 500,000 iterations.

of the blockwise sampler, Figure 2 shows the CT(σ) and acceptance rates for various values

of σ2. The figure shows that CT(σ) has a minimum value of 0.0263 at σ2=234, where the

acceptance rate is 0.279, which agrees with the theory. Among all standard PMMH schemes

with different σ2, Pitt et al. (2012) show that the optimal scheme is the one with σ2≈1.

We also run this optimal standard PMMH scheme and obtain a computing time value of

CT(σ=1)=5.32. Hence, the optimal block-wise PMMH is 5.32/0.0263 approximately 202

times more efficient than the optimal standard PMMH.

3.1 Guidelines for selecting the optimal number of particles for

panel data

We now consider the case with the likelihood factorised as in (5). From Lemma 4, for pseudo

numbers where =1/2, the optimal variance of the log-likelihood estimator based on each

group is σ2opt/G≈ 2.162/(1+ρ), which is approximately 2.34 given that G approximately

100 is large. For quasi numbers, the optimal variance of the log-likelihood estimator based

on each group is σ2opt/G≈0.822/(1+ρ)≈0.34, given that G is large. Hence, for each group,

k, we propose tuning the number of particles N(k) = N(k)(θ) such that V(z(k)|θ,N(k)) is

approximately 2.34 and 0.34 if the likelihood is estimated using pseudo numbers and quasi

numbers, respectively. In many cases, it’s more convenient to tune N(k)=N(k)(θ) at some

13

0 100 200 300 400 500 600 700 800 900 10000.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Variance σ2

Com

putin

g tim

e IF

(σ)/σ

2

0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Variance σ2

Acc

epta

nce

rate

Figure 2: Blockwise PMMH: The left panel shows the computing time CT(σ) and the rightpanel shows the acceptance rate v.s. the variance σ2. The dashed lines indicate the valuesw.r.t. the optimal variance σ2

opt=234.

central value θ and then fix N(k) across all MCMC iterations.

4 Applications

4.1 Panel data example

A clinical trial is conducted to test the effectiveness of beta-carotene in preventing non-

melanoma skin cancer (Greenberg et al., 1989). Patients were randomly assigned to a

control or treatment group and biopsied once a year to ascertain the number of new skin

cancers since the last examination. The response yij is a count of the number of new skin

cancers in year j for the ith subject. Covariates include age, skin (1 if skin has burns and

0 otherwise), gender, exposure (a count of the number of previous skin cancers), year of

follow-up and treatment (1 if the subject is in the treatment group and 0 otherwise). There

are T =1683 subjects with complete covariate information.

Following Donohue et al. (2011) we consider the mixed Poisson model with a random

intercept

p(yij|β,αi) = Poisson(exp(ηij)),

ηij = β0+β1Agei+β2Skini+β3Genderi+β4Exposureij+αi,

14

where αi∼N (0,2), i=1,....,T =1683, j=1,...,ni=5. The likelihood is

p(y|θ) =T∏

i=1

p(yi|θ), p(yi|θ) =∫ ( ni∏

j=1

p(yij|β, αi)

)p(αi|2)dαi

with θ=(β,2) the vector of model parameters to be estimated.

We run both the optimal standard PMMH and the optimal block-wise PMMH for 50,000

iterations with the first 10,000 discarded as burnin. For simplicity, each likelihood p(yi|θ) isestimated by importance sampling based on Ni i.i.d. samples from the natural importance

sampler p(αi|2). For the standard PMMH, for each θ, the number of particles Ni=Ni(θ)

is tuned so as the variance of the log-likelihood estimator V(logp(yi|θ)) is not bigger than1/T . This is done as follows. We start from some small Ni and increase Ni if this variance

is bigger than 1/T . Note that the variance V(logp(yi|θ)) can be approximated in closed

form. The CPU time spent on tuning Ni is taken into account in the comparison. In the

block-wise PMMH, we divide the data into G=99 groups, so that each group has 17 panels,

and the variance of log-likelihood estimator in each group is tuned to not be bigger than

2.34. Figure 3 plots the samples generated by the two algorithms.

As performance measures, we report the acceptance rate, the integrated autocorrelation

time (IACT), the CPU times, and the time normalised variance (TNV). For a univariate

Markov chain, the IACT is estimated by

IACT = 1 + 2

1000∑

t=1

ρt,

where ρt are the sample autocorrelations. The IACT for a multivariate chain is averaged

over the IACT values for the parameters. The time normalised variance is the product of

the IACT and the CPU time. The TNV is proportional to the computing time defined in

(13) if the CPU time to generate N particles is linearly proportional to N .

Table 1 summarises the acceptance rates, the IACT ratio, the CPU ratio, and the

TNV ratio, with the blockwise PMMH as the baseline. As shown, the block-wise PMMH

outperforms the standard PMMH. In particular, the block-wise PMMH is around 25 times

more efficient than the standard PMMH in terms of the time normalised variance.

15

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

−6

−5.5

−5

−4.5

−4

−3.5

−3

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

0.15

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

Figure 3: Plots of the standard PMMH chain (black) and block-wise PMMH chain (green)for the 6 parameters.

Methods Acceptance IACT ratio CPU ratio TVN ratio

Standard PMMH 0.222 1.080 23.095 24.938Block-wise PMMH 0.243 1 1 1

Table 1: Skin cancer example using the block-wise PMMH as the baseline

Choice of the numbers of samples N

We demonstrate the usefulness of Lemma 4 as guidelines for selecting suitable numbers of

samples Ni in each estimation of p(yi|θ). Here we use the static strategy where Ni are fixed

across θ and are tuned at a central θ obtained by a short pilot run. Because there are 17

panels in each group, given a target group-variance σ2G, Ni =Ni(θ) is selected such that

V(logp(yi|θ))≈σ2G/17, then, V(z(k))≈σ2

G. It is easier in terms of implementation to keep

Ni fixed, furthermore tuning Ni for each proposal θ will add more computational cost.

Figure 4 shows the average N =∑Ni/T , IACT and computing time CT for various

group-variance σ2G, when the likelihood is estimated using pseudo numbers. Here, CT is

computed as the product of N and IACT. CT has a minimum at σ2G≈2.3, which requires

40 samples on average to estimate each p(yi|θ). In this example we found that CT does

not change much when σ2G is in between 2 and 2.4. CT increases slowly when we choose

Ni such that σ2G is smaller 2.3, whereas it increases quite dramatically when σ2

G is higher

than the optimal. To be on the safe side, we therefore advocate a conservative choice of Ni

in practice.

16

σ2G

0 0.5 1 1.5 2 2.5 3

Aver

age

N

0

50

100

150

200

250

300

350

σ2G

0 0.5 1 1.5 2 2.5 3

IACT

0

500

1000

1500

2000

2500

σ2G

0 0.5 1 1.5 2 2.5 3

CT

×104

4.2

4.4

4.6

4.8

5

5.2

5.4

5.6

5.8

Figure 4: Pseudo numbers: Average N , IACT and CT = N× IACT for various σ2G.

We now consider using quasi numbers, generated using the scrambled net algorithm of

Matousek (1998), to estimate the likelihood. Because scrambled quasi numbers are depen-

dent, it is necessary to generate a different set of scrambled quasi numbers to compute each

unbiased estimate p(yi|θ). Then, p(yi|θ) are independent of each other, so p(y|θ)=∏ip(yi|θ)is a unbiased estimate of p(y|θ)=∏ip(yi|θ). We compute V(logp(yi|θ)) by replications as its

closed-form approximation is no longer available. Figure 5 shows that CT has a minimum

at σ2G ≈ 0.3, which agrees with the theory in Lemma 4. CT increases slowly when σ2

G is

smaller 0.3, and it increases quickly when σ2G is higher than this value.

σ2G

0 0.2 0.4 0.6 0.8 1

Aver

age

N

5

10

15

20

25

30

35

40

45

50

σ2G

0 0.2 0.4 0.6 0.8 1

IACT

0

200

400

600

800

1000

1200

1400

σ2G

0 0.2 0.4 0.6 0.8 1

CT

7000

7500

8000

8500

9000

9500

10000

10500

11000

Figure 5: Quasi numbers: Average N , IACT and CT = N× IACT for various σ2G.

17

Pseudo numbers vs quasi numbers

We now compare the performance of the standard PMMH using pseudo random numbers,

the blockwise PMMH using pseudo random numbers, and the blockwise PMMH using

randomised quasi numbers. Here, we fix the number of samples Ni=50 across all θ and i.

Table 2 summarises the performance measures with the quasi blockwise PMMH as the

baseline. The two blockwise PMMH frameworks outperform the standard PMMH. The

quasi blockwise PMMH outperforms the pseudo blockwise PMMH in terms of acceptance

rates, IACT and TNV, but not the CPU time. We note that the CPU times depend

heavily on the programming language and the implementation. Our implementation of

the quasi blockwise PMMH is suboptimal as we first generate scrambled quasi numbers

then transform them into standard normal variates which are needed for estimating the

likelihood. For the pseudo blockwise PMMH, we generate standard normal variates directly

using the optimal builtin function in Matlab.

Methods Acceptance IACT ratio CPU ratio TNV ratio

Standard PMMH 0.002 12.005 0.927 11.133pseudo blockwise PMMH 0.179 4.121 0.742 3.057quasi blockwise PMMH 0.225 1 1 1

Table 2: Skin cancer example using the quasi blockwise PMMH as the baseline

4.2 Subsampling MCMC

Quiroz et al. (2016) propose a data subsampling approach, based on the correlated PMMH

algorithm in Deligiannidis et al. (2015), to speed up MCMC. As in Quiroz et al. (2016), we

consider the following two AR(1) processes with Student-t iid errors ǫt∼ t(ν) with known

degrees of freedom ν. We demonstrate the PMMH methods using two models whose data

generating processes are

yt =

β0+β1yt−1+ǫt ,[M1,θ=(β0=0.3,β1=0.6)]

µ+ρ(yt−1−µ)+ǫt ,[M2,θ=(µ=0.3,ρ=0.99)],(14)

18

t=1,...,T , where p(ǫt)∝(1+ǫ2t/ν)−(ν+1)/2 with ν=5 and the uniform priors

p(β0,β1)=U(−5,5)·U(0,1) and p(µ,ρ)=U(−5,5)·U(0,1).

Define ℓt(θ) :=logp(yt|yt−1,θ) and rewrite the log likelihood ℓ(θ) as

ℓ(θ)=q(θ)+d(θ), q(θ)=

T∑

t=1

qt(θ), d(θ)=

T∑

t=1

dt(θ), with dt(θ)=ℓt(θ)−qt(θ),

where qt(θ)≈ℓt(θ) is a control variate. We set qt(θ) to a Taylor series approximation of lt(θ)

evaluated at the nearest centroid from a clustering in data space. This has the effect of

reducing the complexity of computing q(θ) from O(T ) to O(C), where C is the number of

centroids. The reader is referred to (Quiroz et al., 2016) for the details. Now, an unbiased

estimate of ℓ(θ) based on a simple random sample with replacement is

ℓ(θ)=T

N

N∑

i=1

dui(θ)+q(θ), with ui∈{1,...,T}, Pr(ui= t)=

1

T, t=1,...,T. (15)

Here N is the subsample size and u=(u1,...,uN) represents a vector of observation indices.

The resulting likelihood estimate is pN(y|θ,u) = exp(ℓ(θ)−σ2(θ)/2

), where σ2(θ) is an

unbiased estimate of the variance of ℓ(θ) in (15). Quiroz et al. (2016) show that carrying

out MCMC with this slightly biased likelihood estimator samples from a perturbed posterior

that is very close to the correct posterior if N is sufficiently large in relation to σ2(θ).

Block-wise PMMH creates G blocks, each with N/G indices. For large G, a high

correlation in the MH log-ratio is naturally induced, since u and u′ differ only at a small

number of positions.

We generate T =100,000 observations from the models in (14) and run both the corre-

lated PMMH and the block-wise PMMH for 55,000 iterations from which we discard the

first 5,000 draws as burn-in. Using the same σ2(θ) as in Quiroz et al. (2016) results in

sample sizes N approximately 1300 and N approximately 2600 for models M1 and M2,

respectively. For the block-wise PMMH we use G=100 so that each block has ≈ 13 ob-

servations for M1 and ≈26 observations for M2. Also, following Quiroz et al. (2016), the

persistence parameter in the correlated PMMH is set to φ=0.9999, and we use a random

19

walk proposal which is adapted during the burn-in phase to target α≈0.15 (Sherlock et al.,

2015).

Table 3 summarises the performance measures introduced in Section 4.1. It is evident

that the block-wise PMMH dramatically outperforms the correlated PMMH on the speed

measures. This is because, as discussed above, the correlated PMMH requires N operations

for generating the vector u. The block-wise PMMH moves only one block at a time, so

that the update of the vector u requires N/G operations.


M1 M2 M1 M2 M1 M2 M1 M2

Correlated PMMH 0.149 0.140 1.110 1.124 62.893 38.610 69.444 43.478Block-wise PMMH 0.160 0.151 1 1 1 1 1 1

Table 3: Data subsampling example using block-wise PMMH as a baseline.

4.3 Diffusion process example

This section applies the PMMH approaches to estimating diffusion processes. Consider a

diffusion process X={Xt,t≥0} governed by the stochastic differential equation (SDE)

dXt = µ(Xt, θ)dt+ σ(Xt, θ)dWt, (16)

withWt a Wiener process. We assume that regularity conditions on µ(·,·) and σ(·,·) are met

so that the solution to the SDE in (16) exists and is unique. We are interested in estimating

the vector of parameters θ based on discrete-time observations x={x0,x1,...,xn}, where xiis the observation of X at time ti= i∆ with ∆ some time-interval. The likelihood is

p(x|θ)=n−1∏

i=0

p∆(xi+1|xi,θ)

where the ∆-interval Markov transition density p∆(xi+1|xi,θ) is typically intractable. In

order to make the discrete approximation of the continuous-time process X sufficiently

accurate, it is common to represent

p∆(xi+1|xi, θ) =∫ph(xi+1|zi,M−1, θ)ph(zi,M−1|zi,M−2, θ)...ph(zi,1|xi, θ)dzi,1...dzi,M−1 (17)

20

where ph(·|·,θ) is the Markov transition density of X after time-step h=∆/M . Given that

h is small enough, ph(u|v,θ) is well approximated by the Euler approximation

peulerh (u|v,θ)=N (u;v+hµ(v,θ),hΣ(v,θ)),

where N (·;a,S) denotes the normal density with mean a and covariance S, here Σ(v,θ)=

σ(v,θ)σ′(v,θ). The transition density in (17) is then approximated by

peuler∆ (xi+1|xi,θ)=∫peulerh (xi+1|zi,M−1,θ)p

eulerh (zi,M−1|zi,M−2,θ)...p

eulerh (zi,1|xi,θ)dzi,1...dzi,M−1.

Finally, we are interested in the posterior

p(θ|x)∝p(θ)peuler(x|θ), peuler(x|θ)=n−1∏

i=0

peuler∆ (xi+1|xi,θ).

The likelihood peuler(x|θ) is intractable but can be estimated unbiasedly. We follow Stramer and Bognar

(2011) and estimate peuler∆ (xi+1|xi,θ) by importance sampling using the importance sampler

of Durham and Gallant (2002)

zi,m+1∼N(zi,m+

xi+1−zi,mM−m ,h

M−m−1

M−m Σ(zi,m,θ)

), m=0,...,M−2,

where zi,0=xi. The density of this sampler is

g(zi)=g(zi,1,...,zi,M−1)=M−2∏

m=0

N(zi,m+1;zi,m+

xi+1−zi,mM−m ,h

M−m−1

M−m Σ(zi,m,θ)

).

We sample N such trajectories z(j)i =(z

(j)i,1 ,...,z

(j)i,M−1), j=1,...,N and denote by ui the set of

all required random variates. Then, the unbiased estimator of peuler∆ (xi+1|xi,θ) is

peuler∆ (xi+1|ui,xi,θ)=1

N

N∑

j=1

peulerh (xi+1|z(j)i,M−1,θ)peulerh (z

(j)i,M−1|z

(j)i,M−2,θ)...p

eulerh (z

(j)i,1 |xi,θ)

g(z(j)i )

The likelihood peuler(x|θ) is estimated unbiasedly and in the form of factorisation (5), so

all the theory developed is applicable.

21

We apply the proposed method to fit the FedFunds dataset to the Cox-Ingersoll-Ross

(CIR) model

dXt=β(α−Xt)dt+σ√XtdWt.

The FedFunds dataset we use consists of 745 monthly federal funds rates in the US from

July 1954 to August 2016.

Following Stramer and Bognar (2011) we use the prior p(α,β,σ)=I(0,1)(α)I(0,∞)(β)σ−1I(0,∞)(σ).

As in Stramer and Bognar (2011), we set ∆=1/12, N =5. We set M =300 to make the

Euler approximation highly accurate, Stramer and Bognar (2011) use M = 20. We use

G=184 groups. Table 4 summarises the results, which show that the blockwise PMMH

works better than the standard PMMH. However, we note that the improvement is not as

much as we archieve in the previous examples.

Our findings quite contradict the results in Stramer and Bognar (2011) who report

that the blocking strategy does not work better than the standard PMMH. The setting in

Stramer and Bognar (2011) is different from ours. First, their updating scheme alternates

between θ and the blocks u(1),...,u(G), rather updating θ together with one of the blocks at

a time as we do. So their scheme updates the main parameter θ one every G iterations,

which is very inefficient when G is large. Second, Stramer and Bognar’s dataset consists

of 432 monthly rates from January 1963 to December 1998, which is a small portion of our

dataset. Third, they setM=20 while we setM=300. Our setting is much more challenging

for the standard PMMH because the estimates of the likelihood are highly variable. The

setting in Stramer and Bognar (2011) is doable for the standard PMMH and therefore it

is not necessary to use the blockwise strategy.


Standard PMMH 0.246 1.657 1.146 1.899Blockwise PMMH 0.383 1 1 1

Table 4: FedFunds data example using the pseudo blockwise PMMH as the baseline

22

Lag0 20 40 60 80 100

Sam

ple

Auto

corre

latio

n

-0.2

0

0.2

0.4

0.6

0.8

1Sample Autocorrelation Function

Lag0 20 40 60 80 100

Sam

ple

Auto

corre

latio

n

-0.2

0

0.2

0.4

0.6

0.8


Lag0 20 40 60 80 100

Sam

ple

Auto

corre

latio

n

-0.2

0

0.2

0.4

0.6

0.8


Lag0 20 40 60 80 100

Sam

ple

Auto

corre

latio

n

-0.2

0

0.2

0.4

0.6

0.8


Lag0 20 40 60 80 100

Sam

ple

Auto

corre

latio

n

-0.2

0

0.2

0.4

0.6

0.8


Lag0 20 40 60 80 100

Sam

ple

Auto

corre

latio

n

-0.2

0

0.2

0.4

0.6

0.8


Figure 6: Autocorrelations for the standard PMMH chains (first row) and the blockwisePMMH chains (second row)

4.4 ABC example: Bayesian inference for α-stable models

α-stable distributions (Nolan, 2007) are a class of heavy-tailed distributions used in many

statistical applications. The main difficulty when working with α-stable distributions is that

they do not have closed form densities, which makes it difficult to do inference. However,

as it is easy to sample from an α-stable distribution, one can use Approximate Bayesian

Computation techniques (ABC) for Bayesian inference (Tavare et al., 1997; Peters et al.,

2012). ABC approximates the likelihood by the so-called likelihood-free

pLF(y|θ) =∫Kǫ(S(y

′), S(y))p(y′|θ)dy′, (18)

where Kǫ(.,.) is a kernel with the bandwidth ǫ and S(.) is a summary statistics. Inferences

are then based on the approximate posterior pABC(θ|y)∝p(θ)pLF(y|θ) where the likelihood-free pLF(y|θ) is unbiasedly estimated by an average of Kǫ(S(y

[i]),S(y)) with y[i]iid∼p(·|θ). We

illustrate in this example that the block-wise PMMH scheme is still applicable, although

the likelihood can’t be factorised as in (5).

We use the example in Peters et al. (2012) and generate a data set y = {y1,...,yn}with n = 200 from a univariate α-stable distribution with parameters α = 1.7, β = 0.9,

γ =10 and δ=10. The summary statistics of a pseudo-dataset y′ = {y′1,...,y′n} is S(y′) =

(vα(y′),vβ(y

′),vγ(y′),vδ(y

′)); see Peters et al. (2012) for details. The likelihood-free pLF(y|θ)

23

in (18) is estimated by pLF(y|θ) =Kǫ(S(y′),S(y)) with Kǫ the Gaussian kernel with co-

variance matrix ǫI4. We use only one pseudo-dataset to estimate the likelihood-free as

suggested in Peters et al. (2012).

Both the standard PMMH and the block-wise PMMH are run for 50,000 iterations

with the first 10,000 discarded as burnins. The block-wise PMMH scheme is carried out as

follows. Given a vector of parameters θ, write the pseudo-data point y′i as f(θ,ui) with ui

the set of random numbers used to generate y′i. We divide the set u={ui,i=1,...,200} into

G=100 blocks with block k consisting of u2k−1 and u2k, k=1,...,G. The block-wise PMMH

is now used to update θ and each block of u at a time.

Table 5 summarises the performance measures for different values of ǫ, averaged over 10

replications. Here, the mean squared error (MSE) is the l2-norm of the difference between

the estimated posterior mean and the true parameters. See Pasarica and Gelman (2010)

for a definition of averaged squared distance (ASD) as a performance measure in MCMC.

It is understood that the bigger ASD the better. The results show that the block-wise

PMMH works better than the standard PMMH in this example.

ǫ Methods Acc. rate MSE IACT ASD

10 Standard 0.31 1.41 359.75 3.14Block-wise 0.37 1.29 174.11 3.43



Table 5: α-stable distribution example

4.5 A time series example

We consider a time series {yt,t=1,...,T =1000} generated from the state space model

yt|xt ∼ Poisson(λt), λt=eβ+xt , (19)

xt+1 = φxt+ηt, ηt∼N(0,σ2), x1∼N(0,σ2/(1−φ2)).

24

The model parameters are θ=(β,φ,σ2). We generate the data using the parameter values

β=1, φ=0.5 and σ2=2(1−φ2).

We employ the high-dimensional importance sampling method of Shephard and Pitt

(1997) and Durbin and Koopman (1997) to obtain an unbiased likelihood estimator p(y|θ,u).The simulation smoothing step requires 2T independent univariate normal variates to gen-

erate each sample path of the states, so the set of random variates u needed is a matrix of

size N×(2T ) with N the number of samples. We divide u into G=100 blocks where u1

consists of the first 2T/G columns of u, u2 consists of the next 2T/G columns of u, etc.

We use the static strategy in this example, i.e. the number of sample paths N is fixed.

Let θ be some central value of θ, e.g. the MLE estimate using the simulated maximum

likelihood method (Gourieroux and Monfort, 1995). For simplicity, we set θ to the true

value. For the standard PMMH, the value of N is chosen such that V(pN(y|θ,u))≈ 1,

where the variance V(pN(y|θ,u)) is estimated by replications. For the two block-wise

PMMH schemes, one with pseudo numbers and the other with quasi numbers, we se-

lect N such that V(pN(y|θ,u))≈ 2.162/(1−ρ2) and ≈ 0.822/(1−ρ2) respectively, with the

correlation ρ estimated as follows. Let N0 be some large value of N . Let z=log pN0(y|θ,u)and z′ = log pN0(y|θ,u′) with u′ obtained from u by generating a new set for a randomly-

selected block uk, with the other blocks kept fixed. We generate J = 1000 realisations

(zj ,z′j)

Jj=1 of (z,z′) and estimate ρ by the sample correlation ρ. For the correlated PMMH

of Deligiannidis et al., we set the correlation ρ=0.99 and use the same N as in the pseudo

block-wise PMMH. Each MCMC scheme is run for 25,000 iterations with 5000 burn-ins.

Methods N Acc. rate IACT ratio CPU ratio CT ratio

standard PMMH 3500 - - - -correlated PMMH (ρ= .99) 56 0.18 1.613 1.336 2.155pseudo block-wise PMMH 56 0.23 1.154 1.257 1.451quasi block-wise PMMH 16 0.23 1 1 1

Table 6: State space example: performance measure ratios with the quasi block-wisePMMH as the baseline

Table 6 summarises the results. The standard PMMH requires around N=3500 samples

and is therefore very computationally demanding. So we do not run the standard PMMH in

this case. The quasi block-wise PMMH outperforms the others in this example. Although

25

both the correlated PMMH and pseudo block-wise PMMH use the same number of particles

N , the latter has a less CPU time because it generates only one block of u in each iteration.

5 Large sample in T analysis for panel data

In this section we discuss properties of the block-wise PMMH for large T for the panel data

model discussed in Sections 3 and 4.1 and show that the total number of particles required

per MCMC iteration is O(T 3/2) if pseudo random numbers are used and roughly O(T 7/6)

if quasi random numbers are used, rather than O(T 2) as in the standard PMMH. We also

justify part (i) of Assumption 1 and show a weak posterior correlation between θ and z.

Consider the panel data model, with the panels in the kth group denoted by Gk, and

suppose that we use the same Ni =Nk particles for all panels i∈ Gk. Let p(yi|θ) be the

likelihood of the ith panel, and let pNi(yi|θ,ui) be the unbiased estimate of p(yi|θ). We

make the following assumption

Assumption 3. For each i∈Gk and parameter value θ, there exists a Ai(θ)2 such that as

Nk→∞

Nk

(pNi

(yi;θ,ui)−p(yi|θ))

d⇒N (0,Ai(θ)2), (20)

for some >0.

The central limit theorem (20) holds for most importance sampling estimates of the

likelihood, where = 1/2 if using pseudo random numbers and = 3/2−ǫ with an ar-

bitrarily small ǫ > 0 if randomised quasi numbers are used; see, e.g., Loh (2003); Owen

(1997).

Let γi(θ)2=Ai(θ)

2/p(yi|θ)2 and define zi(θ,ui) :=log

(pNi

(yi;θ,ui)/p(yi|θ)). The follow-

ing result holds and is similar to Lemma 2(ii) in Pitt et al. (2012). Its proof is straightfor-

ward and is omitted.

Lemma 5. Suppose that Assumption 3 holds. Then, for i∈Gk, as Nk→∞,

Nk

(zi(θ,ui)+

γ2i (θ)

2N2k

)d⇒N (0,γ2i (θ)).

26

So, informally, zi(θ,ui)≈N(−γ2i (θ)/(2N2

k ),γ2i (θ)/N2k

). The next corollary formalises

the results of this section. Its proof follows from the discussion immediately above and

Lemma 5.

Corollary 1. Define γ2(k)(θ) :=∑

i∈Gkγ2i (θ). Suppose that we take GT =O(T

12 ), where the

subscript T , here and below, indicates dependence on T . Then,

(i) The number of elements in each group is |Gk|=O(T/GT )=O(T12 ), γ2(k)(θ)=O(T/GT )=

O(T12 ), and

ρT =1−O(T− 12 ), σ2

opt,T =O(T12 ) and Nk,T =Nk(θ,σ

2opt,T ,GT )=O(T

1/(4)).

(ii) The total number of particles used to estimate the likelihood is

G∑

k=1

Nk,T (θ)×(T/G)=O(T 1+1/(4))=

O(T 3/2), if=1/2

O(T7−4ǫ6−4ǫ ), if=3/2−ǫ.

(iii) Part (i) of Assumption 1 holds.

We now present an important result that justifies that slowly moving z does not have

any undesired effect on the mixing of the chain on θ. Suppose that the same NT=O(T1/(4))

particles is used for all panels.

Lemma 6 (Posterior orthogonality of θ and z). Assume that

(i) z=∑T

i=1zi(θ,ui)∼N (−ζT (θ)/2,ζT (θ)) with ζT (θ)=(T/N2T )ηT (θ), ηT (θ)=

1T

∑Ti=1γ

2i (θ)

and µη=Eπ(ηT (θ)) is bounded (in T ).

(ii) For any bounded and differentiable function h(θ),

T×Eπ

[(θ−θ)′∂h(θ)

∂θ

∂ηT (θ)

∂θ′(θ−θ)

]T→∞−→ v

for some constant v>0. Here θ denotes the posterior mean Eπ(θ) and θ′ denotes the

transpose of vector θ. Furthermore

T×Vπ(h(θ))T→∞−→ vh, and T×Vπ(ηT (θ))

T→∞−→ vη

27

for some constants vh>0, vη>0.

Then the posterior correlation of h(θ) and the log-likelihood estimation error z is approxi-

mately zero, i.e., Corr(h(θ),z|y)→0, as T→∞.

Assumption (i) is justified by Lemma 5 and (ii) is justified by the Berstein von-Mises

theorem.

Proof. See the Appendix.

6 Conclusion

We have presented an efficient block-wise PMMH algorithm for Bayesian inference for

models where the likelihood is a product as in panel data models and can be estimated

unbiasedly or for big data problems where the log likelihood is a sum with many terms

which can be estimated unbiasedly. The proposed algorithm divides the set of random

numbers used to estimate the likelihood into blocks, and then updates the parameters of

interest and one block at a time. The block-wise PMMH approach requires a much smaller

number of particles in the likelihood estimation than the standard PMMH. Applications

strongly confirm the usefulness of the proposed methodology. The blocking approach can

in principle be generalised to any model that is estimated by PMMH, including intractable

time series problems. The blocking approach can also be combined with the autocorrelation

approach in Dahlin et al. (2015) and Deligiannidis et al. (2015).

Using a random truncation to obtain unbiased estimates (Rhee and Glynn, 2015) is cur-

rently attracting significant attention. The resulting debiased estimates are highly variable

that makes it challenging to use the standard PMMH. The block-wise PMMH based on

quasi numbers can offer an efficient method in this situation.

References

Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte Carlo

methods. Journal of the Royal Statistical Society, Series B, 72:1–33.

28

Andrieu, C. and Roberts, G. (2009). The pseudo-marginal approach for efficient Monte

Carlo computations. The Annals of Statistics, 37:697–725.

Bartolucci, F., Farcomeni, A., and Pennoni, F. (2012). Latent Markov Models for Longitu-

dinal Data. Chapman and Hall/CRC press.

Beaumont, M. A. (2003). Estimation of population growth or decline in genetically moni-

tored populations. Genetics, 164:1139–1160.

Dahlin, J., Lindsten, F nad Kronander, J., and Schon, T. B. (2015). Accelerating pseu-

domarginal Metropolis-Hastings by correlating auxiliary variables. Technical report.

arXiv:1511.05483v1.

Del Moral, P. (2004). Feynman-Kac Formulae: Genealogical and Interacting Particle Sys-

tems with Applications. Springer, New York.

Deligiannidis, G., Doucet, A., and Pitt, M. (2015). The correlated pseudo-marginal method.

Technical report.

Dick, J. and Pillichshammer, F. (2010). Digital nets and sequence. Discrepancy theory and

quasi-Monte Carlo integration. Cambridge University Press, Cambridge.

Donohue, M. C., Overholser, R., Xu, R., and Vaida, F. (2011). Conditional Akaike infor-

mation under generalized linear and proportional hazards mixed models. Biometrika,

98:685–700.

Doucet, A., Pitt, M., Deligiannidis, G., and Kohn, R. (2015). Efficient implementation of

Markov chain Monte Carlo when using an unbiased likelihood estimator. Biometrika,

102(2):295–313.

Durbin, J. and Koopman, S. J. (1997). Monte Carlo maximum likelihood estimation for

non-Gaussian state space models. Biometrika, 84:669–684.

Durham, G. B. and Gallant, A. R. (2002). Numerical techniques for maximum likeli-

hood estimation of continuous-time diffusion processes. Journal of Business & Economic

Statistics, 20(3):335–338.

29

Fitzmaurice, G. M., Laird, N. M., and Ware, J. H. (2011). Applied Longitudinal Analysis.

John Wiley & Sons, Ltd, New Jersey, 2nd edition.

Flury, T. and Shephard, N. (2011). Bayesian inference based only on simulated likelihood:

Particle filter analysis of dynamic economic models. Econometric Theory, 1:1–24.

Gerber, M. and Chopin, N. (2015). Sequential quasi Monte Carlo. J. R. Stat. Soc. Ser. B.

Stat. Methodol., 77(3):509–579.

Gourieroux, C. and Monfort, A. (1995). Statistics and Econometric Models, volume 2.

Cambridge University Press, Melbourne.

Greenberg, E. R., Baron, J. A., Stevens, M. M., Stukel, T. A., Mandel, J. S., Spencer,

S. K., Elias, P. M., Lowe, N., Nierenberg, D. N., G., B., and Vance, J. C. (1989). The

skin cancer prevention study: design of a clinical trial of beta-carotene among persons

at high risk for nonmelanoma skin cancer. Controlled Clinical Trials, 10:153–166.

Gunawan, D., Tran, M.-N., Suzuki, K., Dick, J., and Kohn, R. (2016). Computationally ef-

ficient bayesian estimation of high dimensional copulas with discrete and mixed margins.

Technical report.

Johnson, A. A., Jones, G. L., and Neath, R. C. (2013). Component-wise Markov Chain

Monte Carlo: Uniform and geometric ergodicity under mixing and composition. Statis-

tical Science, 28(3):360–375.

Lee, A. and Holmes, C. (2010). Discussion on particle Markov chain Monte Carlo methods.

Journal of the Royal Statistical Society, Series B, 72:1–33.

Lin, L., Liu, K., and Sloan, J. (2000). A noisy Monte Carlo algorithm. Physical Review D,

61.

Loh, W.-L. (2003). On the asymptotic distribution of scrambled net quadrature. Ann.

Statist., 31(4):1282–1324.

Matousek, J. (1998). On the l2-discrepancy for anchored boxes. Journal of Complexity,

14:527–556.

30

Murray, I. and Graham, M. M. (2016). Pseudo-marginal slice sampling. In Proceedings of

the 19th International Conference on Artificial Intelligence and Statistics, JMLR W&CP,

volume 51, pages 911–19.

Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods.

Society for Industrial and Applied Mathematics, Philadelphia.

Nolan, J. (2007). Stable Distributions: Models for Heavy-Tailed Data. Birkhauser, Boston.

Owen, A. B. (1997). Scrambled net variance for integrals of smooth functions. The Annals

of Statistics, 25(4):1541–1562.

Pasarica, C. and Gelman, A. (2010). Adaptively scaling the Metropolis algorithm using

expected squared jumped distance. Statistica Sinica, 20:343–364.

Peters, G., Sisson, S., and Fan, Y. (2012). Likelihood-free Bayesian inference for α-stable

models. Computational Statistics & Data Analysis, 56(11):3743 – 3756.

Pitt, M. K., Silva, R. S., Giordani, P., and Kohn, R. (2012). On some properties of

Markov chain Monte Carlo simulation methods based on the particle filter. Journal of

Econometrics, 171(2):134–151.

Quiroz, M., Villani, M., and Kohn, R. (2016). Speeding up MCMC by efficient data

subsampling. Technical report.

Rhee, C.-H. and Glynn, P. W. (2015). Unbiased estimation with square root convergence

for sde models. Operations Research, 63(5):1026–1043.

Shephard, N. and Pitt, M. K. (1997). Likelihood analysis of non-Gaussian measurement

time series. Biometrika, 84:653–667.

Sherlock, C., Thiery, A., Roberts, G., and Rosenthal, J. (2015). On the efficiency of the

pseudo marginal random walk Metropolis algorithm. The Annals of Statistics, 43(1):238–

275.

Stramer, O. and Bognar, M. (2011). Bayesian inference for irreducible diffusion processes

using the pseudo-marginal approach. Bayesian Analysis, 6(2):231–258.

31

Tavare, S., Balding, D. J., Griffiths, R. C., and Donnelly, P. (1997). Inferring coalescence

times from DNA sequence data. Genetics, 145(2):505–518.

Tran, M.-N., Nott, D. J., and Kohn, R. (2015). Variational Bayes with intractable likeli-

hood. Technical report. arXiv:1503.08621.

Appendix A Proofs

Proof of Lemma 2. To obtain the conditional acceptance probability, we use the following

results,

∫ A

−∞

exp(z)N (z;a,b2)dz=exp(a+b2/2)Φ

(A−a−b2

b

)(21)

∫ ∞

A

N (z;a,b2)dz=Φ

(a−Ab

). (22)

Now, as in Lemma 1, let a(z′) =E(z|z′) =−σ2/2G+ρz′ and τ 2 =V(z|z′) = σ2(1−ρ2), sothat the conditional density of z given z′ is N (z;a(z′),τ 2), and using (21) and (22), the

conditional probability of acceptance is

∫min(1,exp(z−z′))N (z;a(z′),τ 2)dz=

∫ z′

−∞

exp(z−z′)N (z;a(z′),τ 2)dz+

∫ ∞

z′N (z;a(z′),τ 2)dz

=exp

(a(z′)−z′+τ 2/2

)Φ

(z′−a(z′)−τ 2

τ

)+Φ

(a(z′)−z′

τ

)

=exp

(−y+τ 2/2

)Φ

(y−τ 2τ

)+Φ

(−yτ

),

where y :=z′−a(z′)=(1−ρ)(z′+σ2/2). To obtain the unconditional acceptance probability,

we need to take the expectation of the conditional probability with respect to z′. We note

that y∼N(

σ2

2G, σ

2

G2

). Hence, the unconditional probability of acceptance is

∫exp

(−y+τ 2/2

)Φ

(y−τ 2τ

)N (y;σ2/(2G),σ2/G2)dy+

∫Φ

(−yτ

)N (y;σ2/(2G),σ2/G2)dy

(23)

32

To proceed further we use the following elementary results,

exp(−y)N (y;a,b2)=exp(b2/2−a)N (y;a−b2,b2) (24)

Φ

( −a−c√b2+d2

)=

∫Φ

(−y−ab

)N (y;c,d2)dy (25)

Φ

( −a+c√b2+d2

)=

∫Φ

(y−ab

)N (y;c,d2)dy. (26)

Hence,

exp

(−y+τ 2/2

)N (y;σ2/G,σ2/G2)=exp(τ 2/2+σ2/2G2−σ2/G)N (y;σ2/G−σ2/G2,σ2/G2)

=N (y;ρσ2/G,σ2/G2)∫Φ

(y−τ 2τ

)N (y;ρσ2/G,σ2/G2)dy=Φ

(ρσ2/G−σ2(1−ρ2)√σ2/G2+σ2(1−ρ2)

)

=Φ

(−σ

√1−ρ√2

)

∫Φ

(−yτ

)N (y;σ2/G,σ2/G2)dy=Φ

(−σ

√1−ρ√2

).

Proof of Lemma 3. For notational simplicity, we write the proposal density q(z|z′;ρ,σ) asq(z|z′), the acceptance probability in (9) as α(z′,z;ρ,σ) as α(z′,z) and the acceptance prob-

ability k(z′|σ,ρ), conditional on the previous iterate, as k(z′). Let {(θj,zj),j=1,...,M} be

iterates, after convergence, for the Markov chain produced by the PMMH sampling scheme.

Then, the Markov transition distribution from (θ′,z′) to (θ,z) is

p(θ′, z′; dθ, dz) = α(z′, z)π(θ)q(z|z′)dθdz +(1−

∫α(z′, z∗)π(θ∗)q(z∗|z′)dθ∗dz∗

)δ(θ′,z′)(θ, z)

= α(z′, z)π(θ)q(z|z′)dθdz + (1− k(z′|σ, ρ)) δ(θ′,z′)(dθ, dz),

δθ′,z′(dθ,dz) is the probability measure concentrated at (θ′,z′).

33

Consider now the space of functions

F=

{ϕ : Θ=Θ⊗R 7→R,ϕ=ϕ(θ)ψ(z),π(ϕ) :=Eθ∼π(θ)(ϕ)=0,π(ϕ2) :=Eθ∼π(θ)(ϕ

2)<∞,

π(ψ2) :=Ez∼π(z)(ψ)2<∞

}.

We define the operator P :F 7→F as

(Pϕ)(θ,z) :=

∫ϕ(θ∗,z∗)p(θ,z;θ∗,z∗)dθ∗dz∗

=π(ϕ)

∫ψ(z)α(z,z∗)q(z∗|z)dz∗+ϕ(θ)(1−k(z))

=ϕ(θ)ψ(z)(1−k(z)).

as π(ϕ)=0 by assumption. It is straightforward to check that (P jϕ)(θ,z)=ϕ(θ)ψ(z)(1−k(z))j and that (Pϕ)(θj−1,zj−1)=E(ϕ(θj ,zj)|θj−1,zj−1). Hence, (P

jϕ)(θ0,z0)= ϕ(θ0,z0)(1−k(z0))

j.

We now consider ϕ(θ,z) = ϕ(θ)ψ(z) with ψ(z) ≡ 1 so that ϕ ∈ F; suppose also that

(θ0,z0)∼ πN . Define cj :=Cov(ϕ(θj ,zj),ϕ(θ0,z0))=Cov(ϕ(θj),ϕ(θ0)). Then,

cj=E(ϕ(θj ,zj)ϕ(θ0,z0)

)

=E(θ0,z0)∼πN

(E(ϕ(θj ,zj)|θ0,z0)ϕ(θ0,z0)

)

=E(θ0,z0)∼πN

((1−k(z0))jϕ(θ0,z0)2

)

=Ez0∼πN (z)

((1−k(z0))j)Eθ0∼π(ϕ(θ0)

2)

because z0 only depends on σ by construction

=Ez0∼πN (z)

((1−k(z0))j

)c0.

The inefficiency IF is defined as

IF=(c0+2

∞∑

j=1

cj)/c0=1+2

∞∑

j=1

Ez∼πN (z)

((1−k(z)

)j)=1+2Ez∼πN(z)

(1−k(z)k(z)

)

34

as required.

Proof of Lemma 4. From Lemma 1, π(z′|σ)=N (z′;σ2/2,σ2). Let ω := [(1−ρ)(z′+σ2/2)−τ 2]/τ with τ=σ

√1−ρ2. Then,

ω∼N(− ρτ

1+ρ,1−ρ1+ρ

),

and we note that the variance of ω just depends on ρ. For ρ close to 1, the variance of ω is

approximately 1/(2G), which is very small. Thus, ω will be concentrated close to its mean

ω∗ :=−ρτ/(1+ρ). Define p∗(ω|τ) :=1−k(z′|ρ,σ)=Φ(ω+τ)+exp(−ωτ−τ 2/2)Φ(ω). Then,

IF(σ,ρ)=

∫1+p∗(ω|τ)1−p∗(ω|τ)N

(ω;− ρτ

1+ρ,1−ρ1+ρ

)dω.

It is convenient to write IF(σ,ρ) as IF(τ |ρ), which we will optimize the computing time

over τ keeping ρ fixed. Let,

f(ω;τ) :=1+p∗(ω|τ)1−p∗(ω|τ)

Using the 4th order Taylor series expansion of f(w;τ) at ω=ω∗, the inefficiency factor

can be approximated by

IFapprox(τ |ρ) = f(ω∗|τ) + 1

2

1− ρ

1 + ρf (2)(ω∗|τ) + 1

8

(1− ρ

1 + ρ

)2

f (4)(ω∗|τ),

which is considered as a function of τ with ρ fixed. This approximation is very accurate

because, as noted, the variance of ω is very small for large G. So the computing time

CT(σ,ρ)=IF(σ,ρ)/σ1/ is approximated by

CTapprox(τ |ρ)=(1−ρ2) 12

IFapprox(τ |ρ)τ 1/

∝ IFapprox(τ |ρ)τ 1/

Minimizing this term over τ , for ρ close to 1, we find that CTapprox(τ |ρ) is minimized at

τ ≈2.16 for =1/2, and at τ ≈0.82 for ≈3/2. So the optimal σopt≈2.16/√

1−ρ2 for

=1/2 and σopt≈0.82/√

1−ρ2 for =3/2−ǫ with any arbitrarily small ǫ.

35

Then the unconditional acceptance rate (10) is

P (accept|ρ, σopt) = 2

(1− Φ

(σopt√1− ρ√2

))

= 2

(1− Φ

(σopt√

1− ρ2√2(1 + ρ)

))

≈ 2

(1− Φ

(2.162

))≈ 0.28.

Proof of Lemma 6. The joint posterior density of θ and z in (8) can be written as π(θ,z)=

π(θ)π(z|θ) with π(z|θ)∼N (ζT (θ)/2,ζT (θ)). We have

Cov(h(θ),z|y)=E[h(θ)z|y]−E[h(θ)|y]E[z|y]

=1

2Eπ[h(θ)ζT (θ)]−

1

2Eπ[h(θ)]Eπ[ζT (θ)]

=1

2Eπ[(h(θ)−µh)ζT (θ)], with µh=Eπ[h(θ)]

≈ 1

2Eπ

[(h(θ)−µh

)(ζT (θ)+

∂ζT (θ)

∂θ′(θ−θ)

)]

=1

2Eπ

[(h(θ)−µh

)∂ζT (θ)∂θ′

(θ−θ)]

≈ 1

2Eπ

[(h(θ)+

∂h(θ)

∂θ′(θ−θ)−µh

)∂ζT (θ)∂θ′

(θ−θ)]

=1

2Eπ

[(θ−θ)′∂h(θ)

∂θ

∂ζT (θ)

∂θ′(θ−θ)

]

=1

2N2T

T×Eπ

[(θ−θ)′∂h(θ)

∂θ

∂ηT (θ)

∂θ′(θ−θ)

]

≈ v

2N2T

.

V(z|y)=E[z2|y]−(E[z|y])2=Eπ[ζT (θ)

2

4+ζT (θ)]−

1

4(Eπ[ζT (θ)])

2

=1

4Vπ(ζT (θ))+Eπ(ζT (θ))≈

T

4N4T

vη+T

N2T

µη.

V(h(θ)|y)=Vπ(h(θ))≈vhT.

36

Hence,

Corr(h(θ),z|y)= Cov(h(θ),z|y)√V(h(θ)|y)V(z|y)

≈v

2N2T√

vhT( T4N4

Tvη+

TN2

Tµη)

=v√

vh(vη+4N2T µη)

→0

as T→∞.

Appendix B Derivation of the expression (13) for Com-

puting Time

The average number of particles required in each MCMC iteration to give the same accuracy

in terms of variance as M iid iterates θ1,...,θM from π(θ) is proportional to

1

M

M∑

i=1

G∑

k=1

Nk(θi)IF(σ,ρ)=1

M

M∑

i=1

G∑

k=1

G1

2 γ1/(k) (θi)

σ1/IF(σ,ρ)→

(G

12

G∑

k=1

γ1/(k)

)IF(σ,ρ)

σ1/

as M→∞, where γ1/(k) =Eθ∼π(γ

1/(k) (θ)). The factor in the brackets are independent of σ2,

hence the computing time is proportional to CT= IF(σ,ρ)

σ1/ .

37

This figure "ARplainKDEBlocking1.png" is available in "png" format from:

http://arxiv.org/ps/1603.02485v3


This figure "ARsteadyKDEBlocking1.png" is available in "png" format from:



This figure "ARplainKDEBlocking2.png" is available in "png" format from:



This figure "ARsteadyKDEBlocking2.png" is available in "png" format from:



Date post:	29-Sep-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Block-WisePseudo-MarginalMetropolis-HastingsarXiv:1603.02485v3 [stat.ME] 10 Nov 2016...

Documents