Chapter 4 Markov Chain Monte Carlo

Introduction to Monte Carlo MethodsLecture Notes

Chapter 4

Markov Chain Monte Carlo

Qing Zhou∗

Contents

1 The Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . 21.2 Transition kernel and stationary distribution . . . . . . . . . . . 21.3 Simulating a Markov chain . . . . . . . . . . . . . . . . . . . . . 2

2 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . 42.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Detail balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Autocorrelation and efficiency . . . . . . . . . . . . . . . . . . . . 12

3 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Some Special Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 Random-walk Metropolis . . . . . . . . . . . . . . . . . . . . . . 164.2 Metropolized independence sampler . . . . . . . . . . . . . . . . 174.3 Single-coordinate updating . . . . . . . . . . . . . . . . . . . . . 19

5 The Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2 Stationary distribution and detail balance . . . . . . . . . . . . . 24

6 Examples of the Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . 266.1 The slice sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.2 Blocked Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . 27

∗UCLA Department of Statistics (email: [email protected]).

1

http://www.stat.ucla.edu/~zhou/courses/Stats102C/

Zhou, Qing/Monte Carlo Methods: Chapter 4 2

1. The Basic Idea

We want to simulate a d-dimensional random vector X ∼ π (joint distribution)and compute

µ = Eπ(h(X)) =

∫Rd

h(x)π(x)dx.

1.1. Markov chain Monte Carlo

Generate a Markov chain x1, x2, · · · , xn by simulating xt ∼ p(·|xt−1), wherext = (xt1, · · · , xtd), such that as n→∞,

1. µ̂ = 1n

n∑t=1

h(xt) ≈ µ,

2. xn ∼ π.

Note that x1, x2, · · · , xn are correlated.

1.2. Transition kernel and stationary distribution

Denote the one-step transition kernel of a Markov chain (M.C.) on a generalstate space (Rd) by K(x, y) := pXt|Xt−1

(y|x). This generalizes the one-step tran-sition probabilities pij = P (Xt = j | Xt−1 = j) for discrete state Markov chains.If a probability density π satisfies∫

π(x)K(x, y)dx = π(y) for all y, (1)

then π(x) is a stationary distribution of the Markov chain:

Xt ∼ π =⇒ Xt+1 ∼ π.

The definition in (1) is a natural generalization of the definition for discretecase: ∑

i

πi · pij = πj for all j.

1.3. Simulating a Markov chain

Given initial state x0, transition kernel K(x, y), it is straightforward to simulatean M.C. with the transition kernel for t = 1, 2, · · · , n by the following algrithm.


For t = 1, 2, · · · , n,

Draw xt ∼ K(xt−1, •).

This is to draw from the conditional distribution [xt | xt−1]. Recall for discretecase, we draw xt from a discrete distribution with probabilities P[xt−1, •], onerow in the transition matrix P.


2. The Metropolis-Hastings Algorithm

Given a target distribution with density π(x), the Metropolis-Hastings (MH)algorithm simulates a Markov chain with π as its stationary distribution. Let Sdenote the support of π(x), i.e.,

S = {x : π(x) > 0}, (2)

which defines the state space for the Markov chain simulated by the MH algo-rithm.

2.1. Algorithm

Algorithm 1 (The MH algorithm). Pick a random initial state x(0) ∈ S. Designa proposal distribution q(x, y), which draws a random variable y given the valueof x, i.e. it defines a conditional distribution [y | x]. The proposal must satisfyq(x, y) = 0 for any y /∈ S, i.e. the proposal only generates y such that π(y) > 0.

For t = 1, 2, · · · , n,

1. Draw y from the proposed distribution q(x(t−1), y);

2. Compute the MH ratio r(x(t−1), y) = min

[1,

π(y)q(y, x(t−1))

π(x(t−1))q(x(t−1), y)

];

3. Draw u ∼ Unif(0, 1) and update

x(t) =

{y, if u ≤ r(x(t−1), y);x(t−1), otherwise.

First development: Metropolis et al. (1953) with q(x, y) = q(y, x) (symmetricproposal), in which case the MH ratio simplifies:

r(x, y) = min

[1,π(y)

π(x)

]=

1, if π(y) ≥ π(x);π(y)

π(x), if π(y) < π(x).

As an example, consider a simple proposal q(x, y) that draws

y | x ∼ Unif(x− δ, x+ δ).

Therefore, the proposal (conditional density)

q(x, y) = p(y | x) =1

2δ, if |y − x| < δ.


As a bivariate function, q(x, y) is symmetric in x, y, i.e. q(y, x) = q(x, y):

q(y, x) =1

2δ, if |x− y| < δ.

Therefore, this is a symmetric proposal. Using this proposal, the main steps ofthe MH algorithm are illustrated with the following figure. In the figure, botha, b ∈ (x(t) − δ, x(t) + δ) and π(a) > π(x(t)) > π(b).

x

π(x)

x(t)( )a b

y | x(t) ∼ Unif(x(t) − δ, x(t) + δ);

r(x(t), y) = min

[1,

π(y)

π(x(t))

].

• If y = a, then r(x(t), y) = 1 and x(t+1) = y.• If y = b, then r(x(t), y) = π(b)/π(x(t)) < 1: x(t+1) = y with probabilityπ(b)/π(x(t)) and x(t+1) = x(t) with probability 1− π(b)/π(x(t)).

2.2. Examples

Example 1. Draw N (0, 1) by an MH algorithm using Unif(x − δ, x + δ) withδ = 1 as the proposal.

# R code for this example

n=10000;

d=1;

X=numeric(n);

X[1]=0;

a=0;

for(t in 2:n)

{

Y=runif(1,X[t-1]-d,X[t-1]+d);

r=min(1,exp(-0.5*Y^2)/exp(-0.5*X[t-1]^2));

u=runif(1,0,1);

if(u<r){X[t]=Y;a=a+1}else{X[t]=X[t-1]};

}


a/n # acceptance rate

[1] 0.805

#use the last 5000 iterations (X[5001:n]) as our samples from N(0,1)

mean(X[5001:n])

[1] -0.04334007

sd(X[5001:n])

[1] 0.9988046

hist(X[5001:n])Histogram of X[5001:n]

X[5001:n]

Frequency

-3 -2 -1 0 1 2 3

0200

400

600

800

1000

Some remarks:

• If an MH algorithm is irreducible and aperiodic, then the M.C. {x(t)}converges to the stationary distribution π(x) and sampler averages ap-proximate expectations:

1

n

∑t

h(x(t))a.s.−→Eπh(x). (3)

• Burn-in period. Run this example with different initial values x(0) = 5(red) vs x(0) = −5 (blue). The plot shows that the M.C. converges (twocurves mix) after about 30 iterations (burn-in period). We usually use theaverage over x(t) after the burn-in period for estimation in (3).


0 50 100 150 200

-4-2

02

4

T

cbin

d(X

1[T]

, X2[

T])

Next, we demonstrate how to design a proposal q(x, y) such that y always staysin S.

Example 2. Poisson Distribution.

π(x) =e−λλx

x!∝ λx

x!, x = 0, 1, 2, · · · .

For the example, the state space S = {0, 1, · · · } (nonnegative integers). There-fore, the proposal q(x, y) should only move in S. One possible design is

If x ≥ 1, then y =

{x+ 1, with probability 1/2;x− 1, with probability 1/2;

If x = 0, then y = 1 with probability 1.

The state transition diagram of q(x, y):

0 1 2 3 · · ·

q(0, 1) = 1 and q(x, y) = 1/2 if x ≥ 1 and y ∈ {x− 1, x+ 1}.

The ratio between target densities:π(y)

π(x)=λy

λxx!

y!. (π(x) can be unnormalized.)

If x, y ≥ 1,q(y, x)

q(x, y)= 1: Symmetric.


If x = 0, y = 1,q(y, x)

q(x, y)=q(1, 0)

q(0, 1)=

12

1=

1

2.

If x = 1, y = 0,q(y, x)

q(x, y)=q(0, 1)

q(1, 0)=

112

= 2.

We use the 1-D Ising model to demonstrate the MH algorithm for simulatingfrom a joint distribution.

Example 3 (1-D Ising Model). Consider a random vector x = (x1, · · · , xd) ∈{1,−1}d, i.e. every xj ∈ {1,−1}. Define an energy function

U(x) = −d−1∑i=1

xixi+1.

At a given temperature T > 0, the Boltzmann distribution is specified by theprobability mass function

π(x) ∝ exp

[−U(x)

T

]= exp

[µ

d−1∑i=1

xixi+1

], (4)

where µ = 1/T > 0, up to a normalizing constant. Note that π(x) = π(x1, . . . , xd)is a joint distribution over d binary discrete random variables xi, i = 1, . . . , d.There are a total of 2d possible combinations among the xi’s. We call each com-bination a configuration. This is a simple model for a physical system consistingof d particles. The Boltzmann distribution assign a probability π(x) for eachconfiguration x.

We can use a graph to represent the joint distribution π(x). Each node in thegraph corresponds to a random variable and an edge exists if there is a productterm (xixi+1) in U(x):

x1 x2 x3 · · · xi−1 xi xi+1 · · · xd

Given a current configuration x(t) = (x(t)1 , · · · , x(t)d ), one iteration of the MH

algorithm consists of:

1. Proposal: Randomly choose j from {1, . . . , d} and flip xj to its opposite:

y = (x(t)1 , · · · ,−x(t)j , · · · , x(t)d ).

This is a symmetric proposal: q(x(t), y) = q(y, x(t)).2. Thus, the MH ratio

r(x(t), y) = min

[1,

π(y)

π(x(t))

],

π(y)

π(x(t))= exp

{−2µx

(t)j

(x(t)j−1 + x

(t)j+1

)},


where x(t)0 = x

(t)d+1 ≡ 0.

The following R code implements this MH algorithm to simulate from π withT = 1 and estimate Eπg(x) = Eπ(

∑i xi). You may change the value of T

(temperature) to see its effect on the distribution and the expectation.

n=6000;

d=20;

X=matrix(0,n,d);

X[1,]=sample(c(-1,1),size=d,replace=TRUE);

g=numeric(n);

g[1]=sum(X[1,]);

T=1;

for(t in 2:n)

{

y=X[t-1,];

j=sample(1:d,size=1);

y[j]=-X[t-1,j];

if(j==1){

r=exp(-2*X[t-1,1]*X[t-1,2]/T);

}else if(j==d){

r=exp(-2*X[t-1,d-1]*X[t-1,d]/T);

}else{

r=exp(-2*X[t-1,j]*(X[t-1,j-1]+X[t-1,j+1])/T);

}

U=runif(1,0,1);

if(U<=min(r,1)){X[t,]=y}else{X[t,]=X[t-1,]};

g[t]=sum(X[t,]);

}

mean(g[1000:n])

2.3. Detail balance

In this section, we verify that π is indeed a stationary distribution of the Markovchain simulated by the MH algorithm. That is, for any y ∈ S,∫

Sπ(x)K(x, y)dx = π(y), or

∑x∈S

π(x)K(x, y) = π(y)

for discrete state space, where K(x, y) is the one-step transition kernel of theMH algorithm.


We consider a sufficient condition that is easy to check, called the detail balancecondition:

π(x)K(x, y) = π(y)K(y, x), for all x, y ∈ S. (5)

The key intuition behind the detail balance condition may be understood usingwater flow between two tanks x and y as an analogy, illustrated in the followingfigure. The volumes of water in the two tanks are π(x) and π(y), respectively.Two pipes connect the tanks, one allowing water to flow from x to y and theother from y to x. The flow rates are K(x, y) and K(y, x) per unit volume ofwater. Thus, the flow rate from tank x to y is π(x)K(x, y), and π(y)K(y, x) inthe other direction. If the detail balance condition holds, then the amount ofwater flow from x to y will match exactly that from y to x, and as a result, thevolumes π(x) and π(y) will stay constant over time.

x

π(x)

y

π(y)

K(x, y)

K(y, x)

Lemma 1. If the detail balance condition (5) holds, then π is a stationarydistribution of the Markov chain with K(x, y) as the one-step transition kernel.

Proof. Integrating over x on both sides of the detail balance condition, we have,for any y,∫

π(x)K(x, y)dx =

∫π(y)K(y, x)dx = π(y)

∫K(y, x)dx = π(y),

where the last equality is due to the fact that K(y, x) is a conditional densityfor [x | y].


Theorem 1. The MH algorithm simulates a Markov chain for which π(x) is astationary distribution.

Proof. It suffices to show that the detail balance condition (5) is satisfied, i.e.π(x)K(x, y) = π(y)K(y, x) for any x and y.

1. It is trivially true for x = y;2. Suppose y 6= x. Then the MH algorithm must propose y and accept it.

Therefore, the transition kernel

K(x, y) = q(x, y) min

[1,π(y)q(y, x)

π(x)q(x, y)

].

Now we have

π(x)K(x, y) = π(x)q(x, y) min

[1,π(y)q(y, x)

π(x)q(x, y)

]= min[π(x)q(x, y), π(y)q(y, x)]

= min

[π(x)q(x, y)

π(y)q(y, x), 1

]· π(y)q(y, x)

= π(y)K(y, x).


2.4. Autocorrelation and efficiency

Consider the efficiency of MCMC for estimating

µh =

∫h(x)π(x)dx = Eπ[h(X)].

Suppose x(1), x(2), · · · , x(m) is a Markov chain with π as its stationary and also

limiting distribution. Let hm = 1m

m∑i=1

h(x(i)).

Assume that x(0) ∼ π(x). If m is large, then

Var(hm) =σ2

m

1 + 2

m−1∑j=1

(1− j

m

)ρj

≈σ2

m

1 + 2

∞∑j=1

ρj

, (6)

where σ2 = Varπ[h(x)] and

ρj = cor(h(x(1)), h(x(1+j))) = cor(h(x(t)), h(x(t+j))), for any t = 1, 2, . . .

is the j-step autocorrelation.

Comparing (6) to an independent sample, x(i) ∼ π independently for i =1, . . . ,m,

Var(hm) =σ2

m,

we define effective sample size of this Markov chain as

m

1 + 2∑∞j=1 ρj

.

Thus, the faster the autocorrelation ρj decays to zero, the more efficient theestimation of µh by the MCMC algorithm.


In Example 1, we may change the value of δ in the proposal Unif(x− δ, x+ δ) tosee the change in autocorrelations, demonstrating different efficiency for differentproposals. The figures below show the autocorrelation plot, ρj for j = 0, . . . , 40,generated by

acf(X)

for δ = 1 and δ = 5 and the corresponding acceptance rates Pa. The autocorre-lation plots suggest that the choice of δ = 5 gives more efficient estimates. SeeSection 4.1 for related discussion.

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Series X

δ = 1, Pa = 0.805

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Series X

δ = 5, Pa = 0.312


3. Simulated Annealing

For any π(x), let h(x) = − log(π(x)). For T > 0, define

π(x;T ) ∝ exp

[−h(x)

T

],

where T is the temperature in the Boltzmann distribution (4) regarding h(x)as the energy function. In particular, π(x) = π(x;T = 1). Denote the globalminimizer of h by

x∗ = argminh(x) = argmaxπ(x).

Varying the temperature T ∈ (0,∞), we can change the shape of the distributionπ(x;T ):

• T →∞: for any x,

π(x;T )

π(x∗;T )= exp

[h(x∗)− h(x)

T

]→ 1.

Thus, π(x;T ) ∝ 1, close to uniform distribution.• T → 0: for any h(x) > h(x∗),

π(x;T )

π(x∗;T )= exp

[h(x∗)− h(x)

T

]→ exp(−∞) = 0.

Thus, π(x;T ) is concentrated at x∗, i.e. a point mass at x∗.

The goal of simulated annealing is to find x∗, the global minimizer of h(x). Thismethod uses the MH algorithm to simulate from π(x;T ) with a non-increasingsequence of T . It starts with a high temperature (large T ) and gradually de-creases T to zero. At a high temperature, since π(x;T ) is pretty flat, the MHalgorithm has a decent chance to explore different local modes of the density.Later on, as T decreases to 0, the samples will have a high probability to con-verge to x∗.


Algorithm 2 (Simulated annealing). Choose T1 ≥ T2 ≥ · · · ≥ Tn → 0 andpick x(0).

For t = 1, . . . , n:

• Set T = Tt.• Draw x(t) given x(t−1) via one step of an MH algorithm targeting atπ(x;T ). That is, in step 2 of Algorithm 1, we replace π(x) and π(y) byπ(x;T ) and π(y;T ), respectively.


4. Some Special Designs

4.1. Random-walk Metropolis

Consider π(x) defined on Rd (d-dimensional Euclidean Space). Use the additionof a random perturbation (an error vector) as the proposal in the MH algorithm.

Given the current sample x(t), the proposal q(x(t), y) draws

y = x(t) + εt, εt ∼ gσ(ε), (7)

where gσ is a spherically symmetric distribution, i.e., gσ(a) = gσ(b) if ‖a‖ = ‖b‖(Euclidean norm).

x1

x2

a

b

Examples of gσ(ε) include multi-variate GaussianNd(0, σ2Id) and Unif(B(0, σ)),where B(0, σ) is the ball centering at 0 with radius σ, i.e.

B(0, σ) :={x ∈ Rd : ‖x‖ ≤ σ}.

The proposal in (7) is symmetric, q(x(t), y) = q(y, x(t)), since gσ(ε) = gσ(−ε).

The random-walk Metropolis:

Given x(t),

1. Draw εt ∼ gσ(ε): spherically symmetric (σ can be controlled by the user),

set y = x(t) + εt, r(x(t), y) = min

[1,

π(y)

π(x(t))

];

2. Draw u ∼ Unif(0, 1) and update

x(t+1) =

{y, if u ≤ r(x(t), y);x(t), otherwise.

How to choose σ: maintain acceptance rate ∈ [0.25, 0.35]. See the autocorrelationplots in Section 2.4.


4.2. Metropolized independence sampler

In some problems, we may have ways to approximate the target distribution πby a trial distribution g that we can simulate from. In these cases, we may chooseq(x, y) = g(y), which defines a proposal that is independent of x. An MH algo-rithm with such an independent proposal is called a Metropolized independencesampler:

Given x(t),

1. Draw y ∼ g(y),

r(x(t), y) = min

[1,

π(y)

π(x(t))

g(x(t))

g(y)

]= min

[1,

w(y)

w(x(t))

],

where w(x) = π(x)/g(x) is the importance weight;2. Draw u ∼ Unif (0, 1),

x(t+1) =

{y, if u ≤ r(x(t), y);x(t), otherwise.

Some remarks:

(a) This method is closely related to importance sampling and it uses importanceweights w(y)/w(x(t)) to calculate the MH ratio. Similar to importance sampling,the efficiency of this MH algorithm depends on how close g(y) is to π(y). Oneway to measure the closeness is by the variance of the importance weights:Varg[w(x)] :=Vw. Small Vw suggests that g is close to π and usually leads to ahigher acceptance rate. If Varg(w(x)) = 0, then g = π and r(x, y) = 1 for all x, y.Therefore, for a Metropolized independence sampler, the higher the acceptancerate, the more efficient of the algorithm.

(b) To get robust performance and reduce the variance Vw, the trial distributiong should have a heavier tail than π. For example, if π is a normal distributionthen g could be a t-distribution.

x

π(x)

g(x)


Example 4 (Gamma distribution). Design a Metropolized independent sam-pler to draw from Gamma(α, β), α > 1, β > 0,

π(x) =βα

Γ(α)xα−1e−βx, x > 0,

using Exp(λ) as the trial distribution. Let

w(x) =xα−1e−βx

λe−λx.

Choose λ to minimize Varg(w(x)).

Since Eg(w(x)) =∫xα−1e−βxdx = Γ(α)/βα is a constant independent of λ, it

is equivalent to minimizing Eg[w(x)2] =∫w(x)2g(x)dx. Some calculation shows

that

Eg[w(x)2] =1

λ

∫ ∞0

x2α−2e−(2β−λ)xdx.

Note that Eg[w(x)2] <∞ if and only if 2β − λ > 0. So we must choose

λ < 2β. (8)

Under this condition, the integrand is an unnormalized Gamma(2α− 1, 2β−λ)and thus

Eg[w(x)2] =1

λ· Γ(2α− 1)

(2β − λ)2α−1.

Therefore, to minimize Eg[w(x)2] we just need to maximize

f(λ) = λ(2β − λ)2α−1

over λ. Since the objective f(λ) > 0, we can equivalently

maxλ

[log f(λ) = log λ+ (2α− 1) log(2β − λ)

]of which the only maximizer is

λ∗ = β/α

by setting derivative to zero. Since α > 1, we have λ∗ < β satisfying the con-straint (8). This also shows that the tail of g is heavier than that of π (Remark b):

limx→∞

π(x)

g(x)= C lim

x→∞

xα−1

e(β−λ∗)x= 0,

where C > 0 is a constant.

In fact, with λ∗ = β/α, g and π have the same mean (1/λ∗ = α/β). That is,we have matched the expectations of the two distributions with this optimalchoice.


4.3. Single-coordinate updating

This design is for multivariate distributions. For

x = (x1, · · · , xi−1, xi, xi+1 · · · , xd) ∈ Rd,

define

xi(y) :=(x1, · · · , xi−1, y, xi+1, · · · , xd) : x with y replacing xi;

x[−i] :=(x1, · · · , xi−1, xi+1, · · · , xd) : x with xi omitted.

Our target distribution is π(x).

To do single-coordinate update in the MH algorithm, the proposal q(x,y) hastwo steps:

(a) Select a coordinate i, either cycling through 1 to d deterministically, orrandomly from {1, . . . , d}.

(b) Given i, draw y ∼ qi(xi, y), which proposes a scaler y from some univariatedistribution [y | xi], e.g. y ∼ N (xi, 1). Then put y = xi(y). That is, theproposal only changes the ith coordinate of x.

The MH ratio is determined by

π(y)

π(x)

q(y,x)

q(x,y)=π(xi(y))

π(x)

qi(y, xi)

qi(xi, y). (9)

Let π(· | x[−i]) be the conditional density of [xi | x[−i]]. Then we have

π(x) = π(xi|x[−i]) · π(x[−i]),

π(xi(y)) = π(y|x[−i]) · π(x[−i]),

and consequently, the ratio in (9) simplifies to

π(y)

π(x)

q(y,x)

q(x,y)=

π(y|x[−i])

π(xi|x[−i])

qi(y, xi)

qi(xi, y). (10)

This is the same as an MH algorithm with the conditional distribution π(· | x[−i])as the target and qi(xi, y) as the proposal.

An important special case is to choose qi(xi, y) = π(y|x[−i]), i.e., we proposey by sampling from the conditional distribution y ∼ π(·|x[−i]). Accordingly,qi(y, xi) = πi(xi|x[−i]). Then by (10) the MH ratio

r(x,y) = min

[1,π(y|x[−i])

π(xi|x[−i])·π(xi|x[−i])

π(y|x[−i])

]≡ 1,

so y = xi(y) is always accepted. In other words, we just iteratively sample fromthe conditional distribution π(·|x[−i]) for a chosen coordinate i ∈ {1, . . . , d}.This is the Gibbs sampler.


5. The Gibbs Sampler

The target distribution is π(x) = π(x1, x2, · · · , xd), x ∈ Rd. Following the no-tation in Section 4.3, define

x(t) = (x(t)1 , x

(t)2 , · · · , x(t)d ),

x(t)i (y) = (x

(t)1 , · · · , x(t)i−1, y, x

(t)i+1, · · · , x

(t)d ),

x(t)[−i] = (x

(t)1 , · · · , x(t)i−1, x

(t)i+1, · · · , x

(t)d ).

5.1. Algorithms

The Gibbs sampler iteratively samples from the conditional distribution π(·|x[−i])for a chosen coordinate i ∈ {1, . . . , d}. There are two ways to pick a coordinate,corresponding to random-scan versus systematic-scan Gibbs sampler:

Algorithm 3 (Random-scan Gibbs sampler). Pick an initial value x(1).

For t = 1, . . . , n:

1. Randomly select a coordinate i from {1, 2, · · · , d};2. Draw y from the conditional distribution π(xi|x(t)

[−i]). Let x(t+1) = x(t)i (y)

(i.e. x(t+1)i = y, x

(t+1)[−i] = x

(t)[−i]).

Algorithm 4 (Systematic-scan Gibbs sampler). Pick an initial value x(1).

For t = 1, . . . , n: Given the current sample x(t) = (x(t)1 , · · · , x(t)d ),

for i = 1, 2, · · · , d,

draw x(t+1)i ∼ π(xi|x(t+1)

1 , · · · , x(t+1)i−1 , x

(t)i+1, · · · , x

(t)d ).

By default, we use systematic-scan (Algorithm 4) unless noted otherwise. Givensamples {x(t) : t = 1, . . . , n} generated by the Gibbs sampler, we estimateEπh(x), the expectation of h(x) with respect to π, by the sample average:

h̄ =1

n

n∑t=1

h(x(t)). (11)

Similar to the MH algorithm, we often throw away samples generated during theburn-in period, say the first 1000 iterations, and calculate h̄ from post burn-insamples.

To design a Gibbs sampler for a joint distribution π(x), the key is to deriveconditional distributions [xi | x[−i]] for all i. We will demonstrate how to findsuch conditional distributions in a few examples.


Example 5. Design a Gibbs sampler to simulate from a bivariate Normal dis-tribution:

X = (X1, X2) ∼ N2

((00

),

(1 ρρ 1

)),

i.e. the pdf of the target distribution is

π(x1, x2) =1

2π√

1− ρ2exp

{−x

21 − 2ρx1x2 + x22

2(1− ρ2)

}.

Use the samples to estimate E(X1X2) and the correlation coefficient cor(X1, X2).

Find the conditional distribution [x1 | x2] as follows: Regarding x2 as a constant,

π(x1 | x2) ∝ π(x1, x2) ∝ exp

[−x

21 − 2ρx2x12(1− ρ2)

], (12)

where any multiplicative factor that only depends on x2 is regarded as a constantand absorbed into the proportion sign. Now complete squares:

x21 − 2ρx2x1 = (x1 − ρx2)2 − (ρx2)2,

and plug it into (12),

π(x1 | x2) ∝ exp

[− (x1 − ρx2)2

2(1− ρ2)

],

which is an unnormalized density for N (ρx2, 1− ρ2). Thus,

x1 | x2 ∼ N (ρx2, 1− ρ2).

Similarly, x2 | x1 ∼ N (ρx1, 1− ρ2).

Gibbs sampler (one iteration): Given x(t) = (x(t)1 , x

(t)2 ),

x(t+1)1 |x(t)2 ∼ N (ρx

(t)2 , 1− ρ2). (13)

x(t+1)2 |x(t+1)

1 ∼ N (ρx(t+1)1 , 1− ρ2). (14)

x1

x2

(x(t+1)1 , x

(t+1)2 )

(x(t)1 , x

(t)2 ) (x

(t+1)1 , x

(t)2 )


#R code: Gibbs sampler for Example 5 (bivariate normal)

rho=0.8;

n=6000;

X=matrix(0,n,2);

X[1,]=c(10,10);

for(t in 2:n)

{

X[t,1]=rnorm(1,rho*X[t-1,2],sqrt(1-rho^2));

X[t,2]=rnorm(1,rho*X[t,1],sqrt(1-rho^2));

}

#estimate E(X1X2)

B=1001; #post burn-in

h=X[,1]*X[,2];

acf(h)

h_hat=mean(h[B:n])

#estimate cor(X1,X2)

r=cor(X[B:n,1],X[B:n,2])

Using the post burn-in samples t ≥ B, the estimates of E(X1X2) and cor(X1, X2)were:

> h_hat

[1] 0.7448093

> r

[1] 0.7851272

The samples generated in the first 100 iterations and the autocorrelation plot

for h(t) = x(t)1 x

(t)2 are shown below:


−2 0 2 4 6 8 10

−2

02

46

810

X[1:100, ][,1]

X[1

:100, ][,2

]

0 10 20 30

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Series h

For this Gibbs sampler, we can use induction to work out the distribution of

x(t) for any t ≥ 1, assuming we initialize the algorithm at (x(0)1 , x

(0)2 ):(

x(t)1

x(t)2

)∼ N2

((ρ2t−1x

(0)2

ρ2tx(0)2

),

(1− ρ4t−2 ρ− ρ4t−1ρ− ρ4t−1 1− ρ4t

))t→∞−−−→ N2

((00

),

(1 ρρ 1

)). (15)

In particular, (15) shows that the limiting distribution is indeed π(x).

Example 6. Consider a joint distribution between a discrete and a continuousrandom variables:

π(x, y) ∝(n

x

)yx+α−1(1− y)n−x+β−1

for x = 0, 1, · · · , n and y ∈ [0, 1]. The two conditional distributions are derivedas follows:

π(x|y) ∝(n

x

)yx(1− y)n−x ⇒ x|y ∼ Bin(n, y).

π(y|x) ∝ yx+α−1(1− y)n−x+β−1 ⇒ y|x ∼ Beta(x+ α, n− x+ β).

The pdf of the Beta(α, β) distribution (α > 0, β > 0) is

f(y|α, β) =Γ(α+ β)

Γ(α)Γ(β)yα−1(1− y)β−1 y ∈ [0, 1].

If Y ∼ Beta(α, β), then E(Y ) =α

α+ β.

If two independent random variablesX1 ∼ Gamma(α, 1) andX2 ∼ Gamma(β, 1),

thenX1

X1 +X2∼ Beta(α, β).


Example 7 (Gibbs sampler for 1-D Ising Model). As defined in Example 3,the joint distribution for the 1-D Ising model with temperature T > 0 is givenby

π(x) ∝ exp

(1

T

d−1∑i=1

xixi+1

), xi ∈ {1,−1}.

To develop a Gibbs sampler for this problem, we find the conditional distribution[xi | x[−i]] for each i = 1, . . . , d:

π(xi | x[−i]) ∝ π(x1, . . . , xi, . . . , xd)

∝ exp

{1

T(x1x2 + . . .+ xi−1xi + xixi+1 + . . .+ xd−1xd)

}∝ exp

{xiT

(xi−1 + xi+1)}, xi ∈ {1,−1}. (16)

Since xi ∈ {1,−1}, put

Zi = exp

{1

T(xi−1 + xi+1)

}+ exp

{− 1

T(xi−1 + xi+1)

}.

We have

π(xi | x[−i]) =1

Ziexp

{xiT

(xi−1 + xi+1)}

for xi ∈ {1,−1}.

For i = 1 or d, plug in x0 = xd+1 = 0.

Note that π(xi | x[−i]) = P(Xi = xi | x[−i]), xi ∈ {1,−1}, is simply a binarydiscrete distribution. Let θ1 = π(xi = 1 | x[−i]), θ2 = π(xi = −1 | x[−i]) and puttheta = (θ1, θ2). To sample from [xi | x[−i]]:

x[i]=sample(c(1,-1),size=1,prob=theta);

where the vector x stores the current sample. In fact, we do not need to normalizeπ(xi | x[−i]) in the above code. Instead, we may set θ1 and θ2 by (16):

θ1 = exp

{1

T(xi−1 + xi+1)

}, θ2 = exp

{− 1

T(xi−1 + xi+1)

},

since the sample function will normalize theta anyway.


5.2. Stationary distribution and detail balance

As a special case of the MH algorithm, the detail balance condition is satisfiedfor the Gibbs sampler, which implies that π is a stationary distribution.

It is also easy to verify the detail balance condition directly. To do this, we regardeach conditional sampling step as a one-step transition of the underlying Markovchain. Let x = (x1, · · · , xd) and y = xi(y). Then the one-step transition kernelK(x,y) = π(y|x[−i]). Our goal is to show that π(x)K(x,y) = π(y)K(y,x).

Proof.

π(x)K(x,y) = π(x) · π(y|x[−i]) =π(x) · π(y)

π(x[−i]).

π(y)K(y,x) = π(y) · π(x|x[−i]) =π(x) · π(y)

π(x[−i]).


6. Examples of the Gibbs Sampler

6.1. The slice sampler

Suppose we want to simulate from π(x) ∝ q(x), where x ∈ Rd. The slice samplersimulates from a uniform distribution over the region under the surface of q(x)by the Gibbs sampler, based on the following result:

Lemma 2. Suppose a pdf π(x) ∝ q(x), x ∈ Rd. Denote the region under thesurface of q(x) by

S = {(x, y) ∈ Rd+1 : y ≤ q(x)}.

If (X, Y ) ∼ Unif(S), then the marginal distribution of X is π, i.e. X ∼ π.

q(x)

S

x

Proof. Let |S| denote the volume of S:

|S| =∫q(x)dx. (17)

Since (X, Y ) ∼ Unif(S), their joint pdf is

fX,Y (x, y) = 1/|S|, (x, y) ∈ S.

If X = x, the range of Y is (0, q(x)). Then the marginal density at x is

pX(x) =

∫ q(x)

0

fX,Y (x, y)dy =

∫ q(x)

0

1

|S|dy =

q(x)

|S|= π(x).

The last equality in the above is due to the fact that |S| is the normalizingconstant for q(x) as in (17).

The slice sampler uses a Gibbs sampler to simulate from Unif(S) by iteratingbetween [Y | X] and [X | Y ]. Then, according to Lemma 2, the marginaldistribution of X is the target distribution π(x). It is easy to see that

Y | X = x ∼ Unif(0, q(x)).


Let X (y) = {x ∈ Rd : q(x) ≥ y} be the set of x with q(x) ≥ y, i.e. a super-levelset of q(x). Then as shown in the following figure,

X | Y = y ∼ Unif(X (y)).

Consequently, one iteration of the slice sampler consists of two conditional sam-pling steps: Given x(t),

1. Draw y(t+1) ∼ Unif[0, q(x(t))] (vertical blue dashed line);2. Draw x(t+1) uniformly from region X (t+1) = {x ∈ Rd : q(x) ≥ y(t+1)}

(horizontal red dashed line).

Then when t is large, (x(t), y(t)) ∼ Unif(S) and x(t) ∼ π, achieving the goal ofsampling from π.

x

y

S

x(t)

q(x(t))

y(t+1)

Example 8 (td-distribution). Use slice sampler to simulate from t-distributionwith d degree of freedom:

π(x) ∝ (1 + x2/d)−(d+1)/2 := q(x), x ∈ R.

Suppose the sample at iteration t is xt. The two steps to generate xt+1 are:

1. Draw yt+1 ∼ Unif[0, q(xt)], where q(xt) = (1 + x2t/d)−(d+1)/2.2. Draw xt+1 uniformly from the interval

Xt+1 = {x ∈ R : q(x) ≥ yt+1} = [−b(yt+1), b(yt+1)],

where b(y) =√d(y−2/(d+1) − 1). Note that ±b(y) are the two roots of the

quadratic equation q(x) = y.

6.2. Blocked Gibbs sampler

Partition {1, . . . , d} into two blocks, A and B: A∪B = {1, . . . , d} and A∩B = ∅.Iterative sampling from [xA | xB ] and [xB | xA].


Example 9. Ising model on a graph G = (V,E). V = {1, . . . , d} is the vertexset. E ⊂ V × V is the edge set: There is an edge between two vertices i, jiff (i, j) ∈ E. Given G, define a Boltzmann distribution for (X1, . . . , Xd) attemperature T > 0:

π(x1, . . . , xd) ∝ exp

1

T

∑(i,j)∈E

xixj

, xi ∈ {1,−1}. (18)

This joint distribution implies the following conditional independence state-ments among X1, . . . , Xd:

Theorem 2. Let Ni = {j : (i, j) ∈ E} be the set of neighbors of vertex i in thegraph G. If k /∈ Ni and k 6= i, then

Xi ⊥⊥ Xk | {Xj : j ∈ Ni}.

Proof. It follows from (18) that the conditional density of Xi given X[−i] is

π(xi | x[−i]) ∝ exp

xiT

∑j∈Ni

xj

,

which only depends on xj , j ∈ Ni.

A few common examples of graphs G:

• Chain, E = {(1, 2), (2, 3), . . . , (d− 1, d)}: 1-D Ising model (Example 3).• Complete graph, E = {(i, j) : i < j}, i.e. there is an edge between every

pair of nodes i, j. For example, a complete graph over four nodes (d = 4):

X1 X2

X3X4

• Star topology, E = {(1, i) : i = 2, . . . , d}: X1 is the hub node (vertex) andis the only neighbor of all other nodes X2, . . . , Xd.

Xi ⊥⊥ Xj | X1 for all i 6= j ∈ {2, . . . , d}. (19)

X1

X2

X3

X4X5

X6


If G has a star topology, we can develop a two-block Gibbs sampler to samplefrom (18) by letting A = {1} and B = {2, . . . , d}. The two conditional samplingsteps in one iteration of the Gibbs sampler are:

1. Sample from [xA | xB ] = [x1 | x[−1]]: Since

π(x1 | x2, . . . , xd) ∝ exp

[1

T(x2 + . . .+ xd)x1

],

for x1 ∈ {1,−1}, after normalization we have

π(x1 | x2, . . . , xd) =exp

[1T (x2 + . . .+ xd)x1

]exp

[1T (x2 + . . .+ xd)

]+ exp

[− 1T (x2 + . . .+ xd)

] ,x1 ∈ {1,−1}.

2. Sample from [xB | xA] = [x[−1] | x1]: Since X2, . . . , Xd are independentgiven X1 = x1 by (19), we have

π(x2, . . . , xd | x1) =

d∏j=2

π(xj | x1)

∝d∏j=2

exp(x1xj

T

), xj ∈ {1,−1}.

Draw xj from [xj | x1] for each j = 2, . . . , d independently according to:

π(xj | x1) =exp

(1T x1xj

)exp

(x1

T

)+ exp

(−x1

T

) , xj ∈ {1,−1}.

Date post:	23-Jan-2022
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Chapter 4 Markov Chain Monte Carlo

Documents