MTH 707A: Markov chain Monte Carlo
Instructor: Dootika Vats
Spring 2020
Contents
1 Course overview and motivation 3
2 Markov chains basics 6
3 Stochastic Stability 10
4 Constructing MCMC algorithms 144.1 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 Types of proposal distributions . . . . . . . . . . . . . . . . . . 174.2 General Accept-Reject MCMC . . . . . . . . . . . . . . . . . . . . 194.3 Combining Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Component-wise Updates . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.1 Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4.2 Metropolis-within-Gibbs . . . . . . . . . . . . . . . . . . . . . . 27
4.5 Linchpin variable sampler . . . . . . . . . . . . . . . . . . . . . . . 28
5 Convergence 315.1 F -irreducibility and aperiodicity . . . . . . . . . . . . . . . . . . . . 315.2 Harris recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Minorization and coupling 376.1 Minorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2 The Split Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7 Rate of Convergence and CLT 417.1 Uniform Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.2 Random Scan Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . 567.3 Geometric ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.3.1 Drift and minorization conditions . . . . . . . . . . . . . . . . . 59
1
8 Estimating the asymptotic variance 738.1 Spectral variance estimators . . . . . . . . . . . . . . . . . . . . . . 748.2 Batch means estimator . . . . . . . . . . . . . . . . . . . . . . . . . 76
9 Terminating MCMC simulation 779.1 Effective sample size . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10 Miscellaneous topics 7910.1 Multiple chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7910.2 Thinning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7910.3 Multivariate analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 79
11 Errata 81
2
1 Course overview and motivation
This is a graduate level (PhD level) course on Markov chain Monte Carlo. Unlike
other MCMC courses, this will focus specifically on the theoretical analysis of MCMC
samplers; their rates of convergence and theoretical tools for their analysis.
MCMC is a computational simulation technique from drawing samples from compli-
cated distributions. Throughout this course, I will denote f as a target density and
also denote F as the associated target distribution (measure). We will only discuss
probability measures, so I will refer to them as distributions. Formally, let F be a
probability measure defined on a measurable space (X ,B(X ))
Given a complicated integration problem, MCMC is a potential solution.
Example 1 (Estimating integrals). Suppose we want to integrate
µ =
∫ ∞0
1
(1 + x)2.3(log(x+ 3))2dx .
We can rewrite this as
µ =
∫ ∞0
1
(1 + x)2.3(log(x+ 3))2dx
=
∫ ∞0
ex
(1 + x)2.3(log(x+ 3))2e−xdx
= EF
[ex
(1 + x)2.3(log(x+ 3))2
],
where F is the Exp(1) distribution.
Example 2 (Expectations). Suppose for a function h : X → R, we are interested in
EF [h(x)] =
∫Xg(x)F (dx) .
Example 3 (Quantiles). Let V (x) = h(X) and 0 < q < 1, where X ∼ F . Define theq-quantile of V as
ξq = F−1v (q) = inf{ν : Fv(ν) ≥ q}
Example 4 (Densities). Let V (x) = h(X) and X ∼ F . How do we estimate thedensity of V ?
3
In order to solve all these problems, we would want to obtain a sample from the
distribution F . That is, we want
X1, X2, . . . , Xn ∼ F
This is a non-trivial problem.
• How do we even draw samples from known distributions? MTH511 done
– Inversion method (Exponential example)
– Accept-reject sampling
– Ratio of uniforms
• How do we draw samples from nice unknown distributions? MTH511 done
– Accept-reject sampling
– Ratio of uniforms
• How do we draw samples from bad unknown distributions? What are bad distri-butions?
– Suppose X is R1000. Our distribution is defined on a high-dimensional space.
– F is such that f(x) is complicated
Example 5 (Bayesian one-way random effects model). Suppose we have for i =
1, . . . , k and j = 1, . . . ,mi
Yij | θ, λe ∼ N(θi, λ−1e )
θi | µ, λθ ∼ N(µ, λ−1θ )
µ ∼ N(mo, s−1o ) λe ∼ Gamma(a1, b1) λθ ∼ Gamma(a2, b2)
We are interested in the posterior distribution
q(Θ) = q(θ, µ, λe, λθ | y) ∝ f(y|θ, λe)f(θ|µ, λθ)f(µ)f(λθ)f(λe) .
We are interested in the Bayes estimator E[Θ|y].
So our goal will be: given any proper distribution F , can we draw an (approximate)
4
sample
X1, . . . , Xn ≈ F
so that we will estimate Θ with a sample estimate Θ̂ along with a measure of uncer-
tainty: Θ̂−Θ.
How does sampling work? Suppose want to draw samples from a standard normal
distribution N(0, 1). If you have taken a simulation class, you know how to do this.
Samples will look like Make plot.
From these samples we know how to use sample statistics to estimate almost any
population feature. Strong law, CLT, mean consistency, estimating variance etc.
Now suppose we have no idea how to draw samples from N(0, 1), but given one point
in R, we can decide how to obtain the next point so that eventually my samples behavelike a N(0, 1) sample. Make plot. Since these are not iid samples, I have many questions
• If my first point is somehow N(0, 1) are all points N(0, 1)?
• If my first point is not N(0, 1), then what happens to all the point points? Arethey eventually N(0, 1)?
• Does a strong law hold?
• Does a CLT hold?
• How do we estimate variance in the CLT?
• How many samples do we need?
By the end of the class, you should be able to answer all of these questions.
5
2 Markov chains basics
Note: Throughout this document, we will mainly consider densities with respect
to the Lebesgue measure. For this reason, we will often ignore the measure theoretic
representation of densities. That is for a distribution, F , we will write F (dx) = f(x)
as opposed to F (dx) = f(x)µ(dx). This is purely for notational convenience, and to
avoid confusion for non-measure theoretic audience
- - -
Let (X ,B(X )) be a measurable space. We will usually work will subsets of Rd and theBorel σ-algebras. We will also almost exclusively be discussing discrete-time continu-
ous(general state) space Markov chains.
Then X -valued sequence of random variables Φ = {X0, X1, X2, . . . } is a time-homogeneousMarkov chain if for all A ∈ B(X ) and for all n
Pr(Xn+1 ∈ A | Xn, . . . , X1, X0) = Pr(Xn+1 ∈ A | Xn) .
There are two properties embedded within this:
• The Markov property: that the distribution of Xn+1 given Xn, Xn−1, . . . , X0 isthe same as the distribution of Xn+1 given Xn.
• Stationary transition: the distribution of Xn+1 given Xn is the same for all n.
The key operating object for a discrete-time general state space Markov chain is its
Markov transition kernel
Definition 1 (Markov transition kernel). A Markov transition kernel is a map P :
X × B(X )→ [0, 1] such that
1. for all A ∈ B(X ), P (., A) is a measurable function on X .
2. for all x ∈ X , P (x, .) is a probability measure on B(X ).
Informally, this is just like a conditional probability. For x ∈ X and A ∈ B(X )
P (x,A) = Pr(X1 ∈ A | X0 = x) .
Definition 2 (Markov transition density). Denote k : X ×X :→ [0,∞) as the Markov
6
transition density defined as
k(x, y) = P (x, dy) .
Remark 1. When X is discrete, B(X ) is the set of all subsets of X and P is a matrixof transition probabilities having elements P (x, y) s.t. P (x, y) ≥ 0 and
∑y P (x, y) = 1.
We will almost never discuss this case.
Example 6. Let X = (0, 1). Draw U ∼ U(0, 1). If U ≤ 1/2, then Xn+1 ∼ U(0, Xn)and if U > 1/2, then Xn+1 ∼ U(Xn, 1).
Then for Xn = x
P (x,A) =
∫A
1
2
1
xIy((0, x)) +
1
2
1
1− xIy((x, 1)) dy .
Let F be a probability measure (distribution) in B(X ). Define
FP (A) =
∫XF (dx)P (x,A) .
We will use the shorthand notation FP to denote FP (A) for any generic A.
If Xn ∼ F and Xn+1|(Xn = x) ∼ P (x, ·) then FP is the marginal distribution ofXn+1.
Similarly, if X1 | X0 ∼ P (x, ·), then we can define the marginal distribution of X2 | X0as
P 2(x,A) = Pr(X2 ∈ A | X0 = x) =∫XP (x, dy)P (y, A) .
In this way, we get the n-step Markov transition kernel P n
P n(x,A) = Pr(Xn ∈ A | X0 = x) =∫XP (x, dx1)P (x1, dx2) . . . , P (xn−1, A) . (1)
(Chapman-Kolmogorov Equation) In general, for 0 ≤ m ≤ n ∈ N, P n = P n−mPm
and thus,
P n(x,A) =
∫XPm(x, dy)P n−m(y, A) .
7
Given a Markov chain transition kernel P , and a starting value X0 = x, we generate
the Markov chain Φ = {X0, X1, X2, . . . }
Definition 3. Let F is the initial distribution ofX0. The distribution F is the invariant
or stationary distribution if FP = F .
Intuition: If we start from F , and use P to get another sample, then the next sample
is also from F .
Theorem 1. If F is the invariant distribution of P , then for all n ≥ 1, FP n = F .
Proof. F is invariant for P means that FP = F . So,
FP n(A) =
∫XF (dx)P n(x,A) .
Using the Chapman-Kolmogorov equation for m = 1
=
∫x
F (dx)
∫y
P (x, dy)P n−1(y, A)
=
∫y
(∫x
F (dx)P (x, dy)
)P n−1(y, A)
=
∫XF (dy)P n−1(y, A)
=...
=
∫XF (dy)P (y, A)
= F (A) .
Intuition: If we start from F , and use P repeatedly, we will keep getting samples from
F . This is exactly what we want, so that is great! Now there are still two concerns:
we can’t really start from F in most problems, and how do we construct P?
Definition 4. The kernel P is F -symmetric if
F (dx)P (x, dy) = F (dy)P (y, dx) .
8
This equation is also often called detailed balance, and the phenomenon is also called
F -reversibility.
Theorem 2. F -symmetry implies F is invariant for P .
Proof. Need to show that FP = F .
FP (A) =
∫XF (dx)P (x,A)
=
∫X
∫A
F (dx)P (x, dy)
=
∫A
F (dy)
∫XP (y, dx)
=
∫A
F (dy)
= F (A) .
Intuition: Recall that our goal is to construct an F -invariant transition P . So, if we
construct P such that it is F -symmetric (reversible), then we are good to go.
9
3 Stochastic Stability
Unfortunately, finding a stationary distribution is not sufficient to get representative
samples. In other words, we also want the Markov chain to “explore the space X”. Wewill require some additional structure to Φ.
Example 7. For intuition finite state space works better. Let
P =
1/2 1/2 0
1/2 1/2 0
0 0 1
For this transition, no matter what F is invariant, you won’t ever get a representative
sample since the chain will get stuck on the third state.
Let φ be any non-trivial positive measure on B(X ).
Definition 5. A set A ∈ B(X ) is F -communicating, if ∀B ⊂ A such that F (B) > 0and B ∈ B(X ) and ∀x ∈ A, ∃ n such that
P n(x,B) > 0 .
Intuitively, the chain will comeback to anywhere in A eventually. This is a weak
property, since it does not ask the chain to move out of A.
Figure 1: Pictorial depiction of F -communicating. Here we can go from x to B in 3steps.
10
Definition 6. P is F -irreducible if ∀x ∈ X and A ∈ B(X ) such that F (A) > 0,∃nsuch that
P n(x,A) > 0 .
Otherwise P is reducible.
Figure 2: Pictorial depiction of F -irreducible. Here we can go from x to A in 6 steps.
Example 8 (Broken support). Consider the following density:
f(x) =1
4I(0 < x < 2) +
1
4I(3 < x < 5) .
and let F be the associated distribution
Figure 3: Given x, we cannot jump support in this case.
Now consider the Markov chain that, given Xn draws uniformly from (Xn−1, Xn + 1).That is
P (x,A) =
∫A
1
2I(x− 1 < y < x+ 1)dy .
(F is not an invariant distribution of P)
But, P is not F -irreducible since if X0 = x ∈ (0, 2) the set A = (3, 5) can never be
11
reached. The kernel P would be F -irreducible if kernel made jumps in (x−1.2, x+1.2)for example.
Remark 2. Positivity condition: A trivial way of getting F -irreducibility is to have
the positivity condition - construct P so that P (x,A) > 0 for all A and x! We will
keep using this often.
Example 9 (Limited jumps). Consider the density
f(x) =1√2π
exp
{−x
2
2
}.
Figure 4: Given x, we can explore the whole support.
Let F be the associated distribution. Consider the MTK
P (x,A) =
∫A
I(x− 1 < y < x+ 1)dy .
(Again, here F is not stationary for P ) It is clear that P is irreducible since it is
possible to go from anywhere to anywhere in some amount of steps. But showing this
mathematically is hard.
Theorem 3 (Geyer (1998), Theorem 4.1). Suppose
a The state space X is a separable metric space.
b Every non-empty open set A ∈ B(X ), F (A) > 0
c Every point has a F -communicating neighborhood
then the chain is F -irreducible.
Proof. The proof is complicated, but the intuition can be seen from Example 9. First,
X is a separable metric space is connected (and second countable) - there exists a
12
countable collection of open sets U such that even open subset of X is a union of setsfrom U .
(b) for our purpose means that the distribution F has full support X , and (c) meansthat given a point x, there is a set A that if F -communicating.
These three points eventually allow movement from anywhere to anywhere in finite
steps.
Definition 7. For d ≥ 2, consider disjoint sets A1, . . . , Ad such that for N = (A1 ∪· · · ∪ Ad)c then F (N) = 0. Further, let the sets satisfy F (Ai) > 0 and P (x,Ai+1) = 1for all x ∈ Ai, i ≤ i ≤ d− 1 and P (x,A1) = 1 for all x ∈ Ad. If such sets do not exist,then Φ is aperiodic.
Otherwise, Φ is periodic.
Figure 5: The Markov chain jumps from one set to the other deterministically. Theabove is a periodic Markov chain.
13
4 Constructing MCMC algorithms
Before we start reading more theoretical properties of the Markov chains, let’s get
introduced to some important MCMC algorithms. This will help build intuition and
give some relevance to the kinds of questions we want to ask and answer.
4.1 Metropolis-Hastings
Perhaps the most common MCMC algorithm is the Metropolis-Hastings (MH) algo-
rithm.
Let Q be a Markov kernel (conditional distribution) Q(x, ·) with density q(x, y). (Thisis effectively q(y|x). )
Let Xn = x. Then
1. Y ∼ Q(x, .) and independently U ∼ Unif(0, 1)
2. If
U < min
{1,f(y) q(y, x)
f(x) q(x, y)
}then set Xn+1 = y
3. Else set Xn+1 = x
Here r(x, y) is the Hasting’s ratio where
r(x, y) =f(y)q(y, x)
f(x)q(x, y)
and α(x, y) = min{1, r(x, y)} is called the acceptance probability.
Remark 3. The Metropolis-Hasting’s algorithm is described by α(x, y). There are
other functions of acceptance probabilities that yield other algorithms. We will discuss
these later.
Note, for a given x, δx(A) = 1 only if x ∈ A. This is the Dirac measure.
14
Figure 6: Metropolis-Hastings algorithm: some intuition. Here, the point proposed (y)is in a lower probability region, so there are chances to reject.
Theorem 4. The MH algorithm defines a valid Markov Kernel:
P (x, dy) = Q(x, dy)α(x, y) + δx(dy)
∫[1− α(x, u)] Q(x, du)
Proof.
Pr(Xn+1 = dy | Xn) = Pr(Xn+1 = dy, U ≤ α(x, y) | Xn)︸ ︷︷ ︸I
+ Pr(Xn+1 = dy, U > α(x, y) | Xn)︸ ︷︷ ︸II
.
I = Pr(Xn+1 = dy, U ≤ α(x, y) | Xn = x)
= E[I(Xn+1 = dy, U ≤ α(x, y) | Xn = x)
]= E
[I(Xn+1 = dy)E
[I(U ≤ α(x, y) | Y = y
]| Xn = x
]= E [I(Xn+1 = dy)α(x, y) | Xn = x]
=
∫I(Xn+1 = dy)α(x, y)Q(x, dy)
= α(x, y)Q(x, dy)
II = P (Xn+1 = dy, U > α(x, y) | Xn)
= E[E(I(x = dy) I(U > α(x, y) | Y
)| Xn
]15
= E[I(x = dy)(1− α(x, y)) | Xn = x
]= δx(dy)
∫(1− α(x, u)Q(x, du) .
Now that we know the form of the kernel, we would first like to see whether it is
F -invariant. We know one tool: F -symmetric.
Theorem 5. The Metropolis Hastings Kernel is F -symmetric.
Proof. To show F -symmetric, we need to show that
F (dx)P (x, dy) = F (dy)P (y, dx)
This is trivially true if x = y. So it suffices to consider the case when x 6= y, which isthe case when the MH algorithm moves.
F (dx)P (x, dy) = f(x) q(x, y)α(x, y) dx dy
= f(x) q(x, y) min
{1,f(y) q(y, x)
f(x) q(x, y)
}dx dy
= min {f(x) q(x, y), f(y) q(y, x)} dx dy
= f(y) q(y, x) min
{1,f(x)q(x, y)
f(y)q(y, x)
}dx dy
= F (dy)P (y, dx)
Theorem 6. If q(x, y) > 0 for all x, y ∈ X , then P is F -irreducible.
Proof. Homework.
If q(x, y) > 0 for all x, y then P (x,A) > 0 for all F (A) > 0, thus P is F -irreducible.
In cases when this is not true, Geyer’s theorem can be used in most cases to establish
irreducibility.
16
Example 10 (χ2 target). Consider drawing from a χ2-distribution. Of course, this is
not a complicated distribution, we already know how to draw from it. But for the sake
of demonstration:
f(x) =1
2k/2Γ(k/2)xk/2−1e−x/2 I(x > 0) .
To implement a MH algorithm, we need to choose a proposal distribution. Let Q(x, ·) =N(0, h), where h is the variance. The density is
q(x, y) =1√2πh
exp
{−(y − x)
2
2h
}=
1√2πh
exp
{−(x− y)
2
2h
}= q(y, x) .
Here, the acceptance ratio is
α(x, y) = min
{1,
f(y)
f(x)
q(y, x)
q(x, y)
}= min
{1,
yk/2−1e−y/2
xk/2−1e−x/2
}We now have all the tools to implement the MH algorithm. Note that we don’t have
to worry about f(x) = 0 in this case as long as we start from within our support of the
target distribution. This is because, x will only be a new accepted value if f(x) > 0.
Run R code here.
4.1.1 Types of proposal distributions
Independence MH
Here Q(x, ·) = Q(·). That is, the proposal distribution does not depend on the currentvalue. Then, q(x, y) = q(y), so the MH-ratio is:
r(x, y) =f(y)
f(x)
q(x)
q(y)=w(y)
w(x),
where w(y) = f(y)/q(y) is a weight function. Later, we will see that bounded weight
functions are important here.
Symmetric MH
Here q(x, y) = q(y, x). Then, just like in the example
r(x, y) =f(y)
f(x).
17
The symmetric MH is the most common proposal, since it makes evaluating the ratio
easier. The two most common proposals are N(x, h) and U(x− h, x+ h).
Random Walk MH
Suppose Z ∼ G(·) so that G is not dependent on the current step Xn. Set, Y = Z+Xn.Then, q(x, y) = g(y − x) and
r(x, y) =f(y)
f(x)
g(x− y)g(y − x)
.
Often, this is also Normal or Uniform as well. But we can use G = td proposal
distribution.
Langevin MH (MALA)
Here, we use some information about the target distribution in our proposal. We shift
the mean of the proposal distribution towards a higher probability area under F :
Q(x, ·) = N(x+
h
2∇ log f(x), h
)
Figure 7: MALA proposal moves the proposal towards a higher target mass area.
Consider
∇ log f(x) = ∇f(x)f(x)
.
18
Thus, when f(x) is small, and the gradient is small, there will be a displacement of the
mean, and if f(x) is large and gradient is small (like in a local maximum), then there
will be small displacement.
Some unanswered questions:
• How do we choose the starting value?
• How do we choose the size of the proposal, h?
• How long do we run the Markov chain for?
4.2 General Accept-Reject MCMC
Metropolis-Hastings is only one particular type of accept-reject style MCMC algorithm.
The reason α(x, y) in MH works is because it yields F -reversibility. So the question
is, are there other such acceptance probabilities? The following is an argument from
Billera and Diaconis (2001).
Let
r(x, y) =f(y)
f(x)
q(y, x)
q(x, y).
Note that the MTK remains the same, only the choice of α will change. For a generic
α(x, y), F -reversibility requires (for x 6= y)
f(x)k(x, y) = f(y)k(y, x)
f(x)α(x, y)q(x, y) = f(y)α(y, x)q(y, x)
α(x, y) = α(y, x)r(x, y) .
We want 0 ≤ α(x, y) ≤ 1 and since α(y, x) is also a probability, we want α(y, x) ≤ 1as well. But α(y, x) = α(x, y)/r(x, y) ≤ 1, so α(x, y) ≤ r(x, y). Thus,
0 ≤ α(x, y) ≤ min{1, r(x, y)} . (2)
Thus, if α(x, y) satisfies (2) and then set
α(y, x) =α(x, y)
r(x, y),
yields an F -reversible Markov chain.
19
In addition to F -reversibility, we also want the acceptance probability to be useful,
so it must be a ratio of f(y)/f(x), otherwise unknown constants will not cancel out.
Thus it is natural to consider functions α(x, y) = g(r(x, y)). Then g must satisfy
g(x) = xg
(1
x
)0 ≤ x ≤ 1 .
Another popular acceptance probability (specially for chemists) is the Barker’s accep-
tance probability (Barker, 1965)
αB(x, y) =f(y)q(y, x)
f(x)q(x, y) + f(y)q(y, x)=
r(x, y)
1 + r(x, y).
Barker’s acceptance probability is not used often, and we will learn later as to why.
4.3 Combining Kernels
Suppose P1, . . . Pd are d transition kernels such that FPi = F for all i. These could be
i different proposals, or i different acceptance probabilities.
The composition kernel is
Pc(x, ·) = (P1 . . . Pd)(x, ·) .
Recall that the notation product of kernels refers to the marginal distribution. So
P1P2(x,A) =
∫XP1(x, dy)P2(y, A) .
Let ri > 0 and∑d
i=1 ri = 1. Then the mixing kernel is
Pm(x, ·) = r1P1(x, ·) + · · ·+ rdPd(x, ·) .
The mixing and composition kernels are both F -invariant. That is, FPc = F and
FPm = F . These two kernels are very useful in constructing component-wise updates.
20
Figure 8: Left: Composition kernel. Right: Mixing kernel.
4.4 Component-wise Updates
Example 11 (Multivariate normal). We will first motivate with a multivariate exam-
ple. Let F = N2(0, I2), so that the target is a bivariate normal. Consider Q(x, ·) =N2(x, hI2) (trivial, but just for demonstration).
In this case, the algorithm is still the same, except the proposal draw is multivariate.
Y ∼ N(x, hI2) is drawn: (y1, y2). We will accept or reject the full vector as a whole.The MH ratio is
α(x, y) = min
{1,f(y)
f(x)
}= min
{1,f1(y1) f2(y2)
f1(x1) f2(x2)
}.
Naturally, as the dimension increases, h will then need to decrease to ensure the
same acceptance probabilities. Sometimes, it can then be difficult to tune h, so here
component-wise updates are sometimes preferred (not always though).
Let X = X1×. . . ,×Xd where each Xi ⊆ Rbi . If x = (x1, x2, . . . , xd) ∈ X , set x(i) = x\xi.If f(x) is the joint density associated with F , let f(xi | x(i)) be the full conditionaldensity of xi.
Let pi(x, yi) be an MH MTD with invariant density f(xi | x(i)) with proposal q((xi, x), yi).Then
pi(x, yi) = q((xi, x), yi)α(xi, yi) + δxi(yi)
∫[1− α(xi, ui)]q((xi, x(i), ui)) .
So pi updates the ith component, according to a MH step. Each of the other compo-
nents is set to be whatever they were before, so the overall Markov kernel is defined
21
as:
Pi(x,A) =
∫A
pi(yi | x)δx(i)(y(i))dyi .
Theorem 7. Pi is F -invariant for all i.
Proof. By construction pi are invariant for density f(xi | x(i)), so that
pi((xi, x(i)), yi)f(xi | x(i)) = pi((yi, x(i)), xi)f(yi | x(i))
So,
pi((xi, x(i)), yi)f(xi, x(i)) = pi((xi, x(i)), yi)f(xi | x(i))f(x(i))
= pi((yi, x(i)), xi)f(yi | x(i))f(x(i))
= pi((yi, x(i)), xi)f(yi, x(i)) .
So maybe we could use each of the Pis as our final kernel since it is F -invariant.
However, this won’t work, since Pi only updates the ith component, so it is naturally
reducible. We can then combine the d kernels:
• Random scan:
PRS(x,A) =d∑i=1
riPi(x,A) , with MTK kRS(x, y) =d∑i=1
ripi(yi|xi, x(i)) δx(i)(y(i)) .
We can show that PRS is F -symmetric.
• Deterministic scan:PDS(x,A) = (P1 . . . Pd)(x,A) ,
with
kDS(x, y) = p1((x1, x(1)), y1) p2((y1, x(1)), y2) p3((y1, y2, x(1,2)), y3) . . . pd((y(d), xd), yd) .
We can show that PDS is not generally F -symmetric.
• Random sequence scan:
22
There are d! orders for the composition. Let p ≤ d! and r = (r1, . . . , rp), ri > 0,∑pi=1 = 1. Define
PRQ(x,A) =
p∑j=1
rjPc,j(x,A) , with MTD KRQ(x, y) =
p∑j=1
rjkc,j(x, y) .
We can show that PRQ is F -symmetric.
4.4.1 Gibbs sampler
An important case of component-wise updates is the Gibbs sampler. Here the proposal
distribution for updating the ith component, qi(x, yi) is the full conditional distribution
itself. That is qi(yi|x) = f(yi|x(i)). Then
α((x(i), xi), yi) = min
{1,f(yi|x(i))f(xi|x(i))
q((yi, x(i)), xi)
q((xi, x(i)), yi)
}= min
{1,f(yi|x(i))f(xi|x(i))
f(xi|x(i))f(yi|x(i))
}= 1 .
This is not surprising since we are proposing from the target distribution! However, in
this case the target distribution for MH is not the real “target distribution” F , rather,
it is the full conditional distribution. So if we can sample from the full conditional
distribution directly, then the component update happens with probability 1. Then,
the MTK for the ith component is
Pi(x,A) =
∫A
f(yi|x(i))δx(i)(y(i))dyi .
We will use the notation PRSGS for random scan, PDSGS for deterministic scan Gibbs
sampler.
Example 12 (Multivariate normal distribution). We have already sampled from a
multivariate normal using Metropolis-Hastings algorithm. We will now implement a
Gibbs sampler.
F = N
µ1
µ2
, Σ11 Σ12
Σ21 Σ22
,
23
where µ1 ∈ Rp1 and µ2 ∈ Rp2 . It is then known that if X = (X1, X2)T , then
X1 | X2 = x2 ∼ N(µ1 + Σ12Σ
−122 (x2 − µ2),Σ11 − Σ12Σ−122 Σ21
).
X2 | X1 = x1 ∼ N(µ2 + Σ21Σ
−111 (x1 − µ1),Σ22 − Σ21Σ−111 Σ12
).
Since the full conditional distributions of X1 and X2 are known, Gibbs sampler can be
easily implemented.
Figure 9: Deterministic scan Gibbs sampler for two different Normal covariance struc-tures.
DSGS: The deterministic scan Gibbs sampler can then do the following update:
1. Draw X1,n+1 ∼ X1 | X2,n
2. Draw X2,n+1 ∼ X2 | X1,n+1
The Markov transition density updating from step (x1, x2) to (x′1, x′2) is
kDSGS((x1, x2), (x′1, x′2)) = f(x
′1 | x2)f(x′2 | x′1) .
RSGS: The random scan Gibbs sampler will update as following
1. Pick index i with probability ri
2. Draw Xi,n+1 ∼ Xi | X−i,n
3. Set X−i,n+1 = X−i,n
24
The MTK updating (x1, x2) to (x′1, x′2) is,
kRSGS((x1, x2), (x′1, x′2)) = r1f(x
′1 | x2)δx2 (x′2) + r2f(x′2 | x1)δx1 (x′1)
Example 13 (Gibbs sampler irreducibility). Consider a joint distribution F (x, y) with
joint density
f(x, y) =1
2f1(x)g1(y) +
1
2f2(x)g2(y) .
We will study a Gibbs sampler that targets this distribution. Let us first set up the
system (conditional distribution). See that
f(y) =
∫f(x, y) dx
=
∫ (1
2f1(x)g1(y) +
1
2f2(x)g2(y)
)dx
=1
2g1(y) +
1
2g2(y) .
Similarly,
f(x) =1
2f1(x) +
1
2f2(x) .
So,
f(x | y) = f1(x)g1(y) + f2(x)g2(y)g1(y) + g2(y)
and f(y | x) = f1(x)g1(y) + f2(x)g2(y)f1(x) + f2(x)
A Gibbs sampler moves from a state (x, y) to state (x′, y′) by updating x and y from
the conditional distributions.
1. Update X ′ ∼ X|Y
2. Update Y ′ ∼ Y |X ′.
The Markov transition kernel for this Gibbs sampler is
P ((x, y), A) =
∫A
f(x′|y) f(y′|x) dx′dy′ .
Ok, now that this is set up, let us consider two different sets of target distributions.
25
Figure 10: Deterministic scan Gibbs sampler for the two targets. Left is irreducible,right is reducible
A: Suppose the target density is
f(x, y) =1
2I(0 < x < 1) I(0 < y < 1) +
1
2I(0 < x < 1) I(2 < y < 3) .
The conditional distributions are
f(x|y) = I(0 < x < 1) [I(0 < y < 1) + I(2 < y < 3)]
f(y|x) = [I(0 < y < 1) + I(2 < y < 3)] I(0 < x < 1) .
So, if we are at x ∈ (0, 1) and y ∈ (0, 1), then
f(x|y) = I(0 < x < 1) , f(y|x) = 12
[I(0 < y < 1) + I(2 < y < 3)]
Here, the sampler is free to jump from each of the two portions of support.
B: Suppose the target density is
f(x, y) =1
2I(0 < x < 1) I(0 < y < 1) +
1
2I(2 < x < 3) I(2 < y < 3) .
26
The conditional distribution is
f(x|y) = I(0 < x < 1) I(0 < y < 1) + I(2 < x < 3) I(2 < y < 3)I(0 < y < 1) + I(2 < y < 3)
f(y|x) = I(0 < x < 1) I(0 < y < 1) + I(2 < x < 3) I(2 < y < 3)I(0 < x < 1) + I(2 < x < 3)
For x ∈ (0, 1) and y ∈ (0, 1), both f(x|y) = I(0 < x < 1) and f(y|x) = I(0 < y < 1).So we will not be able to jump to the other part of the support of the distribution.
Thus, this Gibbs sampler will not be irreducible.
4.4.2 Metropolis-within-Gibbs
(Also known as conditional Metropolis-Hastings)
When at least one of the fi(xi|x(i)) is not available to sample from, then we use ageneral proposal qi as described before.
Example 14 (Bayesian reliability model). For i = 1, . . .m, let ti denote the observed
failure time for lamp (where m lamps’ data are collected). Suppose
Ti | λ, β ∼Weibull(λ, β)
where λ > 0 is the scale parameter and β is a shape parameter. In a Bayesian paradigm,
we further assume prior distributions on this. So
λ ∼ Gamma(a0, b0) and β ∼ Gamma(a1, b1) .
The resulting posterior distribution is complicated for which the normalizing constant
is not known.
f(λ, β | T) ∝ λm+a0−1βm+a1−1(
m∏i=1
ti
)β−1exp
{−λ
m∑i=1
tβi
}exp {−b1β} exp {b0λ}
It can also be shown that
λ | β, T ∼ Gamma
(m+ a0, b0 +
m∑i=1
tβi
).
27
However, β | λ, T does not have a closed-form expression.
f(β | λ, T ) ∝ βm+a1−1(
m∏i=1
ti
)β−1exp
{−λ
m∑i=1
tβi
}exp {−b1β} .
In this case, we can implement the following (deterministic scan) Metropolis-within-
Gibbs sampler:
1. λn+1 ∼ λ | βn, T
2. Propose Y ∼ Q((λn+1, βn), ·) and draw U ∼ U [0, 1].
3. If U ≤ α((λn+1, βn), y), where
α((λn+1, βn), y) = min
{1,
f(y|λn+1, T )f(βn|λn+1, T )
q((y, λn+1), βn)
q((βn, λn+1), y)
},
then βn+1 = Y .
4. Else βn+1 = βn.
The MTK for this is
k((λ, β), (λ′, β′)) = f(λ′|β) p((λ′, β), (λ′, β′))︸ ︷︷ ︸MH kernel
.
We can also flip the updates so that β is updated first then λ.
4.5 Linchpin variable sampler
Suppose one of the full conditionals is available to sample from: fi(xi|x(i)), howeverthe other(s) are not. This is similar to Metropolis-within-Gibbs. However, instead of
running a Markov chain on the joint distribution F , we can run the Markov chain for
the marginal distribution x(i). This situation is easier to see for two variables.
Consider the joint target density to be
f(x1, x2) = f(x1 | x2)f(x2) .
Suppose f(x1 | x2) is a known and nice enough distribution in the sense that it ispossible to generate iid samples from it. But it is not possible to draw iid samples from
f(x2). Then X2 is called the linchpin variable.
28
Instead of running a component-wise algorithm consider running a Markov chain with
f(x2) as the target density. That is construct a Markov transition densitykL(x2, dx2)
such that
F (dx′2) =
∫XF (dx2)kL(x2, x
′2) . (3)
Having observed samples from X2, we can “plug them” into the conditional distribution
X1 | X2 to get samples os X1, yielding joint samples. That is, the final Markovtransition density is
k((x1, x2), (x′1, x′2)) = f(x
′1 | x′2) kL(x2, x′2) .
There are multiple possible advantages of linchpin variable samplers over Metropolis-
within-Gibbs. Two of which are:
1. The target distribution is in a smaller dimension since X1 is no longer in the
target distribution of the Markov chain.
2. Sometimes Markov chains are not well behaved due to the interaction between
X1 and X2. Since the target distribution is the marginal distribution of X2, any
annoyance due to this reason is ignored.
Example 15 (Bayesian reliability model). Recall the joint posterior distribution
f(λ, β | T) ∝ λm+a0−1βm+a1−1(
m∏i=1
ti
)β−1exp
{−λ
m∑i=1
tβi
}exp {−b1β} exp {b0λ}
We know that
λ | β, T ∼ Gamma
(m+ a0, b0 +
m∑i=1
tβi
).
We can write
f(λ, β | T) = f(λ | β,T) f(β | T) .
Homework: find the marginal posterior density f(β | T ) up to proportionality.
In this case, we can implement the following algorithm
1. Propose Y ∼ Q((βn), ·) and draw U ∼ U [0, 1].
29
2. If U ≤ α((λn+1, βn), y), where
α((λn+1, βn), y) = min
{1,
f(y|T )f(βn|T )
q(y, βn)
q(βn, y)
},
then βn+1 = Y .
3. Else βn+1 = βn.
4. λn+1 ∼ λ | βn+1, T
The MTK for this is
k((λ, β), (λ′, β′)) = f(λ′|β′) kL(β, β′)︸ ︷︷ ︸Linchpin kernel
.
30
5 Convergence
5.1 F -irreducibility and aperiodicity
Definition 8. Let V1 and V2 be probability measures. Then the total variation distance
between V1 and V2 is defined as:
||V1(·)− V2(·)|| = supA∈B(X )
|V1(A)− V2(A)|
We want to answer the following questions
• Does limn→∞ ||P n(x, ·)− F (·)|| → 0∀x?
• For what n does ||P n(x, ·)−F (·)|| ≤ �. So when can we say we have converged?
• How fast does ||P n(x, ·)− F (·)|| → 0?
Notationally, ν1f =∫X f(x)ν1(dx) and in particular P
nf =∫X f(x)P
n(x, dy).
Proposition 1. For measures ν1 and ν2, the following holds:
‖ν1 − ν2‖ =1
b− asup
f :X→[a,b]|ν1f − ν2f | .
Proposition 2. The following properties of TV hold:
(a) ||P n(x, ·)− F (·)|| is non-decreasing in n if P is F -invariant.
(b) ||ν1P − ν2P || ≤ ||ν1 − ν2||
In order to get convergence, we will have to naturally require irreducibility and aperi-
odicity of the Markov chains (since otherwise the Markov chain does not explore the
support or gets stuck in a cycle)
Theorem 8. If FP = F , P is F -irreducible and aperiodic, then for F -a.e. x
limn→∞
‖P n(x, ·)− F (·)‖ = 0 .
Proof. Proof uses Coupling methods, and will be done in later. For now assume this
31
to be true. Here, F a.e x ∈ X implies that the measure of the set of starting valueswhere convergence does not hold is zero.
When are the algorithms we have learned so far aperiodic?
Theorem 9. Suppose Markov chain P is F -irreducible and there exists S such that
F (S) > 0 and P (x, {x}) > 0 for all x ∈ S, then P is aperiodic.
Proof. By contradiction. This will be homework.
Thus, most reasonable MH algorithms should then be aperiodic. If there is always a
positive probability of staying where we are, then certainly we have aperiodicity.
Theorem 10. An F -irreducible Gibbs sampler is aperiodic.
Proof. Consider for RSGS and suppose by way of contradiction that RSGS is periodic
with period d. Then there exists A1, . . . , Ad ⊆ X such that P (x,Ai+1 = 1∀x ∈ Ai 1 ≤u ≤ d− 1 and P (x,A1) = 1 for all x ∈ Ad. Also F (Ai) > 0.
Let the current state Xn ∈ Ak. RSGS chooses one component at random to update.Let In be this random choice and let hIn(Xn) be that part of Xn which is not updated.
Then Xn and Xn+1 are conditionally independent given hIn(Xn), i.e.
Pr(Xn+1 ∈ Ak|Xn ∈ Ak, hIn(Xn)) = Pr(Xn+1 ∈ Ak|hIn(Xn))
Also by periodicity, Pr(Xn+1 ∈ Ak|Xn ∈ Ak) = 0 but
P (Xn+1 ∈ Ak|Xn ∈ Ak) = E [Pr(Xn+1 ∈ Ak|Xn ∈ Ak, hIn(Xn))|Xn ∈ Ak]
= E[Pr(Xn+1 ∈ Ak|hIn(Xn)]
Therefore, Pr(Xn+1 ∈ Ak|hIn(Xn)) is F -almost surely 0, which in turn implies P (Xn+1 ∈Ak) = 0 which is a contradiction since marginally P (Xn+1 ∈ Ak should be 1/d. Thus,RSGS is aperiodic.
A similar argument can be made for DSGS.
32
So, nearly all standard MCMC samplers will satisfy for Fa.e.x ∈ X
‖P n(x, ·)− F (·)‖ n→∞→ 0 .
But there is still this annoying null set.
Example 16. Let X = [0, 1] and let U ∼ U [0, 1]. If x = 1/m for m ∈ Z+. Define thekernel
P (x, ·) = x2U + (1− x2)δ(1/(m+1)) .
Otherwise,
P (x, c)̇ = U .
That is, if x = 1/m, with probability x2 the kernel draws from a uniform, and with
probability 1− x2 it sets the next value to be exactly 1/(m+ 1). Otherwise, the nextupdate is just a draw from uniform.
Then for F = U [0, 1], FP = F and P is F -irreducible and aperiodic. But if xo = 1/m,
m ≥ 2, then
Pr
(Xn =
1
m+ n∀n | X0 = xo
)=
∞∏j=m
(1− 1
j2
)> 0 .
Thus,
‖P n(1/m, ·)− F (·)‖ → 0
Here the set {1/m : m ≥ 2} is of measure zero under F . Thus, convergence holdsF -a.e. x ∈ X .
To get rid of this annoying set, we will require the additional property of Harris recur-
rence.
5.2 Harris recurrence
Definition 9. Let A ∈ B(X ) and define
τA = inf{n ≥ 1 : Xn ∈ A.}
τA is called the first return time to A. If Xn /∈ A for all n ≥ 1, τA =∞.
33
Definition 10. If FP = F and P is F -irreducible, then P is Harris Recurrent if
for all A ∈ B(X ) with F (A) > 0 and all x ∈ X
Pr(τA 0, some x ∈ X andsome N such that
Pr(Xn /∈ A, ∀n ≥ N |X0 = x) > 0 .
But the Markov chain is time homogeneous. Thus, this implies that there exists y ∈ Xsuch that
Pr(τA
3. ∀x ∈ X , A ∈ B(X ) with F (A) = 0,Pr(Xn ∈ A,∀n | X0 = x) < 1
Proof. 1⇒ 2⇔ 3 are straightforward. See the proof of 3⇒ 1 in Roberts and Rosenthal(2006).
Theorem 13. Every F -irreducible M-H sampler is Harris recurrent.
Proof.
Let s(x) =∫q(x, y)[1−α(x, y)]µ(dy). Then s(x) is the probability of staying at x. We
can write the MH kernel as
P (x,A)
= Pr(X1 ∈ A | X0 = x)
= Pr(X1 ∈ A | X0 = x,X1 = X0) Pr(X1 = X0) + Pr(X1 ∈ A | X0 = x,X1 6= X0) Pr(X1 6= X0)
= δx(A)s(x) + (1− s(x)) Pr(X1 ∈ A | X0 = x,X1 6= X0)
:= δx(A)s(x) + (1− s(x))M(x, ·) ,
Here M(x, ·) is the kernel conditional on Xn+1 6= Xn. We are assuming that M(x, ·) isabsolutely continuous with respect to the Lebesgue measure (µ) for all x ∈ X .
Since P is F -irreducible and F ({x}) < 1, it follows that s(x) < 1∀x ∈ X , sinceotherwise the chain will not move from x and hence cannot be irreducible.
Of course, our target distribution is also absolutely continuous so that
F (A) =
∫A
f(x)µ(dx) .
Suppose for a set A, F (A) = 1, then F (Ac) = 0 and since f(x) > 0 for x ∈ X ,µ(Ac) = 0. Thus, by absolute continuity, M(x,Ac) = 0⇒M(x,A) = 1.
In conclusion, if the current state is x, the chain will eventually move according to
M(x, ·) at which point it will necessarily move into A. Thus P (τA < ∞|X0 = x) = 1.The result follows from Theorem 12.
So we know that full MH algorithms that are F -irreducible are aperiodic and Harris
recurrent. We next move on to component-wise algorithms.
Lemma 1 (without proof). For A such that F (A) = 0, Pr(Xn ∈ A | X0 = x0) ≤Pr(Dn | X0 = x) where Dn is the event that by time n, the chain has not yet moved
35
in each co-ordinate direction.
Theorem 14 (Component-wise algorithms). Suppose p is a component-wise MH MTK.
If FP = F , P is F -irreducible and for all x ∈ X , with probability 1, there will eventuallybe at least one move in every coordinate direction, then P is Harris recurrent.
Proof. Our conditions imply
limn→∞
Pr(Dn | X0 = x) = 0 .
Suppose F (A) = 0. Since f(x) > 0 for all x ∈ X , it follows that µ(A) = 0. By thelemma
Pr(Xn ∈ A∀n | X0 = x) ≤ limn→∞
Pr(Xn ∈ A | X0 = x) ≤ limn→∞
Pr(Dn | X0 = x) = 0 .
Theorem 15 (Ergodicity). If FP = F , P is F -irreducible, aperiodic, and Harris
recurrent, then for every initial distribution λ
limn→∞
||λP n − F || = 0 .
Consequently, for all x ∈ X
limn→∞
||P n(x, ·)− F (·)|| = 0 .
Moreover, for any two initial distributions λ1 and λ2
limn→∞
||λ1P n − λ2P n|| = 0 .
The Markov chain is then said to be ergodic.
Since virtually all MCMC algorithms satisfy the above conditions, theoretical conver-
gence of the Markov chain is guaranteed. Before we can give a proof of the above
theorem, we will need to understand coupling for which we need the minorization
condition.
36
6 Minorization and coupling
In this section, we will prove Theorem 15. For that we need the concept of “coupling”,
for which we need the “split chain”, for which we need “minorization”.
6.1 Minorization
Definition 11. A minorization condition holds if there exists C ⊆ X , a positiveinteger m, a constant � > 0 and a probability measure Q on X such that ∀x ∈ C ⊆ X ,and for all A ∈ B(X ),
Pm(x,A) ≥ �Q(A) .
C is said to be small.
Intuitively, this means that all m-step transitions from within C have an �-overlap.
That is, they have an � component that is common to them. We will generally only
care about m = 1. A minorization condition always exists (not going to prove).
Example 18. Independent M-H Sampler
This example illustrates the minorization condition for the Independent M-H sampler
(proposal density is not dependent on the current state of the Markov chain).
Let q(y) be the proposal density. Then
P (x,A) =≥∫A
q(y)
{1,f(y)q(x)
f(x)q(y)
}µ(dy)
=
∫A
f(y) min
{q(y)
f(y),q(x)
f(x)
}µ(dy)
Let � = infx
q(x)
f(x)= sup
x
f(x)
q(x)> 0. Then
≥ �∫A
f(y)µ(dy)
= �F (A)
Since x ∈ X was chosen randomly, X is small as long as supx∈X
f(x)
q(x)< ∞. Thus, we
37
require a bounded weight function.
Example 19. Two Variable Deterministic scan Gibbs
We will bound the MTD for the DSGS
k((x, y), (x′, y′)) = fX|Y (x′|y)fY |X(y′|x′)
Let C be such that infy∈C
fX|Y (x′|y) > 0 = h(x′)
≥ h(x′)fY |X(y′|x′)
=
∫Xh(x′)fY |X(y
′|x′)h(x′)fY |X(y
′|x′)∫X h(x
′)fY |X(y′|x′)
= �Q(x′, y′)
Example 20. Random Scan Gibbs Sampler
We will connect RSGS to the two variable DSGS by using a minorization condition for
m = 2. Using the MTDs, we get
k((x, y), (u, v))k((u, v), (x′, y′))
=(pfX|Y (u|y)δy(v) + (1− p)fY |X(v|x)δx(u)
) (pfX|Y (x
′|v)δv(y) + (1− p)fY |X(y′|u)δu(x′))
≥ p(1− p)fX|Y (u|y)δy(v)fY |X(y′|u)δu(x′)
= p(1− p)fX|Y (x′|y)fY |X(y′|x′)
= p(1− p)kDUGS(x′, y′|x, y)
≥ p(1− p)�Q(x′, y′) ,
where � and C are defined in the previous example.
6.2 The Split Chain
This is a really nice idea that allows us to “forget” the starting value. Suppose the
minorization condition holds, i.e., there exists C, � > 0 and a probability measure Q
such that
P (x,A) ≥ �Q(A)
for all x ∈ C and A ∈ B(X ). When x ∈ C
38
P (x,A) = �Q(A) + (1− �) P (x,A)− �Q(A)(1− �)︸ ︷︷ ︸R(x,A)
= �Q(A) + (1− �)R(x,A)
This suggests a recipe for simulating the split chain living X×{0, 1} having marginalP (x, ·).
1. x /∈ C then generate Xn+1 ∼ P (x, ·)
2. x ∈ C, δn ∼ Bernoulli(�)
• if δn = 1, then Xn+1 ∼ Q(·)
• if δn = 0, then Xn+1 ∼ R(x, ·).
Notice that every time Xn+1 ∼ Q, the Markov chain resets.
6.3 Coupling
The idea here is to simulate two independent chains, and using the split chain concept
to generate the chain until both chains simultaneously can be generated from Q. At
that time, the two chain would have coupled, and would have each forgotten their
starting value.
That is, we will run Markov chain Xn and Markov chain X′n, so that each has stationary
distribution F , but at some point they “meet”. Once they meet, they become one.
Like soulmates.
Let X0 ∼ λ,X ′0 ∼ F, n = 0 and let C be a small set. Given Xn, X ′n
1. If Xn = X′n, then Xn+1 = X
′n+1 ∼ P (xn, ·)
2. o/w if (Xn, X′n) ∈ C × C
• w.p. �,Xn+1 = X ′n+1 ∼ Q
• w.p. (1− �), Xn+1 ∼ R(xn, ·) and X ′n+1 ∼ R(x′n, ·)
39
3. o/w if (Xn, X′n) /∈ C × C, then Xn+1 ∼ P (xn, ·) and X ′n+1 ∼ P (x′n, ·).
Let T be the (random) coupling time, so T = inf{n ≥ 1 : Xn = X ′n}.
Proposition 3. If P is aperiodic and Harris recurrent
1. T n)→ 0 as n→∞
Proof. The above two statements about T are equivalent, more or less, but need to be
written explicitly. The proposition follows from the definition of Harris recurrence.
Theorem 16. For all λ ‖λP n − F‖ ≤ P (T > n)
Proof.
|λP n(A)− F (A)|
= |Pr(Xn ∈ A)− Pr(X ′n ∈ A)|
= |Pr(Xn ∈ A,Xn = X ′n) + Pr(Xn ∈ A,Xn 6= X ′n)
− Pr(X ′n ∈ A,Xn = X ′n)− Pr(X ′n ∈ A,Xn 6= X ′n)|
= |Pr(Xn ∈ A,Xn 6= X ′n − Pr(X ′n ∈ A,Xn 6= X ′n)|
≤ max {Pr(Xn ∈ A,Xn 6= X ′n,Pr(X ′n ∈ A,Xn 6= X ′n)}
≤ Pr(X ′n 6= Xn)
≤ Pr(T > n)
→ 0
Since Pr(T > n)→ 0, this basically proves that for all initial distributions λ
||λP n − F || → 0 as n→∞.
40
7 Rate of Convergence and CLT
We have been studying Markov chains in class so far, this is a “Markov chain Monte
Carlo” course. Now we will learn about “Monte Carlo”. That is, we now know how
to obtain samples from a target distribution so that eventually they are from the right
distribution. But the samples are correlated, so are they useful? Suppose F is the
target distribution and for a function h, µg = EF [g]. Having obtained X1, X2, . . . , Xn,
we can
µn =1
n
n∑t=1
g(Xt) .
In general we want the following in the MCMC context:
1. A representative sample X1, . . . , Xn from F .
2. µn → µ with probability 1 as n→∞.
3.√n(µn − µ)
d→ N(0, σ2) as n→∞.
4. Can I estimate σ2?
So far we have shown that if P is F -invariant, F -irreducible, aperiodic and Harris
recurrent, then we can do 1. In addition we will see that 2 holds as well.
Theorem 17 (Birkhoff ergodic theorem). Suppose that P is F -invariant, F -irreducible,
aperiodic and Harris recurrent. If µg =∫X g(x)F (dx) exists, then as n→∞. Then
µn =1
n
n∑i=1
g(Xi)→ µg with probability 1, as n→∞ and
Let FV be the density function of g(x). Consider estimating the quantile ξg = {v :FV (v) ≥ q} for 0 < q < 1. Let Yn(j) denote the jth order statistic. Then for j − 1 <nq < j,
ξn = Yn(j) → ξ with probability 1 .
Proof. Without proof.
41
The Markov Chain CLT
We know that strong convergence of µn holds under the Harris ergodicity of the Markov
chain. Thus, those conditions are necessary for a a Markov chain CLT to hold. In
addition, a CLT would should exist if there exists a σ2g such that as n→∞
√n(µn − µg)
d→ N(0, σ2g)
Moreover, we will show that
σ2g = limn→∞
1
nEF
( n∑i=1
g(Xi)− nµg
)2 = VarF (g(X)) + 2 ∞∑j=1
CovF (g(X1), g(Xj))
Remark 4. Harris ergodicity and EF [g(X)]2
Theorem 19. Suppose X = R and set A(x) = 1 − r(x). If A(x) is a monotonicallydecreasing for sufficiently large x, and
limx→∞
∣∣∣∣ f(x)A′(x)∣∣∣∣ =∞
then limn→∞
nE[(g(x)− µg)2rn(x)
]=∞.
Proof. Check Roberts (1999)
Example 21. Suppose F = Exp(1) and consider an Independent M-H sampler with
proposal Q = Exp(θ), θ > 2. Then
A(x) =
∫Xq(y)α(x, y)dy = e−θx[θex + (1− θ)] ,
and ∣∣∣∣ f(x)A′(x)∣∣∣∣ = 1θ(1− θ)e−x(θ−2)(1− e−x) →∞ as x→∞
since θ > 2
So, when does a CLT hold? In addition to Harris ergodicity and a moment condition,
we need the Markov chain to converge rapidly. An intuitive way to understand this is,
√n(µn − µg)
d→ N(0, σ2g)
where,
σ2g = VarF (g(X)) + 2∞∑k=1
CovF (g(X1), g(X1+k)) =∞∑
k=−∞
CovF (g(X1), g(X1+k)) .
For a CLT, we want σ2g
Convergence Rates
Let M : X → R+ and ψ : N→ [0, 1] be such that
‖P n(x, ·)− F (·)‖ ≤M(x)ψ(n) ∀x, n .
1. Polynomial Ergodicity of order k: ψ(n) = n−k for some k > 0.
2. Geometric Ergodicity: ψ(n) = tn for some 0 ≤ t < 1
3. Uniform Ergodicity: supxM(x) m
4. Gaussian processes
Remark 5. Consider {Xn, n ≥ 0} and {g(Xn), n ≥ 0}. Let {Wn = g(Xn), n ≥ 0} andMmk = σ(Wk, . . . ,Wm). Then Mmk ⊆ Fmk If αw and α are the mixing coefficients for{Wn} and {Xn}, then αw(n) ≤ α(n).
Theorem 20. Suppose X0 ∼ F and the Markov chain is Harris ergodic. Then theMarkov chain is strongly mixing.
Proof. Since Markov chain is time homogeneous, we consider arbitrary A ∈ B(X ) andB ∈ B(X ) such that X0 ∈ A and Xn ∈ B. By the coupling inequality,∫
B
Px(T > n)F (dx) ≥∫B
|P n(x,A)− F (A)|F (dx)
44
≥∣∣∣∣∫B
(P n(x,A)− F (A))F (dx)∣∣∣∣
= |Pr(X0 ∈ B,Xn ∈ A)− F (A)F (B)| .
Therefore, α(n) ≤ EF [Px(T > n)]. Recall that Px(T > n)→ 0 for all x ∈ X as n→∞,and a dominated convergence argument shows that
EF [Px(T > n)]→ 0 as n→∞ .
Hence α(n)→ 0 as n→∞.
Theorem 21. Suppose P is Harris ergodic, and X0 ∼ F . If
‖P n(x, ·)− F (·)‖ ≤M(x)ψ(n) .
Then α(n) ≤ ψ(n)EF [M(x)]. This would imply,
∞∑n=1
naψ(n)b
= nVarF
(1
n
n∑t=1
g(Xt)
)
=1
n
n∑t=1
n∑l=1
CovF (g(Xt), g(Xl))
=1
n
n∑t=1
VarF (g(Xt) +1
n
∑tl
CovF (g(Xt), g(Xl))
=1
n
n∑t=1
VarF (g(Xt) + 21
n
∑t
Then σ2 = EF (Y1)2 + 2
∞∑j=2
EF (Y1Yj) 0, then as n→∞
√nµn
d→ N(0, σ2).
The following theorem is a corollary to the above result. The proof follows directly
from the above theorem.
Theorem 24. Suppose P is F -Harris ergodic. In addition, let X0 ∼ F , ‖P n(x, ·)−F (·)‖ ≤ M(x)ψ(n) and EF [M(x)] < ∞. Let g : X → R and suppose at least oneof the following holds:
1. supx |g(x)|
uses theory of harmonic functions, so not provided here.
Remark 6. The same result holds for the law of large numbers as well.
Now, geometric and polynomial ergodicity give forms of ψ(n) such that∞∑n=1
ψ(n)
Theorem 27. Suppose P is geometrically ergodic and g : X → R such thatEF [|g(x)|2] 0 and a probability measureQ, such that
P (x,A) ≥ �Q(A) for all x ∈ X , and A ∈ B(X ) .
Then P is uniformally ergodic and
‖P n(x, ·)− F (·)‖ ≤ (1− �)n .
49
Proof. By the coupling equality, we know that
‖P n(x, ·)− F (·)‖ ≤ Pr(T > n) .
But since the whole space X is small, the chain couples with probability � at everystep, that is, T ∼ Geometric(�). Thus, P (T > n) = (1− �)n.
Remark 8. Notice that the theorem yields a quantitative upper bound so we can
easily see that to be within δ of stationarity in total variation
n ≥ log δlog(1− �)
Unfortunately, these bounds are often extremely conservative because � is tiny.
Remark 9. Suppose we can only establish
P n0(x,A) ≥ �Q(A)∀x.
Then, Φ is still uniformally ergodic and
||P n(x, ·)− F (·)|| ≤ (1− �)bnn0c
Example 23. If X is finite, then ever F -irreducible, F -invariant, recurrent Markovchain is uniformly ergodic.
Proof. We will construct a minorization condition. Let |X | = t. Since the Markovchain is F -irreducible, for x1 ∈ X , there exists n1 such that
P n1(x, dy) > 0
Let n = least common multiple of n1, . . . , nt. Then for all x ∈ X , P n(x, dy) > 0. Wethen have the following:
P n(x, dy) ≥ infx∈X
P n(x, dy)
50
=t∑i=1
infx∈X
P n(x, dyi)︸ ︷︷ ︸�
infx∈X
P n(x, dy)
t∑i=1
infx∈X
P n(x, dyi)︸ ︷︷ ︸Q(dy)
= �Q(dy)
Thus, X is small, and the theorem applies.
Example 24. Independent M-H
Recall that if � = infx∈Xq(x)
f(x)> 0 then X is small. So as long as � > 0, the independent
M-H is uniformly ergodic. Mengersen et al. (1996) show that if � = 0, then the chain
is not even geometrically ergodic (proof not included here).
Example 25. Exp(1) example Let F = Exp(1) and Q = Exp(θ).
� = infx∈X
q(x)
f(x)= inf
x∈X
θe−xθ
e−x= θ inf
x∈Xe−x(θ−1) .
� = θ if θ < 1. When θ = 1, that is proposal is the target, then � = 1, which yields the
best bound.
Example 26. M-H on compact X
Suppose X is compact and P be a M-H kernel with proposal q which is continuous inboth arguments. If f(x) ≤ k on X , then P is uniformly ergodic.
P (x,A) ≥∫A
q(x, y)α(x, y)µ(dy)
≥∫A
q(x, y) min
{1,f(y)q(y, x)
f(x)q(x, y)
}µ(dy)
≥∫A
min
{q(x, y),
f(y)q(y, x)
f(x)
}µ(dy)
≥ infx,y∈X
q(x, y)︸ ︷︷ ︸δ
∫A
min
{1,f(y)
f(x)
}µ(dy)
= δ
∫A
min
{1,f(y)
f(x)
}µ(dy)
≥ δ∫A
min
{1,f(y)
k
}µ(dy)
51
= δ
∫A
h(y)µ(dy) .
Thus, if f and q are both continuous on a compact space, then P is uniformly ergodic.
Example 27. Two Variable Gibbs sampler
Consider the two variable Gibbs sampler with invariant density f(x, y) on X × Y .
Then k((x, y), (x′, y′)) = fX|Y (x′|y)fY |X(y′|x′). If infy∈Y fX|Y (x′|y) = h(x′) > 0 and let
� =∫h(x′)dx′, then:
P ((x, y), A) =
∫A
k((x, y), (x′, y′))dx′dy′
≥ �1�
∫A
h(x′)fY |X(y|x′)dx′dy′
But this is rarely useful, unless Y is compact and fX|Y (x|y) is continuous in y.
We will now look at an example of the two variable Gibbs sampler which is not defined
on a compact space, but will eventually end up being uniformly ergodic.
Example 28. Suppose i = 1, . . . ,m and a, c, d > 0 and known
Yiind∼ Poisson(λi)
λiiid∼ Gamma(a, β)
β ∼ Gamma(c, d)
Then f(λ, β|y) ∝(∏m
i=1 λa+yi−1i e
−(β+1)λi)βc−1e−dβ. We can show that,
λ | β, y ∼m∏i=1
Gamma(a+ yi, β + 1)
β | λ, y ∼ Gamma
(c, d+
m∑i=1
λi
)
Consider the Gibbs sampler: (β, λ)→ (β′, λ)→ (β′, λ′) so that
k((β, λ), (β′, λ′)) = fβ|λ(β′|λ, y)fλ|β(λ′|β′ , y) .
52
Now
fβ|λ(β′|λ, y) = (d+
∑mi=1 λi)
c
Γ(c)(β′)c−1e−d+
∑mi=1 λi)β
′
≥ dc
Γ(c)(β′)c−1e−(c+
∑mi=1 λi)β
′
which is not bounded below as a function λ. Thus it would seem like in this particular
case we have created for ourselves a chain that is not uniformly ergodic. We will
however see that this is not the case.
Two Variable Gibbs: A special algorithm
The example above is a good motivation to illustrate how the two-variable Gibbs is a
special Markov chain. Continuing notation from the previous example, the marginal
sequence {βn, n ≥ 1} (or even the λ) sequence is a Markov chain with kernel
Pβ(β,A) =
∫A
∫fβ|λ(β
′|λ′)fλ|β(λ′|β) dλ′︸ ︷︷ ︸k(β,β′)
dβ′ .
We can show that k(β′|β) is a density and is reversible.
k(β, β′) is a density
∫Xβk(β, β′) =
∫Xβ
∫Xλfβ|λ(β
′|λ′)fλ|β(λ′|β) dλ′ dβ′
=
∫Xλ
∫Xβfβ|λ(β
′|λ′)fλ|β(λ′|β)dβ′ dλ′
=
∫Xλfλ′|β(λ
′|β)∫Xβfβ|λ(β
′|λ′) dβ′ dλ′
= 1
Moreover {βn} converges in total variation norm at the same rate as the joint Markovchain {βn, λn}. Thus if {βn} is uniformly ergodic, then so is {βn, λn}. We need tounderstand de-initializing to see this.
53
Definition 12. Let {Xn, n ≥ 0} be a Markov chain and {Yn, n ≥ 0} be a stochasticprocess. We say, {Yn} is de-initializing for {Xn} if for all n ≥ 1
L(Xn|X0, Yn) = L(Xn|Yn) ,
where L is the law of conditional probability (essentially the same as the conditionaldistribution).
Theorem 29. Suppose µ and µ′ are probability measures. If {Yn} is de-initializing for{Xn}, then
‖L(Xn|X0 ∼ µ)− L(Xn|X0 ∼ µ′)‖ ≤ ‖L(Yn|X0 ∼ µ)− L(Yn|X0 ∼ µ′)‖ .
Proof. Recall ||ν − ν ′|| = supS|µ(S)− ν ′(S)| = sup
0≤f(y)≤1
∣∣∣∣∫ fdν − ∫ fdν ′∣∣∣∣Let S ∈ B(X ). Then consider in absolute value:
|P (Xn ∈ S|X0 ∼ µ)− P (Xn ∈ S|X0 ∼ µ′)|
=
∣∣∣∣∫ P (Xn ∈ S | X0 = x)µ(dx)− ∫ P (Xn ∈ S | X0 = x)µ′(dx)∣∣∣∣=
∣∣∣∣∫ ∫ P (Xn ∈ S | X0 = x, Yn = y)P (Yn ∈ dy | X0 = x)µ(dx)−∫ ∫
P (Xn ∈ S | X0 = x, Yn = y)P (Yn ∈ dy | X0 = x)µ′(dx)∣∣∣∣
By de-initializing
=
∣∣∣∣∫ ∫ P (Xn ∈ S | Yn = y)P (Yn ∈ dy | X0 = x)µ(dx)−∫ ∫
P (Xn ∈ S | Yn = y)P (Yn ∈ dy | X0 = x)µ′(dx)∣∣∣∣
=
∣∣∣∣∫ ∫ fS(y)P (Yn ∈ dy | X0 = x)µ(dx)− ∫ ∫ fS(y)P (Yn ∈ dy | X0 = x)µ′(dx)∣∣∣∣So by the alternate definition of the total variation norm
|P (Xn ∈ S | X0 ∼ µ)− P (Xn ∈ S | X0 ∼ µ′)| ≤ ‖L(Yn|X0 ∼ µ)− L(Yn | bX0 ∼ µ′)‖ ,
which holds for all S ∈ B(X ) and this the claim follows.
54
Example 29. Consider a two variable Gibbs sampler: (x, y)→ (x, y′)→ (x′, y′)
X ′ ∼ fX|Y (x′|y′) Y ′ ∼ fY |X(y′|x)
Claim:: The rate of convergence of {Xt} is the same as the rate of convergence of{Xt, Yt}. Proof: later, since still figuring it ou
Thus, to study convergence rates of a two variable Gibbs sampler, it is sufficient to
study the convergence rate of either marginal Markov chains.
PX(x,A) =∫A
∫Y fX|Y (x
′|y′)fY |X(y′|x)dydx′
PY (y, A) =∫A
∫X fY |X(y
′|x′)fX|Y (x′|y)dxdy′
Example 30 (Two-variable Gibbs continued). Recall that for i = 1, . . . ,m and a, c, d >
0 and known
Yiind∼ Poisson(λi)
λiiid∼ Gamma(a, β)
β ∼ Gamma(c, d)
k((β, λ), (β′, λ′)) = fβ|λ(β′|λ, y)fλ|β(λ′|β , y) .
Then the marginal chain on just β has transition density
k(β, β′)
=
∫fβ|λ(β
′|λ, y)fλ|β(λ′|β , y) dλ
=
∫(d+
∑mi=1 λi)
c
Γ(c)(β′)c−1e−d+
∑mi=1 λi)β
′ ·m∏i=1
(β + 1)a+yi
Γ(a+ yi)λa+yi−1i e
−(β+1)λi dλ
≥∫
dc
Γ(c)(β′)
c−1e−dβ
′m∏i=1
∫ ∞0
(β + 1)a+yi
Γ(a+ yi)λa+yi−1i e
−(β′+β+1)λi
=
∫dc
Γ(c)(β′)
c−1e−dβ
′m∏i=1
m∏i=1
(β + 1
β′ + β + 1
)a+yi=
∫dc
Γ(c)(β′)
c−1e−dβ
′m∏i=1
m∏i=1
(1
β′ + 1
)a+yi:= h(β′) .
55
Setting � =∫h(β′)dβ′, we get
k(β, β′) ≥ �h(β′)
�for all β
Hence, the Markov chain chain {βt} is uniformly ergodic, and thus {βt, λt} is uniformlyergodic.
Theorem 30. The First Comparison Theorem
Suppose P and Q are Markov kernels and that ∃ δ > 0 such that
P (x,A) ≥ δQ(x,A) x ∈ X , A ∈ B(X ) .
If P and Q have invariant distribution F and Q is uniformly ergodic, then so is P .
Proof. Since Q is uniformly ergodic, then X is small w.r.t Q. That is, ∃m ≥ 1, � > 0and a probability measure ν such that
Qm(x,A) ≥ �ν(A) ∀x ∈ X , A ∈ B(X )
Due to the minorization condition, we have that
Pm(x,A) ≥ δmQm(x,A) ≥ δm�ν(A) .
7.2 Random Scan Gibbs Sampler
Recall the MTD of a two-variable RSGS:
kRSGS(x′, y′|x, y) = rfX|Y δ(y′ − y) + (1− r)fY |X(y′|x)δ(x′ − x) .
with kernel
PRSGS((x, y), A) = rPY (x,A) + (1− r)PX(y, A)
where
PX(y, A) =
∫{x:(x,y)∈A}
fX|Y (x|y)µX(dx) and PY (x,A) =∫{y:(x,y)∈A}
fY |X(y|x)µY (dy)
56
Theorem 31. If PRSGS is uniformly ergodic for some selection probability r∗, then it
is uniformly ergodic for all selection probabilities r ∈ (0, 1)
Proof.
PRSGS((x, y), A) = rPY (x,A) + (1− r)PX(y, A)
=r
r∗r∗PY (x,A) +
1− r1− r∗
(1− r∗)PX(y, A)
≥ min{r
r∗,
1− r1− r∗
}PRSGS,r∗((x, y), A)
and thus by the first comparison theorem, the claim follows since PRSGS,r∗ is uniformly
ergodic.
Theorem 32. If PDUGS is uniformly ergodic, then so id PRSGS. This is true even
outside of the two variable case.
Proof. We have established before that
P 2nRSGS((x, y), A) ≥ [r(1− r)]nP nDUGS((x, y), A)
and so the result follows immediately from the first comparison theorem.
Two-variable Conditional Metropolis Hastings
1. Y ′ ∼ FY |X(·|X)
2. V ∼ q((x, y′), ·) and independently U ∼ Uniform(0,1)
If
U ≤fX|Y (v|y′)q((v, y′), x)fX|Y (x|y′)q((x, y′), v)
Set x′ = v; otherwise x′ = x
57
This simulates a Markov chain having the following MTD
kCMH((x, y), (x′, y′)) = fY |X(y
′|x)h((x, y′), x′)
where h(x′|x, y′) = q(x′|x, y′)α(x′, x, y′) + δ(x′ − x)r(x, y′)
Similarly, we can use a random scan
kRCMH((x, y), (x′, y′)) = rfY |X(y
′|x)δ(x′ − x) + (1− r)h(x, y, x′)δ(y′ − y)
Just as we did with the Gibbs sampler, we can write the kernels
PCMH = PY |XPMH:X and PRCMH = rPY |X + (1− r)PMH:X
Consider PCMH . Notice that id we choose the proposal density so that
q(x|v, y) ' fX|Y (x|y)
then acceptance probability will satisfy α(x′, x′, y′) ' 1
So the conditional M-H should behave a lot life DUGS. Notice that because of the
update order choice the marginal sequence {Xn, n ≥ 1} is a Markov chain havingMTD
hx(x, x′) =
∫Yh((x, y′), x′) fY |X(y
′|x)µ(dy′) ,
and the marginal kernel is FX-symmetric.
In-class teaching stopped here due to COVID19. What follows are typed up notes.
There’s discussion questions that go along with it.
58
7.3 Geometric ergodicity
Lecture 1:
Recall that P is geometrically ergodic if there exists M : X → (0,∞) and 0 < t < 1such that
‖P n(x, ·)− F (·)‖ ≤M(x)tn .
If P is geometrically ergodic and EF |g|2+δ 0 then as n→∞,
√n (µn − µg)
d→ N(0, σ2g) .
One of the most common way of showing that a CLT exists, is to show that the
Markov chain is geometrically ergodic. This is often done by establishing a drift and
an associated minorization condition.
7.3.1 Drift and minorization conditions
A drift condition holds if there exists a function V : X → (0,∞) such that for some0 < λ < 1 and b 2b
1− λ
}.
Recall that a small set is the set C so that for there exists a measure Q such that for
all x ∈ C P (x, ·) ≤ �Q(·). Whenever the Markov chain is in the small set, there isa chance to couple, and as soon as the Markov chain couples, it forgets its starting
value (and thus starts sampling from F ). So for Markov chains that converge fast, they
will couple faster. Which can happen if the Markov chain is in the small set often.
59
Together, the drift and minorization conditions make that happen.
To see this, consider Figure 11 where the target density π, a drift function V and a
small set d are presented. The drift condition says that, on average (not all the time),
the next move of the Markov chain will be such that it goes to a smaller value of the
drift function; that is, it drifts downwards. Since the small set is a level set, there’s
a good chance that the next state of the Markov chain has landed in the small set.
Thus, if a drift and minorization holds, it guarantees a reasonably fast coupling time,
implying a fast, or geometric rate of convergence.
Figure 11: Caption here
Theorem 33. P is geometrically ergodic if a drift condition holds and the set C is
small.
Example 31. Consider the following model for i = 1, . . . ,m for m ≥ 3,
Yiiid∼ N(µ, θ) and ν(µ, θ) ∝ 1√
θ.
This seems like a weird prior, but let’s just go with it. The posterior distribution is
f(µ, θ|y) ∝ θ−(m+1)/2 exp
{− 1
2θ
m∑j=1
(yj − µ)2}
Let ȳ = m−1∑yj and s
2 =∑
j(yj − ȳ)2. The full conditional distributions are
µ|θ, y ∼ N(ȳ,θ
m
)60
θ|µ, y ∼ IG(m− 1
2,s2 +m(ȳ − µ)2
2
).
We will show that the deterministic scan Gibbs sampler is geometrically ergodic. The
MTD is
k(µ′, θ′|µ, θ) = fθ|µ(θ′|µ, y)fµ|θ(µ′|θ′, y) .
Let V (θ, µ) = (µ− θ̄)2. Then,
PV (θ, µ) =
∫V (θ, µ)k(µ′, θ′|µ, θ)dµ′, dθ′
=
∫ ∫(µ′ − ȳ)2fθ|µ(θ′|µ, y)fµ|θ(µ′|θ′, y)
=
∫ [∫(µ′ − ȳ)2fθ|µ(θ′|µ, y)dµ′
]︸ ︷︷ ︸
Var(µ′|θ′,y)
fµ|θ(µ′|θ′, y)dθ′
=
∫θ′
mfµ|θ(µ
′|θ′, y)dθ′︸ ︷︷ ︸E[θ′|µ,y]
=1
m
s2 +m(µ− ȳ)2
m− 3
=(µ− ȳ)2
m− 3+
s2
m(m− 3).
Thus,
PV (θ, µ) ≤ λV (θ, µ) + b
where λ > 1/(m− 3) and b = s2/(m(m− 3)). Thus, a drift condition holds. Next, weneed to show that a minorization condition holds.
Consider the set
C ={
(θ, µ) : (µ− ȳ)2 ≤ d}
We need to show that there exists an � > 0 and a density q(θ, µ) such that if (θ, µ) ∈ C
k(µ′, θ′|µ, θ) = fθ|µ(θ′|µ, y)fµ|θ(µ′|θ′, y) ≥ �q(θ′, µ′) .
First note that
k(µ′, θ′|µ, θ) ≥ fµ|θ(µ′|θ′, y) infµ∈C
fθ|µ(θ′|µ, y) .
61
Let g(θ) = infµ∈C f(θ | µ, y). Let
θ∗ =md
(m− 1) log((1 +md)/s2),
then, it can be shown that (do this yourself)
g(θ) =
IG
(m− 1
2,s2 +md
2; θ
)θ ≥ θ∗
IG
(m− 1
2,s2
2; θ
)θ < θ∗
.
Then,
k(µ′, θ′|µ, θ) ≥ �f(µ′|θ′, y)g(θ′)
�.
It is often quite tricky to establish drift and minorization; one has to cook up a function
V and then demonstrate these two conditions. You can use some intuition for con-
structing V : since we want the Markov chain to be in the small set often, it is natural
then to have the small set be an area of high probability under the target. This would
then mean that V should take small values in this area so the Markov chain can “drift”
down to it.
The following in a useful result that helps avoid checking the minorization condition:
Theorem 34. If V is unbounded off compact sets, that is if for any r ∈ R+ {x :V (x) ≤ r} is compact, then a minorization condition holds for V .
Proof. Since V is unbounded off compact sets, by definition, the set C is compact.
Then
P (x, dy) ≥ infx∈C
P (x, dy) = �infx∈C P (x, dy)
�
where � makes infx∈C P (x, dy) a probability measure.
End of Lecture 1
62
Lecture 2
Now that we’ve somewhat gotten used to the drift and minorization conditions, in
this lecture, we will focus on proving that a drift and minorization together imply
geometric ergodicity. The proof first depends on establishing another equivalent drift
and minorization.
Recall that the previous drift and minorization conditions were that there exists a
function V : X → [0,∞), 0 < λ < 1, b 2b/(1− λ) such that
PV (x) ≤ λV (x) + b and C = {V (x) ≤ d} is small.
An alternate drift says, exists a function W : X → [1,∞), 0 < ρ < 1, L 2b
1− λ=b
β
⇒ W (x) > bβ
+ 1 =b+ β
β
63
Define s(x) = W (x)− (b+ β)/β > 0. Then
∆W (x) ≤ −2βW (x) + b+ (1− λ)
= −βW (x)− βW (x) + b+ (1− λ)
≤ −βW (x)− β[b+ β
β+ s(x)
]+ b+ (1− λ)
= −βW (x)− b− β − βs(x) + b+ (1− λ)
= −βW (x)− βs(x)− 1− λ2
≤ −βW (x) .
If x ∈ C, then
∆W (x) ≤ −βW (x)− βW (x) + b+ (1− λ)
≤ −βW (x) + b+ (1− λ) .
Combining the two, we get
∆W (x) ≤ −βW (x) + (b+ (1− λ))I(x ∈ C)
⇒ PW (X) ≤ (1− β)W (x) + (b+ (1− λ))I(x ∈ C)
⇒ PW (x) ≤ ρW (x) + LI(x ∈ C) ,
where ρ = (1 − λ)/2 and L = b + (1 − λ). Which establishes the second drift andminorization. A siilar argument in reverse establishes equivalence. We may thus jump
back and forth between these two drift and minorizations.
Now we will establish a “proof” of geometric ergodicity under a drift and minorization.
First, recall the coupling inequality for the coupling time T :
‖P n(x, ·)− F (·)‖ ≤ Pr(T > n) .
Suppose that T has a moment generating function. Then there exists a β > 1 so that
E(βT ) n)
≤ E[βT I(T > n)
]→ 0 as n→∞ by dominated convergence theorem. .
64
Thus ‖P n(x, ·)− F (·)‖ = o(βn) .
So we can potentially get a geometric rate of convergence in this way. The question is,
when does T have a moment generating function? Recall that T is the coupling time,
and for (the random variable) T to have thin tails, the chain must be in the small set
often. Recall τC = inf{n ≥ 1 : Xn ∈ C} is the first hitting time to C. We will showthat
PW ≤ ρW + LI(x ∈ C)⇒ E[λ−τc
]≤ d+ L/ρ .
It will then follow that the time to a successful coupling is a geometric sum of random
excursion times to C, which will imply a moment generating function for T .
Theorem 35. Suppose W : X → [1,∞), 0 < ρ < 1, L < ∞ and a small set C suchthat
PW (x) ≤ ρW (x) + LI(x ∈ C) .
For 1 < β < 1/ρ,
E [βτC ] ≤ β(ρV (x) + L) x ∈ C
E [βτC ] ≤ (V (x)) x 6∈ C
Proof. Note that for A ∈ B(X ) and x ∈ X ,
Px(τA = k) = Pr(τA = k | X0 = x)
and Px(τA = 1) = P (x,A). Consequently, due to the Markov property, for all k > 1,
Px(τA = k) =
∫AcP (x, dy)Py(τA = k − 1)
=
∫AcP (x, dy) . . .
∫AcP (yk−2, dyk−1)P (yk−1, A) .
Suppose x 6∈ C. Then by the drift condition and later using W ≥ 1
W (x) ≥ λ−1PW (x)
= λ−1∫XW (y)P (x, dy)
= λ−1∫CcW (y)P (x, dy) + λ−1
∫C
W (y)P (x, dy)
≥ λ−1∫CcW (y)P (x, dy) + λ−1
∫C
P (x, dy)
65
= λ−1∫CcW (y)P (x, dy) + λ−1Px(τC = 1)
= λ−1∫Cc
[λ−1
∫XW (z)P (y, dz)
]P (x, dy) + λ−1Px(τC = 1)
continuing with the same steps...
≥∞∑k=1
λ−kPx(τC = k)
= E[λ−τC
]≥ E [βτC ] .
Now suppose x ∈ C and consider a move x→ y. By using the previous result and thatW > 1,
E [βτC ] = E [E [βτC | y]]
=
∫X
E [βτC | y]P (x, dy)
=
∫C
E [βτC | y]P (x, dy) +∫Cc
E [βτC | y]P (x, dy)
=
∫C
βP (x, dy) +
∫Cc
Ey[βτC+1
]P (x, dy)
=
∫C
βP (x, dy) +
∫CcβEy [β
τC ]P (x, dy)
≤∫C
βW (y)P (x, dy) +
∫CcβW (y)P (x, dy)
= βPW (x) ≤ β(ρW (x) + L) .
The rest of the argument gets quite messy. But essentially, we have established bounds
for E[βτC ], which provides us bounds on E[βT ]. Together, we will get that for some
constant K and δ < 1
‖P n(x, ·)− F (·)‖ ≤ KW (x)δn .
The final form of δ are complicated, and hence I have not introduced them here.
Note that throughout, we have use fairly loose bounds. Everytime we introduce an
inequality, we weaken our bound. To begin with the coupling inequality also need not
be a tight bound.
End of Lecture 2
66
Lecture 3
In this lecture we will see how the drift and minorization can give us quantifiable upper
bounds on the total variation distance.
Recall that we have at least one instance of a quantifiable upper bound. If X is smallwith minorization constant �, then
‖P n(x, ·)− F (·)‖ ≤ (1− �)n .
Also recall that the proof of this result was simple - since the whole support is small,
coupling time is a geometric random variable. Additionally, since the upper bound is
not dependent on the starting value, we conclude that the chain is uniformly ergodic
in this case.
What if X is not small, but a subset C is small? Then the above result can be changeda little. Let us first introduce some notation. Recall a set C is small if there exists
� > 0 and a measure Q such that for all x ∈ C
P (x, ·) ≥ �Q(·) .
Let Xn and X′n be two Markov chains such that X
′0 ∼ π (from stationarity). Define
notation
t1 = inf {m : (Xm ×X ′m) ∈ C × C} ,
which is the first time both chains are in the small set. Additionally, define
ti = inf {m : m ≥ ti−1 + 1, (Xm ×X ′m) ∈ C × C} ,
which is the time stamp for the ith time both chains are in C. For a chain run for n
steps, let Nn denote the number of returns to the set C for both chains. That is
Nn = max {i : ti < n} .
Theorem 36. Under the above definitions,
‖P n(x, ·)− F (·)‖ ≤ (1− �)n + Pr(Nn < n)
67
Proof. As usual, the proof is based first on the coupling inequality. Recall that
‖P n(x, ·)− F (·)‖ ≤ Pr(T > n) .
Next note that Nn ≤ n and when Nn = n, this means that in every update, bothMarkov chains were in C. Additionally if T > n and Nn = n means that in every step
both chains were in C but they failed to couple. The probability of this is (1− �)n.
‖P n(x, ·)− F (·)‖ ≤ Pr(T > n)
= Pr(T > n,Nn = n) + Pr(T > n,Nn < n)
= (1− �)n + Pr(T > n,Nn < n)
= (1− �)n + Pr(Nn < n) .
Through a series of lemmas and interesting proof techniques, Rosenthal (1995) ar-
rives at an upper bound for Pr(Nn < n) that is based on the drift and minorization
conditions.
Theorem 37. Suppose a drift condition with V : X → [0,∞) hold such that for some0λ < 1 and b ∈ R,
PV (x) ≤ λV (x) + b .
Additionally let C = {x : V (x) ≤ d} be a small set with minorization constant � whered > 2b/(1− λ). Then for any 0 < r < 1,
‖P n(x, ·)− F (·)‖ ≤ (1− �)rk +(α−(1−r)Ar
)k (1 +
b
1− λ+ V (x0)
),
where
α−1 =1 + 2b+ λd
1 + d< 1 and A = 1 + 2(λd+ b) .
The above theorem is quite critical since it allows us to do the follows: if we can
establish a drift and minorization condition for a Markov chain, we can then bound
the distance from the stationary distribution. This upper bound with yield a time
stamp n∗ at which the chain is sufficiently close to the starting distribution. That n∗
can then be treated as the number of samples that are reasonable to discard from the
sample.
68
However, the sad truth is that, the upper bound is usually quite lax, and often produces
n∗ values in the millions, so that these are not practically useful. There are some
exceptions.
End of Lecture 3
69
Lecture 4
At this point, we have essentially completed the theoretical study of Markov chains.
At this point we will proceed to the Monte Carlo aspects of Markov chain Monte Carlo.
Particularly, we want to address the following:
• Sufficient burn-in: Given a bad starting value that is far away from an area of highposterior density, we want to be able assess in how many steps the Markov chain
will approximately start sampling from F . In the previous lecture, we learned of
a technique to potentially do this rigorously, however, often the bounds on the
TV distance are much too loose to be useful. In such instances, we want to rely
mainly on trace plots to assess when the Markov chain has gotten away from the
starting value.
What you shouldn’t do is blindly through away some portion of the Markov
chain, without some reasonable justification. Throwing away samples is a waste
of resources, unless you are informed either theoretically or visually about why
you’re throwing away these samples. Here is a fantastic rant about this: Charlie
Geyer’s rant on burn-in.
• Existence of CLT: We now have the tools to check whether the central limittheorem holds. We can verify if the Markov chain is uniformly ergodic or if it
is geometrically ergodic by establishing a drift and minorization. Every Markov
chain is different, so special analysis of every chain must be done. Further, if
interest is in estimating the mean of a function g under F , then Eg2+δ
Verify on your own that F = N(0, σ2/(1− ρ2)) for ρ < 1 is the invariant distribution.
1. Starting value: Let’s look at the impact of the starting value. We generate
the Markov chain for 500 steps with ρ = .95 with starting value x1 = 0. The
stationary distribution a normal centered at 0, thus this is a goos starting value.
Figure 13 below produces the trace plots for this run, and it is evident from the
stability of the chain, and there is no reasonable need to discard the starting
values.
0 100 200 300 400 500
−8
−6
−4
−2
02
46
Chain Length
AR
(1)
Pro
cess
Figure 12: AR(1) process with starting value 0.
On the other hand, when we start far (far) away from an area of high probability,
the chain takes a while to reach a reasonable level of stationarity. We start the
Markov chain from 100 and the trace plot in Figure 13 indicates that it takes
about 60-70 steps before the Markov chain reaches a point of stability.
0 100 200 300 400 500
020
4060
8010
0
Chain Length
AR
(1)
Pro
cess
Figure 13: AR(1) process with starting value 100.
A explained before, we may want to remove these 60-70 steps from the Markov
71
chain because we have some visual confirmation about the quality of the starting
value. Alternatively, if we just run the Markov chain long enough, the effect of a
few “bad” samples in the beginning can mostly be ignored.
2. Existence of CLT: For a CLT to hold, we need the Markov chain to be
either uniformly, geometrically or polynomially ergodic and we need appropriate
moment conditions to hold. Consider estimating the mean of g(x) = xr for
r > 1. Then since the target distribution has a moment generating function,
finite moments exist. We move on to establishing drift and minorization.
An alternative form is Xt | Xt−1 ∼ N(ρXn, σ2). The Markov transition kernel
P (x,A) =
∫A
1
2πσ2exp
{− 1
2σ2(y − ρx)2
}dy .
We will prove geometric ergodicity by establishing a drift and minorization con-
dition. In trying to find a drift function, notice that the target distribution is
normally distributed around 0. Recall that ideally we want the drift function to
be centered around an area of high probability. Considering this, let
V (x) = x2 .
We need to show the drift condition is satisfied for the above function.∫V (y)k(x, y)dy = E [V (x1) | x0]
= E[x21 | x0
]= Var(x1 | x0) + (E[x1 | x0])2
= σ2 + ρx0
≤ |ρ|x0 + σ2 .
Since |ρ| < 1, we have that the drift condition holds for λ = |ρ| and b = σ2.Also, since the drift function V is unbounded off compact sets, the minorization
condition is satisfied.
�
End of Lecture 4
72
Lecture 5
8 Estimating the asymptotic variance
Recall that we are interested in estimating µg := EFg =∫g(x)F (dx). Samples
X1, . . . , Xn are n samples using a Markov chain with transition kernel P . The chosen
estimator of µg is,
µ̂ =1
n
n∑t=1
g(Xt)a.s.→ µg .
Additionally, under the many possible situations, if a Markov chain CLT holds for µ̂
then as n→∞,√n(µ̂− µg)
d→ N(0, σ2g) ,
where
σ2g =∞∑
k=−∞
CovF (g(X1), g(X1+k)) .
In this section we will estimate σ2g using two methods. To assess the quality of estima-
tion it would be useful to compare the performance of various estimators for a Markov
chain where the true σ2g value is known.
Example 32 (AR(1) continued). The AR(1) process is specially useful since a closed-
form expression of σ2g is available for g(x) = x. Recall that the stationary distribution
of the AR(1) process if N(0, σ2/(1 − ρ2)), where σ2 is the variance of the errors �t.Thus, CovF (X1, X1+k) = |ρ|k.
σ2g =∞∑
k=−∞
CovF (g(X1), g(X1+k))
= VarF (X1, X1) + 2∞∑k=1
CovF (g(X1), g(X1+k))
=σ2
1− ρ2+ 2
∞∑k=1
ρkσ2
1− ρ2
=σ2
(1− ρ)2.
Estimating σ2g has some additional challenged in MCMC. Note that we will be using
estimated of σ2g to determine when to terminate simulation. That is simulation is
73
terminated at a random time. So in fact, theoretically, n = T (µ̂, σ̂2g) which will be
random. Glynn and Whitt (1992) essentially show that in order for the simulation to
terminate “adequately well”, the estimator of σ2g must be strongly consistent.
8.1 Spectral variance estimators
The asymptotic variance is an infinite sum, and we have n samples. So obviously,
estimating σ2g is a difficult problem. A popular, but computationally burdensome
estimator is the spectral variance estimator. Let R(k) = CovF (g(X1), g(X1+k)) be the
lag-k covariance. The sample lag−k covariance is
R̂(k) =1
n
n−k∑t=1
(g(Xt)− µ̂) (g(Xt+k)− µ̂) .
Ideally, we would like to estimate σ2g with∑∞
k=−∞ R̂(k), but we can only go up to
k = −(n− 1) and k = (n− 1). So we can potentially use∑n−1
k=−(n−1) R̂(k). But notice
that to estimate R̂(n− 1) there is only one sample point available. Thus larger orderlag covariances are not going to be estimated well. So the summation can’t go up to
−(n − 1) and instead we will make it go to some truncation point b. Further, sincethe quality of estimat