+ All Categories
Home > Documents > MTH 707A: Markov chain Monte Carlo · 2021. 1. 14. · MCMC is a computational simulation technique...

MTH 707A: Markov chain Monte Carlo · 2021. 1. 14. · MCMC is a computational simulation technique...

Date post: 28-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
82
MTH 707A: Markov chain Monte Carlo Instructor: Dootika Vats Spring 2020 Contents 1 Course overview and motivation 3 2 Markov chains basics 6 3 Stochastic Stability 10 4 Constructing MCMC algorithms 14 4.1 Metropolis-Hastings .......................... 14 4.1.1 Types of proposal distributions .................. 17 4.2 General Accept-Reject MCMC .................... 19 4.3 Combining Kernels ........................... 20 4.4 Component-wise Updates ....................... 21 4.4.1 Gibbs sampler ............................ 23 4.4.2 Metropolis-within-Gibbs ...................... 27 4.5 Linchpin variable sampler ....................... 28 5 Convergence 31 5.1 F -irreducibility and aperiodicity .................... 31 5.2 Harris recurrence ............................ 33 6 Minorization and coupling 37 6.1 Minorization .............................. 37 6.2 The Split Chain ............................. 38 6.3 Coupling ................................. 39 7 Rate of Convergence and CLT 41 7.1 Uniform Ergodicity ........................... 49 7.2 Random Scan Gibbs Sampler ..................... 56 7.3 Geometric ergodicity .......................... 59 7.3.1 Drift and minorization conditions ................. 59 1
Transcript
  • MTH 707A: Markov chain Monte Carlo

    Instructor: Dootika Vats

    Spring 2020

    Contents

    1 Course overview and motivation 3

    2 Markov chains basics 6

    3 Stochastic Stability 10

    4 Constructing MCMC algorithms 144.1 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    4.1.1 Types of proposal distributions . . . . . . . . . . . . . . . . . . 174.2 General Accept-Reject MCMC . . . . . . . . . . . . . . . . . . . . 194.3 Combining Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Component-wise Updates . . . . . . . . . . . . . . . . . . . . . . . 21

    4.4.1 Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4.2 Metropolis-within-Gibbs . . . . . . . . . . . . . . . . . . . . . . 27

    4.5 Linchpin variable sampler . . . . . . . . . . . . . . . . . . . . . . . 28

    5 Convergence 315.1 F -irreducibility and aperiodicity . . . . . . . . . . . . . . . . . . . . 315.2 Harris recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    6 Minorization and coupling 376.1 Minorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2 The Split Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.3 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    7 Rate of Convergence and CLT 417.1 Uniform Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.2 Random Scan Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . 567.3 Geometric ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    7.3.1 Drift and minorization conditions . . . . . . . . . . . . . . . . . 59

    1

  • 8 Estimating the asymptotic variance 738.1 Spectral variance estimators . . . . . . . . . . . . . . . . . . . . . . 748.2 Batch means estimator . . . . . . . . . . . . . . . . . . . . . . . . . 76

    9 Terminating MCMC simulation 779.1 Effective sample size . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    10 Miscellaneous topics 7910.1 Multiple chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7910.2 Thinning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7910.3 Multivariate analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    11 Errata 81

    2

  • 1 Course overview and motivation

    This is a graduate level (PhD level) course on Markov chain Monte Carlo. Unlike

    other MCMC courses, this will focus specifically on the theoretical analysis of MCMC

    samplers; their rates of convergence and theoretical tools for their analysis.

    MCMC is a computational simulation technique from drawing samples from compli-

    cated distributions. Throughout this course, I will denote f as a target density and

    also denote F as the associated target distribution (measure). We will only discuss

    probability measures, so I will refer to them as distributions. Formally, let F be a

    probability measure defined on a measurable space (X ,B(X ))

    Given a complicated integration problem, MCMC is a potential solution.

    Example 1 (Estimating integrals). Suppose we want to integrate

    µ =

    ∫ ∞0

    1

    (1 + x)2.3(log(x+ 3))2dx .

    We can rewrite this as

    µ =

    ∫ ∞0

    1

    (1 + x)2.3(log(x+ 3))2dx

    =

    ∫ ∞0

    ex

    (1 + x)2.3(log(x+ 3))2e−xdx

    = EF

    [ex

    (1 + x)2.3(log(x+ 3))2

    ],

    where F is the Exp(1) distribution.

    Example 2 (Expectations). Suppose for a function h : X → R, we are interested in

    EF [h(x)] =

    ∫Xg(x)F (dx) .

    Example 3 (Quantiles). Let V (x) = h(X) and 0 < q < 1, where X ∼ F . Define theq-quantile of V as

    ξq = F−1v (q) = inf{ν : Fv(ν) ≥ q}

    Example 4 (Densities). Let V (x) = h(X) and X ∼ F . How do we estimate thedensity of V ?

    3

  • In order to solve all these problems, we would want to obtain a sample from the

    distribution F . That is, we want

    X1, X2, . . . , Xn ∼ F

    This is a non-trivial problem.

    • How do we even draw samples from known distributions? MTH511 done

    – Inversion method (Exponential example)

    – Accept-reject sampling

    – Ratio of uniforms

    • How do we draw samples from nice unknown distributions? MTH511 done

    – Accept-reject sampling

    – Ratio of uniforms

    • How do we draw samples from bad unknown distributions? What are bad distri-butions?

    – Suppose X is R1000. Our distribution is defined on a high-dimensional space.

    – F is such that f(x) is complicated

    Example 5 (Bayesian one-way random effects model). Suppose we have for i =

    1, . . . , k and j = 1, . . . ,mi

    Yij | θ, λe ∼ N(θi, λ−1e )

    θi | µ, λθ ∼ N(µ, λ−1θ )

    µ ∼ N(mo, s−1o ) λe ∼ Gamma(a1, b1) λθ ∼ Gamma(a2, b2)

    We are interested in the posterior distribution

    q(Θ) = q(θ, µ, λe, λθ | y) ∝ f(y|θ, λe)f(θ|µ, λθ)f(µ)f(λθ)f(λe) .

    We are interested in the Bayes estimator E[Θ|y].

    So our goal will be: given any proper distribution F , can we draw an (approximate)

    4

  • sample

    X1, . . . , Xn ≈ F

    so that we will estimate Θ with a sample estimate Θ̂ along with a measure of uncer-

    tainty: Θ̂−Θ.

    How does sampling work? Suppose want to draw samples from a standard normal

    distribution N(0, 1). If you have taken a simulation class, you know how to do this.

    Samples will look like Make plot.

    From these samples we know how to use sample statistics to estimate almost any

    population feature. Strong law, CLT, mean consistency, estimating variance etc.

    Now suppose we have no idea how to draw samples from N(0, 1), but given one point

    in R, we can decide how to obtain the next point so that eventually my samples behavelike a N(0, 1) sample. Make plot. Since these are not iid samples, I have many questions

    • If my first point is somehow N(0, 1) are all points N(0, 1)?

    • If my first point is not N(0, 1), then what happens to all the point points? Arethey eventually N(0, 1)?

    • Does a strong law hold?

    • Does a CLT hold?

    • How do we estimate variance in the CLT?

    • How many samples do we need?

    By the end of the class, you should be able to answer all of these questions.

    5

  • 2 Markov chains basics

    Note: Throughout this document, we will mainly consider densities with respect

    to the Lebesgue measure. For this reason, we will often ignore the measure theoretic

    representation of densities. That is for a distribution, F , we will write F (dx) = f(x)

    as opposed to F (dx) = f(x)µ(dx). This is purely for notational convenience, and to

    avoid confusion for non-measure theoretic audience

    - - -

    Let (X ,B(X )) be a measurable space. We will usually work will subsets of Rd and theBorel σ-algebras. We will also almost exclusively be discussing discrete-time continu-

    ous(general state) space Markov chains.

    Then X -valued sequence of random variables Φ = {X0, X1, X2, . . . } is a time-homogeneousMarkov chain if for all A ∈ B(X ) and for all n

    Pr(Xn+1 ∈ A | Xn, . . . , X1, X0) = Pr(Xn+1 ∈ A | Xn) .

    There are two properties embedded within this:

    • The Markov property: that the distribution of Xn+1 given Xn, Xn−1, . . . , X0 isthe same as the distribution of Xn+1 given Xn.

    • Stationary transition: the distribution of Xn+1 given Xn is the same for all n.

    The key operating object for a discrete-time general state space Markov chain is its

    Markov transition kernel

    Definition 1 (Markov transition kernel). A Markov transition kernel is a map P :

    X × B(X )→ [0, 1] such that

    1. for all A ∈ B(X ), P (., A) is a measurable function on X .

    2. for all x ∈ X , P (x, .) is a probability measure on B(X ).

    Informally, this is just like a conditional probability. For x ∈ X and A ∈ B(X )

    P (x,A) = Pr(X1 ∈ A | X0 = x) .

    Definition 2 (Markov transition density). Denote k : X ×X :→ [0,∞) as the Markov

    6

  • transition density defined as

    k(x, y) = P (x, dy) .

    Remark 1. When X is discrete, B(X ) is the set of all subsets of X and P is a matrixof transition probabilities having elements P (x, y) s.t. P (x, y) ≥ 0 and

    ∑y P (x, y) = 1.

    We will almost never discuss this case.

    Example 6. Let X = (0, 1). Draw U ∼ U(0, 1). If U ≤ 1/2, then Xn+1 ∼ U(0, Xn)and if U > 1/2, then Xn+1 ∼ U(Xn, 1).

    Then for Xn = x

    P (x,A) =

    ∫A

    1

    2

    1

    xIy((0, x)) +

    1

    2

    1

    1− xIy((x, 1)) dy .

    Let F be a probability measure (distribution) in B(X ). Define

    FP (A) =

    ∫XF (dx)P (x,A) .

    We will use the shorthand notation FP to denote FP (A) for any generic A.

    If Xn ∼ F and Xn+1|(Xn = x) ∼ P (x, ·) then FP is the marginal distribution ofXn+1.

    Similarly, if X1 | X0 ∼ P (x, ·), then we can define the marginal distribution of X2 | X0as

    P 2(x,A) = Pr(X2 ∈ A | X0 = x) =∫XP (x, dy)P (y, A) .

    In this way, we get the n-step Markov transition kernel P n

    P n(x,A) = Pr(Xn ∈ A | X0 = x) =∫XP (x, dx1)P (x1, dx2) . . . , P (xn−1, A) . (1)

    (Chapman-Kolmogorov Equation) In general, for 0 ≤ m ≤ n ∈ N, P n = P n−mPm

    and thus,

    P n(x,A) =

    ∫XPm(x, dy)P n−m(y, A) .

    7

  • Given a Markov chain transition kernel P , and a starting value X0 = x, we generate

    the Markov chain Φ = {X0, X1, X2, . . . }

    Definition 3. Let F is the initial distribution ofX0. The distribution F is the invariant

    or stationary distribution if FP = F .

    Intuition: If we start from F , and use P to get another sample, then the next sample

    is also from F .

    Theorem 1. If F is the invariant distribution of P , then for all n ≥ 1, FP n = F .

    Proof. F is invariant for P means that FP = F . So,

    FP n(A) =

    ∫XF (dx)P n(x,A) .

    Using the Chapman-Kolmogorov equation for m = 1

    =

    ∫x

    F (dx)

    ∫y

    P (x, dy)P n−1(y, A)

    =

    ∫y

    (∫x

    F (dx)P (x, dy)

    )P n−1(y, A)

    =

    ∫XF (dy)P n−1(y, A)

    =...

    =

    ∫XF (dy)P (y, A)

    = F (A) .

    Intuition: If we start from F , and use P repeatedly, we will keep getting samples from

    F . This is exactly what we want, so that is great! Now there are still two concerns:

    we can’t really start from F in most problems, and how do we construct P?

    Definition 4. The kernel P is F -symmetric if

    F (dx)P (x, dy) = F (dy)P (y, dx) .

    8

  • This equation is also often called detailed balance, and the phenomenon is also called

    F -reversibility.

    Theorem 2. F -symmetry implies F is invariant for P .

    Proof. Need to show that FP = F .

    FP (A) =

    ∫XF (dx)P (x,A)

    =

    ∫X

    ∫A

    F (dx)P (x, dy)

    =

    ∫A

    F (dy)

    ∫XP (y, dx)

    =

    ∫A

    F (dy)

    = F (A) .

    Intuition: Recall that our goal is to construct an F -invariant transition P . So, if we

    construct P such that it is F -symmetric (reversible), then we are good to go.

    9

  • 3 Stochastic Stability

    Unfortunately, finding a stationary distribution is not sufficient to get representative

    samples. In other words, we also want the Markov chain to “explore the space X”. Wewill require some additional structure to Φ.

    Example 7. For intuition finite state space works better. Let

    P =

    1/2 1/2 0

    1/2 1/2 0

    0 0 1

    For this transition, no matter what F is invariant, you won’t ever get a representative

    sample since the chain will get stuck on the third state.

    Let φ be any non-trivial positive measure on B(X ).

    Definition 5. A set A ∈ B(X ) is F -communicating, if ∀B ⊂ A such that F (B) > 0and B ∈ B(X ) and ∀x ∈ A, ∃ n such that

    P n(x,B) > 0 .

    Intuitively, the chain will comeback to anywhere in A eventually. This is a weak

    property, since it does not ask the chain to move out of A.

    Figure 1: Pictorial depiction of F -communicating. Here we can go from x to B in 3steps.

    10

  • Definition 6. P is F -irreducible if ∀x ∈ X and A ∈ B(X ) such that F (A) > 0,∃nsuch that

    P n(x,A) > 0 .

    Otherwise P is reducible.

    Figure 2: Pictorial depiction of F -irreducible. Here we can go from x to A in 6 steps.

    Example 8 (Broken support). Consider the following density:

    f(x) =1

    4I(0 < x < 2) +

    1

    4I(3 < x < 5) .

    and let F be the associated distribution

    Figure 3: Given x, we cannot jump support in this case.

    Now consider the Markov chain that, given Xn draws uniformly from (Xn−1, Xn + 1).That is

    P (x,A) =

    ∫A

    1

    2I(x− 1 < y < x+ 1)dy .

    (F is not an invariant distribution of P)

    But, P is not F -irreducible since if X0 = x ∈ (0, 2) the set A = (3, 5) can never be

    11

  • reached. The kernel P would be F -irreducible if kernel made jumps in (x−1.2, x+1.2)for example.

    Remark 2. Positivity condition: A trivial way of getting F -irreducibility is to have

    the positivity condition - construct P so that P (x,A) > 0 for all A and x! We will

    keep using this often.

    Example 9 (Limited jumps). Consider the density

    f(x) =1√2π

    exp

    {−x

    2

    2

    }.

    Figure 4: Given x, we can explore the whole support.

    Let F be the associated distribution. Consider the MTK

    P (x,A) =

    ∫A

    I(x− 1 < y < x+ 1)dy .

    (Again, here F is not stationary for P ) It is clear that P is irreducible since it is

    possible to go from anywhere to anywhere in some amount of steps. But showing this

    mathematically is hard.

    Theorem 3 (Geyer (1998), Theorem 4.1). Suppose

    a The state space X is a separable metric space.

    b Every non-empty open set A ∈ B(X ), F (A) > 0

    c Every point has a F -communicating neighborhood

    then the chain is F -irreducible.

    Proof. The proof is complicated, but the intuition can be seen from Example 9. First,

    X is a separable metric space is connected (and second countable) - there exists a

    12

  • countable collection of open sets U such that even open subset of X is a union of setsfrom U .

    (b) for our purpose means that the distribution F has full support X , and (c) meansthat given a point x, there is a set A that if F -communicating.

    These three points eventually allow movement from anywhere to anywhere in finite

    steps.

    Definition 7. For d ≥ 2, consider disjoint sets A1, . . . , Ad such that for N = (A1 ∪· · · ∪ Ad)c then F (N) = 0. Further, let the sets satisfy F (Ai) > 0 and P (x,Ai+1) = 1for all x ∈ Ai, i ≤ i ≤ d− 1 and P (x,A1) = 1 for all x ∈ Ad. If such sets do not exist,then Φ is aperiodic.

    Otherwise, Φ is periodic.

    Figure 5: The Markov chain jumps from one set to the other deterministically. Theabove is a periodic Markov chain.

    13

  • 4 Constructing MCMC algorithms

    Before we start reading more theoretical properties of the Markov chains, let’s get

    introduced to some important MCMC algorithms. This will help build intuition and

    give some relevance to the kinds of questions we want to ask and answer.

    4.1 Metropolis-Hastings

    Perhaps the most common MCMC algorithm is the Metropolis-Hastings (MH) algo-

    rithm.

    Let Q be a Markov kernel (conditional distribution) Q(x, ·) with density q(x, y). (Thisis effectively q(y|x). )

    Let Xn = x. Then

    1. Y ∼ Q(x, .) and independently U ∼ Unif(0, 1)

    2. If

    U < min

    {1,f(y) q(y, x)

    f(x) q(x, y)

    }then set Xn+1 = y

    3. Else set Xn+1 = x

    Here r(x, y) is the Hasting’s ratio where

    r(x, y) =f(y)q(y, x)

    f(x)q(x, y)

    and α(x, y) = min{1, r(x, y)} is called the acceptance probability.

    Remark 3. The Metropolis-Hasting’s algorithm is described by α(x, y). There are

    other functions of acceptance probabilities that yield other algorithms. We will discuss

    these later.

    Note, for a given x, δx(A) = 1 only if x ∈ A. This is the Dirac measure.

    14

  • Figure 6: Metropolis-Hastings algorithm: some intuition. Here, the point proposed (y)is in a lower probability region, so there are chances to reject.

    Theorem 4. The MH algorithm defines a valid Markov Kernel:

    P (x, dy) = Q(x, dy)α(x, y) + δx(dy)

    ∫[1− α(x, u)] Q(x, du)

    Proof.

    Pr(Xn+1 = dy | Xn) = Pr(Xn+1 = dy, U ≤ α(x, y) | Xn)︸ ︷︷ ︸I

    + Pr(Xn+1 = dy, U > α(x, y) | Xn)︸ ︷︷ ︸II

    .

    I = Pr(Xn+1 = dy, U ≤ α(x, y) | Xn = x)

    = E[I(Xn+1 = dy, U ≤ α(x, y) | Xn = x)

    ]= E

    [I(Xn+1 = dy)E

    [I(U ≤ α(x, y) | Y = y

    ]| Xn = x

    ]= E [I(Xn+1 = dy)α(x, y) | Xn = x]

    =

    ∫I(Xn+1 = dy)α(x, y)Q(x, dy)

    = α(x, y)Q(x, dy)

    II = P (Xn+1 = dy, U > α(x, y) | Xn)

    = E[E(I(x = dy) I(U > α(x, y) | Y

    )| Xn

    ]15

  • = E[I(x = dy)(1− α(x, y)) | Xn = x

    ]= δx(dy)

    ∫(1− α(x, u)Q(x, du) .

    Now that we know the form of the kernel, we would first like to see whether it is

    F -invariant. We know one tool: F -symmetric.

    Theorem 5. The Metropolis Hastings Kernel is F -symmetric.

    Proof. To show F -symmetric, we need to show that

    F (dx)P (x, dy) = F (dy)P (y, dx)

    This is trivially true if x = y. So it suffices to consider the case when x 6= y, which isthe case when the MH algorithm moves.

    F (dx)P (x, dy) = f(x) q(x, y)α(x, y) dx dy

    = f(x) q(x, y) min

    {1,f(y) q(y, x)

    f(x) q(x, y)

    }dx dy

    = min {f(x) q(x, y), f(y) q(y, x)} dx dy

    = f(y) q(y, x) min

    {1,f(x)q(x, y)

    f(y)q(y, x)

    }dx dy

    = F (dy)P (y, dx)

    Theorem 6. If q(x, y) > 0 for all x, y ∈ X , then P is F -irreducible.

    Proof. Homework.

    If q(x, y) > 0 for all x, y then P (x,A) > 0 for all F (A) > 0, thus P is F -irreducible.

    In cases when this is not true, Geyer’s theorem can be used in most cases to establish

    irreducibility.

    16

  • Example 10 (χ2 target). Consider drawing from a χ2-distribution. Of course, this is

    not a complicated distribution, we already know how to draw from it. But for the sake

    of demonstration:

    f(x) =1

    2k/2Γ(k/2)xk/2−1e−x/2 I(x > 0) .

    To implement a MH algorithm, we need to choose a proposal distribution. Let Q(x, ·) =N(0, h), where h is the variance. The density is

    q(x, y) =1√2πh

    exp

    {−(y − x)

    2

    2h

    }=

    1√2πh

    exp

    {−(x− y)

    2

    2h

    }= q(y, x) .

    Here, the acceptance ratio is

    α(x, y) = min

    {1,

    f(y)

    f(x)

    q(y, x)

    q(x, y)

    }= min

    {1,

    yk/2−1e−y/2

    xk/2−1e−x/2

    }We now have all the tools to implement the MH algorithm. Note that we don’t have

    to worry about f(x) = 0 in this case as long as we start from within our support of the

    target distribution. This is because, x will only be a new accepted value if f(x) > 0.

    Run R code here.

    4.1.1 Types of proposal distributions

    Independence MH

    Here Q(x, ·) = Q(·). That is, the proposal distribution does not depend on the currentvalue. Then, q(x, y) = q(y), so the MH-ratio is:

    r(x, y) =f(y)

    f(x)

    q(x)

    q(y)=w(y)

    w(x),

    where w(y) = f(y)/q(y) is a weight function. Later, we will see that bounded weight

    functions are important here.

    Symmetric MH

    Here q(x, y) = q(y, x). Then, just like in the example

    r(x, y) =f(y)

    f(x).

    17

  • The symmetric MH is the most common proposal, since it makes evaluating the ratio

    easier. The two most common proposals are N(x, h) and U(x− h, x+ h).

    Random Walk MH

    Suppose Z ∼ G(·) so that G is not dependent on the current step Xn. Set, Y = Z+Xn.Then, q(x, y) = g(y − x) and

    r(x, y) =f(y)

    f(x)

    g(x− y)g(y − x)

    .

    Often, this is also Normal or Uniform as well. But we can use G = td proposal

    distribution.

    Langevin MH (MALA)

    Here, we use some information about the target distribution in our proposal. We shift

    the mean of the proposal distribution towards a higher probability area under F :

    Q(x, ·) = N(x+

    h

    2∇ log f(x), h

    )

    Figure 7: MALA proposal moves the proposal towards a higher target mass area.

    Consider

    ∇ log f(x) = ∇f(x)f(x)

    .

    18

  • Thus, when f(x) is small, and the gradient is small, there will be a displacement of the

    mean, and if f(x) is large and gradient is small (like in a local maximum), then there

    will be small displacement.

    Some unanswered questions:

    • How do we choose the starting value?

    • How do we choose the size of the proposal, h?

    • How long do we run the Markov chain for?

    4.2 General Accept-Reject MCMC

    Metropolis-Hastings is only one particular type of accept-reject style MCMC algorithm.

    The reason α(x, y) in MH works is because it yields F -reversibility. So the question

    is, are there other such acceptance probabilities? The following is an argument from

    Billera and Diaconis (2001).

    Let

    r(x, y) =f(y)

    f(x)

    q(y, x)

    q(x, y).

    Note that the MTK remains the same, only the choice of α will change. For a generic

    α(x, y), F -reversibility requires (for x 6= y)

    f(x)k(x, y) = f(y)k(y, x)

    f(x)α(x, y)q(x, y) = f(y)α(y, x)q(y, x)

    α(x, y) = α(y, x)r(x, y) .

    We want 0 ≤ α(x, y) ≤ 1 and since α(y, x) is also a probability, we want α(y, x) ≤ 1as well. But α(y, x) = α(x, y)/r(x, y) ≤ 1, so α(x, y) ≤ r(x, y). Thus,

    0 ≤ α(x, y) ≤ min{1, r(x, y)} . (2)

    Thus, if α(x, y) satisfies (2) and then set

    α(y, x) =α(x, y)

    r(x, y),

    yields an F -reversible Markov chain.

    19

  • In addition to F -reversibility, we also want the acceptance probability to be useful,

    so it must be a ratio of f(y)/f(x), otherwise unknown constants will not cancel out.

    Thus it is natural to consider functions α(x, y) = g(r(x, y)). Then g must satisfy

    g(x) = xg

    (1

    x

    )0 ≤ x ≤ 1 .

    Another popular acceptance probability (specially for chemists) is the Barker’s accep-

    tance probability (Barker, 1965)

    αB(x, y) =f(y)q(y, x)

    f(x)q(x, y) + f(y)q(y, x)=

    r(x, y)

    1 + r(x, y).

    Barker’s acceptance probability is not used often, and we will learn later as to why.

    4.3 Combining Kernels

    Suppose P1, . . . Pd are d transition kernels such that FPi = F for all i. These could be

    i different proposals, or i different acceptance probabilities.

    The composition kernel is

    Pc(x, ·) = (P1 . . . Pd)(x, ·) .

    Recall that the notation product of kernels refers to the marginal distribution. So

    P1P2(x,A) =

    ∫XP1(x, dy)P2(y, A) .

    Let ri > 0 and∑d

    i=1 ri = 1. Then the mixing kernel is

    Pm(x, ·) = r1P1(x, ·) + · · ·+ rdPd(x, ·) .

    The mixing and composition kernels are both F -invariant. That is, FPc = F and

    FPm = F . These two kernels are very useful in constructing component-wise updates.

    20

  • Figure 8: Left: Composition kernel. Right: Mixing kernel.

    4.4 Component-wise Updates

    Example 11 (Multivariate normal). We will first motivate with a multivariate exam-

    ple. Let F = N2(0, I2), so that the target is a bivariate normal. Consider Q(x, ·) =N2(x, hI2) (trivial, but just for demonstration).

    In this case, the algorithm is still the same, except the proposal draw is multivariate.

    Y ∼ N(x, hI2) is drawn: (y1, y2). We will accept or reject the full vector as a whole.The MH ratio is

    α(x, y) = min

    {1,f(y)

    f(x)

    }= min

    {1,f1(y1) f2(y2)

    f1(x1) f2(x2)

    }.

    Naturally, as the dimension increases, h will then need to decrease to ensure the

    same acceptance probabilities. Sometimes, it can then be difficult to tune h, so here

    component-wise updates are sometimes preferred (not always though).

    Let X = X1×. . . ,×Xd where each Xi ⊆ Rbi . If x = (x1, x2, . . . , xd) ∈ X , set x(i) = x\xi.If f(x) is the joint density associated with F , let f(xi | x(i)) be the full conditionaldensity of xi.

    Let pi(x, yi) be an MH MTD with invariant density f(xi | x(i)) with proposal q((xi, x), yi).Then

    pi(x, yi) = q((xi, x), yi)α(xi, yi) + δxi(yi)

    ∫[1− α(xi, ui)]q((xi, x(i), ui)) .

    So pi updates the ith component, according to a MH step. Each of the other compo-

    nents is set to be whatever they were before, so the overall Markov kernel is defined

    21

  • as:

    Pi(x,A) =

    ∫A

    pi(yi | x)δx(i)(y(i))dyi .

    Theorem 7. Pi is F -invariant for all i.

    Proof. By construction pi are invariant for density f(xi | x(i)), so that

    pi((xi, x(i)), yi)f(xi | x(i)) = pi((yi, x(i)), xi)f(yi | x(i))

    So,

    pi((xi, x(i)), yi)f(xi, x(i)) = pi((xi, x(i)), yi)f(xi | x(i))f(x(i))

    = pi((yi, x(i)), xi)f(yi | x(i))f(x(i))

    = pi((yi, x(i)), xi)f(yi, x(i)) .

    So maybe we could use each of the Pis as our final kernel since it is F -invariant.

    However, this won’t work, since Pi only updates the ith component, so it is naturally

    reducible. We can then combine the d kernels:

    • Random scan:

    PRS(x,A) =d∑i=1

    riPi(x,A) , with MTK kRS(x, y) =d∑i=1

    ripi(yi|xi, x(i)) δx(i)(y(i)) .

    We can show that PRS is F -symmetric.

    • Deterministic scan:PDS(x,A) = (P1 . . . Pd)(x,A) ,

    with

    kDS(x, y) = p1((x1, x(1)), y1) p2((y1, x(1)), y2) p3((y1, y2, x(1,2)), y3) . . . pd((y(d), xd), yd) .

    We can show that PDS is not generally F -symmetric.

    • Random sequence scan:

    22

  • There are d! orders for the composition. Let p ≤ d! and r = (r1, . . . , rp), ri > 0,∑pi=1 = 1. Define

    PRQ(x,A) =

    p∑j=1

    rjPc,j(x,A) , with MTD KRQ(x, y) =

    p∑j=1

    rjkc,j(x, y) .

    We can show that PRQ is F -symmetric.

    4.4.1 Gibbs sampler

    An important case of component-wise updates is the Gibbs sampler. Here the proposal

    distribution for updating the ith component, qi(x, yi) is the full conditional distribution

    itself. That is qi(yi|x) = f(yi|x(i)). Then

    α((x(i), xi), yi) = min

    {1,f(yi|x(i))f(xi|x(i))

    q((yi, x(i)), xi)

    q((xi, x(i)), yi)

    }= min

    {1,f(yi|x(i))f(xi|x(i))

    f(xi|x(i))f(yi|x(i))

    }= 1 .

    This is not surprising since we are proposing from the target distribution! However, in

    this case the target distribution for MH is not the real “target distribution” F , rather,

    it is the full conditional distribution. So if we can sample from the full conditional

    distribution directly, then the component update happens with probability 1. Then,

    the MTK for the ith component is

    Pi(x,A) =

    ∫A

    f(yi|x(i))δx(i)(y(i))dyi .

    We will use the notation PRSGS for random scan, PDSGS for deterministic scan Gibbs

    sampler.

    Example 12 (Multivariate normal distribution). We have already sampled from a

    multivariate normal using Metropolis-Hastings algorithm. We will now implement a

    Gibbs sampler.

    F = N

    µ1

    µ2

    , Σ11 Σ12

    Σ21 Σ22

    ,

    23

  • where µ1 ∈ Rp1 and µ2 ∈ Rp2 . It is then known that if X = (X1, X2)T , then

    X1 | X2 = x2 ∼ N(µ1 + Σ12Σ

    −122 (x2 − µ2),Σ11 − Σ12Σ−122 Σ21

    ).

    X2 | X1 = x1 ∼ N(µ2 + Σ21Σ

    −111 (x1 − µ1),Σ22 − Σ21Σ−111 Σ12

    ).

    Since the full conditional distributions of X1 and X2 are known, Gibbs sampler can be

    easily implemented.

    Figure 9: Deterministic scan Gibbs sampler for two different Normal covariance struc-tures.

    DSGS: The deterministic scan Gibbs sampler can then do the following update:

    1. Draw X1,n+1 ∼ X1 | X2,n

    2. Draw X2,n+1 ∼ X2 | X1,n+1

    The Markov transition density updating from step (x1, x2) to (x′1, x′2) is

    kDSGS((x1, x2), (x′1, x′2)) = f(x

    ′1 | x2)f(x′2 | x′1) .

    RSGS: The random scan Gibbs sampler will update as following

    1. Pick index i with probability ri

    2. Draw Xi,n+1 ∼ Xi | X−i,n

    3. Set X−i,n+1 = X−i,n

    24

  • The MTK updating (x1, x2) to (x′1, x′2) is,

    kRSGS((x1, x2), (x′1, x′2)) = r1f(x

    ′1 | x2)δx2 (x′2) + r2f(x′2 | x1)δx1 (x′1)

    Example 13 (Gibbs sampler irreducibility). Consider a joint distribution F (x, y) with

    joint density

    f(x, y) =1

    2f1(x)g1(y) +

    1

    2f2(x)g2(y) .

    We will study a Gibbs sampler that targets this distribution. Let us first set up the

    system (conditional distribution). See that

    f(y) =

    ∫f(x, y) dx

    =

    ∫ (1

    2f1(x)g1(y) +

    1

    2f2(x)g2(y)

    )dx

    =1

    2g1(y) +

    1

    2g2(y) .

    Similarly,

    f(x) =1

    2f1(x) +

    1

    2f2(x) .

    So,

    f(x | y) = f1(x)g1(y) + f2(x)g2(y)g1(y) + g2(y)

    and f(y | x) = f1(x)g1(y) + f2(x)g2(y)f1(x) + f2(x)

    A Gibbs sampler moves from a state (x, y) to state (x′, y′) by updating x and y from

    the conditional distributions.

    1. Update X ′ ∼ X|Y

    2. Update Y ′ ∼ Y |X ′.

    The Markov transition kernel for this Gibbs sampler is

    P ((x, y), A) =

    ∫A

    f(x′|y) f(y′|x) dx′dy′ .

    Ok, now that this is set up, let us consider two different sets of target distributions.

    25

  • Figure 10: Deterministic scan Gibbs sampler for the two targets. Left is irreducible,right is reducible

    A: Suppose the target density is

    f(x, y) =1

    2I(0 < x < 1) I(0 < y < 1) +

    1

    2I(0 < x < 1) I(2 < y < 3) .

    The conditional distributions are

    f(x|y) = I(0 < x < 1) [I(0 < y < 1) + I(2 < y < 3)]

    f(y|x) = [I(0 < y < 1) + I(2 < y < 3)] I(0 < x < 1) .

    So, if we are at x ∈ (0, 1) and y ∈ (0, 1), then

    f(x|y) = I(0 < x < 1) , f(y|x) = 12

    [I(0 < y < 1) + I(2 < y < 3)]

    Here, the sampler is free to jump from each of the two portions of support.

    B: Suppose the target density is

    f(x, y) =1

    2I(0 < x < 1) I(0 < y < 1) +

    1

    2I(2 < x < 3) I(2 < y < 3) .

    26

  • The conditional distribution is

    f(x|y) = I(0 < x < 1) I(0 < y < 1) + I(2 < x < 3) I(2 < y < 3)I(0 < y < 1) + I(2 < y < 3)

    f(y|x) = I(0 < x < 1) I(0 < y < 1) + I(2 < x < 3) I(2 < y < 3)I(0 < x < 1) + I(2 < x < 3)

    For x ∈ (0, 1) and y ∈ (0, 1), both f(x|y) = I(0 < x < 1) and f(y|x) = I(0 < y < 1).So we will not be able to jump to the other part of the support of the distribution.

    Thus, this Gibbs sampler will not be irreducible.

    4.4.2 Metropolis-within-Gibbs

    (Also known as conditional Metropolis-Hastings)

    When at least one of the fi(xi|x(i)) is not available to sample from, then we use ageneral proposal qi as described before.

    Example 14 (Bayesian reliability model). For i = 1, . . .m, let ti denote the observed

    failure time for lamp (where m lamps’ data are collected). Suppose

    Ti | λ, β ∼Weibull(λ, β)

    where λ > 0 is the scale parameter and β is a shape parameter. In a Bayesian paradigm,

    we further assume prior distributions on this. So

    λ ∼ Gamma(a0, b0) and β ∼ Gamma(a1, b1) .

    The resulting posterior distribution is complicated for which the normalizing constant

    is not known.

    f(λ, β | T) ∝ λm+a0−1βm+a1−1(

    m∏i=1

    ti

    )β−1exp

    {−λ

    m∑i=1

    tβi

    }exp {−b1β} exp {b0λ}

    It can also be shown that

    λ | β, T ∼ Gamma

    (m+ a0, b0 +

    m∑i=1

    tβi

    ).

    27

  • However, β | λ, T does not have a closed-form expression.

    f(β | λ, T ) ∝ βm+a1−1(

    m∏i=1

    ti

    )β−1exp

    {−λ

    m∑i=1

    tβi

    }exp {−b1β} .

    In this case, we can implement the following (deterministic scan) Metropolis-within-

    Gibbs sampler:

    1. λn+1 ∼ λ | βn, T

    2. Propose Y ∼ Q((λn+1, βn), ·) and draw U ∼ U [0, 1].

    3. If U ≤ α((λn+1, βn), y), where

    α((λn+1, βn), y) = min

    {1,

    f(y|λn+1, T )f(βn|λn+1, T )

    q((y, λn+1), βn)

    q((βn, λn+1), y)

    },

    then βn+1 = Y .

    4. Else βn+1 = βn.

    The MTK for this is

    k((λ, β), (λ′, β′)) = f(λ′|β) p((λ′, β), (λ′, β′))︸ ︷︷ ︸MH kernel

    .

    We can also flip the updates so that β is updated first then λ.

    4.5 Linchpin variable sampler

    Suppose one of the full conditionals is available to sample from: fi(xi|x(i)), howeverthe other(s) are not. This is similar to Metropolis-within-Gibbs. However, instead of

    running a Markov chain on the joint distribution F , we can run the Markov chain for

    the marginal distribution x(i). This situation is easier to see for two variables.

    Consider the joint target density to be

    f(x1, x2) = f(x1 | x2)f(x2) .

    Suppose f(x1 | x2) is a known and nice enough distribution in the sense that it ispossible to generate iid samples from it. But it is not possible to draw iid samples from

    f(x2). Then X2 is called the linchpin variable.

    28

  • Instead of running a component-wise algorithm consider running a Markov chain with

    f(x2) as the target density. That is construct a Markov transition densitykL(x2, dx2)

    such that

    F (dx′2) =

    ∫XF (dx2)kL(x2, x

    ′2) . (3)

    Having observed samples from X2, we can “plug them” into the conditional distribution

    X1 | X2 to get samples os X1, yielding joint samples. That is, the final Markovtransition density is

    k((x1, x2), (x′1, x′2)) = f(x

    ′1 | x′2) kL(x2, x′2) .

    There are multiple possible advantages of linchpin variable samplers over Metropolis-

    within-Gibbs. Two of which are:

    1. The target distribution is in a smaller dimension since X1 is no longer in the

    target distribution of the Markov chain.

    2. Sometimes Markov chains are not well behaved due to the interaction between

    X1 and X2. Since the target distribution is the marginal distribution of X2, any

    annoyance due to this reason is ignored.

    Example 15 (Bayesian reliability model). Recall the joint posterior distribution

    f(λ, β | T) ∝ λm+a0−1βm+a1−1(

    m∏i=1

    ti

    )β−1exp

    {−λ

    m∑i=1

    tβi

    }exp {−b1β} exp {b0λ}

    We know that

    λ | β, T ∼ Gamma

    (m+ a0, b0 +

    m∑i=1

    tβi

    ).

    We can write

    f(λ, β | T) = f(λ | β,T) f(β | T) .

    Homework: find the marginal posterior density f(β | T ) up to proportionality.

    In this case, we can implement the following algorithm

    1. Propose Y ∼ Q((βn), ·) and draw U ∼ U [0, 1].

    29

  • 2. If U ≤ α((λn+1, βn), y), where

    α((λn+1, βn), y) = min

    {1,

    f(y|T )f(βn|T )

    q(y, βn)

    q(βn, y)

    },

    then βn+1 = Y .

    3. Else βn+1 = βn.

    4. λn+1 ∼ λ | βn+1, T

    The MTK for this is

    k((λ, β), (λ′, β′)) = f(λ′|β′) kL(β, β′)︸ ︷︷ ︸Linchpin kernel

    .

    30

  • 5 Convergence

    5.1 F -irreducibility and aperiodicity

    Definition 8. Let V1 and V2 be probability measures. Then the total variation distance

    between V1 and V2 is defined as:

    ||V1(·)− V2(·)|| = supA∈B(X )

    |V1(A)− V2(A)|

    We want to answer the following questions

    • Does limn→∞ ||P n(x, ·)− F (·)|| → 0∀x?

    • For what n does ||P n(x, ·)−F (·)|| ≤ �. So when can we say we have converged?

    • How fast does ||P n(x, ·)− F (·)|| → 0?

    Notationally, ν1f =∫X f(x)ν1(dx) and in particular P

    nf =∫X f(x)P

    n(x, dy).

    Proposition 1. For measures ν1 and ν2, the following holds:

    ‖ν1 − ν2‖ =1

    b− asup

    f :X→[a,b]|ν1f − ν2f | .

    Proposition 2. The following properties of TV hold:

    (a) ||P n(x, ·)− F (·)|| is non-decreasing in n if P is F -invariant.

    (b) ||ν1P − ν2P || ≤ ||ν1 − ν2||

    In order to get convergence, we will have to naturally require irreducibility and aperi-

    odicity of the Markov chains (since otherwise the Markov chain does not explore the

    support or gets stuck in a cycle)

    Theorem 8. If FP = F , P is F -irreducible and aperiodic, then for F -a.e. x

    limn→∞

    ‖P n(x, ·)− F (·)‖ = 0 .

    Proof. Proof uses Coupling methods, and will be done in later. For now assume this

    31

  • to be true. Here, F a.e x ∈ X implies that the measure of the set of starting valueswhere convergence does not hold is zero.

    When are the algorithms we have learned so far aperiodic?

    Theorem 9. Suppose Markov chain P is F -irreducible and there exists S such that

    F (S) > 0 and P (x, {x}) > 0 for all x ∈ S, then P is aperiodic.

    Proof. By contradiction. This will be homework.

    Thus, most reasonable MH algorithms should then be aperiodic. If there is always a

    positive probability of staying where we are, then certainly we have aperiodicity.

    Theorem 10. An F -irreducible Gibbs sampler is aperiodic.

    Proof. Consider for RSGS and suppose by way of contradiction that RSGS is periodic

    with period d. Then there exists A1, . . . , Ad ⊆ X such that P (x,Ai+1 = 1∀x ∈ Ai 1 ≤u ≤ d− 1 and P (x,A1) = 1 for all x ∈ Ad. Also F (Ai) > 0.

    Let the current state Xn ∈ Ak. RSGS chooses one component at random to update.Let In be this random choice and let hIn(Xn) be that part of Xn which is not updated.

    Then Xn and Xn+1 are conditionally independent given hIn(Xn), i.e.

    Pr(Xn+1 ∈ Ak|Xn ∈ Ak, hIn(Xn)) = Pr(Xn+1 ∈ Ak|hIn(Xn))

    Also by periodicity, Pr(Xn+1 ∈ Ak|Xn ∈ Ak) = 0 but

    P (Xn+1 ∈ Ak|Xn ∈ Ak) = E [Pr(Xn+1 ∈ Ak|Xn ∈ Ak, hIn(Xn))|Xn ∈ Ak]

    = E[Pr(Xn+1 ∈ Ak|hIn(Xn)]

    Therefore, Pr(Xn+1 ∈ Ak|hIn(Xn)) is F -almost surely 0, which in turn implies P (Xn+1 ∈Ak) = 0 which is a contradiction since marginally P (Xn+1 ∈ Ak should be 1/d. Thus,RSGS is aperiodic.

    A similar argument can be made for DSGS.

    32

  • So, nearly all standard MCMC samplers will satisfy for Fa.e.x ∈ X

    ‖P n(x, ·)− F (·)‖ n→∞→ 0 .

    But there is still this annoying null set.

    Example 16. Let X = [0, 1] and let U ∼ U [0, 1]. If x = 1/m for m ∈ Z+. Define thekernel

    P (x, ·) = x2U + (1− x2)δ(1/(m+1)) .

    Otherwise,

    P (x, c)̇ = U .

    That is, if x = 1/m, with probability x2 the kernel draws from a uniform, and with

    probability 1− x2 it sets the next value to be exactly 1/(m+ 1). Otherwise, the nextupdate is just a draw from uniform.

    Then for F = U [0, 1], FP = F and P is F -irreducible and aperiodic. But if xo = 1/m,

    m ≥ 2, then

    Pr

    (Xn =

    1

    m+ n∀n | X0 = xo

    )=

    ∞∏j=m

    (1− 1

    j2

    )> 0 .

    Thus,

    ‖P n(1/m, ·)− F (·)‖ → 0

    Here the set {1/m : m ≥ 2} is of measure zero under F . Thus, convergence holdsF -a.e. x ∈ X .

    To get rid of this annoying set, we will require the additional property of Harris recur-

    rence.

    5.2 Harris recurrence

    Definition 9. Let A ∈ B(X ) and define

    τA = inf{n ≥ 1 : Xn ∈ A.}

    τA is called the first return time to A. If Xn /∈ A for all n ≥ 1, τA =∞.

    33

  • Definition 10. If FP = F and P is F -irreducible, then P is Harris Recurrent if

    for all A ∈ B(X ) with F (A) > 0 and all x ∈ X

    Pr(τA 0, some x ∈ X andsome N such that

    Pr(Xn /∈ A, ∀n ≥ N |X0 = x) > 0 .

    But the Markov chain is time homogeneous. Thus, this implies that there exists y ∈ Xsuch that

    Pr(τA

  • 3. ∀x ∈ X , A ∈ B(X ) with F (A) = 0,Pr(Xn ∈ A,∀n | X0 = x) < 1

    Proof. 1⇒ 2⇔ 3 are straightforward. See the proof of 3⇒ 1 in Roberts and Rosenthal(2006).

    Theorem 13. Every F -irreducible M-H sampler is Harris recurrent.

    Proof.

    Let s(x) =∫q(x, y)[1−α(x, y)]µ(dy). Then s(x) is the probability of staying at x. We

    can write the MH kernel as

    P (x,A)

    = Pr(X1 ∈ A | X0 = x)

    = Pr(X1 ∈ A | X0 = x,X1 = X0) Pr(X1 = X0) + Pr(X1 ∈ A | X0 = x,X1 6= X0) Pr(X1 6= X0)

    = δx(A)s(x) + (1− s(x)) Pr(X1 ∈ A | X0 = x,X1 6= X0)

    := δx(A)s(x) + (1− s(x))M(x, ·) ,

    Here M(x, ·) is the kernel conditional on Xn+1 6= Xn. We are assuming that M(x, ·) isabsolutely continuous with respect to the Lebesgue measure (µ) for all x ∈ X .

    Since P is F -irreducible and F ({x}) < 1, it follows that s(x) < 1∀x ∈ X , sinceotherwise the chain will not move from x and hence cannot be irreducible.

    Of course, our target distribution is also absolutely continuous so that

    F (A) =

    ∫A

    f(x)µ(dx) .

    Suppose for a set A, F (A) = 1, then F (Ac) = 0 and since f(x) > 0 for x ∈ X ,µ(Ac) = 0. Thus, by absolute continuity, M(x,Ac) = 0⇒M(x,A) = 1.

    In conclusion, if the current state is x, the chain will eventually move according to

    M(x, ·) at which point it will necessarily move into A. Thus P (τA < ∞|X0 = x) = 1.The result follows from Theorem 12.

    So we know that full MH algorithms that are F -irreducible are aperiodic and Harris

    recurrent. We next move on to component-wise algorithms.

    Lemma 1 (without proof). For A such that F (A) = 0, Pr(Xn ∈ A | X0 = x0) ≤Pr(Dn | X0 = x) where Dn is the event that by time n, the chain has not yet moved

    35

  • in each co-ordinate direction.

    Theorem 14 (Component-wise algorithms). Suppose p is a component-wise MH MTK.

    If FP = F , P is F -irreducible and for all x ∈ X , with probability 1, there will eventuallybe at least one move in every coordinate direction, then P is Harris recurrent.

    Proof. Our conditions imply

    limn→∞

    Pr(Dn | X0 = x) = 0 .

    Suppose F (A) = 0. Since f(x) > 0 for all x ∈ X , it follows that µ(A) = 0. By thelemma

    Pr(Xn ∈ A∀n | X0 = x) ≤ limn→∞

    Pr(Xn ∈ A | X0 = x) ≤ limn→∞

    Pr(Dn | X0 = x) = 0 .

    Theorem 15 (Ergodicity). If FP = F , P is F -irreducible, aperiodic, and Harris

    recurrent, then for every initial distribution λ

    limn→∞

    ||λP n − F || = 0 .

    Consequently, for all x ∈ X

    limn→∞

    ||P n(x, ·)− F (·)|| = 0 .

    Moreover, for any two initial distributions λ1 and λ2

    limn→∞

    ||λ1P n − λ2P n|| = 0 .

    The Markov chain is then said to be ergodic.

    Since virtually all MCMC algorithms satisfy the above conditions, theoretical conver-

    gence of the Markov chain is guaranteed. Before we can give a proof of the above

    theorem, we will need to understand coupling for which we need the minorization

    condition.

    36

  • 6 Minorization and coupling

    In this section, we will prove Theorem 15. For that we need the concept of “coupling”,

    for which we need the “split chain”, for which we need “minorization”.

    6.1 Minorization

    Definition 11. A minorization condition holds if there exists C ⊆ X , a positiveinteger m, a constant � > 0 and a probability measure Q on X such that ∀x ∈ C ⊆ X ,and for all A ∈ B(X ),

    Pm(x,A) ≥ �Q(A) .

    C is said to be small.

    Intuitively, this means that all m-step transitions from within C have an �-overlap.

    That is, they have an � component that is common to them. We will generally only

    care about m = 1. A minorization condition always exists (not going to prove).

    Example 18. Independent M-H Sampler

    This example illustrates the minorization condition for the Independent M-H sampler

    (proposal density is not dependent on the current state of the Markov chain).

    Let q(y) be the proposal density. Then

    P (x,A) =≥∫A

    q(y)

    {1,f(y)q(x)

    f(x)q(y)

    }µ(dy)

    =

    ∫A

    f(y) min

    {q(y)

    f(y),q(x)

    f(x)

    }µ(dy)

    Let � = infx

    q(x)

    f(x)= sup

    x

    f(x)

    q(x)> 0. Then

    ≥ �∫A

    f(y)µ(dy)

    = �F (A)

    Since x ∈ X was chosen randomly, X is small as long as supx∈X

    f(x)

    q(x)< ∞. Thus, we

    37

  • require a bounded weight function.

    Example 19. Two Variable Deterministic scan Gibbs

    We will bound the MTD for the DSGS

    k((x, y), (x′, y′)) = fX|Y (x′|y)fY |X(y′|x′)

    Let C be such that infy∈C

    fX|Y (x′|y) > 0 = h(x′)

    ≥ h(x′)fY |X(y′|x′)

    =

    ∫Xh(x′)fY |X(y

    ′|x′)h(x′)fY |X(y

    ′|x′)∫X h(x

    ′)fY |X(y′|x′)

    = �Q(x′, y′)

    Example 20. Random Scan Gibbs Sampler

    We will connect RSGS to the two variable DSGS by using a minorization condition for

    m = 2. Using the MTDs, we get

    k((x, y), (u, v))k((u, v), (x′, y′))

    =(pfX|Y (u|y)δy(v) + (1− p)fY |X(v|x)δx(u)

    ) (pfX|Y (x

    ′|v)δv(y) + (1− p)fY |X(y′|u)δu(x′))

    ≥ p(1− p)fX|Y (u|y)δy(v)fY |X(y′|u)δu(x′)

    = p(1− p)fX|Y (x′|y)fY |X(y′|x′)

    = p(1− p)kDUGS(x′, y′|x, y)

    ≥ p(1− p)�Q(x′, y′) ,

    where � and C are defined in the previous example.

    6.2 The Split Chain

    This is a really nice idea that allows us to “forget” the starting value. Suppose the

    minorization condition holds, i.e., there exists C, � > 0 and a probability measure Q

    such that

    P (x,A) ≥ �Q(A)

    for all x ∈ C and A ∈ B(X ). When x ∈ C

    38

  • P (x,A) = �Q(A) + (1− �) P (x,A)− �Q(A)(1− �)︸ ︷︷ ︸R(x,A)

    = �Q(A) + (1− �)R(x,A)

    This suggests a recipe for simulating the split chain living X×{0, 1} having marginalP (x, ·).

    1. x /∈ C then generate Xn+1 ∼ P (x, ·)

    2. x ∈ C, δn ∼ Bernoulli(�)

    • if δn = 1, then Xn+1 ∼ Q(·)

    • if δn = 0, then Xn+1 ∼ R(x, ·).

    Notice that every time Xn+1 ∼ Q, the Markov chain resets.

    6.3 Coupling

    The idea here is to simulate two independent chains, and using the split chain concept

    to generate the chain until both chains simultaneously can be generated from Q. At

    that time, the two chain would have coupled, and would have each forgotten their

    starting value.

    That is, we will run Markov chain Xn and Markov chain X′n, so that each has stationary

    distribution F , but at some point they “meet”. Once they meet, they become one.

    Like soulmates.

    Let X0 ∼ λ,X ′0 ∼ F, n = 0 and let C be a small set. Given Xn, X ′n

    1. If Xn = X′n, then Xn+1 = X

    ′n+1 ∼ P (xn, ·)

    2. o/w if (Xn, X′n) ∈ C × C

    • w.p. �,Xn+1 = X ′n+1 ∼ Q

    • w.p. (1− �), Xn+1 ∼ R(xn, ·) and X ′n+1 ∼ R(x′n, ·)

    39

  • 3. o/w if (Xn, X′n) /∈ C × C, then Xn+1 ∼ P (xn, ·) and X ′n+1 ∼ P (x′n, ·).

    Let T be the (random) coupling time, so T = inf{n ≥ 1 : Xn = X ′n}.

    Proposition 3. If P is aperiodic and Harris recurrent

    1. T n)→ 0 as n→∞

    Proof. The above two statements about T are equivalent, more or less, but need to be

    written explicitly. The proposition follows from the definition of Harris recurrence.

    Theorem 16. For all λ ‖λP n − F‖ ≤ P (T > n)

    Proof.

    |λP n(A)− F (A)|

    = |Pr(Xn ∈ A)− Pr(X ′n ∈ A)|

    = |Pr(Xn ∈ A,Xn = X ′n) + Pr(Xn ∈ A,Xn 6= X ′n)

    − Pr(X ′n ∈ A,Xn = X ′n)− Pr(X ′n ∈ A,Xn 6= X ′n)|

    = |Pr(Xn ∈ A,Xn 6= X ′n − Pr(X ′n ∈ A,Xn 6= X ′n)|

    ≤ max {Pr(Xn ∈ A,Xn 6= X ′n,Pr(X ′n ∈ A,Xn 6= X ′n)}

    ≤ Pr(X ′n 6= Xn)

    ≤ Pr(T > n)

    → 0

    Since Pr(T > n)→ 0, this basically proves that for all initial distributions λ

    ||λP n − F || → 0 as n→∞.

    40

  • 7 Rate of Convergence and CLT

    We have been studying Markov chains in class so far, this is a “Markov chain Monte

    Carlo” course. Now we will learn about “Monte Carlo”. That is, we now know how

    to obtain samples from a target distribution so that eventually they are from the right

    distribution. But the samples are correlated, so are they useful? Suppose F is the

    target distribution and for a function h, µg = EF [g]. Having obtained X1, X2, . . . , Xn,

    we can

    µn =1

    n

    n∑t=1

    g(Xt) .

    In general we want the following in the MCMC context:

    1. A representative sample X1, . . . , Xn from F .

    2. µn → µ with probability 1 as n→∞.

    3.√n(µn − µ)

    d→ N(0, σ2) as n→∞.

    4. Can I estimate σ2?

    So far we have shown that if P is F -invariant, F -irreducible, aperiodic and Harris

    recurrent, then we can do 1. In addition we will see that 2 holds as well.

    Theorem 17 (Birkhoff ergodic theorem). Suppose that P is F -invariant, F -irreducible,

    aperiodic and Harris recurrent. If µg =∫X g(x)F (dx) exists, then as n→∞. Then

    µn =1

    n

    n∑i=1

    g(Xi)→ µg with probability 1, as n→∞ and

    Let FV be the density function of g(x). Consider estimating the quantile ξg = {v :FV (v) ≥ q} for 0 < q < 1. Let Yn(j) denote the jth order statistic. Then for j − 1 <nq < j,

    ξn = Yn(j) → ξ with probability 1 .

    Proof. Without proof.

    41

  • The Markov Chain CLT

    We know that strong convergence of µn holds under the Harris ergodicity of the Markov

    chain. Thus, those conditions are necessary for a a Markov chain CLT to hold. In

    addition, a CLT would should exist if there exists a σ2g such that as n→∞

    √n(µn − µg)

    d→ N(0, σ2g)

    Moreover, we will show that

    σ2g = limn→∞

    1

    nEF

    ( n∑i=1

    g(Xi)− nµg

    )2 = VarF (g(X)) + 2 ∞∑j=1

    CovF (g(X1), g(Xj))

    Remark 4. Harris ergodicity and EF [g(X)]2

  • Theorem 19. Suppose X = R and set A(x) = 1 − r(x). If A(x) is a monotonicallydecreasing for sufficiently large x, and

    limx→∞

    ∣∣∣∣ f(x)A′(x)∣∣∣∣ =∞

    then limn→∞

    nE[(g(x)− µg)2rn(x)

    ]=∞.

    Proof. Check Roberts (1999)

    Example 21. Suppose F = Exp(1) and consider an Independent M-H sampler with

    proposal Q = Exp(θ), θ > 2. Then

    A(x) =

    ∫Xq(y)α(x, y)dy = e−θx[θex + (1− θ)] ,

    and ∣∣∣∣ f(x)A′(x)∣∣∣∣ = 1θ(1− θ)e−x(θ−2)(1− e−x) →∞ as x→∞

    since θ > 2

    So, when does a CLT hold? In addition to Harris ergodicity and a moment condition,

    we need the Markov chain to converge rapidly. An intuitive way to understand this is,

    √n(µn − µg)

    d→ N(0, σ2g)

    where,

    σ2g = VarF (g(X)) + 2∞∑k=1

    CovF (g(X1), g(X1+k)) =∞∑

    k=−∞

    CovF (g(X1), g(X1+k)) .

    For a CLT, we want σ2g

  • Convergence Rates

    Let M : X → R+ and ψ : N→ [0, 1] be such that

    ‖P n(x, ·)− F (·)‖ ≤M(x)ψ(n) ∀x, n .

    1. Polynomial Ergodicity of order k: ψ(n) = n−k for some k > 0.

    2. Geometric Ergodicity: ψ(n) = tn for some 0 ≤ t < 1

    3. Uniform Ergodicity: supxM(x) m

    4. Gaussian processes

    Remark 5. Consider {Xn, n ≥ 0} and {g(Xn), n ≥ 0}. Let {Wn = g(Xn), n ≥ 0} andMmk = σ(Wk, . . . ,Wm). Then Mmk ⊆ Fmk If αw and α are the mixing coefficients for{Wn} and {Xn}, then αw(n) ≤ α(n).

    Theorem 20. Suppose X0 ∼ F and the Markov chain is Harris ergodic. Then theMarkov chain is strongly mixing.

    Proof. Since Markov chain is time homogeneous, we consider arbitrary A ∈ B(X ) andB ∈ B(X ) such that X0 ∈ A and Xn ∈ B. By the coupling inequality,∫

    B

    Px(T > n)F (dx) ≥∫B

    |P n(x,A)− F (A)|F (dx)

    44

  • ≥∣∣∣∣∫B

    (P n(x,A)− F (A))F (dx)∣∣∣∣

    = |Pr(X0 ∈ B,Xn ∈ A)− F (A)F (B)| .

    Therefore, α(n) ≤ EF [Px(T > n)]. Recall that Px(T > n)→ 0 for all x ∈ X as n→∞,and a dominated convergence argument shows that

    EF [Px(T > n)]→ 0 as n→∞ .

    Hence α(n)→ 0 as n→∞.

    Theorem 21. Suppose P is Harris ergodic, and X0 ∼ F . If

    ‖P n(x, ·)− F (·)‖ ≤M(x)ψ(n) .

    Then α(n) ≤ ψ(n)EF [M(x)]. This would imply,

    ∞∑n=1

    naψ(n)b

  • = nVarF

    (1

    n

    n∑t=1

    g(Xt)

    )

    =1

    n

    n∑t=1

    n∑l=1

    CovF (g(Xt), g(Xl))

    =1

    n

    n∑t=1

    VarF (g(Xt) +1

    n

    ∑tl

    CovF (g(Xt), g(Xl))

    =1

    n

    n∑t=1

    VarF (g(Xt) + 21

    n

    ∑t

  • Then σ2 = EF (Y1)2 + 2

    ∞∑j=2

    EF (Y1Yj) 0, then as n→∞

    √nµn

    d→ N(0, σ2).

    The following theorem is a corollary to the above result. The proof follows directly

    from the above theorem.

    Theorem 24. Suppose P is F -Harris ergodic. In addition, let X0 ∼ F , ‖P n(x, ·)−F (·)‖ ≤ M(x)ψ(n) and EF [M(x)] < ∞. Let g : X → R and suppose at least oneof the following holds:

    1. supx |g(x)|

  • uses theory of harmonic functions, so not provided here.

    Remark 6. The same result holds for the law of large numbers as well.

    Now, geometric and polynomial ergodicity give forms of ψ(n) such that∞∑n=1

    ψ(n)

  • Theorem 27. Suppose P is geometrically ergodic and g : X → R such thatEF [|g(x)|2] 0 and a probability measureQ, such that

    P (x,A) ≥ �Q(A) for all x ∈ X , and A ∈ B(X ) .

    Then P is uniformally ergodic and

    ‖P n(x, ·)− F (·)‖ ≤ (1− �)n .

    49

  • Proof. By the coupling equality, we know that

    ‖P n(x, ·)− F (·)‖ ≤ Pr(T > n) .

    But since the whole space X is small, the chain couples with probability � at everystep, that is, T ∼ Geometric(�). Thus, P (T > n) = (1− �)n.

    Remark 8. Notice that the theorem yields a quantitative upper bound so we can

    easily see that to be within δ of stationarity in total variation

    n ≥ log δlog(1− �)

    Unfortunately, these bounds are often extremely conservative because � is tiny.

    Remark 9. Suppose we can only establish

    P n0(x,A) ≥ �Q(A)∀x.

    Then, Φ is still uniformally ergodic and

    ||P n(x, ·)− F (·)|| ≤ (1− �)bnn0c

    Example 23. If X is finite, then ever F -irreducible, F -invariant, recurrent Markovchain is uniformly ergodic.

    Proof. We will construct a minorization condition. Let |X | = t. Since the Markovchain is F -irreducible, for x1 ∈ X , there exists n1 such that

    P n1(x, dy) > 0

    Let n = least common multiple of n1, . . . , nt. Then for all x ∈ X , P n(x, dy) > 0. Wethen have the following:

    P n(x, dy) ≥ infx∈X

    P n(x, dy)

    50

  • =t∑i=1

    infx∈X

    P n(x, dyi)︸ ︷︷ ︸�

    infx∈X

    P n(x, dy)

    t∑i=1

    infx∈X

    P n(x, dyi)︸ ︷︷ ︸Q(dy)

    = �Q(dy)

    Thus, X is small, and the theorem applies.

    Example 24. Independent M-H

    Recall that if � = infx∈Xq(x)

    f(x)> 0 then X is small. So as long as � > 0, the independent

    M-H is uniformly ergodic. Mengersen et al. (1996) show that if � = 0, then the chain

    is not even geometrically ergodic (proof not included here).

    Example 25. Exp(1) example Let F = Exp(1) and Q = Exp(θ).

    � = infx∈X

    q(x)

    f(x)= inf

    x∈X

    θe−xθ

    e−x= θ inf

    x∈Xe−x(θ−1) .

    � = θ if θ < 1. When θ = 1, that is proposal is the target, then � = 1, which yields the

    best bound.

    Example 26. M-H on compact X

    Suppose X is compact and P be a M-H kernel with proposal q which is continuous inboth arguments. If f(x) ≤ k on X , then P is uniformly ergodic.

    P (x,A) ≥∫A

    q(x, y)α(x, y)µ(dy)

    ≥∫A

    q(x, y) min

    {1,f(y)q(y, x)

    f(x)q(x, y)

    }µ(dy)

    ≥∫A

    min

    {q(x, y),

    f(y)q(y, x)

    f(x)

    }µ(dy)

    ≥ infx,y∈X

    q(x, y)︸ ︷︷ ︸δ

    ∫A

    min

    {1,f(y)

    f(x)

    }µ(dy)

    = δ

    ∫A

    min

    {1,f(y)

    f(x)

    }µ(dy)

    ≥ δ∫A

    min

    {1,f(y)

    k

    }µ(dy)

    51

  • = δ

    ∫A

    h(y)µ(dy) .

    Thus, if f and q are both continuous on a compact space, then P is uniformly ergodic.

    Example 27. Two Variable Gibbs sampler

    Consider the two variable Gibbs sampler with invariant density f(x, y) on X × Y .

    Then k((x, y), (x′, y′)) = fX|Y (x′|y)fY |X(y′|x′). If infy∈Y fX|Y (x′|y) = h(x′) > 0 and let

    � =∫h(x′)dx′, then:

    P ((x, y), A) =

    ∫A

    k((x, y), (x′, y′))dx′dy′

    ≥ �1�

    ∫A

    h(x′)fY |X(y|x′)dx′dy′

    But this is rarely useful, unless Y is compact and fX|Y (x|y) is continuous in y.

    We will now look at an example of the two variable Gibbs sampler which is not defined

    on a compact space, but will eventually end up being uniformly ergodic.

    Example 28. Suppose i = 1, . . . ,m and a, c, d > 0 and known

    Yiind∼ Poisson(λi)

    λiiid∼ Gamma(a, β)

    β ∼ Gamma(c, d)

    Then f(λ, β|y) ∝(∏m

    i=1 λa+yi−1i e

    −(β+1)λi)βc−1e−dβ. We can show that,

    λ | β, y ∼m∏i=1

    Gamma(a+ yi, β + 1)

    β | λ, y ∼ Gamma

    (c, d+

    m∑i=1

    λi

    )

    Consider the Gibbs sampler: (β, λ)→ (β′, λ)→ (β′, λ′) so that

    k((β, λ), (β′, λ′)) = fβ|λ(β′|λ, y)fλ|β(λ′|β′ , y) .

    52

  • Now

    fβ|λ(β′|λ, y) = (d+

    ∑mi=1 λi)

    c

    Γ(c)(β′)c−1e−d+

    ∑mi=1 λi)β

    ≥ dc

    Γ(c)(β′)c−1e−(c+

    ∑mi=1 λi)β

    which is not bounded below as a function λ. Thus it would seem like in this particular

    case we have created for ourselves a chain that is not uniformly ergodic. We will

    however see that this is not the case.

    Two Variable Gibbs: A special algorithm

    The example above is a good motivation to illustrate how the two-variable Gibbs is a

    special Markov chain. Continuing notation from the previous example, the marginal

    sequence {βn, n ≥ 1} (or even the λ) sequence is a Markov chain with kernel

    Pβ(β,A) =

    ∫A

    ∫fβ|λ(β

    ′|λ′)fλ|β(λ′|β) dλ′︸ ︷︷ ︸k(β,β′)

    dβ′ .

    We can show that k(β′|β) is a density and is reversible.

    k(β, β′) is a density

    ∫Xβk(β, β′) =

    ∫Xβ

    ∫Xλfβ|λ(β

    ′|λ′)fλ|β(λ′|β) dλ′ dβ′

    =

    ∫Xλ

    ∫Xβfβ|λ(β

    ′|λ′)fλ|β(λ′|β)dβ′ dλ′

    =

    ∫Xλfλ′|β(λ

    ′|β)∫Xβfβ|λ(β

    ′|λ′) dβ′ dλ′

    = 1

    Moreover {βn} converges in total variation norm at the same rate as the joint Markovchain {βn, λn}. Thus if {βn} is uniformly ergodic, then so is {βn, λn}. We need tounderstand de-initializing to see this.

    53

  • Definition 12. Let {Xn, n ≥ 0} be a Markov chain and {Yn, n ≥ 0} be a stochasticprocess. We say, {Yn} is de-initializing for {Xn} if for all n ≥ 1

    L(Xn|X0, Yn) = L(Xn|Yn) ,

    where L is the law of conditional probability (essentially the same as the conditionaldistribution).

    Theorem 29. Suppose µ and µ′ are probability measures. If {Yn} is de-initializing for{Xn}, then

    ‖L(Xn|X0 ∼ µ)− L(Xn|X0 ∼ µ′)‖ ≤ ‖L(Yn|X0 ∼ µ)− L(Yn|X0 ∼ µ′)‖ .

    Proof. Recall ||ν − ν ′|| = supS|µ(S)− ν ′(S)| = sup

    0≤f(y)≤1

    ∣∣∣∣∫ fdν − ∫ fdν ′∣∣∣∣Let S ∈ B(X ). Then consider in absolute value:

    |P (Xn ∈ S|X0 ∼ µ)− P (Xn ∈ S|X0 ∼ µ′)|

    =

    ∣∣∣∣∫ P (Xn ∈ S | X0 = x)µ(dx)− ∫ P (Xn ∈ S | X0 = x)µ′(dx)∣∣∣∣=

    ∣∣∣∣∫ ∫ P (Xn ∈ S | X0 = x, Yn = y)P (Yn ∈ dy | X0 = x)µ(dx)−∫ ∫

    P (Xn ∈ S | X0 = x, Yn = y)P (Yn ∈ dy | X0 = x)µ′(dx)∣∣∣∣

    By de-initializing

    =

    ∣∣∣∣∫ ∫ P (Xn ∈ S | Yn = y)P (Yn ∈ dy | X0 = x)µ(dx)−∫ ∫

    P (Xn ∈ S | Yn = y)P (Yn ∈ dy | X0 = x)µ′(dx)∣∣∣∣

    =

    ∣∣∣∣∫ ∫ fS(y)P (Yn ∈ dy | X0 = x)µ(dx)− ∫ ∫ fS(y)P (Yn ∈ dy | X0 = x)µ′(dx)∣∣∣∣So by the alternate definition of the total variation norm

    |P (Xn ∈ S | X0 ∼ µ)− P (Xn ∈ S | X0 ∼ µ′)| ≤ ‖L(Yn|X0 ∼ µ)− L(Yn | bX0 ∼ µ′)‖ ,

    which holds for all S ∈ B(X ) and this the claim follows.

    54

  • Example 29. Consider a two variable Gibbs sampler: (x, y)→ (x, y′)→ (x′, y′)

    X ′ ∼ fX|Y (x′|y′) Y ′ ∼ fY |X(y′|x)

    Claim:: The rate of convergence of {Xt} is the same as the rate of convergence of{Xt, Yt}. Proof: later, since still figuring it ou

    Thus, to study convergence rates of a two variable Gibbs sampler, it is sufficient to

    study the convergence rate of either marginal Markov chains.

    PX(x,A) =∫A

    ∫Y fX|Y (x

    ′|y′)fY |X(y′|x)dydx′

    PY (y, A) =∫A

    ∫X fY |X(y

    ′|x′)fX|Y (x′|y)dxdy′

    Example 30 (Two-variable Gibbs continued). Recall that for i = 1, . . . ,m and a, c, d >

    0 and known

    Yiind∼ Poisson(λi)

    λiiid∼ Gamma(a, β)

    β ∼ Gamma(c, d)

    k((β, λ), (β′, λ′)) = fβ|λ(β′|λ, y)fλ|β(λ′|β , y) .

    Then the marginal chain on just β has transition density

    k(β, β′)

    =

    ∫fβ|λ(β

    ′|λ, y)fλ|β(λ′|β , y) dλ

    =

    ∫(d+

    ∑mi=1 λi)

    c

    Γ(c)(β′)c−1e−d+

    ∑mi=1 λi)β

    ′ ·m∏i=1

    (β + 1)a+yi

    Γ(a+ yi)λa+yi−1i e

    −(β+1)λi dλ

    ≥∫

    dc

    Γ(c)(β′)

    c−1e−dβ

    ′m∏i=1

    ∫ ∞0

    (β + 1)a+yi

    Γ(a+ yi)λa+yi−1i e

    −(β′+β+1)λi

    =

    ∫dc

    Γ(c)(β′)

    c−1e−dβ

    ′m∏i=1

    m∏i=1

    (β + 1

    β′ + β + 1

    )a+yi=

    ∫dc

    Γ(c)(β′)

    c−1e−dβ

    ′m∏i=1

    m∏i=1

    (1

    β′ + 1

    )a+yi:= h(β′) .

    55

  • Setting � =∫h(β′)dβ′, we get

    k(β, β′) ≥ �h(β′)

    �for all β

    Hence, the Markov chain chain {βt} is uniformly ergodic, and thus {βt, λt} is uniformlyergodic.

    Theorem 30. The First Comparison Theorem

    Suppose P and Q are Markov kernels and that ∃ δ > 0 such that

    P (x,A) ≥ δQ(x,A) x ∈ X , A ∈ B(X ) .

    If P and Q have invariant distribution F and Q is uniformly ergodic, then so is P .

    Proof. Since Q is uniformly ergodic, then X is small w.r.t Q. That is, ∃m ≥ 1, � > 0and a probability measure ν such that

    Qm(x,A) ≥ �ν(A) ∀x ∈ X , A ∈ B(X )

    Due to the minorization condition, we have that

    Pm(x,A) ≥ δmQm(x,A) ≥ δm�ν(A) .

    7.2 Random Scan Gibbs Sampler

    Recall the MTD of a two-variable RSGS:

    kRSGS(x′, y′|x, y) = rfX|Y δ(y′ − y) + (1− r)fY |X(y′|x)δ(x′ − x) .

    with kernel

    PRSGS((x, y), A) = rPY (x,A) + (1− r)PX(y, A)

    where

    PX(y, A) =

    ∫{x:(x,y)∈A}

    fX|Y (x|y)µX(dx) and PY (x,A) =∫{y:(x,y)∈A}

    fY |X(y|x)µY (dy)

    56

  • Theorem 31. If PRSGS is uniformly ergodic for some selection probability r∗, then it

    is uniformly ergodic for all selection probabilities r ∈ (0, 1)

    Proof.

    PRSGS((x, y), A) = rPY (x,A) + (1− r)PX(y, A)

    =r

    r∗r∗PY (x,A) +

    1− r1− r∗

    (1− r∗)PX(y, A)

    ≥ min{r

    r∗,

    1− r1− r∗

    }PRSGS,r∗((x, y), A)

    and thus by the first comparison theorem, the claim follows since PRSGS,r∗ is uniformly

    ergodic.

    Theorem 32. If PDUGS is uniformly ergodic, then so id PRSGS. This is true even

    outside of the two variable case.

    Proof. We have established before that

    P 2nRSGS((x, y), A) ≥ [r(1− r)]nP nDUGS((x, y), A)

    and so the result follows immediately from the first comparison theorem.

    Two-variable Conditional Metropolis Hastings

    1. Y ′ ∼ FY |X(·|X)

    2. V ∼ q((x, y′), ·) and independently U ∼ Uniform(0,1)

    If

    U ≤fX|Y (v|y′)q((v, y′), x)fX|Y (x|y′)q((x, y′), v)

    Set x′ = v; otherwise x′ = x

    57

  • This simulates a Markov chain having the following MTD

    kCMH((x, y), (x′, y′)) = fY |X(y

    ′|x)h((x, y′), x′)

    where h(x′|x, y′) = q(x′|x, y′)α(x′, x, y′) + δ(x′ − x)r(x, y′)

    Similarly, we can use a random scan

    kRCMH((x, y), (x′, y′)) = rfY |X(y

    ′|x)δ(x′ − x) + (1− r)h(x, y, x′)δ(y′ − y)

    Just as we did with the Gibbs sampler, we can write the kernels

    PCMH = PY |XPMH:X and PRCMH = rPY |X + (1− r)PMH:X

    Consider PCMH . Notice that id we choose the proposal density so that

    q(x|v, y) ' fX|Y (x|y)

    then acceptance probability will satisfy α(x′, x′, y′) ' 1

    So the conditional M-H should behave a lot life DUGS. Notice that because of the

    update order choice the marginal sequence {Xn, n ≥ 1} is a Markov chain havingMTD

    hx(x, x′) =

    ∫Yh((x, y′), x′) fY |X(y

    ′|x)µ(dy′) ,

    and the marginal kernel is FX-symmetric.

    In-class teaching stopped here due to COVID19. What follows are typed up notes.

    There’s discussion questions that go along with it.

    58

  • 7.3 Geometric ergodicity

    Lecture 1:

    Recall that P is geometrically ergodic if there exists M : X → (0,∞) and 0 < t < 1such that

    ‖P n(x, ·)− F (·)‖ ≤M(x)tn .

    If P is geometrically ergodic and EF |g|2+δ 0 then as n→∞,

    √n (µn − µg)

    d→ N(0, σ2g) .

    One of the most common way of showing that a CLT exists, is to show that the

    Markov chain is geometrically ergodic. This is often done by establishing a drift and

    an associated minorization condition.

    7.3.1 Drift and minorization conditions

    A drift condition holds if there exists a function V : X → (0,∞) such that for some0 < λ < 1 and b 2b

    1− λ

    }.

    Recall that a small set is the set C so that for there exists a measure Q such that for

    all x ∈ C P (x, ·) ≤ �Q(·). Whenever the Markov chain is in the small set, there isa chance to couple, and as soon as the Markov chain couples, it forgets its starting

    value (and thus starts sampling from F ). So for Markov chains that converge fast, they

    will couple faster. Which can happen if the Markov chain is in the small set often.

    59

  • Together, the drift and minorization conditions make that happen.

    To see this, consider Figure 11 where the target density π, a drift function V and a

    small set d are presented. The drift condition says that, on average (not all the time),

    the next move of the Markov chain will be such that it goes to a smaller value of the

    drift function; that is, it drifts downwards. Since the small set is a level set, there’s

    a good chance that the next state of the Markov chain has landed in the small set.

    Thus, if a drift and minorization holds, it guarantees a reasonably fast coupling time,

    implying a fast, or geometric rate of convergence.

    Figure 11: Caption here

    Theorem 33. P is geometrically ergodic if a drift condition holds and the set C is

    small.

    Example 31. Consider the following model for i = 1, . . . ,m for m ≥ 3,

    Yiiid∼ N(µ, θ) and ν(µ, θ) ∝ 1√

    θ.

    This seems like a weird prior, but let’s just go with it. The posterior distribution is

    f(µ, θ|y) ∝ θ−(m+1)/2 exp

    {− 1

    m∑j=1

    (yj − µ)2}

    Let ȳ = m−1∑yj and s

    2 =∑

    j(yj − ȳ)2. The full conditional distributions are

    µ|θ, y ∼ N(ȳ,θ

    m

    )60

  • θ|µ, y ∼ IG(m− 1

    2,s2 +m(ȳ − µ)2

    2

    ).

    We will show that the deterministic scan Gibbs sampler is geometrically ergodic. The

    MTD is

    k(µ′, θ′|µ, θ) = fθ|µ(θ′|µ, y)fµ|θ(µ′|θ′, y) .

    Let V (θ, µ) = (µ− θ̄)2. Then,

    PV (θ, µ) =

    ∫V (θ, µ)k(µ′, θ′|µ, θ)dµ′, dθ′

    =

    ∫ ∫(µ′ − ȳ)2fθ|µ(θ′|µ, y)fµ|θ(µ′|θ′, y)

    =

    ∫ [∫(µ′ − ȳ)2fθ|µ(θ′|µ, y)dµ′

    ]︸ ︷︷ ︸

    Var(µ′|θ′,y)

    fµ|θ(µ′|θ′, y)dθ′

    =

    ∫θ′

    mfµ|θ(µ

    ′|θ′, y)dθ′︸ ︷︷ ︸E[θ′|µ,y]

    =1

    m

    s2 +m(µ− ȳ)2

    m− 3

    =(µ− ȳ)2

    m− 3+

    s2

    m(m− 3).

    Thus,

    PV (θ, µ) ≤ λV (θ, µ) + b

    where λ > 1/(m− 3) and b = s2/(m(m− 3)). Thus, a drift condition holds. Next, weneed to show that a minorization condition holds.

    Consider the set

    C ={

    (θ, µ) : (µ− ȳ)2 ≤ d}

    We need to show that there exists an � > 0 and a density q(θ, µ) such that if (θ, µ) ∈ C

    k(µ′, θ′|µ, θ) = fθ|µ(θ′|µ, y)fµ|θ(µ′|θ′, y) ≥ �q(θ′, µ′) .

    First note that

    k(µ′, θ′|µ, θ) ≥ fµ|θ(µ′|θ′, y) infµ∈C

    fθ|µ(θ′|µ, y) .

    61

  • Let g(θ) = infµ∈C f(θ | µ, y). Let

    θ∗ =md

    (m− 1) log((1 +md)/s2),

    then, it can be shown that (do this yourself)

    g(θ) =

    IG

    (m− 1

    2,s2 +md

    2; θ

    )θ ≥ θ∗

    IG

    (m− 1

    2,s2

    2; θ

    )θ < θ∗

    .

    Then,

    k(µ′, θ′|µ, θ) ≥ �f(µ′|θ′, y)g(θ′)

    �.

    It is often quite tricky to establish drift and minorization; one has to cook up a function

    V and then demonstrate these two conditions. You can use some intuition for con-

    structing V : since we want the Markov chain to be in the small set often, it is natural

    then to have the small set be an area of high probability under the target. This would

    then mean that V should take small values in this area so the Markov chain can “drift”

    down to it.

    The following in a useful result that helps avoid checking the minorization condition:

    Theorem 34. If V is unbounded off compact sets, that is if for any r ∈ R+ {x :V (x) ≤ r} is compact, then a minorization condition holds for V .

    Proof. Since V is unbounded off compact sets, by definition, the set C is compact.

    Then

    P (x, dy) ≥ infx∈C

    P (x, dy) = �infx∈C P (x, dy)

    where � makes infx∈C P (x, dy) a probability measure.

    End of Lecture 1

    62

  • Lecture 2

    Now that we’ve somewhat gotten used to the drift and minorization conditions, in

    this lecture, we will focus on proving that a drift and minorization together imply

    geometric ergodicity. The proof first depends on establishing another equivalent drift

    and minorization.

    Recall that the previous drift and minorization conditions were that there exists a

    function V : X → [0,∞), 0 < λ < 1, b 2b/(1− λ) such that

    PV (x) ≤ λV (x) + b and C = {V (x) ≤ d} is small.

    An alternate drift says, exists a function W : X → [1,∞), 0 < ρ < 1, L 2b

    1− λ=b

    β

    ⇒ W (x) > bβ

    + 1 =b+ β

    β

    63

  • Define s(x) = W (x)− (b+ β)/β > 0. Then

    ∆W (x) ≤ −2βW (x) + b+ (1− λ)

    = −βW (x)− βW (x) + b+ (1− λ)

    ≤ −βW (x)− β[b+ β

    β+ s(x)

    ]+ b+ (1− λ)

    = −βW (x)− b− β − βs(x) + b+ (1− λ)

    = −βW (x)− βs(x)− 1− λ2

    ≤ −βW (x) .

    If x ∈ C, then

    ∆W (x) ≤ −βW (x)− βW (x) + b+ (1− λ)

    ≤ −βW (x) + b+ (1− λ) .

    Combining the two, we get

    ∆W (x) ≤ −βW (x) + (b+ (1− λ))I(x ∈ C)

    ⇒ PW (X) ≤ (1− β)W (x) + (b+ (1− λ))I(x ∈ C)

    ⇒ PW (x) ≤ ρW (x) + LI(x ∈ C) ,

    where ρ = (1 − λ)/2 and L = b + (1 − λ). Which establishes the second drift andminorization. A siilar argument in reverse establishes equivalence. We may thus jump

    back and forth between these two drift and minorizations.

    Now we will establish a “proof” of geometric ergodicity under a drift and minorization.

    First, recall the coupling inequality for the coupling time T :

    ‖P n(x, ·)− F (·)‖ ≤ Pr(T > n) .

    Suppose that T has a moment generating function. Then there exists a β > 1 so that

    E(βT ) n)

    ≤ E[βT I(T > n)

    ]→ 0 as n→∞ by dominated convergence theorem. .

    64

  • Thus ‖P n(x, ·)− F (·)‖ = o(βn) .

    So we can potentially get a geometric rate of convergence in this way. The question is,

    when does T have a moment generating function? Recall that T is the coupling time,

    and for (the random variable) T to have thin tails, the chain must be in the small set

    often. Recall τC = inf{n ≥ 1 : Xn ∈ C} is the first hitting time to C. We will showthat

    PW ≤ ρW + LI(x ∈ C)⇒ E[λ−τc

    ]≤ d+ L/ρ .

    It will then follow that the time to a successful coupling is a geometric sum of random

    excursion times to C, which will imply a moment generating function for T .

    Theorem 35. Suppose W : X → [1,∞), 0 < ρ < 1, L < ∞ and a small set C suchthat

    PW (x) ≤ ρW (x) + LI(x ∈ C) .

    For 1 < β < 1/ρ,

    E [βτC ] ≤ β(ρV (x) + L) x ∈ C

    E [βτC ] ≤ (V (x)) x 6∈ C

    Proof. Note that for A ∈ B(X ) and x ∈ X ,

    Px(τA = k) = Pr(τA = k | X0 = x)

    and Px(τA = 1) = P (x,A). Consequently, due to the Markov property, for all k > 1,

    Px(τA = k) =

    ∫AcP (x, dy)Py(τA = k − 1)

    =

    ∫AcP (x, dy) . . .

    ∫AcP (yk−2, dyk−1)P (yk−1, A) .

    Suppose x 6∈ C. Then by the drift condition and later using W ≥ 1

    W (x) ≥ λ−1PW (x)

    = λ−1∫XW (y)P (x, dy)

    = λ−1∫CcW (y)P (x, dy) + λ−1

    ∫C

    W (y)P (x, dy)

    ≥ λ−1∫CcW (y)P (x, dy) + λ−1

    ∫C

    P (x, dy)

    65

  • = λ−1∫CcW (y)P (x, dy) + λ−1Px(τC = 1)

    = λ−1∫Cc

    [λ−1

    ∫XW (z)P (y, dz)

    ]P (x, dy) + λ−1Px(τC = 1)

    continuing with the same steps...

    ≥∞∑k=1

    λ−kPx(τC = k)

    = E[λ−τC

    ]≥ E [βτC ] .

    Now suppose x ∈ C and consider a move x→ y. By using the previous result and thatW > 1,

    E [βτC ] = E [E [βτC | y]]

    =

    ∫X

    E [βτC | y]P (x, dy)

    =

    ∫C

    E [βτC | y]P (x, dy) +∫Cc

    E [βτC | y]P (x, dy)

    =

    ∫C

    βP (x, dy) +

    ∫Cc

    Ey[βτC+1

    ]P (x, dy)

    =

    ∫C

    βP (x, dy) +

    ∫CcβEy [β

    τC ]P (x, dy)

    ≤∫C

    βW (y)P (x, dy) +

    ∫CcβW (y)P (x, dy)

    = βPW (x) ≤ β(ρW (x) + L) .

    The rest of the argument gets quite messy. But essentially, we have established bounds

    for E[βτC ], which provides us bounds on E[βT ]. Together, we will get that for some

    constant K and δ < 1

    ‖P n(x, ·)− F (·)‖ ≤ KW (x)δn .

    The final form of δ are complicated, and hence I have not introduced them here.

    Note that throughout, we have use fairly loose bounds. Everytime we introduce an

    inequality, we weaken our bound. To begin with the coupling inequality also need not

    be a tight bound.

    End of Lecture 2

    66

  • Lecture 3

    In this lecture we will see how the drift and minorization can give us quantifiable upper

    bounds on the total variation distance.

    Recall that we have at least one instance of a quantifiable upper bound. If X is smallwith minorization constant �, then

    ‖P n(x, ·)− F (·)‖ ≤ (1− �)n .

    Also recall that the proof of this result was simple - since the whole support is small,

    coupling time is a geometric random variable. Additionally, since the upper bound is

    not dependent on the starting value, we conclude that the chain is uniformly ergodic

    in this case.

    What if X is not small, but a subset C is small? Then the above result can be changeda little. Let us first introduce some notation. Recall a set C is small if there exists

    � > 0 and a measure Q such that for all x ∈ C

    P (x, ·) ≥ �Q(·) .

    Let Xn and X′n be two Markov chains such that X

    ′0 ∼ π (from stationarity). Define

    notation

    t1 = inf {m : (Xm ×X ′m) ∈ C × C} ,

    which is the first time both chains are in the small set. Additionally, define

    ti = inf {m : m ≥ ti−1 + 1, (Xm ×X ′m) ∈ C × C} ,

    which is the time stamp for the ith time both chains are in C. For a chain run for n

    steps, let Nn denote the number of returns to the set C for both chains. That is

    Nn = max {i : ti < n} .

    Theorem 36. Under the above definitions,

    ‖P n(x, ·)− F (·)‖ ≤ (1− �)n + Pr(Nn < n)

    67

  • Proof. As usual, the proof is based first on the coupling inequality. Recall that

    ‖P n(x, ·)− F (·)‖ ≤ Pr(T > n) .

    Next note that Nn ≤ n and when Nn = n, this means that in every update, bothMarkov chains were in C. Additionally if T > n and Nn = n means that in every step

    both chains were in C but they failed to couple. The probability of this is (1− �)n.

    ‖P n(x, ·)− F (·)‖ ≤ Pr(T > n)

    = Pr(T > n,Nn = n) + Pr(T > n,Nn < n)

    = (1− �)n + Pr(T > n,Nn < n)

    = (1− �)n + Pr(Nn < n) .

    Through a series of lemmas and interesting proof techniques, Rosenthal (1995) ar-

    rives at an upper bound for Pr(Nn < n) that is based on the drift and minorization

    conditions.

    Theorem 37. Suppose a drift condition with V : X → [0,∞) hold such that for some0λ < 1 and b ∈ R,

    PV (x) ≤ λV (x) + b .

    Additionally let C = {x : V (x) ≤ d} be a small set with minorization constant � whered > 2b/(1− λ). Then for any 0 < r < 1,

    ‖P n(x, ·)− F (·)‖ ≤ (1− �)rk +(α−(1−r)Ar

    )k (1 +

    b

    1− λ+ V (x0)

    ),

    where

    α−1 =1 + 2b+ λd

    1 + d< 1 and A = 1 + 2(λd+ b) .

    The above theorem is quite critical since it allows us to do the follows: if we can

    establish a drift and minorization condition for a Markov chain, we can then bound

    the distance from the stationary distribution. This upper bound with yield a time

    stamp n∗ at which the chain is sufficiently close to the starting distribution. That n∗

    can then be treated as the number of samples that are reasonable to discard from the

    sample.

    68

  • However, the sad truth is that, the upper bound is usually quite lax, and often produces

    n∗ values in the millions, so that these are not practically useful. There are some

    exceptions.

    End of Lecture 3

    69

  • Lecture 4

    At this point, we have essentially completed the theoretical study of Markov chains.

    At this point we will proceed to the Monte Carlo aspects of Markov chain Monte Carlo.

    Particularly, we want to address the following:

    • Sufficient burn-in: Given a bad starting value that is far away from an area of highposterior density, we want to be able assess in how many steps the Markov chain

    will approximately start sampling from F . In the previous lecture, we learned of

    a technique to potentially do this rigorously, however, often the bounds on the

    TV distance are much too loose to be useful. In such instances, we want to rely

    mainly on trace plots to assess when the Markov chain has gotten away from the

    starting value.

    What you shouldn’t do is blindly through away some portion of the Markov

    chain, without some reasonable justification. Throwing away samples is a waste

    of resources, unless you are informed either theoretically or visually about why

    you’re throwing away these samples. Here is a fantastic rant about this: Charlie

    Geyer’s rant on burn-in.

    • Existence of CLT: We now have the tools to check whether the central limittheorem holds. We can verify if the Markov chain is uniformly ergodic or if it

    is geometrically ergodic by establishing a drift and minorization. Every Markov

    chain is different, so special analysis of every chain must be done. Further, if

    interest is in estimating the mean of a function g under F , then Eg2+δ

  • Verify on your own that F = N(0, σ2/(1− ρ2)) for ρ < 1 is the invariant distribution.

    1. Starting value: Let’s look at the impact of the starting value. We generate

    the Markov chain for 500 steps with ρ = .95 with starting value x1 = 0. The

    stationary distribution a normal centered at 0, thus this is a goos starting value.

    Figure 13 below produces the trace plots for this run, and it is evident from the

    stability of the chain, and there is no reasonable need to discard the starting

    values.

    0 100 200 300 400 500

    −8

    −6

    −4

    −2

    02

    46

    Chain Length

    AR

    (1)

    Pro

    cess

    Figure 12: AR(1) process with starting value 0.

    On the other hand, when we start far (far) away from an area of high probability,

    the chain takes a while to reach a reasonable level of stationarity. We start the

    Markov chain from 100 and the trace plot in Figure 13 indicates that it takes

    about 60-70 steps before the Markov chain reaches a point of stability.

    0 100 200 300 400 500

    020

    4060

    8010

    0

    Chain Length

    AR

    (1)

    Pro

    cess

    Figure 13: AR(1) process with starting value 100.

    A explained before, we may want to remove these 60-70 steps from the Markov

    71

  • chain because we have some visual confirmation about the quality of the starting

    value. Alternatively, if we just run the Markov chain long enough, the effect of a

    few “bad” samples in the beginning can mostly be ignored.

    2. Existence of CLT: For a CLT to hold, we need the Markov chain to be

    either uniformly, geometrically or polynomially ergodic and we need appropriate

    moment conditions to hold. Consider estimating the mean of g(x) = xr for

    r > 1. Then since the target distribution has a moment generating function,

    finite moments exist. We move on to establishing drift and minorization.

    An alternative form is Xt | Xt−1 ∼ N(ρXn, σ2). The Markov transition kernel

    P (x,A) =

    ∫A

    1

    2πσ2exp

    {− 1

    2σ2(y − ρx)2

    }dy .

    We will prove geometric ergodicity by establishing a drift and minorization con-

    dition. In trying to find a drift function, notice that the target distribution is

    normally distributed around 0. Recall that ideally we want the drift function to

    be centered around an area of high probability. Considering this, let

    V (x) = x2 .

    We need to show the drift condition is satisfied for the above function.∫V (y)k(x, y)dy = E [V (x1) | x0]

    = E[x21 | x0

    ]= Var(x1 | x0) + (E[x1 | x0])2

    = σ2 + ρx0

    ≤ |ρ|x0 + σ2 .

    Since |ρ| < 1, we have that the drift condition holds for λ = |ρ| and b = σ2.Also, since the drift function V is unbounded off compact sets, the minorization

    condition is satisfied.

    End of Lecture 4

    72

  • Lecture 5

    8 Estimating the asymptotic variance

    Recall that we are interested in estimating µg := EFg =∫g(x)F (dx). Samples

    X1, . . . , Xn are n samples using a Markov chain with transition kernel P . The chosen

    estimator of µg is,

    µ̂ =1

    n

    n∑t=1

    g(Xt)a.s.→ µg .

    Additionally, under the many possible situations, if a Markov chain CLT holds for µ̂

    then as n→∞,√n(µ̂− µg)

    d→ N(0, σ2g) ,

    where

    σ2g =∞∑

    k=−∞

    CovF (g(X1), g(X1+k)) .

    In this section we will estimate σ2g using two methods. To assess the quality of estima-

    tion it would be useful to compare the performance of various estimators for a Markov

    chain where the true σ2g value is known.

    Example 32 (AR(1) continued). The AR(1) process is specially useful since a closed-

    form expression of σ2g is available for g(x) = x. Recall that the stationary distribution

    of the AR(1) process if N(0, σ2/(1 − ρ2)), where σ2 is the variance of the errors �t.Thus, CovF (X1, X1+k) = |ρ|k.

    σ2g =∞∑

    k=−∞

    CovF (g(X1), g(X1+k))

    = VarF (X1, X1) + 2∞∑k=1

    CovF (g(X1), g(X1+k))

    =σ2

    1− ρ2+ 2

    ∞∑k=1

    ρkσ2

    1− ρ2

    =σ2

    (1− ρ)2.

    Estimating σ2g has some additional challenged in MCMC. Note that we will be using

    estimated of σ2g to determine when to terminate simulation. That is simulation is

    73

  • terminated at a random time. So in fact, theoretically, n = T (µ̂, σ̂2g) which will be

    random. Glynn and Whitt (1992) essentially show that in order for the simulation to

    terminate “adequately well”, the estimator of σ2g must be strongly consistent.

    8.1 Spectral variance estimators

    The asymptotic variance is an infinite sum, and we have n samples. So obviously,

    estimating σ2g is a difficult problem. A popular, but computationally burdensome

    estimator is the spectral variance estimator. Let R(k) = CovF (g(X1), g(X1+k)) be the

    lag-k covariance. The sample lag−k covariance is

    R̂(k) =1

    n

    n−k∑t=1

    (g(Xt)− µ̂) (g(Xt+k)− µ̂) .

    Ideally, we would like to estimate σ2g with∑∞

    k=−∞ R̂(k), but we can only go up to

    k = −(n− 1) and k = (n− 1). So we can potentially use∑n−1

    k=−(n−1) R̂(k). But notice

    that to estimate R̂(n− 1) there is only one sample point available. Thus larger orderlag covariances are not going to be estimated well. So the summation can’t go up to

    −(n − 1) and instead we will make it go to some truncation point b. Further, sincethe quality of estimat


Recommended