Gibbs sampling (an MCMC method) and relations to EMtom/10-702/GibbsAndMCMCsampling.pdf · 2 Gibbs...

Post on 07-Feb-2018

213 views 0 download

transcript

1

Gibbs and Metropolis sampling (MCMC methods) and relations of Gibbs to EM

Lecture Outline 1. Gibbs

• the algorithm • a bivariate example • an elementary convergence proof for a (discrete) bivariate case • more than two variables • a counter example.

2. EM – again • EM as a maximization/maximization method • Gibbs as a variation of Generalized EM

3. Generating a Random Variable. • Continuous r.v.s and an exact method based on transforming the cdf. • The “accept/reject” algorithm. • The Metropolis Algorithm

2

Gibbs Sampling

We have a joint density f(x, y1, …, yk)

and we are interested, say, in some features of the marginal density

f(x) = ∫∫…∫ f(x, y1, …, yk) dy1, dy2, …, dyk.

For instance, suppose that we are interested in the average

E[X] = ∫ x f(x)dx. If we can sample from the marginal distribution, then

∞→mlim n1 ∑

=

n

iiX

1 = E[X]

without using f(x) explicitly in integration. Similar reasoning applies to any other characteristic of the statistical model, i.e., of the population.

3

The Gibbs Algorithm for computing this average.

Assume we can sample the k+1-many univariate conditional densities: f(X | y1, …, yk) f(Y1 | x, y2, …, yk) f(Y2 | x, y1, y3, …, yk) …

f(Yk | x, y1, y3, …, yk-1).

Choose, arbitrarily, k initial values: Y1= y01 , Y2= y0

2, …., Yk= yk0 .

Create: x1 by a draw from f(X | y01 , …, yk

0 )

y11 by a draw from f(Y1 | x1 , y0

2, …, yk0 )

y12 by a draw from f(Y2 | x1 , y1

1, y03…, yk

0 ) … yk

1 by a draw from f(Yk | x1 , y11, …, yk

11− ).

4

This constitutes one Gibbs “pass” through the k+1 conditional distributions,

yielding values: ( x1 , y11, y1

2, …., yk1 ).

Iterate the sampling to form the second “pass” ( x2, y2

1 , y22, …., yk

2 ). Theorem: (under general conditions)

The distribution of xn converges to F(x) as n → ∞.

Thus, we may take the last n X-values after many Gibbs passes:

n1 ∑

+

=

nm

mi

iX ≈ E[X]

or take just the last value, inix of n-many sequences of Gibbs passes

(i = 1, … n) n1 ∑

=

n

ii

ni

iX ≈ E[X]

to solve for the average, = ∫ x f(x)dx.

5

A bivariate example of the Gibbs Sampler. Example: Let X and Y have similar truncated conditional exponential distributions:

f(X | y) ∝ ye-yx for 0 < X < b

f(Y | x) ∝ xe-xy for 0 < Y < b where b is a known, positive constant. Though it is not convenient to calculate, the marginal density f(X) is

readily simulated by Gibbs sampling from these (truncated) exponentials.

Below is a histogram for X, b = 5.0, using a sample of 500 terminal observations with 15 Gibbs’ passes per trial, in

ix (i = 1,…, 500, ni = 15) (from Casella and George, 1992).

6

Histogram for X, b = 5.0, using a sample of 500 terminal observations with 15 Gibbs’ passes per trial, in

ix (i = 1,…, 500, ni = 15). Taken from (Casella and George, 1992).

7

Here is an alternative way to compute the marginal f(X) using the same Gibbs Sampler. Recall the law of conditional expectations (assuming E[X] exists):

E[ E[X | Y] ] = E[X]

Thus E[f(x|Y)] = ∫ f(x | y)f(y)dy = f(x). Now, use the fact that the Gibbs sampler gives us a simulation of the marginal density f(Y) using the penultimate values (for Y) in each Gibbs’ pass, above: 1−in

iy (i = 1, …500; ni = 15). Calculate f(x | 1−in

iy ), which by assumption is feasible. Then note that:

f(x) ≈ n1 ∑

=

−n

ii

ni

iy ) |f(x 1

8

The solid line graphs the alternative Gibbs Sampler estimate of the marginal f(x) from eth same sequence of 500 Gibbs’ passes, using ∫ f(x | y)f(y)dy = f(x). The dashed-line is the exact solution. Taken from (Casella and George, 1992).

9

An elementary proof of convergence in the case of 2 x 2 Bernoulli data Let (X,Y) be a bivariate variable, marginally, each is Bernoulli

X 0 1

43

210

1pppp

Y

where pi ≥ 0, ∑ pi = 1, marginally

P(X=0) = p1+p3 and P(X=1) = p2+p4

P(Y=0) = p1+p2 and P(Y=1) = p3+p4.

10

The conditional probabilities P(X|y) and P(Y|x) are evident:

P(Y|x): X 0 1

++

++

424

313

422

3110

1 ppp

ppp

ppp

ppp

Y

P(X|y): X 0 1

++

++

434

433

212

2110

1 ppp

ppp

ppp

ppp

Y

11

Suppose (for illustration) that we want to generate the marginal

distribution of X by the Gibbs Sampler, using the sequence of iterations

of draws between the two conditional probabilites P(X|y) and P(Y|x).

That is, we are interested in the sequence <xi : i = 1, … > created from the

starting value y0= 0 or y0= 1.

Note that:

P( X n = 0 |xi : i = 1, …, n-1) = P( X n = 0 |xn 1− ) the Markov property

= P( X n=0 | yn 1− =0) P(Y n 1− =0 | xn 1− ) + P( X n=0 | yn 1− =1) P(Y n 1− =1 | xn 1− )

12

Thus, we have the four (positive) transition probabilities:

P( X n = j | xn 1− = i) = pij > 0, with ∑i ∑j pij = 1 (i, j = 0, 1).

With the transition probabilities positive, it is an (old) ergodic theorem

that, P( X n) converges to a (unique) stationary distribution, independent

of the starting value ( y0).

Next, we confirm the easy fact that the marginal distribution P(X) is that

same distinguished stationary point of this Markov process.

13

P( X n = 0)

= P( X n = 0 | xn 1− = 0) P( X n 1− = 0) + P( X n = 0 |xn 1− = 1) P( X n 1− = 1)

= P( X n=0 | yn 1− =0) P(Y n 1− =0 | xn 1− = 0) P( X n 1− = 0)

+ P( X n=0 | yn 1− =1) P(Y n 1− =1 | xn 1− = 0) P( X n 1− = 0)

+ P( X n=0 | yn 1− =0) P(Y n 1− =0 | xn 1− = 1) P( X n 1− = 1)

+ P( X n=0 | yn 1− =1) P(Y n 1− =1 | xn 1− = 1) P( X n 1− = 1)

= EP [EP [ X n=0 | X n 1− ] ]

= EP [ X n=0 ] = P( X n = 0) .

14

The Ergodic Theorem:

Definitions:

• A Markov chain, X0, X1, …. satisfies

P(Xn| xi: i = 1, …, n-1) = P(Xn| xn-1)

• The distribution F(x), with density f(x), for a Markov chain is

stationary (or invariant) if

∫A f(x) dx = ∫ P(Xn∈A | xn-1) f(x) dx.

• The Markov chain is irreducible if each set with positive P-

probability is visited at some point (almost surely).

15

• An irreducible Markov chain is recurrent if, for each set A

having positive P-probability, with positive P-probability the

chain visits A infinitely often.

• A Markov chain is periodic if for some integer k > 1, there is a

partition into k sets {A1, …, Ak} such that

P(Xn+1 ∈ Aj+1 | xn∈Aj) = 1 for all j= 1, …, k-1 (mod k). That

is, the chain cycles through the partition.

Otherwise, the chain is aperiodic.

16

Theorem: If the Markov chain X0, X1, …. is irreducible with an

invariant probability distribution F(x) then:

1. the Markov chain is recurrent

2. F is the unique invariant distribution

If the chain is aperiodic, then for F-almost all x0, both

3. limn→∞ supA | P(Xn ∈ A | X0 = x0 ) – ∫A f(x) dx | = 0

And for any function h with ∫ h(x) dx < ∞,

4. limn→∞ n1 ∑

=

n

iiiXh )( = ∫ h(x) f(x) dx (= EF[h(x)] ),

That is, the time average of h(X) equals its state-average, a.e. F.

17

A (now-familiar) puzzle.

Example (continued): Let X and Y have similar conditional exponential distributions:

f(X | y) ∝ ye-yx for 0 < X

f(Y | x) ∝ xe-xy for 0 < Y

To solve for the marginal density f(X) use Gibbs sampling from these

exponential distributions. The resulting sequence does not converge!

Question: Why does this happen?

Answer: (Hint: Recall HW #1, problem 2.) Let θ be the statistical parameter for X with f(X|θ) the exponential model. What “prior” density for θ yields the posterior f(θ | x) ∝ xe-xθ?

Then, what is the “prior” expectation for X? Remark: Note that W = Xθ is pivotal. What is its distribution?

18

More on this puzzle: The conjugate prior for the parameter θ in the exponential distribution is the Gamma Γ(α, β).

f(θ) = )(αβ α

Γ θα-1 e-βθ for θ, α, β > 0,

Then the posterior for θ based on x = (x1, .., xn), n iid observations from the exponential distribution is f(θ|x) is Gamma Γ(α′, β′)

where α′ = α+n and β′ = β + Σ xi. Let n=1, and consider the limiting distribution as α, β → 0.

This produces the “posterior” density f(θ | x) ∝ xe-xθ , which is mimicked in Bayes theorem by the improper “prior” density

f(θ ) ∝ 1/θ. But then EF(θ) does not exist!

19

Part 2 EM – again

• EM as a maximization/maximization method • Gibbs as a variation of Generalized EM

20

EM as a maximization/maximization method.

Recall:

L(θ ; x) is the likelihood function for θ with respect to the incomplete data x.

L(θ ; (x, z)) is the likelihood for θ with respect to the complete data (x,z).

And L(θ ; z | x) is a conditional likelihood for θ with respect to z, given x;

which is based on h(z | x, θ): the conditional density for the data z, given (x,θ).

Then as f(X | θ) = f(X, Z | θ) / h(Z | x, θ)

we have log L(θ ; x) = log L(θ ; (x, z)) – log L(θ ; z | x) (*)

As below, we use the EM algorithm to compute the mle

θ̂ = argmaxΘ L(θ ; x)

21

With θ̂0 an arbitrary choice, define (E-step) Q(θ | x,θ̂0) = ∫Z [log L(θ ; x, z)] h(z | x,θ̂0) dz

and

H(θ | x, θ̂0) = ∫Z [log L(θ ; z | x)] h(z | x, θ̂0) dz.

then log L(θ ; x) = Q(θ | x, θ0) – H(θ | x, θ0), as we have integrated-out z from (*) using the conditional density h(z | x, θ̂0). The EM algorithm is an iteration of

i. the E-step: determine the integral Q(θ | x, θ̂j), ii. the M-step: define θ̂j+1 as argmaxΘ Q(θ | x, θ̂j).

Continue until there is convergence of the θ̂j.

22

Now, for a Generalized EM algorithm. Let be P(Z) any distribution over the augmented data Z, with density p(z) Define the function F by:

F(θ, P(Z)) = ∫Z [log L(θ; x, z)] p(z)dz - ∫Z log p(z) p(z)dz = EP [log L(θ; x, z)] - EP [ log p(z)]

When p(Z) = h(Z | x, θ̂0) from above, then F(θ, P(Z)) = log L(θ ; x). Claim: For a fixed (arbitrary) value θ = θ̂0, F(θ̂0, P(Z)) is maximized over distributions P(Z) by choosing p(Z) = h(Z | x, θ̂0). Thus, the EM algorithm is a sequence of M-M steps: the old E-step now is a max over the second term in F(θ̂0, P(Z)), given the first term. The second step remains (as in EM) a max over θ for a fixed second term, which does not involve θ

23

Suppose that the augmented data Z are multidimensional. Consider the GEM approach and, instead of maximizing the choice of P(Z) over all of the augmented data – instead of the old E-step – instead maximize over only one coordinate of Z at a time, alternating with the (old) M-step. This gives us the following link with the Gibbs algorithm: Instead of maximizing at each of these two steps, use the conditional distributions, we sample from them!

24

Part 3) Generating a Random Variable 3.1) Continuous r.v.’s – an Exact Method using transformation of the CDF

• Let Y be a continuous r.v. with cdf FY(•) Then the range of FY(•) is (0, 1), and as a r.v. FY it is distributed U ~ Uniform (0,1). Thus the inverse tranformation FY

-1(U) gives us the desired distribution for Y. Examples:

• If Y ~ Exponential(λ) then FY-1(U) = -λ ln(1-U) is the desired Exponential.

And from known relationships between the Exponential distribution and other members of the Exponential Family, we may proceed as follows.

25

Let Uj be iid Uniform(0,1), so that Yj = -λ ln(Uj) are iid Exponential(λ)

• Z = -2 ∑=

n

j 1ln(Uj) ~ χ2

2n a Chi-squared distribution on 2n degrees of freedom

Note only even integer values possible here, alas!

• Z = -β ∑=

a

j 1ln(Uj) ~ Gamma ΓΓΓΓ(a, β) – with integer values only for a.

• Z =

)ln(U )ln(U

j1

j1

+=

=ba

j

aj ~ Beta(a,b) – with integer values only for a.

26

3.2) The “Accept/Reject” algorithm for approximations using pdf’s. Suppose we want to generate Y ~ Beta(a,b), for non-integer values of a and b, say a = 2.7 and b = 6.3. Let (U,V) be independent Uniform(0, 1) random variables. Let c > maxy fY(y) Now calculate P(Y < y) as follows: P(V < y, U < (1/c) fY(V) ) = ∫ ∫

y cvfY dudv0/)(

0 = (1/c) ∫ y

Y dvvf0 )(

= (1/c) P(Y < y). So: (i) generate independent (U,V) Uniform(0,1)

(ii) If U < (1/c)fY(V), set Y = V, otherwise, return to step (i).

Note: The waiting time for one value of Y with this algorthim is c, so we want c small. Thus, choose c = maxy fY(y). But we waste generated values of U,V whenever U > (1/c)fY(V), so we want to choose a better approximation distribution for V than the uniform.

27

Let Y ~ fY(y) and V ~ fV(v).

• Assume that these two have common support, i.e., the smallest closed sets of measure one are the same.

• Also, assume that M = supy [fY(y) / fV(y)] exists, i.e., M < ∞.

Then generate the r.v. Y ~ fY(y) using

U ~ Uniform(0,1) and V ~ fV(v), with (U, V) independent, as follows:

(i) Generate values (u, v).

(ii) If u < (1/M) fY(v) / fV(y) then set y = v.

If not, return to step (i) and redraw (u,v).

28

Proof of correctness for the accept/reject algorithm:

The generated r.v. Y has a cdf

P(Y < y) = P(V < y | stop)

= P (V < y | U < (1/M) fY(v) / fV(y) )

= ))(/)()/1((

))(/)()/1(,(VfVfMUP

VfVfMUyVPVY

VY<

<≤

= ∫ ∫

∫ ∫∞∞−

∞−dvvduf

dvvdufvfvfM

V

y vfvfMV

VY

VY

)(

)()(/)()/1(

0

)(/)()/1(0

= .)( dvvfyY∫ ∞−

Example: Generate Y ~ Beta(2.7,6.3).

Let V ~ Beta(2,6). Then M = 1.67 and we may proceed with the algorithm.

29

3.3) Metropolis algorithm for “heavy-tailed” target densities. As before, let Y ~ fY(y), V ~ fV(v), U ~ Uniform(0,1), with (U,V) independent.

Assume only that Y and V have a common support.

Metropolis Algorithm: Step0: Generate v0 and set z0 = v0. For i = 1, ….,

Stepi: Generate (ui , vi)

Define ρi = min {)()(

iV

iYvfvf x

)()(

1

1

iY

iVzfzf , 1}

vi if ui < ρρρρi Let zi =

zi-1 if ui > ρρρρi.

Then, as i →∞, the r.v. Zi converges in distribution to the random variable Y.

30

Additional References Casella, G. and George, E. (1992) “Explaining the Gibbs Sampler,”

Amer. Statistician 46, 167-174.

Flury, B. and Zoppe, A. (2000) “Exercises in EM,” Amer. Staistican 54, 207-209.

Hastie, T., Tibshirani, R, and Friedman, J. The Elements of Statistical

Learning. New York: Spring-Verlag, 2001, sections 8.5-8.6. Tierney, L. (1994) “Markov chains for exploring posterior distributions”

(with discussion) Annals of Statistics 22, 1701-1762.