+ All Categories
Home > Documents > Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte...

Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte...

Date post: 20-May-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
22
Deep Learning Srihari 1 Markov Chain Monte Carlo Methods Sargur N. Srihari [email protected]
Transcript
Page 1: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

1

Markov Chain Monte Carlo Methods

Sargur N. [email protected]

Page 2: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

Topics in Monte Carlo Methods

1. Sampling and Monte Carlo Methods2. Importance Sampling3. Markov Chain Monte Carlo Methods4. Gibbs Sampling5. Mixing between separated modes

2

Page 3: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

Topics in Markov Chain Monte Carlo

• Limitations of plain Monte Carlo methods• Markov Chains• Metropolis-Hastings Algorithm• MCMC and Energy-based models

3

Page 4: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

Limitations of plain Monte Carlo• Direct (unconditional) sampling

– Hard to get rare events in high-dimensional spaces– Infeasible for MRFs, unless we know partition function Z

• Rejection sampling, Importance sampling– Doesn’t work well if proposal q(x) is very different from p(x)– Yet constructing a q(x) similar to p(x) can be difficult

• Making a good proposal usually requires knowledge of the analytic form of p(x) – but if we had that, we wouldn’t even need to sample!

• Intuition of MCMC– Instead of a fixed proposal q(x), use an adaptive proposal

4

Page 5: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning SrihariMCMC and Adaptive Proposals

• MCMC:– Instead of q(x), use q(x’|x) where x’ is the new state

being sampled, and x is previous sample• As x changes, q(x’|x) changes as a function of x’

5

q(x)p(x) p(x)

q(x2|x1)q(x3|x2) q(x4|x3)

x3 x1 x2 x1 x2 x3

Importance sampling from p(x)with a bad proposal q(x)

MCMC sampling from p(x)with an adaptive proposal q(x’|x)

Page 6: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

Markov Chain Monte Carlo Methods• In many cases we wish to use a Monte Carlo

technique but there is no tractable method for drawing exact samples from pmodel(x) or from a good (low variance) importance sampling distribution q(x)

• In deep learning this happens most often when pmodel(x) is represented by an undirected model

• In this case we use a mathematical tool called a Markov chain to sample from pmodel(x)

• Algorithms that use Markov chains to perform Monte Carlo estimates are called MCMC 6

Page 7: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning SrihariMarkov Chain• A sequence of random variables S0, S1, S2,…

with each Si∈{1,2,…,d} taking one of d possible values representing state of a system– Initial state distributed according to p(S0)– Subsequent states generated from a conditional

distribution that depends only on the previous state• i.e. Si is distributed according to p(Si∣Si−1)

7

A Markov chain over three states. The weighted directed edges indicate probabilities of transitioning to a different state

Page 8: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning SrihariIdea of MCMC• Idea of MCMC is

– Construct a Markov chain whose states will be joint assignments to the variables in the model

• And whose stationary distribution will equal the model probability p

8

Page 9: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

Metropolis-Hastings• User to specify a transition kernel q(x’|x) and

acceptance probability A(x’|x)• M-H Algorithm

– Draw sample x’ from q(x’|x), where x is previous sample

– New sample x’ accepted/rejected with probability A(x’|x)

• This acceptance probability is

• It encourages us to move towards more likely points in the distribution 9

A(x ' |x) = min 1,

p(x ')q(x |x ')p(x)q(x ' |x)

⎛⎝⎜

⎞⎠⎟

Page 10: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

Acceptance probability• A(x’|x) is like ratio of importance sampling

weights

p(x’)/q(x’|x) is the importance weight for x’p(x)/q(x|x’) is the importance weight for x

– We divide the importance weight for x’ by that of x– Notice that we only need to compute p(x’)/p(x)

rather than p(x’) or p(x) separately• A(x’|x) ensures that, after sufficient draws,

samples will come from true distribution p(x)10

A(x ' |x) = min 1,

p(x ')q(x |x ')p(x)q(x ' |x)

⎛⎝⎜

⎞⎠⎟

Page 11: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

The M-H Algorithm (1)• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

11

Initialize x(0)

Page 12: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

The M-H Algorithm (2)

12

Initialize x(0)

Draw, accept x(1)

• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

Page 13: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

The M-H Algorithm (3)

13

Initialize x(0)

Draw, accept x(1)

• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

Draw, accept x(2)

Page 14: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

The M-H Algorithm (4)

14

Initialize x(0)

Draw, accept x(1)

• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

Draw, accept x(2)

Draw, but reject, set x(3)=x(2)

Page 15: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

The M-H Algorithm (5)

15

Initialize x(0)

Draw, accept x(1)

• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

Draw, accept x(2)

Draw, but reject, set x(3)=x(2)

Draw, accept x(4)

Page 16: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

The M-H Algorithm (6)

16

Initialize x(0)

Draw, accept x(1)

• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

Draw, accept x(2)

Draw, but reject, set x(3)=x(2)

Draw, accept x(4)

Draw, accept x(5)

Page 17: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

MCMC and Energy-Based Models• Guarantees for MCMC are when model does

not assign zero probability to any state• Thus convenient to present these techniques as

sampling from an energy-based model (EBM)p(x)∝ exp(-E(x))

17

p(x) = 1

Zp̂(x)

!p(x) = φ(C)

C∈G∏

Z = !p(x)dx∫ !p(x) = exp(−E(x))

Page 18: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

Need more than ancestral sampling• for directed acyclic graphs:

– Start with lowest numbered node– Draw a sample from the distribution p(x1) which we call – Work through each of the nodes in order

• For node n we draw a sample from conditional distribution p(xn|pan)– Defines an efficient single pass algorithm

• Not so simple in EBMs: chicken-egg problem– In order to sample from A we need to draw from p(A|B,D)– In order to sample from B we need to sample from p(B|A,C)

18

1x̂

Page 19: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

Avoiding chicken-and-egg in EBM• We avoid the chicken-and-egg problem using a

Markov chain• Core idea of Markov chain

– Have a state x that begins with an arbitrary value– Over time we repeatedly update x – Eventually x becomes a fair sample from p(x)– Markov chain is defined by random state x and transition

distribution T(x�| x)

19

Page 20: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning SrihariTheoretical understanding of MCMC• Reparameterize the problem

– Restrict attention to the case where r.v. x has countably many states

• Represent the state as an integer x• Different integer values of x map back to different states x

in the original problem– Consider when we run infinitely many Markov

chains in parallel• All the different states these Markov chains are drawn

from some distribution q(t)(x) where t is no of time steps • q(0) is some distribution that we initialize

• Our goal is for q(t)(x) to converge to p(x) 20

Page 21: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

Equilibrium Distribution

• Because we have reparameterized the problem in terms of positive integer x, we can describe the probability distribution q using a vector vwith q(x=i)=vi

• The stationary distribution, also called the equilibrium distribution, is given by an eigenvector equation v’=Av=v

21

Page 22: Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte Carlo Methods •In many cases we wish to use a Monte Carlo technique but there is

Deep Learning Srihari

Choice of Transition Distribution

• If we have chosen T correctly then the stationary distribution q will be equal to the distribution p we wish to sample from

• We describe how to choose T next, when we discuss Gibbs sampling

22


Recommended