Markov Chain Monte Carlo Methods - University at …srihari/CSE676/17.3 MCMC...Markov Chain Monte...

Post on 20-May-2020

3 views 0 download

transcript

Deep Learning Srihari

1

Markov Chain Monte Carlo Methods

Sargur N. Sriharisrihari@cedar.buffalo.edu

Deep Learning Srihari

Topics in Monte Carlo Methods

1. Sampling and Monte Carlo Methods2. Importance Sampling3. Markov Chain Monte Carlo Methods4. Gibbs Sampling5. Mixing between separated modes

2

Deep Learning Srihari

Topics in Markov Chain Monte Carlo

• Limitations of plain Monte Carlo methods• Markov Chains• Metropolis-Hastings Algorithm• MCMC and Energy-based models

3

Deep Learning Srihari

Limitations of plain Monte Carlo• Direct (unconditional) sampling

– Hard to get rare events in high-dimensional spaces– Infeasible for MRFs, unless we know partition function Z

• Rejection sampling, Importance sampling– Doesn’t work well if proposal q(x) is very different from p(x)– Yet constructing a q(x) similar to p(x) can be difficult

• Making a good proposal usually requires knowledge of the analytic form of p(x) – but if we had that, we wouldn’t even need to sample!

• Intuition of MCMC– Instead of a fixed proposal q(x), use an adaptive proposal

4

Deep Learning SrihariMCMC and Adaptive Proposals

• MCMC:– Instead of q(x), use q(x’|x) where x’ is the new state

being sampled, and x is previous sample• As x changes, q(x’|x) changes as a function of x’

5

q(x)p(x) p(x)

q(x2|x1)q(x3|x2) q(x4|x3)

x3 x1 x2 x1 x2 x3

Importance sampling from p(x)with a bad proposal q(x)

MCMC sampling from p(x)with an adaptive proposal q(x’|x)

Deep Learning Srihari

Markov Chain Monte Carlo Methods• In many cases we wish to use a Monte Carlo

technique but there is no tractable method for drawing exact samples from pmodel(x) or from a good (low variance) importance sampling distribution q(x)

• In deep learning this happens most often when pmodel(x) is represented by an undirected model

• In this case we use a mathematical tool called a Markov chain to sample from pmodel(x)

• Algorithms that use Markov chains to perform Monte Carlo estimates are called MCMC 6

Deep Learning SrihariMarkov Chain• A sequence of random variables S0, S1, S2,…

with each Si∈{1,2,…,d} taking one of d possible values representing state of a system– Initial state distributed according to p(S0)– Subsequent states generated from a conditional

distribution that depends only on the previous state• i.e. Si is distributed according to p(Si∣Si−1)

7

A Markov chain over three states. The weighted directed edges indicate probabilities of transitioning to a different state

Deep Learning SrihariIdea of MCMC• Idea of MCMC is

– Construct a Markov chain whose states will be joint assignments to the variables in the model

• And whose stationary distribution will equal the model probability p

8

Deep Learning Srihari

Metropolis-Hastings• User to specify a transition kernel q(x’|x) and

acceptance probability A(x’|x)• M-H Algorithm

– Draw sample x’ from q(x’|x), where x is previous sample

– New sample x’ accepted/rejected with probability A(x’|x)

• This acceptance probability is

• It encourages us to move towards more likely points in the distribution 9

A(x ' |x) = min 1,

p(x ')q(x |x ')p(x)q(x ' |x)

⎛⎝⎜

⎞⎠⎟

Deep Learning Srihari

Acceptance probability• A(x’|x) is like ratio of importance sampling

weights

p(x’)/q(x’|x) is the importance weight for x’p(x)/q(x|x’) is the importance weight for x

– We divide the importance weight for x’ by that of x– Notice that we only need to compute p(x’)/p(x)

rather than p(x’) or p(x) separately• A(x’|x) ensures that, after sufficient draws,

samples will come from true distribution p(x)10

A(x ' |x) = min 1,

p(x ')q(x |x ')p(x)q(x ' |x)

⎛⎝⎜

⎞⎠⎟

Deep Learning Srihari

The M-H Algorithm (1)• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

11

Initialize x(0)

Deep Learning Srihari

The M-H Algorithm (2)

12

Initialize x(0)

Draw, accept x(1)

• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

Deep Learning Srihari

The M-H Algorithm (3)

13

Initialize x(0)

Draw, accept x(1)

• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

Draw, accept x(2)

Deep Learning Srihari

The M-H Algorithm (4)

14

Initialize x(0)

Draw, accept x(1)

• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

Draw, accept x(2)

Draw, but reject, set x(3)=x(2)

Deep Learning Srihari

The M-H Algorithm (5)

15

Initialize x(0)

Draw, accept x(1)

• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

Draw, accept x(2)

Draw, but reject, set x(3)=x(2)

Draw, accept x(4)

Deep Learning Srihari

The M-H Algorithm (6)

16

Initialize x(0)

Draw, accept x(1)

• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

Draw, accept x(2)

Draw, but reject, set x(3)=x(2)

Draw, accept x(4)

Draw, accept x(5)

Deep Learning Srihari

MCMC and Energy-Based Models• Guarantees for MCMC are when model does

not assign zero probability to any state• Thus convenient to present these techniques as

sampling from an energy-based model (EBM)p(x)∝ exp(-E(x))

17

p(x) = 1

Zp̂(x)

!p(x) = φ(C)

C∈G∏

Z = !p(x)dx∫ !p(x) = exp(−E(x))

Deep Learning Srihari

Need more than ancestral sampling• for directed acyclic graphs:

– Start with lowest numbered node– Draw a sample from the distribution p(x1) which we call – Work through each of the nodes in order

• For node n we draw a sample from conditional distribution p(xn|pan)– Defines an efficient single pass algorithm

• Not so simple in EBMs: chicken-egg problem– In order to sample from A we need to draw from p(A|B,D)– In order to sample from B we need to sample from p(B|A,C)

18

1x̂

Deep Learning Srihari

Avoiding chicken-and-egg in EBM• We avoid the chicken-and-egg problem using a

Markov chain• Core idea of Markov chain

– Have a state x that begins with an arbitrary value– Over time we repeatedly update x – Eventually x becomes a fair sample from p(x)– Markov chain is defined by random state x and transition

distribution T(x�| x)

19

Deep Learning SrihariTheoretical understanding of MCMC• Reparameterize the problem

– Restrict attention to the case where r.v. x has countably many states

• Represent the state as an integer x• Different integer values of x map back to different states x

in the original problem– Consider when we run infinitely many Markov

chains in parallel• All the different states these Markov chains are drawn

from some distribution q(t)(x) where t is no of time steps • q(0) is some distribution that we initialize

• Our goal is for q(t)(x) to converge to p(x) 20

Deep Learning Srihari

Equilibrium Distribution

• Because we have reparameterized the problem in terms of positive integer x, we can describe the probability distribution q using a vector vwith q(x=i)=vi

• The stationary distribution, also called the equilibrium distribution, is given by an eigenvector equation v’=Av=v

21

Deep Learning Srihari

Choice of Transition Distribution

• If we have chosen T correctly then the stationary distribution q will be equal to the distribution p we wish to sample from

• We describe how to choose T next, when we discuss Gibbs sampling

22