Deep Learning Srihari
Topics in Monte Carlo Methods
1. Sampling and Monte Carlo Methods2. Importance Sampling3. Markov Chain Monte Carlo Methods4. Gibbs Sampling5. Mixing between separated modes
2
Deep Learning Srihari
Topics in Markov Chain Monte Carlo
• Limitations of plain Monte Carlo methods• Markov Chains• Metropolis-Hastings Algorithm• MCMC and Energy-based models
3
Deep Learning Srihari
Limitations of plain Monte Carlo• Direct (unconditional) sampling
– Hard to get rare events in high-dimensional spaces– Infeasible for MRFs, unless we know partition function Z
• Rejection sampling, Importance sampling– Doesn’t work well if proposal q(x) is very different from p(x)– Yet constructing a q(x) similar to p(x) can be difficult
• Making a good proposal usually requires knowledge of the analytic form of p(x) – but if we had that, we wouldn’t even need to sample!
• Intuition of MCMC– Instead of a fixed proposal q(x), use an adaptive proposal
4
Deep Learning SrihariMCMC and Adaptive Proposals
• MCMC:– Instead of q(x), use q(x’|x) where x’ is the new state
being sampled, and x is previous sample• As x changes, q(x’|x) changes as a function of x’
5
q(x)p(x) p(x)
q(x2|x1)q(x3|x2) q(x4|x3)
x3 x1 x2 x1 x2 x3
Importance sampling from p(x)with a bad proposal q(x)
MCMC sampling from p(x)with an adaptive proposal q(x’|x)
Deep Learning Srihari
Markov Chain Monte Carlo Methods• In many cases we wish to use a Monte Carlo
technique but there is no tractable method for drawing exact samples from pmodel(x) or from a good (low variance) importance sampling distribution q(x)
• In deep learning this happens most often when pmodel(x) is represented by an undirected model
• In this case we use a mathematical tool called a Markov chain to sample from pmodel(x)
• Algorithms that use Markov chains to perform Monte Carlo estimates are called MCMC 6
Deep Learning SrihariMarkov Chain• A sequence of random variables S0, S1, S2,…
with each Si∈{1,2,…,d} taking one of d possible values representing state of a system– Initial state distributed according to p(S0)– Subsequent states generated from a conditional
distribution that depends only on the previous state• i.e. Si is distributed according to p(Si∣Si−1)
7
A Markov chain over three states. The weighted directed edges indicate probabilities of transitioning to a different state
Deep Learning SrihariIdea of MCMC• Idea of MCMC is
– Construct a Markov chain whose states will be joint assignments to the variables in the model
• And whose stationary distribution will equal the model probability p
8
Deep Learning Srihari
Metropolis-Hastings• User to specify a transition kernel q(x’|x) and
acceptance probability A(x’|x)• M-H Algorithm
– Draw sample x’ from q(x’|x), where x is previous sample
– New sample x’ accepted/rejected with probability A(x’|x)
• This acceptance probability is
• It encourages us to move towards more likely points in the distribution 9
A(x ' |x) = min 1,
p(x ')q(x |x ')p(x)q(x ' |x)
⎛⎝⎜
⎞⎠⎟
Deep Learning Srihari
Acceptance probability• A(x’|x) is like ratio of importance sampling
weights
p(x’)/q(x’|x) is the importance weight for x’p(x)/q(x|x’) is the importance weight for x
– We divide the importance weight for x’ by that of x– Notice that we only need to compute p(x’)/p(x)
rather than p(x’) or p(x) separately• A(x’|x) ensures that, after sufficient draws,
samples will come from true distribution p(x)10
A(x ' |x) = min 1,
p(x ')q(x |x ')p(x)q(x ' |x)
⎛⎝⎜
⎞⎠⎟
Deep Learning Srihari
The M-H Algorithm (1)• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)
11
Initialize x(0)
Deep Learning Srihari
The M-H Algorithm (2)
12
Initialize x(0)
Draw, accept x(1)
• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)
Deep Learning Srihari
The M-H Algorithm (3)
13
Initialize x(0)
Draw, accept x(1)
• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)
Draw, accept x(2)
Deep Learning Srihari
The M-H Algorithm (4)
14
Initialize x(0)
Draw, accept x(1)
• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)
Draw, accept x(2)
Draw, but reject, set x(3)=x(2)
Deep Learning Srihari
The M-H Algorithm (5)
15
Initialize x(0)
Draw, accept x(1)
• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)
Draw, accept x(2)
Draw, but reject, set x(3)=x(2)
Draw, accept x(4)
Deep Learning Srihari
The M-H Algorithm (6)
16
Initialize x(0)
Draw, accept x(1)
• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)
Draw, accept x(2)
Draw, but reject, set x(3)=x(2)
Draw, accept x(4)
Draw, accept x(5)
Deep Learning Srihari
MCMC and Energy-Based Models• Guarantees for MCMC are when model does
not assign zero probability to any state• Thus convenient to present these techniques as
sampling from an energy-based model (EBM)p(x)∝ exp(-E(x))
17
p(x) = 1
Zp̂(x)
!p(x) = φ(C)
C∈G∏
Z = !p(x)dx∫ !p(x) = exp(−E(x))
Deep Learning Srihari
Need more than ancestral sampling• for directed acyclic graphs:
– Start with lowest numbered node– Draw a sample from the distribution p(x1) which we call – Work through each of the nodes in order
• For node n we draw a sample from conditional distribution p(xn|pan)– Defines an efficient single pass algorithm
• Not so simple in EBMs: chicken-egg problem– In order to sample from A we need to draw from p(A|B,D)– In order to sample from B we need to sample from p(B|A,C)
18
1x̂
Deep Learning Srihari
Avoiding chicken-and-egg in EBM• We avoid the chicken-and-egg problem using a
Markov chain• Core idea of Markov chain
– Have a state x that begins with an arbitrary value– Over time we repeatedly update x – Eventually x becomes a fair sample from p(x)– Markov chain is defined by random state x and transition
distribution T(x�| x)
19
Deep Learning SrihariTheoretical understanding of MCMC• Reparameterize the problem
– Restrict attention to the case where r.v. x has countably many states
• Represent the state as an integer x• Different integer values of x map back to different states x
in the original problem– Consider when we run infinitely many Markov
chains in parallel• All the different states these Markov chains are drawn
from some distribution q(t)(x) where t is no of time steps • q(0) is some distribution that we initialize
• Our goal is for q(t)(x) to converge to p(x) 20
Deep Learning Srihari
Equilibrium Distribution
• Because we have reparameterized the problem in terms of positive integer x, we can describe the probability distribution q using a vector vwith q(x=i)=vi
• The stationary distribution, also called the equilibrium distribution, is given by an eigenvector equation v’=Av=v
21
Deep Learning Srihari
Choice of Transition Distribution
• If we have chosen T correctly then the stationary distribution q will be equal to the distribution p we wish to sample from
• We describe how to choose T next, when we discuss Gibbs sampling
22