Download - MCMC Algorithms - Saquib

7/31/2019 MCMC Algorithms - Saquib

1/25

MCMC algorithms: Metropolis-

Hastings and its variants

Data Mining Seminar Fall 2012

Nazmus Saquib


2/25

Motivation

Metropolis among the top 10 algorithms in

science and engineering.

Use in Statistics, Econometrics, Physics,

Computing science.

Example: High dimensional problems such as

computing the volume of a convex body in d

dimensions.


3/25

Motivation

Normalizing factor in Bayes Theorem:

Statistical Mechanics: Partition function Z


4/25

Back to Monte Carlo

Monte Carlo Simulation:

Draw i.i.d. set of N samples {x_(i)}.

Almost surely converges

Using central limit theorem.


5/25

Rejection Sampling

Sample another easy to use distribution q(x)

that satisfies p(x)


6/25

Importance Sampling


7/25

Why MCMC?

Wasting resources we need to spend more

time on the tail that overlaps with E.


8/25

MCMC Principles

Even with adaptation, often impossible to obtain proposaldistributions that are easy to sample from and goodapproximations at the same time.

Markov Chain is used to explore the state space X.

Transition matrix (kernels) are constructed so that the chainspends more time in the important regions.


9/25

MCMC Principles

For any starting point, the chain will converge

to the invariant distribution p(x) As long as T is a stochastic transition matrix

Irreducible graph should be connected.

Aperiodicity chain should not get trapped in cycles.


10/25

Detailed Balance (reversibility)

Condition

One way to design a MCMC sampler is to

satisfy this condition. However, convergence speed plays a more

crucial role in terms of practicalities.


11/25

Spectral Theory and Convergence

(brief review)

Note that p(x) is the left eigenvector of the matrix

T with corresponding eigenvalue 1 (Perron-Frobenius theorem).

Remaining eigenvalues are less than 1.

Second largest eigenvalue, therefore, determinesthe rate of convergence. Should be as small aspossible.


12/25

Application: PageRank (Google)

T = L + E, where L is a large link matrix.

L_(i,j) = normalized number of links from websiteI to website j.

E = uniform random matrix of small magnitudeadded to L to ensure irreducibility andaperiodicity. (addition of noise).

[L + E] p(x_(i+1)) = p(x_i)

Transition matrix as kernels: design differentkernels to introduce bias etc. to make the resultsmore interesting.


13/25

Mathematical Representation

Based on different kernels, different kinds of

Markov Chain algorithms are possible.

The most celebrated is the Metropolis-Hastings algorithm.


14/25

Metropolis-Hastings Algorithm


15/25



16/25


(properties)

Kernel:

Rejection Term:

Detailed Balance:


17/25

Independent Sampler Algorithm

Proposal is independent of the current state.

Algorithm is close to importance sampling, but

now the samples are correlated, since they result

from comparing one sample to the other.


18/25

Metropolis Algorithm

Assumes a symmetric random walk proposal.


19/25


Normalizing constant of the target distribution

is not required. (Cancels each other out)

Parallelization Several independent chains

can be simulated in parallel.

Success or failure depends on the parameters

selected for the proposal distribution.


20/25



21/25

Simulated Annealing

Global Optimization.

Could be estimated by

Argmax p(x_i), x_i, i = 1..N

Inefficient because random samples rarely comefrom the vicinity of the mode (blind samplingunless the distribution has large probabilitymass around the mode).

Simulated Annealing is a variant ofMCMC/Metropolis-Hastings that solves thisproblem.


22/25

Simulated Annealing


23/25

Simulated Annealing


24/25

Other Methods

Mixture of Kernels! Could be very useful when target distribution has many

peaks Can incorporate global proposals to explore vast regions

of the state space. (global proposal locks into peaks)

Local proposals to discover finer details. (explore spacearound peaks)


25/25

Gibbs Sampling etc..

Parasaran..

Thank you!