+ All Categories
Home > Documents > Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf ·...

Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf ·...

Date post: 09-Apr-2018
Category:
Upload: trinhtuong
View: 228 times
Download: 1 times
Share this document with a friend
95
Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya Weizmann Institute of Science
Transcript
Page 1: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Stochastic optimizationMarkov Chain Monte Carlo

Ethan Fetaya

Weizmann Institute of Science

Page 2: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

1 IntroductionMotivationMarkov chainsStationary distributionMixing time

2 AlgorithmsMetropolis-HastingsSimulated AnnealingRejectionless Sampling

Page 3: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

1 IntroductionMotivationMarkov chainsStationary distributionMixing time

2 AlgorithmsMetropolis-HastingsSimulated AnnealingRejectionless Sampling

Page 4: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

Assume we have a discrete/non-convex function f(x) we wish tooptimize.

Example: knapsack problem. Given m items with weightsw = (w1, ..., wm) and values v = (v1, ..., vm) Find the subset withmaximal value under a weight constraint.

max vT z

s.t. wT z ≤ Czi ∈ {0, 1}

These problems are in general NP-hard.

For simplicity we will assume the search space X is finite, but ourresults can be generalized easily.

Page 5: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

Assume we have a discrete/non-convex function f(x) we wish tooptimize.

Example: knapsack problem.

Given m items with weightsw = (w1, ..., wm) and values v = (v1, ..., vm) Find the subset withmaximal value under a weight constraint.

max vT z

s.t. wT z ≤ Czi ∈ {0, 1}

These problems are in general NP-hard.

For simplicity we will assume the search space X is finite, but ourresults can be generalized easily.

Page 6: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

Assume we have a discrete/non-convex function f(x) we wish tooptimize.

Example: knapsack problem. Given m items with weightsw = (w1, ..., wm) and values v = (v1, ..., vm) Find the subset withmaximal value under a weight constraint.

max vT z

s.t. wT z ≤ Czi ∈ {0, 1}

These problems are in general NP-hard.

For simplicity we will assume the search space X is finite, but ourresults can be generalized easily.

Page 7: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

Assume we have a discrete/non-convex function f(x) we wish tooptimize.

Example: knapsack problem. Given m items with weightsw = (w1, ..., wm) and values v = (v1, ..., vm) Find the subset withmaximal value under a weight constraint.

max vT z

s.t. wT z ≤ Czi ∈ {0, 1}

These problems are in general NP-hard.

For simplicity we will assume the search space X is finite, but ourresults can be generalized easily.

Page 8: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

Assume we have a discrete/non-convex function f(x) we wish tooptimize.

Example: knapsack problem. Given m items with weightsw = (w1, ..., wm) and values v = (v1, ..., vm) Find the subset withmaximal value under a weight constraint.

max vT z

s.t. wT z ≤ Czi ∈ {0, 1}

These problems are in general NP-hard.

For simplicity we will assume the search space X is finite, but ourresults can be generalized easily.

Page 9: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

Assume we have a discrete/non-convex function f(x) we wish tooptimize.

Example: knapsack problem. Given m items with weightsw = (w1, ..., wm) and values v = (v1, ..., vm) Find the subset withmaximal value under a weight constraint.

max vT z

s.t. wT z ≤ Czi ∈ {0, 1}

These problems are in general NP-hard.

For simplicity we will assume the search space X is finite, but ourresults can be generalized easily.

Page 10: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

Stochastic approach - pick items randomly x1, ..., xN from your searchspace X, and return arg max

i∈[N ]f(xi).

What probability distribution should we use?

Simple (bad) distribution: pick x uniformly from X. Problem - wemight spend most of the time sampling junk.

Great distribution: Softmax p(x) = ef(x)/T /Z, where T is a parameterand Z =

∑x∈X

ef(x)/T is the partition function. Problem - how can you

sample from p(x) when you cannot compute Z?

To solve this problem we use MCMC (Markov chain Monte carlo)sampling.

Page 11: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

Stochastic approach - pick items randomly x1, ..., xN from your searchspace X, and return arg max

i∈[N ]f(xi).

What probability distribution should we use?

Simple (bad) distribution: pick x uniformly from X. Problem - wemight spend most of the time sampling junk.

Great distribution: Softmax p(x) = ef(x)/T /Z, where T is a parameterand Z =

∑x∈X

ef(x)/T is the partition function. Problem - how can you

sample from p(x) when you cannot compute Z?

To solve this problem we use MCMC (Markov chain Monte carlo)sampling.

Page 12: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

Stochastic approach - pick items randomly x1, ..., xN from your searchspace X, and return arg max

i∈[N ]f(xi).

What probability distribution should we use?

Simple (bad) distribution: pick x uniformly from X.

Problem - wemight spend most of the time sampling junk.

Great distribution: Softmax p(x) = ef(x)/T /Z, where T is a parameterand Z =

∑x∈X

ef(x)/T is the partition function. Problem - how can you

sample from p(x) when you cannot compute Z?

To solve this problem we use MCMC (Markov chain Monte carlo)sampling.

Page 13: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

Stochastic approach - pick items randomly x1, ..., xN from your searchspace X, and return arg max

i∈[N ]f(xi).

What probability distribution should we use?

Simple (bad) distribution: pick x uniformly from X. Problem - wemight spend most of the time sampling junk.

Great distribution: Softmax p(x) = ef(x)/T /Z, where T is a parameterand Z =

∑x∈X

ef(x)/T is the partition function. Problem - how can you

sample from p(x) when you cannot compute Z?

To solve this problem we use MCMC (Markov chain Monte carlo)sampling.

Page 14: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

Stochastic approach - pick items randomly x1, ..., xN from your searchspace X, and return arg max

i∈[N ]f(xi).

What probability distribution should we use?

Simple (bad) distribution: pick x uniformly from X. Problem - wemight spend most of the time sampling junk.

Great distribution: Softmax p(x) = ef(x)/T /Z, where T is a parameterand Z =

∑x∈X

ef(x)/T is the partition function.

Problem - how can you

sample from p(x) when you cannot compute Z?

To solve this problem we use MCMC (Markov chain Monte carlo)sampling.

Page 15: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

Stochastic approach - pick items randomly x1, ..., xN from your searchspace X, and return arg max

i∈[N ]f(xi).

What probability distribution should we use?

Simple (bad) distribution: pick x uniformly from X. Problem - wemight spend most of the time sampling junk.

Great distribution: Softmax p(x) = ef(x)/T /Z, where T is a parameterand Z =

∑x∈X

ef(x)/T is the partition function. Problem - how can you

sample from p(x) when you cannot compute Z?

To solve this problem we use MCMC (Markov chain Monte carlo)sampling.

Page 16: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Motivation

Stochastic approach - pick items randomly x1, ..., xN from your searchspace X, and return arg max

i∈[N ]f(xi).

What probability distribution should we use?

Simple (bad) distribution: pick x uniformly from X. Problem - wemight spend most of the time sampling junk.

Great distribution: Softmax p(x) = ef(x)/T /Z, where T is a parameterand Z =

∑x∈X

ef(x)/T is the partition function. Problem - how can you

sample from p(x) when you cannot compute Z?

To solve this problem we use MCMC (Markov chain Monte carlo)sampling.

Page 17: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Markov chains

1 IntroductionMotivationMarkov chainsStationary distributionMixing time

2 AlgorithmsMetropolis-HastingsSimulated AnnealingRejectionless Sampling

Page 18: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Markov chains

Definition 1.1 (Markov chain)

A series of random variables X1, ..., Xt, ..., is a Markov chain ifP (Xi+1 = y|Xi, ..., X1) = P (Xi+1 = y|Xi).

Example: random walk Xi+1 = Xi + ∆xi where ∆xi are i.i.d is aMarkov chain.

Example: Xi+1 is an element of [N ] not seem before. This is not aMarkov chain.

We will consider homogeneous Markov chains where P (Xi+1 = y|Xi)does not depend on i.

Page 19: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Markov chains

Definition 1.1 (Markov chain)

A series of random variables X1, ..., Xt, ..., is a Markov chain ifP (Xi+1 = y|Xi, ..., X1) = P (Xi+1 = y|Xi).

Example: random walk Xi+1 = Xi + ∆xi where ∆xi are i.i.d is aMarkov chain.

Example: Xi+1 is an element of [N ] not seem before. This is not aMarkov chain.

We will consider homogeneous Markov chains where P (Xi+1 = y|Xi)does not depend on i.

Page 20: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Markov chains

Definition 1.1 (Markov chain)

A series of random variables X1, ..., Xt, ..., is a Markov chain ifP (Xi+1 = y|Xi, ..., X1) = P (Xi+1 = y|Xi).

Example: random walk Xi+1 = Xi + ∆xi where ∆xi are i.i.d is aMarkov chain.

Example: Xi+1 is an element of [N ] not seem before. This is not aMarkov chain.

We will consider homogeneous Markov chains where P (Xi+1 = y|Xi)does not depend on i.

Page 21: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Markov chains

Definition 1.1 (Markov chain)

A series of random variables X1, ..., Xt, ..., is a Markov chain ifP (Xi+1 = y|Xi, ..., X1) = P (Xi+1 = y|Xi).

Example: random walk Xi+1 = Xi + ∆xi where ∆xi are i.i.d is aMarkov chain.

Example: Xi+1 is an element of [N ] not seem before. This is not aMarkov chain.

We will consider homogeneous Markov chains where P (Xi+1 = y|Xi)does not depend on i.

Page 22: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Markov chains

We will use matrix notation:

Define distributions as as a row vector πsuch that π(x) is the probability of x. We can think of a Markov chainas a series π0, π1, ..., πn, ....

Define the transition matrix P such thatPij = Pi→j = P (Xn+1 = j|Xn = i).

We then have πn+1 = πnP , and therefore πn = π0Pn.

πn+1(j) =P (Xn+1 = j) =∑i

P (Xn+1 = j|Xn = i)P (Xn = i)

=∑i

Pijπn(i) = (πnP )(j).

Page 23: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Markov chains

We will use matrix notation: Define distributions as as a row vector πsuch that π(x) is the probability of x.

We can think of a Markov chainas a series π0, π1, ..., πn, ....

Define the transition matrix P such thatPij = Pi→j = P (Xn+1 = j|Xn = i).

We then have πn+1 = πnP , and therefore πn = π0Pn.

πn+1(j) =P (Xn+1 = j) =∑i

P (Xn+1 = j|Xn = i)P (Xn = i)

=∑i

Pijπn(i) = (πnP )(j).

Page 24: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Markov chains

We will use matrix notation: Define distributions as as a row vector πsuch that π(x) is the probability of x. We can think of a Markov chainas a series π0, π1, ..., πn, ....

Define the transition matrix P such thatPij = Pi→j = P (Xn+1 = j|Xn = i).

We then have πn+1 = πnP , and therefore πn = π0Pn.

πn+1(j) =P (Xn+1 = j) =∑i

P (Xn+1 = j|Xn = i)P (Xn = i)

=∑i

Pijπn(i) = (πnP )(j).

Page 25: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Markov chains

We will use matrix notation: Define distributions as as a row vector πsuch that π(x) is the probability of x. We can think of a Markov chainas a series π0, π1, ..., πn, ....

Define the transition matrix P such thatPij = Pi→j = P (Xn+1 = j|Xn = i).

We then have πn+1 = πnP , and therefore πn = π0Pn.

πn+1(j) =P (Xn+1 = j) =∑i

P (Xn+1 = j|Xn = i)P (Xn = i)

=∑i

Pijπn(i) = (πnP )(j).

Page 26: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Markov chains

We will use matrix notation: Define distributions as as a row vector πsuch that π(x) is the probability of x. We can think of a Markov chainas a series π0, π1, ..., πn, ....

Define the transition matrix P such thatPij = Pi→j = P (Xn+1 = j|Xn = i).

We then have πn+1 = πnP , and therefore πn = π0Pn.

πn+1(j) =P (Xn+1 = j) =∑i

P (Xn+1 = j|Xn = i)P (Xn = i)

=∑i

Pijπn(i) = (πnP )(j).

Page 27: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Markov chains

We will use matrix notation: Define distributions as as a row vector πsuch that π(x) is the probability of x. We can think of a Markov chainas a series π0, π1, ..., πn, ....

Define the transition matrix P such thatPij = Pi→j = P (Xn+1 = j|Xn = i).

We then have πn+1 = πnP , and therefore πn = π0Pn.

πn+1(j) =P (Xn+1 = j) =∑i

P (Xn+1 = j|Xn = i)P (Xn = i)

=∑i

Pijπn(i) = (πnP )(j).

Page 28: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Markov chains

We will use matrix notation: Define distributions as as a row vector πsuch that π(x) is the probability of x. We can think of a Markov chainas a series π0, π1, ..., πn, ....

Define the transition matrix P such thatPij = Pi→j = P (Xn+1 = j|Xn = i).

We then have πn+1 = πnP , and therefore πn = π0Pn.

πn+1(j) =P (Xn+1 = j) =∑i

P (Xn+1 = j|Xn = i)P (Xn = i)

=∑i

Pijπn(i) = (πnP )(j).

Page 29: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

1 IntroductionMotivationMarkov chainsStationary distributionMixing time

2 AlgorithmsMetropolis-HastingsSimulated AnnealingRejectionless Sampling

Page 30: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

For well-behaved Markov chains the nice property holds -

πn = π0Pn → π∗ independent of π0.

Definition 1.2 (Irreducibility)

A Markov chain is called irreducible if for all i, j there is a k such thatP kij > 0, i.e. you can get to any state from any state.

Definition 1.3 (Aperiodicity)

A Markov chain is called aperiodical if there exist a k such that P kij > 0for all i, j.

A simple trick to turn a Markov chain aperiodical is to have Pii > 0.

Page 31: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

For well-behaved Markov chains the nice property holds -πn = π0P

n → π∗ independent of π0.

Definition 1.2 (Irreducibility)

A Markov chain is called irreducible if for all i, j there is a k such thatP kij > 0, i.e. you can get to any state from any state.

Definition 1.3 (Aperiodicity)

A Markov chain is called aperiodical if there exist a k such that P kij > 0for all i, j.

A simple trick to turn a Markov chain aperiodical is to have Pii > 0.

Page 32: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

For well-behaved Markov chains the nice property holds -πn = π0P

n → π∗ independent of π0.

Definition 1.2 (Irreducibility)

A Markov chain is called irreducible if for all i, j there is a k such thatP kij > 0, i.e. you can get to any state from any state.

Definition 1.3 (Aperiodicity)

A Markov chain is called aperiodical if there exist a k such that P kij > 0for all i, j.

A simple trick to turn a Markov chain aperiodical is to have Pii > 0.

Page 33: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

For well-behaved Markov chains the nice property holds -πn = π0P

n → π∗ independent of π0.

Definition 1.2 (Irreducibility)

A Markov chain is called irreducible if for all i, j there is a k such thatP kij > 0, i.e. you can get to any state from any state.

Definition 1.3 (Aperiodicity)

A Markov chain is called aperiodical if there exist a k such that P kij > 0for all i, j.

A simple trick to turn a Markov chain aperiodical is to have Pii > 0.

Page 34: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

For well-behaved Markov chains the nice property holds -πn = π0P

n → π∗ independent of π0.

Definition 1.2 (Irreducibility)

A Markov chain is called irreducible if for all i, j there is a k such thatP kij > 0, i.e. you can get to any state from any state.

Definition 1.3 (Aperiodicity)

A Markov chain is called aperiodical if there exist a k such that P kij > 0for all i, j.

A simple trick to turn a Markov chain aperiodical is to have Pii > 0.

Page 35: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

Theorem 1.4 (Stationary distribution)

If a Markov chain P is homogeneous, irreducible and aperiodical then forany distribution π0 we have πn → π∗ where π∗ is the unique solution toπ = πP .

Proof sketch.

Since P is row-stochastic, P1 = 1. Therefore there exits π∗ such thatπ∗P = π∗. From Perron-Frobenius the vector is positive, unique (up toscalar) and each other eigenvalue λ, we have |λ| < 1. P may not have aeigen-decomposition but this is enough (with some work) to proveconvergence.

How is this helpful? We will show how to build a Markov chain with anyπ∗, then sampling from π∗ is easy, just go over the chain to convergence(hopefully fast...).

Page 36: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

Theorem 1.4 (Stationary distribution)

If a Markov chain P is homogeneous, irreducible and aperiodical then forany distribution π0 we have πn → π∗ where π∗ is the unique solution toπ = πP .

Proof sketch.

Since P is row-stochastic, P1 = 1.

Therefore there exits π∗ such thatπ∗P = π∗. From Perron-Frobenius the vector is positive, unique (up toscalar) and each other eigenvalue λ, we have |λ| < 1. P may not have aeigen-decomposition but this is enough (with some work) to proveconvergence.

How is this helpful? We will show how to build a Markov chain with anyπ∗, then sampling from π∗ is easy, just go over the chain to convergence(hopefully fast...).

Page 37: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

Theorem 1.4 (Stationary distribution)

If a Markov chain P is homogeneous, irreducible and aperiodical then forany distribution π0 we have πn → π∗ where π∗ is the unique solution toπ = πP .

Proof sketch.

Since P is row-stochastic, P1 = 1. Therefore there exits π∗ such thatπ∗P = π∗.

From Perron-Frobenius the vector is positive, unique (up toscalar) and each other eigenvalue λ, we have |λ| < 1. P may not have aeigen-decomposition but this is enough (with some work) to proveconvergence.

How is this helpful? We will show how to build a Markov chain with anyπ∗, then sampling from π∗ is easy, just go over the chain to convergence(hopefully fast...).

Page 38: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

Theorem 1.4 (Stationary distribution)

If a Markov chain P is homogeneous, irreducible and aperiodical then forany distribution π0 we have πn → π∗ where π∗ is the unique solution toπ = πP .

Proof sketch.

Since P is row-stochastic, P1 = 1. Therefore there exits π∗ such thatπ∗P = π∗. From Perron-Frobenius the vector is positive, unique (up toscalar) and each other eigenvalue λ, we have |λ| < 1.

P may not have aeigen-decomposition but this is enough (with some work) to proveconvergence.

How is this helpful? We will show how to build a Markov chain with anyπ∗, then sampling from π∗ is easy, just go over the chain to convergence(hopefully fast...).

Page 39: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

Theorem 1.4 (Stationary distribution)

If a Markov chain P is homogeneous, irreducible and aperiodical then forany distribution π0 we have πn → π∗ where π∗ is the unique solution toπ = πP .

Proof sketch.

Since P is row-stochastic, P1 = 1. Therefore there exits π∗ such thatπ∗P = π∗. From Perron-Frobenius the vector is positive, unique (up toscalar) and each other eigenvalue λ, we have |λ| < 1. P may not have aeigen-decomposition but this is enough (with some work) to proveconvergence.

How is this helpful? We will show how to build a Markov chain with anyπ∗, then sampling from π∗ is easy, just go over the chain to convergence(hopefully fast...).

Page 40: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

Theorem 1.4 (Stationary distribution)

If a Markov chain P is homogeneous, irreducible and aperiodical then forany distribution π0 we have πn → π∗ where π∗ is the unique solution toπ = πP .

Proof sketch.

Since P is row-stochastic, P1 = 1. Therefore there exits π∗ such thatπ∗P = π∗. From Perron-Frobenius the vector is positive, unique (up toscalar) and each other eigenvalue λ, we have |λ| < 1. P may not have aeigen-decomposition but this is enough (with some work) to proveconvergence.

How is this helpful?

We will show how to build a Markov chain with anyπ∗, then sampling from π∗ is easy, just go over the chain to convergence(hopefully fast...).

Page 41: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

Theorem 1.4 (Stationary distribution)

If a Markov chain P is homogeneous, irreducible and aperiodical then forany distribution π0 we have πn → π∗ where π∗ is the unique solution toπ = πP .

Proof sketch.

Since P is row-stochastic, P1 = 1. Therefore there exits π∗ such thatπ∗P = π∗. From Perron-Frobenius the vector is positive, unique (up toscalar) and each other eigenvalue λ, we have |λ| < 1. P may not have aeigen-decomposition but this is enough (with some work) to proveconvergence.

How is this helpful? We will show how to build a Markov chain with anyπ∗, then sampling from π∗ is easy, just go over the chain to convergence(hopefully fast...).

Page 42: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

Our interest is in reversible Markov chain where detailed balance holds.

Lemma 1.5 (detailed balance)

If the detailed balance equation πiPij = πjPji holds then π = π∗.

Proof - (πP )(j) =∑

i π(i)Pij =∑

i π(j)Pji = π(j).

So in order to have π steady state we needPij

Pji=

πjπi

One can show that there exists a symmetric postive matrix A, such thatP is A after row-normalization.

Page 43: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

Our interest is in reversible Markov chain where detailed balance holds.

Lemma 1.5 (detailed balance)

If the detailed balance equation πiPij = πjPji holds then π = π∗.

Proof - (πP )(j) =∑

i π(i)Pij =∑

i π(j)Pji = π(j).

So in order to have π steady state we needPij

Pji=

πjπi

One can show that there exists a symmetric postive matrix A, such thatP is A after row-normalization.

Page 44: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

Our interest is in reversible Markov chain where detailed balance holds.

Lemma 1.5 (detailed balance)

If the detailed balance equation πiPij = πjPji holds then π = π∗.

Proof - (πP )(j) =∑

i π(i)Pij =∑

i π(j)Pji = π(j).

So in order to have π steady state we needPij

Pji=

πjπi

One can show that there exists a symmetric postive matrix A, such thatP is A after row-normalization.

Page 45: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Stationary distribution

Our interest is in reversible Markov chain where detailed balance holds.

Lemma 1.5 (detailed balance)

If the detailed balance equation πiPij = πjPji holds then π = π∗.

Proof - (πP )(j) =∑

i π(i)Pij =∑

i π(j)Pji = π(j).

So in order to have π steady state we needPij

Pji=

πjπi

One can show that there exists a symmetric postive matrix A, such thatP is A after row-normalization.

Page 46: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Mixing time

1 IntroductionMotivationMarkov chainsStationary distributionMixing time

2 AlgorithmsMetropolis-HastingsSimulated AnnealingRejectionless Sampling

Page 47: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Mixing time

How fast does a Markov chain converge? There is a huge literature onmixing time, we will state one simple result.

The mixing time tmix(ε) is the minimal time such that no mater wherewe started, for n ≥ tmix(ε) we have ||πn − π∗||TV = ||πn − π∗||1 ≤ ε

If P is reversibel it has an eigen-decomposition with1 = λ1 > λ2 ≥ ... ≥ λ|X| > −1. Define λ∗ = max{λ2, |λ|X||}.

Page 48: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Mixing time

How fast does a Markov chain converge? There is a huge literature onmixing time, we will state one simple result.

The mixing time tmix(ε) is the minimal time such that no mater wherewe started, for n ≥ tmix(ε) we have ||πn − π∗||TV = ||πn − π∗||1 ≤ ε

If P is reversibel it has an eigen-decomposition with1 = λ1 > λ2 ≥ ... ≥ λ|X| > −1. Define λ∗ = max{λ2, |λ|X||}.

Page 49: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Mixing time

How fast does a Markov chain converge? There is a huge literature onmixing time, we will state one simple result.

The mixing time tmix(ε) is the minimal time such that no mater wherewe started, for n ≥ tmix(ε) we have ||πn − π∗||TV = ||πn − π∗||1 ≤ ε

If P is reversibel it has an eigen-decomposition with1 = λ1 > λ2 ≥ ... ≥ λ|X| > −1. Define λ∗ = max{λ2, |λ|X||}.

Page 50: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Mixing time

How fast does a Markov chain converge? There is a huge literature onmixing time, we will state one simple result.

The mixing time tmix(ε) is the minimal time such that no mater wherewe started, for n ≥ tmix(ε) we have ||πn − π∗||TV = ||πn − π∗||1 ≤ ε

If P is reversibel it has an eigen-decomposition with1 = λ1 > λ2 ≥ ... ≥ λ|X| > −1. Define λ∗ = max{λ2, |λ|X||}.

Page 51: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Mixing time

Theorem 1.6 (Mixing time)

If a Markov chain P has all previous requirements and is reversible then

tmix(ε) ≤ log

(1

εmini π∗(i)

)1

1− λ∗

tmix(ε) ≥ log

(1

)λ∗

1− λ∗

This shows that the spectral gap controls the rate of convergence.

Page 52: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Introduction

Mixing time

Theorem 1.6 (Mixing time)

If a Markov chain P has all previous requirements and is reversible then

tmix(ε) ≤ log

(1

εmini π∗(i)

)1

1− λ∗

tmix(ε) ≥ log

(1

)λ∗

1− λ∗

This shows that the spectral gap controls the rate of convergence.

Page 53: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

1 IntroductionMotivationMarkov chainsStationary distributionMixing time

2 AlgorithmsMetropolis-HastingsSimulated AnnealingRejectionless Sampling

Page 54: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Page 55: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

The Metropolis-Hastings algorithms allows us to build a Markov chainwith a desired stationary distribution. The algorithm requires:

1) A desired distribution known up to a constant, e.g.π(x) = exp(f(x)/T )/Z.

2) A Markov chain Q(i→ j) called the proposal distribution. This iswhere we should look around state i. For example in the knapsackproblem it could we uniform over all possibilities of switching a singleelement Zk. For continuous state spaces Q(x0 → x) = N (x0, σI) is acommon choice.

Page 56: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

The Metropolis-Hastings algorithms allows us to build a Markov chainwith a desired stationary distribution. The algorithm requires:

1) A desired distribution known up to a constant, e.g.π(x) = exp(f(x)/T )/Z.

2) A Markov chain Q(i→ j) called the proposal distribution. This iswhere we should look around state i. For example in the knapsackproblem it could we uniform over all possibilities of switching a singleelement Zk. For continuous state spaces Q(x0 → x) = N (x0, σI) is acommon choice.

Page 57: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

The Metropolis-Hastings algorithms allows us to build a Markov chainwith a desired stationary distribution. The algorithm requires:

1) A desired distribution known up to a constant, e.g.π(x) = exp(f(x)/T )/Z.

2) A Markov chain Q(i→ j) called the proposal distribution. This iswhere we should look around state i.

For example in the knapsackproblem it could we uniform over all possibilities of switching a singleelement Zk. For continuous state spaces Q(x0 → x) = N (x0, σI) is acommon choice.

Page 58: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

The Metropolis-Hastings algorithms allows us to build a Markov chainwith a desired stationary distribution. The algorithm requires:

1) A desired distribution known up to a constant, e.g.π(x) = exp(f(x)/T )/Z.

2) A Markov chain Q(i→ j) called the proposal distribution. This iswhere we should look around state i. For example in the knapsackproblem it could we uniform over all possibilities of switching a singleelement Zk.

For continuous state spaces Q(x0 → x) = N (x0, σI) is acommon choice.

Page 59: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

The Metropolis-Hastings algorithms allows us to build a Markov chainwith a desired stationary distribution. The algorithm requires:

1) A desired distribution known up to a constant, e.g.π(x) = exp(f(x)/T )/Z.

2) A Markov chain Q(i→ j) called the proposal distribution. This iswhere we should look around state i. For example in the knapsackproblem it could we uniform over all possibilities of switching a singleelement Zk. For continuous state spaces Q(x0 → x) = N (x0, σI) is acommon choice.

Page 60: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Algorithm Metropolis-Hastings

Input: x0, π and Q.for i = 0 : N do

Pick proposition x∗ from distribution Q(xi → ·)α = min{1, π(x∗)Q(x∗→xi)

π(xi)Q(xi→x∗) }With probability α set xi+1 = x∗, else xi+1 = xi

end for

Notice we only ratio of π so the unknown constant is eliminated.

For example if Q is symmetric and π ∝ exp(f(x)/T ) then iff(x∗) ≥ f(xi) we always move to x∗, else we move with probabilityexp(−|∆f |/T )

Page 61: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Algorithm Metropolis-Hastings

Input: x0, π and Q.for i = 0 : N do

Pick proposition x∗ from distribution Q(xi → ·)α = min{1, π(x∗)Q(x∗→xi)

π(xi)Q(xi→x∗) }With probability α set xi+1 = x∗, else xi+1 = xi

end for

Notice we only ratio of π so the unknown constant is eliminated.

For example if Q is symmetric and π ∝ exp(f(x)/T ) then iff(x∗) ≥ f(xi) we always move to x∗, else we move with probabilityexp(−|∆f |/T )

Page 62: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Algorithm Metropolis-Hastings

Input: x0, π and Q.for i = 0 : N do

Pick proposition x∗ from distribution Q(xi → ·)α = min{1, π(x∗)Q(x∗→xi)

π(xi)Q(xi→x∗) }With probability α set xi+1 = x∗, else xi+1 = xi

end for

Notice we only ratio of π so the unknown constant is eliminated.

For example if Q is symmetric and π ∝ exp(f(x)/T ) then iff(x∗) ≥ f(xi) we always move to x∗,

else we move with probabilityexp(−|∆f |/T )

Page 63: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Algorithm Metropolis-Hastings

Input: x0, π and Q.for i = 0 : N do

Pick proposition x∗ from distribution Q(xi → ·)α = min{1, π(x∗)Q(x∗→xi)

π(xi)Q(xi→x∗) }With probability α set xi+1 = x∗, else xi+1 = xi

end for

Notice we only ratio of π so the unknown constant is eliminated.

For example if Q is symmetric and π ∝ exp(f(x)/T ) then iff(x∗) ≥ f(xi) we always move to x∗, else we move with probabilityexp(−|∆f |/T )

Page 64: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Theorem 2.1

The MH algorithm defines a Markov chain P with stationarydistribution π∗ = π.

Proof.

We will show P has detailed balance: Assume w.l.o.g πjQj→i ≤ πiQi→j .

πiPi→j =πiQi→j min{1, πjQj→iπiQi→j

} = πiQi→jπjQj→iπiQi→j

=πjQj→i = πjQj→i min{1, πiQi→jπjQj→i

} = πjPj→i

Page 65: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Theorem 2.1

The MH algorithm defines a Markov chain P with stationarydistribution π∗ = π.

Proof.

We will show P has detailed balance: Assume w.l.o.g πjQj→i ≤ πiQi→j .

πiPi→j =πiQi→j min{1, πjQj→iπiQi→j

} = πiQi→jπjQj→iπiQi→j

=πjQj→i = πjQj→i min{1, πiQi→jπjQj→i

} = πjPj→i

Page 66: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Theorem 2.1

The MH algorithm defines a Markov chain P with stationarydistribution π∗ = π.

Proof.

We will show P has detailed balance:

Assume w.l.o.g πjQj→i ≤ πiQi→j .

πiPi→j =πiQi→j min{1, πjQj→iπiQi→j

} = πiQi→jπjQj→iπiQi→j

=πjQj→i = πjQj→i min{1, πiQi→jπjQj→i

} = πjPj→i

Page 67: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Theorem 2.1

The MH algorithm defines a Markov chain P with stationarydistribution π∗ = π.

Proof.

We will show P has detailed balance: Assume w.l.o.g πjQj→i ≤ πiQi→j .

πiPi→j =πiQi→j min{1, πjQj→iπiQi→j

}

= πiQi→jπjQj→iπiQi→j

=πjQj→i = πjQj→i min{1, πiQi→jπjQj→i

} = πjPj→i

Page 68: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Theorem 2.1

The MH algorithm defines a Markov chain P with stationarydistribution π∗ = π.

Proof.

We will show P has detailed balance: Assume w.l.o.g πjQj→i ≤ πiQi→j .

πiPi→j =πiQi→j min{1, πjQj→iπiQi→j

} = πiQi→jπjQj→iπiQi→j

=πjQj→i = πjQj→i min{1, πiQi→jπjQj→i

} = πjPj→i

Page 69: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Theorem 2.1

The MH algorithm defines a Markov chain P with stationarydistribution π∗ = π.

Proof.

We will show P has detailed balance: Assume w.l.o.g πjQj→i ≤ πiQi→j .

πiPi→j =πiQi→j min{1, πjQj→iπiQi→j

} = πiQi→jπjQj→iπiQi→j

=πjQj→i = πjQj→i min{1, πiQi→jπjQj→i

} = πjPj→i

Page 70: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Remarks:

Q must be irreducible!

The convergence rate depends heavily on the auxiliary distributionQ.

The algorithm is derivative-free.

Convergence can be exponentially slow.

Can have low complexity per iteration, depends on Q.

π can be known up to a constant.

Optimization is just one application of the MH algorithm.

Page 71: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Remarks:

Q must be irreducible!

The convergence rate depends heavily on the auxiliary distributionQ.

The algorithm is derivative-free.

Convergence can be exponentially slow.

Can have low complexity per iteration, depends on Q.

π can be known up to a constant.

Optimization is just one application of the MH algorithm.

Page 72: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Remarks:

Q must be irreducible!

The convergence rate depends heavily on the auxiliary distributionQ.

The algorithm is derivative-free.

Convergence can be exponentially slow.

Can have low complexity per iteration, depends on Q.

π can be known up to a constant.

Optimization is just one application of the MH algorithm.

Page 73: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Remarks:

Q must be irreducible!

The convergence rate depends heavily on the auxiliary distributionQ.

The algorithm is derivative-free.

Convergence can be exponentially slow.

Can have low complexity per iteration, depends on Q.

π can be known up to a constant.

Optimization is just one application of the MH algorithm.

Page 74: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Remarks:

Q must be irreducible!

The convergence rate depends heavily on the auxiliary distributionQ.

The algorithm is derivative-free.

Convergence can be exponentially slow.

Can have low complexity per iteration, depends on Q.

π can be known up to a constant.

Optimization is just one application of the MH algorithm.

Page 75: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Remarks:

Q must be irreducible!

The convergence rate depends heavily on the auxiliary distributionQ.

The algorithm is derivative-free.

Convergence can be exponentially slow.

Can have low complexity per iteration, depends on Q.

π can be known up to a constant.

Optimization is just one application of the MH algorithm.

Page 76: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Remarks:

Q must be irreducible!

The convergence rate depends heavily on the auxiliary distributionQ.

The algorithm is derivative-free.

Convergence can be exponentially slow.

Can have low complexity per iteration, depends on Q.

π can be known up to a constant.

Optimization is just one application of the MH algorithm.

Page 77: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Metropolis-Hastings

Remarks:

Q must be irreducible!

The convergence rate depends heavily on the auxiliary distributionQ.

The algorithm is derivative-free.

Convergence can be exponentially slow.

Can have low complexity per iteration, depends on Q.

π can be known up to a constant.

Optimization is just one application of the MH algorithm.

Page 78: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Simulated Annealing

1 IntroductionMotivationMarkov chainsStationary distributionMixing time

2 AlgorithmsMetropolis-HastingsSimulated AnnealingRejectionless Sampling

Page 79: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Simulated Annealing

Consider running MH with π ∝ exp(f(x)/T ). What value of T to use?

For large T - rapid mixing time, but π∗ is almost uniform.

For small T - π∗ is highly concentrated on the maximum, but there canbe (exponentially) long mixing time.

The idea behind simulated annealing - start with high T , then decreaseit slowly over time.

Page 80: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Simulated Annealing

Consider running MH with π ∝ exp(f(x)/T ). What value of T to use?

For large T - rapid mixing time, but π∗ is almost uniform.

For small T - π∗ is highly concentrated on the maximum, but there canbe (exponentially) long mixing time.

The idea behind simulated annealing - start with high T , then decreaseit slowly over time.

Page 81: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Simulated Annealing

Consider running MH with π ∝ exp(f(x)/T ). What value of T to use?

For large T - rapid mixing time, but π∗ is almost uniform.

For small T - π∗ is highly concentrated on the maximum, but there canbe (exponentially) long mixing time.

The idea behind simulated annealing - start with high T , then decreaseit slowly over time.

Page 82: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Simulated Annealing

Consider running MH with π ∝ exp(f(x)/T ). What value of T to use?

For large T - rapid mixing time, but π∗ is almost uniform.

For small T - π∗ is highly concentrated on the maximum, but there canbe (exponentially) long mixing time.

The idea behind simulated annealing - start with high T , then decreaseit slowly over time.

Page 83: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Simulated Annealing

Consider running MH with π ∝ exp(f(x)/T ). What value of T to use?

For large T - rapid mixing time, but π∗ is almost uniform.

For small T - π∗ is highly concentrated on the maximum, but there canbe (exponentially) long mixing time.

The idea behind simulated annealing - start with high T , then decreaseit slowly over time.

Page 84: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Simulated Annealing

While simulated annealing is not a homogeneous process, if T changesslow enough it is a close approximation.

One can show that for finite/compact spaces simulated annealing withTi = 1

C ln(T0+i)converges to the global optimum.

Page 85: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Simulated Annealing

While simulated annealing is not a homogeneous process, if T changesslow enough it is a close approximation.

One can show that for finite/compact spaces simulated annealing withTi = 1

C ln(T0+i)converges to the global optimum.

Page 86: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Simulated Annealing

Online demo - http://www.youtube.com/watch?v=iaq_Fpr4KZc

Counter-example: On the blackboard.

Page 87: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Rejectionless Sampling

1 IntroductionMotivationMarkov chainsStationary distributionMixing time

2 AlgorithmsMetropolis-HastingsSimulated AnnealingRejectionless Sampling

Page 88: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Rejectionless Sampling

If we are at a local maxima or a high probability state, we might rejectany proposal with high probability. This is very wasteful.

The idea - sample directly the next accepted state.

This only works for discrete problems such that Q(x0 → x) has areasonable size support.

Page 89: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Rejectionless Sampling

If we are at a local maxima or a high probability state, we might rejectany proposal with high probability. This is very wasteful.

The idea - sample directly the next accepted state.

This only works for discrete problems such that Q(x0 → x) has areasonable size support.

Page 90: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Rejectionless Sampling

If we are at a local maxima or a high probability state, we might rejectany proposal with high probability. This is very wasteful.

The idea - sample directly the next accepted state.

This only works for discrete problems such that Q(x0 → x) has areasonable size support.

Page 91: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Rejectionless Sampling

Define w(x) = Q(x0 → x) ·min{1, π(x)Q(x→xi−1)π(xi−1)Q(xi−1→x)} the probability to

chose and accept x.

Define W =∑

xw(x). This is computable if the support of Q(x0 → ·) issmall and simple.

The probability that x is the next accepted state in the MH run isw(x)/W . Use this to pick the next state instead of the regular iteration.

Page 92: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Rejectionless Sampling

Define w(x) = Q(x0 → x) ·min{1, π(x)Q(x→xi−1)π(xi−1)Q(xi−1→x)} the probability to

chose and accept x.

Define W =∑

xw(x). This is computable if the support of Q(x0 → ·) issmall and simple.

The probability that x is the next accepted state in the MH run isw(x)/W . Use this to pick the next state instead of the regular iteration.

Page 93: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Rejectionless Sampling

Define w(x) = Q(x0 → x) ·min{1, π(x)Q(x→xi−1)π(xi−1)Q(xi−1→x)} the probability to

chose and accept x.

Define W =∑

xw(x). This is computable if the support of Q(x0 → ·) issmall and simple.

The probability that x is the next accepted state in the MH run isw(x)/W . Use this to pick the next state instead of the regular iteration.

Page 94: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Rejectionless Sampling

Algorithm Rejectionless-MH

Input: x0, π and Q.for i = 0 : N do

For x ∈ supp(Q(xi → ·)) compute w(x),

w(x) = Q(x0 → x) ·min{1, π(x)Q(x→xi−1)π(xi−1)Q(xi−1→x)}

W =∑

x∈supp(Q(xi,:))w(x)

Select xi+1 with probability w(x)/Wend for

This can be much slower per iteration, but worth it if W is low enough.

Page 95: Stochastic optimization - Markov Chain Monte Carloethanf/MCMC/stochastic optimization.pdf · Stochastic Optimization Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya

Stochastic Optimization

Algorithms

Rejectionless Sampling

Algorithm Rejectionless-MH

Input: x0, π and Q.for i = 0 : N do

For x ∈ supp(Q(xi → ·)) compute w(x),

w(x) = Q(x0 → x) ·min{1, π(x)Q(x→xi−1)π(xi−1)Q(xi−1→x)}

W =∑

x∈supp(Q(xi,:))w(x)

Select xi+1 with probability w(x)/Wend for

This can be much slower per iteration, but worth it if W is low enough.


Recommended