Langevin-type sampling methods - GitHub Pages · 3. Hamiltonian Monte Carlo (HMC) Neal, Radford M....

Post on 05-Jul-2020

6 views 0 download

transcript

Based on summer literature review

School of Physics, Peking University

Ziming Liu

Advisor: Zheng Zhang, UCSB ECE1

Langevin-type sampling methods

The big party

2

ComputerScience Math

Physics

Robustness

QuantumField

Theory

StatisticalPhysics

QuantumInformation

Information Theory

Tensor Network

QuantumDynamics

TensorAnalysis

PartialDifferentialEquation

The goal of this talk

Complexity

MachineLearning

Hamiltonian Monte CarloRenormalization Group

Ising model

Matrix product state

Tensor trainTensorized neural network

Schrodinger equationKohn-sham equationentanglementClassical-quantum correspondence

Quantum-inspired Hamiltonian Monte Carlo

Overview

3

1. Introduction to Bayesian models

2. Introduction to Langevin dynamics (1st,2nd,3rd-order)

3. Hamiltonian Monte Carlo (HMC)

4

1. Introduction to Bayesian models

Maxwell-Boltzmann distribution

5

Description: for isothermal systemSingle-particle system (temperature 𝑇)

For state 𝑥, energy 𝐸(𝑥), then probability density 𝑝 𝑥 ∝ exp(−𝐸 𝑥

𝑘𝐵𝑇)

Theory

• Maximum entropy principle for isolated system (canonical ensemble)

• Minimum free energy principle for isothermal system

Ideal gas 𝐸 =𝑞2

2𝑚𝑝 𝑞 ∝ exp(−

𝑞2

2𝑚𝑘𝐵𝑇)

Static isothermal atmosphere

𝐸 = 𝑈(𝑥) 𝑝 𝑥 ∝ exp(−𝑈(𝑥)

𝑘𝐵𝑇)

Gas ina well

𝐸 = 𝑈 𝑥 +𝑞2

2𝑚 𝑝 𝑥, 𝑞 ∝ exp(−𝑈 𝑥 +

𝑞2

2𝑚𝑘𝐵𝑇

)

Link between pdf and energy func

Bayesian model

6

What ?

𝑝 𝜃 𝐷 ∝ 𝑝 𝐷 𝜃 𝑝(𝜃)

posterior likelihood prior

𝜃: model parameters

Link to regression models

𝑝 𝜃 𝐷 = exp −𝑈 𝜃

𝑈 𝜃 = − log 𝑝 𝐷 𝜃 − log(𝑝 𝜃 )

Regression error Regularization term

𝜃∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑈(𝜃)𝜃

𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝(𝜃|𝐷)𝜃

Maximum Posterior EstimationGlobal Optima

Bayesian model

7

Why ?

𝜃∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑈(𝜃)𝜃

𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝(𝜃|𝐷)𝜃

Maximum a Posterior EstimationGlobal Optima

Limitations of MAP1 No uncertainty quantification (point estimation)2 Risk of overfitting !

MAP Bayesian

8

https://zhuanlan.zhihu.com/p/33563623

Markov Chain Monte Carlo (MCMC)

9

𝑃𝑠 𝜃 = 𝑃(𝜃|𝐷)Steady distribution Posterior distribution

Bayesian modelMarkov Chain

Given a steady distribution, how to construct a Markov Chain?

Steady distribution Detailed balance

(Simplified to)

1

2 3

Detailed balance:

1

2 3

Steady distributionBut no db:

Detailed balance

10

𝑁1 𝑜𝑟 𝑝1 𝑁2 𝑜𝑟 𝑝2

𝑄(1 → 2) =1/2

𝑄(2 → 1) =1/2

Metropolis-Hastings (MH)

11

The MH algorithm for sampling from a target distribution 𝑝(𝑥), using transition kernel 𝑄, consists of the following steps:• For 𝑡 = 1, 2,⋯

• Sample 𝑦 from 𝑄(𝑦|𝑥𝑡). Think of 𝑦 as a proposed value of 𝑥𝑡+1.• Compute acceptance probability

𝐴(𝑥𝑡 → 𝑦) = min(1,𝑝 𝑦 𝑄(𝑥𝑡|𝑦)

𝑝 𝑥𝑡 𝑄(𝑦|𝑥𝑡))

• With probability A accept the proposed value, and set 𝑥𝑡+1 = 𝑦.Otherwise, set 𝑥𝑡+1 = 𝑥𝑡.

𝑥𝑡 𝑦

𝑝(𝑥): number of particles at state 𝑥

𝑄(𝑦|𝑥): transition rate from 𝑥 to 𝑦

𝑝(𝑦)

𝑝(𝑥𝑡)𝑄(𝑦|𝑥𝑡)

𝑄(𝑥𝑡|𝑦)

𝑝(𝑥)𝑄(𝑦|𝑥): number of particles transitfrom 𝑥 to 𝑦

𝑝 𝑥 𝑄 𝑦 𝑥 𝐴(𝑥 → 𝑦): number of accepted particles transit from 𝑥 to 𝑦

=

Detailedbalance

Metropolis algorithmThe Metropolis algorithm for sampling from a target distribution 𝑝(𝑥),transit through random walk, consists of the following steps:• For 𝑡 = 1, 2,⋯

• Random walk to 𝑦 from 𝑥𝑡. Think of 𝑦 as a proposed value of 𝑥𝑡+1.• Compute acceptance probability

𝐴(𝑥𝑡 → 𝑦) = min(1,𝑝 𝑦

𝑝 𝑥𝑡)

• With probability A accept the proposed value, and set 𝑥𝑡+1 = 𝑦.Otherwise, set 𝑥𝑡+1 = 𝑥𝑡.

12

Comment: random walk is symmetric, so 𝑄 𝑦 𝑥 = 𝑄(𝑥|𝑦)

In thermodynamics models, 𝑃 𝑥 ∼ exp −𝑈 𝑥 /𝑇 (Boltzmann distribution)

𝑈(𝑥)

Step sizelarge: low acceptance rate

small: correlated & not independent

13

2. Introduction toLangevin Dynamics

Zoo of Langevin dynamics

14

Stochastic Gradient Langevin Dynamics (cite=718)

Stochastic Gradient Hamiltonian Monte Carlo (cite=300)

Stochastic sampling using Nose-Hoover thermostat (cite=140)

Stochastic sampling using Fisher information (cite=207)

Welling, Max, and Yee W. Teh. "Bayesian learning via

stochastic gradient Langevin dynamics." Proceedings

of the 28th international conference on machine

learning (ICML-11). 2011.

Chen, Tianqi, Emily Fox, and Carlos Guestrin.

"Stochastic gradient hamiltonian monte

carlo." International conference on machine

learning. 2014.

Ding, Nan, et al. "Bayesian sampling using

stochastic gradient thermostats." Advances in

neural information processing systems. 2014.

Ahn, Sungjin, Anoop Korattikara, and Max Welling.

"Bayesian posterior sampling via stochastic gradient

Fisher scoring." arXiv preprint arXiv:1206.6380 (2012).

1storder, general

1storder, gaussian

2ndorder

3rdorder

1st order Langevin dynamics

15

(also known as Brownian motion or Wiener Process)

𝑑𝑥 = −∇ 𝑓 𝑥 𝑑𝑡 + 𝛽−12𝑑𝑊(𝑡) 𝜌𝑠 ∝ exp(−𝛽𝑓(𝑥))

Energy function (bayesian) / loss function (optimization)

m

The properties of the mediumA heat bath (temperature 𝑻)Hit the ball every 𝑡0 (憋大招)

transfer momentum 𝑝 ∼ 𝑒𝑥𝑝(−𝑝2

2𝑇𝑡0)

Overdamped (coefficient 𝜸 large)Small relaxation time

Physical Intuition𝑓 𝑥 = 𝑐𝑜𝑛𝑠𝑡

①The ball gains a momentum 𝑝 fromparticles (fluctuating) around it.

② It travels in the damping medium𝑚 ሷ𝑥 = −𝛾 ሶ𝑥

→ ሶ𝑥 =𝑝

𝑚exp −

𝛾

𝑚𝑡 , 𝑥 =

𝑝

𝛾(1 − exp(−

𝛾

𝑚𝑡))

③Overdamped condition, then exp −𝛾

𝑚𝑡0 → 0.

So at time 𝑡, the total displacement is 𝑝

𝛾∝ 𝑝.

i.e. 𝑑𝑥 ∝1

𝛾exp −

𝑝2

2𝑇𝑡0∝ 𝑇𝑑𝑊(𝑡0) (𝛽 =

1

𝑇)

2nd order Langevin dynamics

16

𝑑 Ԧ𝑥 = Ԧ𝑝𝑑𝑡

𝑑 Ԧ𝑝 = −∇𝑓 Ԧ𝑥 𝑑𝑡 − 𝐴 Ԧ𝑝𝑑𝑡 + 2𝐴𝑇𝑑𝑊

𝑓( Ԧ𝑥)

Ԧ𝑝

ConservativeForce

DampingForce

Thermal“Force”

𝑃𝑠 Ԧ𝑥, Ԧ𝑝 ∝ exp(−(𝑝2

2+ 𝑈( Ԧ𝑥))/𝑇)

Invariant measure:

17

𝜙𝑡

Fokker Planck Eq for 2nd order LD

One-dim random walk (不变原理)

𝜙𝑥 𝜙𝑥+1𝜙𝑥−1

𝑡

𝑡 + 1𝜕𝜙

𝜕𝑡=1

2𝜙𝑥−1 + 𝜙𝑥+1 − 2𝜙𝑥 ≈

1

2

𝜕2𝜙

𝜕𝑥2

Dynamical Equations

Fokker-Planck Equations

18

𝑑 Ԧ𝜃 = Ԧ𝑝𝑑𝑡

𝑑 Ԧ𝑝 = −∇𝑈 Ԧ𝜃 𝑑𝑡 − 𝜁 Ԧ𝑝𝑑𝑡 + 2𝐴𝑇𝑑𝑊

𝑑𝜁 =𝑝2

𝑛− 𝑇0 𝑑𝑡

When 𝒑↑ → 𝒑𝟐

𝒏∼ 𝑻 > 𝑻𝟎 → 𝜻 ↑ →more friction on 𝒑 → 𝒑 ↓

Negative feedback loop

3rd order Langevin dynamics (special)

Thermal term (thermostat)

3rd order Langevin dynamics (general)

19

𝑑𝑞 = 𝑀−1𝑑𝑝

𝑑𝑝 = −∇𝑈 𝑞 𝑑𝑡 + 𝜎𝐹 Δ𝑡𝑀12𝑑𝑊 − 𝜁𝑝𝑑𝑡 + 𝜎𝐴𝑀

12𝑑𝑊

𝑑𝜁 =1

𝜇𝑝𝑇𝑀−1𝑝 − 𝑁𝑑𝑘𝐵𝑇 𝑑𝑡 − 𝛾𝜁𝑑𝑡 + 2𝑘𝐵𝑇𝛾𝑑𝑊

Invariant measure: exp −𝛽𝑈(𝑞) exp(−𝛽𝑝𝑇𝑀−1𝑝/2) exp −𝜇 𝜁 − ො𝛾 2/2

ො𝛾 = 𝛽(𝜎𝐹2 + 𝜎𝐴

2)/2

20

Slides from: https://ergodic.org.uk/~bl/Data/Slides/MD4.pdf

3rd order Langevin dynamics

21

𝑑𝑞 = 𝑀−1𝑑𝑝

𝑑𝑝 = −∇𝑈 𝑞 𝑑𝑡 + 𝜎𝐹 Δ𝑡𝑀12𝑑𝑊 − 𝜁𝑝𝑑𝑡 + 𝜎𝐴𝑀

12𝑑𝑊

𝑑𝜁 =1

𝜇𝑝𝑇𝑀−1𝑝 − 𝑁𝑑𝑘𝐵𝑇 𝑑𝑡 − 𝛾𝜁𝑑𝑡 + 2𝑘𝐵𝑇𝛾𝑑𝑊

Thermostat

exp −𝛽𝑈(𝑞) exp(−𝛽𝑝𝑇𝑀−1𝑝/2)Hamiltonian dynamics

exp(−𝛽𝑝𝑇𝑀−1𝑝/2)exp(−𝜇𝜁2/2)

OU process for 𝜁 exp(−𝜇𝜁2/2)

Noise for 𝑝 exp −𝜇𝜁2

2→ exp −

𝜇(𝜁 − ො𝛾)2

2

Invariant measure: exp −𝛽𝑈(𝑞) exp(−𝛽𝑝𝑇𝑀−1𝑝/2) exp −𝜇 𝜁 − ො𝛾 2/2

ො𝛾 = 𝛽(𝜎𝐹2 + 𝜎𝐴

2)/2

Bu

ildin

g B

lock

s

22

3. Hamiltonian Monte Carlo(HMC)

Neal, Radford M. "MCMC using Hamiltonian dynamics." Handbook of

markov chain monte carlo 2.11 (2011): 2.

Betancourt, Michael. "A conceptual introduction to Hamiltonian Monte Carlo." arXiv preprint arXiv:1701.02434 (2017).

2nd order Langevin dynamics

23

𝑑 Ԧ𝑥 = Ԧ𝑝𝑑𝑡

𝑑 Ԧ𝑝 = −∇𝑓 Ԧ𝑥 𝑑𝑡 − 𝐴 Ԧ𝑝𝑑𝑡 + 2𝐴𝑇𝑑𝑊

𝑓( Ԧ𝜃)

Ԧ𝑝

ConservativeForce

DampingForce

Thermal“Force”

𝑃𝑠 Ԧ𝑥, Ԧ𝑝 ∝ exp(−(𝑝2

2+ 𝑈( Ԧ𝑥))/𝑇)

Invariant measure:

𝑑 Ԧ𝑥 = Ԧ𝑝𝑑𝑡

𝑑 Ԧ𝑝 = −∇𝑓 Ԧ𝑥 𝑑𝑡

𝐴 = 0

Hamiltonian dynamics

24

𝑑 Ԧ𝑥 = 𝑀−1 Ԧ𝑝𝑑𝑡

𝑑 Ԧ𝑝 = −∇𝑈( Ԧ𝑥)𝑑𝑡

Definition of momentum

Momentum theorem

𝑓( Ԧ𝑥): conservative force

𝐻 𝑥, 𝑝 =1

2𝑝𝑇𝑀−1𝑝 + 𝑈(𝑥)

Kinetic energy

Potential energy

𝑑𝐻 = 𝑝𝑇𝑀−1𝑑𝑝 + 𝑑𝑈 𝑥 = −𝑑𝑥𝑇∇𝑈 𝑥 + 𝑑𝑈 𝑥 = 0

Energy conservation

Hamiltonian

Hamiltonian equations

Steady distribution

25

𝑑 Ԧ𝑥 = 𝑀−1 Ԧ𝑝𝑑𝑡

𝑑 Ԧ𝑝 = −∇𝑈( Ԧ𝑥)𝑑𝑡

Hamiltonian Equations

Steady distribution:

𝑝𝑠 𝑥, 𝑞 ∝ exp −𝑈 Ԧ𝑥 −1

2𝑞𝑇𝑀−1𝑞

𝑝𝑠 𝑥 ∝ exp −𝑈 Ԧ𝑥 = 𝑝(𝑥|𝐷)

Fokker Planck Equation:

𝜕𝑡𝑝 +𝜕𝑝

𝜕𝑥

𝑇𝜕𝐻

𝜕𝑥+

𝜕𝑝

𝜕𝑞

𝑇𝜕𝐻

𝜕𝑞= 0

(Also known as Liouville’s Theorem in physics)

Example: 1d spring-mass system

26

𝐸 =1

2𝑘𝑥2 +

𝑞2

2𝑚

q

x

𝑥 = 𝐴𝑠𝑖𝑛(𝜔𝑡 + 𝜙0)

𝑞 = 𝑚𝜔𝐴𝑐𝑜𝑠(𝜔𝑡 + 𝜙0)

(𝜔 =𝑘

𝑚)

No ergodicity ?

Example: 1d spring-mass system interacting with a heat bath

27

①momentum resampling (𝑚 = 𝑘𝐵𝑇 = 1)𝑞 ∼ N(0,1)

②travel on an energy level for a certain time (L steps)

q

x① ①

Maxwell-Boltzmann distribution

𝑝 𝑞 ∝ exp(−𝑞2

2𝑚𝑘𝐵𝑇)

Ensemble Time

2nd LD & HMC

28

q

x① ①

Continuous scattering Discrete scattering (憋大招)

Algorithm

29

Momentum resampling

Hamiltonian dynamics(Leap frog scheme)

Metropolis-Hastings

Euler vs leap frog

30

Leapfrog method

Euler’s method

𝑞 𝑡 + 𝜖 = 𝑞 𝑡 − 𝜖𝜕𝑈

𝜕𝑥(𝑥(𝑡))

𝑥 𝑡 + 𝜖 = 𝑥 𝑡 + 𝜖𝑞(𝑡)

𝑚

𝑞 𝑡 +𝜖

2= 𝑞 𝑡 −

𝜖

2

𝜕𝑈

𝜕𝑥(𝑥(𝑡))

𝑞 𝑡 + 𝜖 = 𝑞 𝑡 +𝜖

2−𝜖

2

𝜕𝑈

𝜕𝑥(𝑥(𝑡 + 𝜖))

𝑥 𝑡 + 𝜖 = 𝑥 𝑡 + 𝜖𝑞(𝑡 +

𝜖2)

𝑚

𝑥(𝑡 + 𝜖)

𝑞(𝑡 + 𝜖)

1𝜖

𝑚

−𝑘𝜖

𝑚1

𝑥(𝑡)

𝑞(𝑡)=

𝑥(𝑡 + 𝜖)

𝑞(𝑡 + 𝜖)=

1 0

−𝑘𝜖

2𝑚1

1𝜖

𝑚

0 1

1 0

−𝑘𝜖

2𝑚1

𝑥(𝑡)

𝑞(𝑡)×

e.g. 𝑈 𝑥 =1

2𝑘𝑥2

det>1, not preserving volume !

det=1, preserving volume !

Euler vs leap frog

31

Euler’s method : diverge Leapfrog : stable

MCMC & HMC

32

𝑈(𝑥)

Random walk MCMC: position 𝑥

𝑈(𝑥)

HMC: position 𝑥 + momentum 𝑞

𝑃 𝑥 ∼ exp −𝑈 𝑥 /𝑇

𝑃 𝑥 ∼ exp −(𝑈 𝑥 + 𝐾 𝑞 )/𝑇

Hamiltonian dynamics → energy conservation

Always accept ! (if step size→0)

}

MCMC & HMC

33

Neal, Radford M. "MCMC using Hamiltonian dynamics." Handbook of

markov chain monte carlo 2.11 (2011): 2.

HMC limitations

34

①Ill-conditioned distributions ②Multimodal distributions

③Discontinuous ④spiky

⑤Large training dataset

Need different massesin different directions hard to escape from one mode

Large energy gapAcceptance rate low

Large gradientsAcceptance rate low

Expensive gradients computation

HMC variants

35

Riemannian HMC

Girolami, Mark, Ben Calderhead, and

Siu A. Chin. "Riemannian manifold

hamiltonian monte carlo." arXiv

preprint arXiv:0907.1100 (2009).

Magnetic HMC

Tripuraneni, Nilesh, et al. "Magnetic

Hamiltonian Monte Carlo." Proceedings

of the 34th International Conference on

Machine Learning-Volume 70. JMLR.

org, 2017.

Wormhole HMC

Lan, Shiwei, Jeffrey Streets, and Babak

Shahbaba. "Wormhole hamiltonian

monte carlo." Twenty-Eighth AAAI

Conference on Artificial Intelligence.

2014.

Continuous tempered HMC

Graham, Matthew M., and Amos J.

Storkey. "Continuously tempered

hamiltonian monte carlo." arXiv preprint

arXiv:1704.03338 (2017).

Stochastic Gradient HMC

Chen, Tianqi, Emily Fox, and Carlos

Guestrin. "Stochastic gradient hamiltonian

monte carlo." International conference on

machine learning. 2014.

Stochastic Gradient Thermostat

Ding, Nan, et al. "Bayesian sampling

using stochastic gradient

thermostats." Advances in neural

information processing systems. 2014.

Relativistic Monte Carlo

Lu, Xiaoyu, et al. "Relativistic

monte carlo." arXiv preprint

arXiv:1609.04388 (2016).

Optics HMC

Afshar, Hadi Mohasel, and Justin

Domke. "Reflection, refraction, and

hamiltonian monte carlo." Advances

in Neural Information Processing

Systems. 2015.