Diffusions and their numerical approximationApplications of Langevin algorithms
Langevin Dynamics
Loucas Pillaud-Vivien
November 7, 2019
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Introduction
Sampling distribution over high-dimensional space is animportant topic in computational statistics and machinelearningExample of application : Bayesian inference forhigh-dimensional modelsProblems:
1 Most of sampling techniques do not scale to high-dimension.Big d.
2 And to large number of data (recall HMC, need the fullgradient). Big N.
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Example: Bayesian setting
A Bayesian model is specified by:1 sampling distribution of observed data: likelihood Y ∼ L(·|θ)2 a prior distribution p on the parameter space θ ∈ Rd
The inference is based on the posterior distribution
π(dθ) = p(dθ)L(Y |θ)∫L(Y |u)p(du)
The normalizing constant is often not tractable (too highdimensional), we can only compute:
π(dθ) ∝ p(dθ)L(Y |θ)
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Outline
1 Diffusions and their numerical approximationSettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
2 Applications of Langevin algorithmsSampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
Framework
We want to sample the following measure that has a densityw.r.t Lebesgue known up to a normalization factor.
dµ(x) = e−V (x)dx∫Rd e−V (y)dy
We assume that V is L-smooth : i.e. continuouslydifferentiable and ∃L > 0 s.t.
‖∇V (x)−∇V (y)‖ 6 L‖x − y‖
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
Convergence to equilibrium for Diffusions
Let us consider the overdamped Langevin diffusion in Rd :
dXt = −∇V (Xt)dt +√
2dBt ,
L-smoothness of V gives existence and unicity of a solutionStationnary measure: dµ(x) = e−V (x)dx∫
Rd e−V (y)dy .
Semi-group: Pt(f )(x) = E[f (Xt)|X0 = x ] −→ ”law of Xt”.Infinitesimal generator: Lφ = ∆φ−∇V · ∇φ.
We can verify that the semi-group follows the dynamics:
ddt Pt(f ) = LPt(f ).
−→ Question : what speed of convergence then??? ?
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
Convergence to equilibrium for Diffusions
Theorem (Poincare implies convergence to equilibrium)With the notations above, the following propositions areequivalent:
µ satisfies a Poincare Inequality with constant PFor all f smooth, Varµ(Pt(f )) 6 e−2t/PVarµ(f ) for all t > 0.
Proof: Integration by part formula (µ is reversible),
−∫
f (Lg) dµ =∫∇f · ∇g dµ = −
∫(Lf )g dµ, hence,
ddt Varµ(Pt(f )) = d
dt
∫(Pt(f ))2dµ = 2
∫Pt(f )(LPt(f ))dµ
= −2∫‖∇Pt(f )‖2dµ
6 −2/P Varµ(Pt(f ))Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
Poincare inequalities: definition in modern language
Definition (Poincare inequality)µ ∈ P(Rd ) satisfies a Poincare Inequality with constant P if
Varµ(f ) 6 Pµ∫‖∇f ‖2dµ,
for all (bounded) f : Rd −→ R of class C1.
Recall that :
Varµ(f ) =∫
f 2dµ−(∫
fdµ)2
=∫ (
f −∫
fdµ)2
dµ∫‖∇f ‖2dµ = E(f ) is the Dirichlet Energy.
Spectral interpretation: E(f ) =∫∇f · ∇fdµ =
∫f (−Lf )dµ
−→ 1/P = λ2, first non-trivial eigenvalue of L.Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
Application to the Ornstein-Uhlenbeck process
The diffusion of the Ornstein-Uhlenbeck process follows theSDE in Rd :
dXt = −Xtdt +√
2dBt ,
Denote L the operator Lφ = ∆φ− x · ∇φ, then1 For dµ(x) = 1
(2π)d/2 e−‖x‖2/2dx , L is self adjoint in L2µ
2 µ stationnary measure of O-U process3 µ verifies Poincare inequality with constant 1.4 for all f smooth, for all t > 0.
Varµ(Pt(f )) 6 e−2tVarµ(f ).
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
Poincare inequalities
Long story short:
Poincare inequality ⇐⇒ Spectral gap for L⇐⇒
Exponential convergence for the diffusion
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
Poincare inequalities
For what distribution do they occur?When V is m-stongly convex: P = 1/m (linear convergenceof gradient descent)
When V is only convex: yes but no bound...A generic condition for non necessarily convex potential :
12 |∇V |2 −∆V > α
For mixture of Gaussian P explodes exponentially.
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
Poincare inequalities
For what distribution do they occur?When V is m-stongly convex: P = 1/m (linear convergenceof gradient descent)When V is only convex: yes but no bound...
A generic condition for non necessarily convex potential :
12 |∇V |2 −∆V > α
For mixture of Gaussian P explodes exponentially.
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
Poincare inequalities
For what distribution do they occur?When V is m-stongly convex: P = 1/m (linear convergenceof gradient descent)When V is only convex: yes but no bound...A generic condition for non necessarily convex potential :
12 |∇V |2 −∆V > α
For mixture of Gaussian P explodes exponentially.
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
Poincare inequalities
For what distribution do they occur?When V is m-stongly convex: P = 1/m (linear convergenceof gradient descent)When V is only convex: yes but no bound...A generic condition for non necessarily convex potential :
12 |∇V |2 −∆V > α
For mixture of Gaussian P explodes exponentially.
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
Ok, fine. But how do I get back to the real worldand draw samples ?
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
Discretized Langevin Diffusion
Idea: Sample the diffusion paths, using Euler-Maruyamascheme
dXt = −∇V (Xt)dt +√
2dBt
Xk+1 = Xk − γk+1∇V (Xk) +√
2γk+1ξk+1
where(ξk)k is i.i.d N (0, Id )(γk)k is a sequence of stepsizes, either constant or decreasingto 0
Note the similarity with gradient descent or its stochasticcounterpart.This algorithm is referred to Unajusted Langevin Algorithm,Langevin Monte Carlo or Gradient Langevin Dynamics.
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
Discretized Langevin Diffusion: constant stepsize
When ∀k, γk = γ, then (Xk)k is an homogeneous Markovchain with Markov kernel RγUnder some mild assumptions Rγ is irreducible, positiverecurrent and hence has an invariant distributiondµγ 6= dµ.Typical questions:
For a given precision how do we choose the stepsize γ and thenumber of iterations such that
dist(δx Rnγ , dµ) 6 ε
How do we choose x ?How do we quantify dist(dµγ , dµ) ?
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD
Outline
1 Diffusions and their numerical approximationSettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion
2 Applications of Langevin algorithmsSampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD
Result for a strongly convex potential
Theorem (Durmus, Moulines 2016)Assume that V is m-strongly convex and L smooth. Setγ ∈ (0, 1/(m + L)]) and κ = mL/(m + L) then for all x ∈ Rd ,
W 22 (δx Rn
γ , π) 6 2(1− κγ)nW 22 (δx , π) + Cdγ
Remarks :Decomposition bias + variance as for SGD.Geometric rate then distance from dµ to dµγOne may choose γ s.t. for n = Θ
(dε2
)iterations
W 22 (δx Rn
γ , π) 6 ε
Explicit way of choosing γ (it was a problem! –see MALA)
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD
Result for a strongly convex potential : remarks
Remarks :Exactly same results for
Total variation (Dalalyan 2014)KL divergence (Bartlett et al. 2017)
Same result with decreasing step sizes but no parameter totune!Quadratic improvement by Jordan et. al 2018 by consideringunderdamped Langevin (similar to HMC) for n = Θ
(√dε
)iterations W 2
2 (δx Rnγ , π) 6 ε (needed also only strong convexity
outside of a ball).
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD
Grrrrr...But you know... I do not like to compute allthe gradients...
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD
Stochastic Gradient Langevin Dynamics (SGLD)
Recall: the ULA algorithm is a discretization of theoverdamped Langevin diffusion, which leaves invariant thetarget distribution dµ.To further reduce the computational cost, SGLD usesunbiased estimates of the gradient!Initially proposed by Welling, M. and Teh, Y.W. 2011.
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD
SGLD algorithm
Interested in situations where the distribution dµ arises as theposterior distribution in a Bayesian inference problem withprior dµ0 and a large number N >> 1 of i.i.f observations ziwith likelihoods p(zi |X ):
dµ(X |zi ) ∝ dµ0(X )N∏
i=1p(zi |X ).
Denote for i ∈ {1, . . . ,N},Vi (X ) = − log(p(zi |X ))V0(X ) = − log(dµ0(X ))V =
∑Ni=0 Vi
ULA cost of one iteration is Nd which is prohibitively large
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD
SGLD algorithm
Welling, M and Teh, Y.W. suggested to replace ∇V with anunbiased estimate
∇V0 + (N/p)∑i∈S∇Vi ,
where S is a minibatch of size p.A single update of SGLD is thus (cost pd):
Xk+1 = Xk − γ
∇V0(Xk) + Np
∑i∈Sk+1
∇Vi (Xk)
+√
2γZk+1
Same idea as SGD.Two sources of randomness: estimates of the gradient andGaussian added noise to sample.
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD
SGLD algorithm: need for variance reduction
Xk+1 = Xk − γ
∇V0(Xk) + Np
∑i∈Sk+1
∇Vi (Xk)
+√
2γZk+1
Two sources of noise. For γ = γ0/N:1 Noise from gradient estimates too big ⇒ no sampling.2 Need to decrease the variance: assume x∗ unique minimizer of
V ,
Xk+1 = Xk−γ
∇V0(Xk )−∇V0(x∗) +Np
∑i∈Sk+1
∇Vi (Xk )−∇Vi (x∗)
+√
2γZk+1
If γ = γ0/N, SGLD ∼ SGD. Use variance control to sample.Precise analysis from Moulines et al. (2018).
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD
Non-convex Learning via SGLD
Classical learning problem:Find the minimum of F (w) := EP [f (w ,Z )] where f is notnecessarily convex.Call FZ (w) := 1
n∑n
i=1 f (w , zi )Consider the Langevin diffusion and its associated discretization :
dXt = −∇FZ (Xt) +√
2β−1dBt
Xk+1 = Xk − η∇f (w , zk) +√
2ηβ−1ξk
Converges to dµz(dw) ∝ exp (−βFz(w)), when β ∼ 1/T is big,it concentrates around minimizers of Fz and hence F .
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD
Non-convex Learning via SGLDXk+1 = Xk − η∇f (w , zk) +
√2ηβ−1ξk
(Xk) converges to dµz(w) ∝ exp (−βFz(w)), β ∼ 1/T .
Theorem (Raginsky, Rakhlin, Telgarsky (2018))For k > ε−4, η 6 ε4,
EF (Xk)− F ∗ 6 cε+ (β + d)2
n + d log(β + 1)β
Sketch of proof: control of three termsHow far from the true diffusion + invariant measureexp (−βFz(w))How far FZ is from FHow far a sample from exp (−βFz(w)) is near a minimizer ofFZ in terms of β
Loucas Pillaud-Vivien Langevin Dynamics
Diffusions and their numerical approximationApplications of Langevin algorithms
Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD
Conclusion
We have seen how Langevin Dynamics can be used to derive newalgorithm for:
SamplingBayesian LearningNon-convex optimization
Problem with non-convexity: metastability of the markov process−→ old problem in computational chemistry.”Particle remain trap in wells for a long time before going out.”There has been a huge effort in this community to tackle thisproblem
Inspiration for Machine Learning ?
Loucas Pillaud-Vivien Langevin Dynamics