Entropy-SGD: biasing gradient descent into wide valleysPratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi,!
Christian Borgs, Jennifer Chayes, Levent Sagun, Riccardo Zecchina
Empirical validation
Hessian of small-LeNet at an optimum
�5 0 10 20 30 40Eigenvalues
0
10
103
105
Freq
uenc
y
�0.5 �0.4 �0.3 �0.2 �0.1 0.0Eigenvalues
0
102
103
104
Freq
uenc
y
Short negative tail
Motivation
0 50 100 150 200Epochs ⇥ L
5
10
15
20
%Er
ror
7.71%
7.81%
SGDEntropy-SGD
0 50 100 150 200Epochs ⇥ L
0
0.1
0.2
0.3
0.4
0.5
0.6
Cro
ss-E
ntro
pyLo
ss
0.03530.0336
SGDEntropy-SGD
0 10 20 30 40 50Epochs ⇥ L
1.2
1.25
1.3
1.35
Perp
lexi
ty
(Test :1.226)1.224
(Test :1.217)1.213
AdamEntropy-Adam
0 10 20 30 40 50Epochs ⇥ L
75
85
95
105
115
Perp
lexi
ty
(Test :78.6)81.43(Test :77.656)
80.116
SGDEntropy-SGD
All-CNN on CIFAR-10
PTB and char-RNN
Local entropy amplifies wide minima
Discrete Perceptrons
‣ What is the shape of the energy landscape?
‣ Reinforce SGD with properties of the loss function
‣ Does geometry connect to generalization?
vs. complexity of training
Modify the loss function
�0.5
0.0
0.5
1.0
1.5
xcandidate
bxF
bx f
f (x)
F(x, 103)
F(x, 2 ⇥ 104)
original global minimum
new global minimum
‣ Modified energy landscape is smoother by a factor
Theorem: Bound generalization error using stability
11+ g c
if there exists c > 0, such that
l�—2
f (x)�/2 [�2g�1, c]
eEntropy�SGD
⇣a
T
⌘h1� 1
1+g c
ib
eSGD b -smooth
f (x) is a-Lipschitz,
Hardt et al., ‘15
‣ Simulated annealing failsBraunstein, Zecchina ’05!
Baldassi et al. ‘16
F(x,d) = log
���x
0: # mistakes(x0) = 0,
��x� x
0��= d
��
dense clusters provably!generalize better, absent in!the standard replica analysis
isolated solutions
‣ Local entropy counts #solutions in a neighborhood
slow-down
‣ Smooth using a convolution “local entropy”
‣ Gradient
original lossGaussian kernel“scope”
Expected value of a local Gibbs distribution
‣ Estimate the gradient using MCMC
—F(x,g) = g�1⇣
x�⌦x
0↵⌘
F(x,g) =� log
hGg ⇤ e
� f (x)i
⌦x
0↵= 1
Z(x)
Z
x
0x
0exp
✓� f (x0)� 1
2gkx� x
0k2
◆dx
0
focuses on a!neighborhood
extremely general and scalable
decrease with!training iterations
Baldassi et al., ‘15
Langevin dynamics