+ All Categories
Home > Documents > ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see...

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see...

Date post: 31-Dec-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
46
ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS + CDS, NYU in collaboration with D.Freeman (UC Berkeley), Luca Venturi & Afonso Bandeira (NYU)
Transcript
Page 1: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

ON THE OPTIMIZATION LANDSCAPE OF NEURAL

NETWORKSJOAN BRUNA , CIMS + CDS, NYU

in collaboration with D.Freeman (UC Berkeley), Luca Venturi & Afonso Bandeira (NYU)

Page 2: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

MOTIVATION

➤ We consider the standard Empirical Risk Minimization setup:

E(⇥) = E(X,Y )⇠P `(�(X;⇥), Y ) .P =

1

n

X

i�(xi,yi) .

`(z) convexR(⇥): regularization

E(⇥) = E(X,Y )⇠P `(�(X;⇥), Y ) +R(⇥)

Page 3: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

MOTIVATION➤ We consider the standard Empirical Risk Minimization setup:

➤ Population loss decomposition (aka “fundamental theorem of ML”):

➤ Long history of techniques to provably control generalization error via appropriate regularization.

➤ Generalization error and optimization are entangled [Bottou & Bousquet]

E(⇥) = E(X,Y )⇠P `(�(X;⇥), Y ) .

`(z) convex

E(⇥⇤) = E(⇥⇤)| {z }training error

+E(⇥⇤)� E(⇥⇤)| {z }generalization gap

.

R(⇥): regularizationE(⇥) = E(X,Y )⇠P `(�(X;⇥), Y ) +R(⇥)

P =1

L

X

lL

�(xl,yl)

Page 4: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

MOTIVATION

➤ However, when is a large, deep network, current best mechanism to control generalization gap has two key ingredients:

➤ Stochastic Optimization ➤ “During training, it adds the sampling noise that corresponds to empirical-

population mismatch” [Léon Bottou].

➤ Make the model convolutional and very large. ➤ see e.g. “Understanding Deep Learning Requires Rethinking

Generalization”, [Ch. Zhang et al, ICLR’17].

�(X;⇥)

Page 5: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

MOTIVATION➤ However, when is a large, deep network, current best

mechanism to control generalization gap has two key ingredients:

➤ Stochastic Optimization

➤ Make the model convolutional and as large as possible.

➤ We first address how overparametrization affects the energy landscapes.

➤ Goal 1: Study simple topological properties of these landscapes . for half-rectified neural networks.

➤ Goal 2: Estimate simple geometric properties with efficient, scalable algorithms. Diagnostic tool.

�(X;⇥)

E(⇥), E(⇥)

Page 6: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

OUTLINE

➤ Topology of Neural Network Energy Landscapes

➤ Geometry of Neural Network Energy Landscapes

VISUALIZING THE LOSS LANDSCAPE OF NEURAL NETS

Hao Li1, Zheng Xu

1, Gavin Taylor

2, Tom Goldstein

1

1University of Maryland, College Park, 2United States Naval Academy{haoli,xuzh,tomg}@cs.umd.edu, [email protected]

ABSTRACT

Neural network training relies on our ability to find “good” minimizers of highlynon-convex loss functions. It is well known that certain network architecturedesigns (e.g., skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimiz-ers that generalize better. However, the reasons for these differences, and theireffect on the underlying loss landscape, is not well understood.In this paper, we explore the structure of neural loss functions, and the effect ofloss landscapes on generalization, using a range of visualization methods. First,we introduce a simple “filter normalization” method that helps us visualize lossfunction curvature, and make meaningful side-by-side comparisons between lossfunctions. Then, using a variety of visualizations, we explore how network archi-tecture affects the loss landscape, and how training parameters affect the shape ofminimizers.

1 INTRODUCTION

Training neural networks requires minimizing a high-dimensional non-convex loss function – atask that is hard in theory, but sometimes easy in practice. Despite the NP-hardness of traininggeneral neural loss functions (Blum & Rivest, 1989), simple gradient methods often find globalminimizers (parameter configurations with zero or near-zero training loss), even when data and labelsare randomized before training (Zhang et al., 2017). However, this good behavior is not universal;the trainability of neural nets is highly dependent on network architecture design choices, the choiceof optimizer, variable initialization, and a variety of other considerations. Unfortunately, the effect

(a) without skip connections (b) with skip connections

Figure 1: The loss surfaces of ResNet-56 with/without skip connections. The vertical axis islogarithmic to show dynamic range. The proposed filter normalization scheme is used to enablecomparisons of sharpness/flatness between the two figures.

1

arX

iv:1

712.

0991

3v1

[cs.L

G]

28 D

ec 2

017

[Li et al.’17]

Page 7: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

PRIOR RELATED WORK

➤ Models from Statistical physics have been considered as possible approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15]

➤ Tensor factorization models capture some of the non convexity essence [Anandukar et al’15, Cohen et al. ’15, Haeffele et al.’15]

Page 8: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

PRIOR RELATED WORK

➤ Models from Statistical physics have been considered as possible approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15]

➤ Tensor factorization models capture some of the non convexity essence [Anandukar et al’15, Cohen et al. ’15, Haeffele et al.’15]

➤ [Shafran and Shamir,’15] studies bassins of attraction in neural networks in the overparametrized regime.

➤ [Soudry’16, Song et al’16] study Empirical Risk Minimization in two-layer ReLU networks, also in the over-parametrized regime.

Page 9: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

PRIOR RELATED WORK➤ Models from Statistical physics have been considered as possible

approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15]

➤ Tensor factorization models capture some of the non convexity essence [Anandukar et al’15, Cohen et al. ’15, Haeffele et al.’15]

➤ [Shafran and Shamir,’15] studies bassins of attraction in neural networks in the overparametrized regime.

➤ [Soudry’16, Song et al’16] study Empirical Risk Minimization in two-layer ReLU networks, also in the over-parametrized regime.

➤ [Tian’17] studies learning dynamics in a gaussian generative setting.

➤ [Chaudhari et al’17]: Studies local smoothing of energy landscape using the local entropy method from statistical physics.

➤ [Pennington & Bahri’17]: Hessian Analysis using Random Matrix Th.

➤ [Soltanolkotabi, Javanmard & Lee’17]: layer-wise quadratic NNs.

Page 10: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

NON-CONVEXITY ≠ NOT OPTIMIZABLE

➤ We can perturb any convex function in such a way it is no longer convex, but such that gradient descent still converges.

➤ E.g. quasi-convex functions.

Page 11: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

NON-CONVEXITY ≠ NOT OPTIMIZABLE

➤ We can perturb any convex function in such a way it is no longer convex, but such that gradient descent still converges.

➤ E.g. quasi-convex functions.

➤ In particular, deep models have internal symmetries.

F (✓) = F (g.✓) , g 2 G compact.

Page 12: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

ANALYSIS OF NON-CONVEX LOSS SURFACES

➤ Given loss we consider its representation in terms of level sets:

E(✓) , ✓ 2 Rd ,

E(✓) =

Z 1

01(✓ 2 ⌦u)du , ⌦u = {y 2 Rd ; E(y) u} .

Page 13: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

ANALYSIS OF NON-CONVEX LOSS SURFACES

➤ Given loss we consider its representation in terms of level sets:

➤ A first notion we address is about the topology of the level sets .

➤ In particular, we ask how connected they are, i.e. how many connected components at each energy level ?Nu

⌦u

u

E(✓) =

Z 1

01(✓ 2 ⌦u)du , ⌦u = {y 2 Rd ; E(y) u} .

E(✓) , ✓ 2 Rd ,

⌦u

Page 14: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

ANALYSIS OF NON-CONVEX LOSS SURFACES

➤ Given loss we consider its representation in terms of level sets:

➤ A first notion we address is about the topology of the level sets .

➤ In particular, we ask how connected they are, i.e. how many connected components at each energy level ?

➤ Related to presence of poor local minima:

Nu

⌦u

u

E(✓) =

Z 1

01(✓ 2 ⌦u)du , ⌦u = {y 2 Rd ; E(y) u} .

E(✓) , ✓ 2 Rd ,

⌦u

Proposition: If Nu = 1 for all u then Ehas no poor local minima.

(i.e. no local minima y⇤ s.t. E(y⇤) > miny E(y))

Page 15: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

LINEAR VS NON-LINEAR DEEP MODELS

➤ Some authors have considered linear “deep” models as a first step towards understanding nonlinear deep models:

E(W1, . . . ,WK) = E(X,Y )⇠P kWK . . .W1X � Y k2 .

X 2 Rn , Y 2 Rm , Wk 2 Rnk⇥nk�1 .

Page 16: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

LINEAR VS NON-LINEAR DEEP MODELS

➤ Some authors have considered linear “deep” models as a first step towards understanding nonlinear deep models:

•studying critical points.

•later generalized in [Hardt & Ma’16, Lu & Kawaguchi’17]

E(W1, . . . ,WK) = E(X,Y )⇠P kWK . . .W1X � Y k2 .

X 2 Rn , Y 2 Rm , Wk 2 Rnk⇥nk�1 .

Theorem: [Kawaguchi’16] If ⌃ = E(XXT ) and E(XY T )are full-rank and ⌃ has distinct eigenvalues, then E(⇥)has no poor local minima.

Page 17: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

LINEAR VS NON-LINEAR DEEP MODELSE(W1, . . . ,WK) = E(X,Y )⇠P kWK . . .W1X � Y k2 .

Proposition: [BF’16]

1. If nk > min(n,m) , 0 < k < K, then Nu = 1 for all u.

2. (2-layer case, ridge regression)E(W1,W2) = E(X,Y )⇠P kW2W1X � Y k2 + �(kW1k2 + kW2k2)satisfies Nu = 1 8 u if n1 > min(n,m).

➤ We pay extra redundancy price to get simple topology.

Page 18: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

LINEAR VS NON-LINEAR DEEP MODELSE(W1, . . . ,WK) = E(X,Y )⇠P kWK . . .W1X � Y k2 .

Proposition: [BF’16]

1. If nk > min(n,m) , 0 < k < K, then Nu = 1 for all u.

2. (2-layer case, ridge regression)E(W1,W2) = E(X,Y )⇠P kW2W1X � Y k2 + �(kW1k2 + kW2k2)satisfies Nu = 1 8 u if n1 > min(n,m).

Proposition: [BF’16] For any architecture (choice ofinternal dimensions), there exists a distributionP(X,Y ) such that Nu > 1 in the ReLU ⇢(z) = max(0, z) case.

➤ We pay extra redundancy price to get simple topology.

➤ This simple topology is an “artifact” of the linearity of the network:

Page 19: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

PROOF SKETCH

Given ⇥A = (WA1 , . . . ,WA

K ) and ⇥B = (WB1 , . . . ,WB

K ),we construct a path �(t) that connects ⇥A with ⇥B

st E(�(t)) max(E(⇥A), E(⇥B)).

➤ Goal:

Page 20: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

PROOF SKETCH

Given ⇥A = (WA1 , . . . ,WA

K ) and ⇥B = (WB1 , . . . ,WB

K ),we construct a path �(t) that connects ⇥A with ⇥B

st E(�(t)) max(E(⇥A), E(⇥B)).

➤ Goal:

➤ Main idea:

➤ Simple fact:

1. Induction on K.

2. Lift the parameter space to fW = W1W2: the problem is convex ) thereexists a (linear) path e�(t) that connects ⇥A and ⇥B .

3. Write the path in terms of original coordinates by factorizing e�(t).

If M0,M1 2 Rn⇥n0with n0 > n,

then there exists a path t : [0, 1] ! �(t)with �(0) = M0, �(1) = M1 andM0,M1 2 span(�(t)) for all t 2 (0, 1).

Page 21: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

MODEL SYMMETRIES

➤ How much extra redundancy are we paying to achieve instead of simply no poor-local minima?

Nu = 1

[with L. Venturi, A. Bandeira, ’17]

Page 22: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

MODEL SYMMETRIES

➤ How much extra redundancy are we paying to achieve instead of simply no poor-local minima? ➤ In the multilinear case, we don’t need .

Nu = 1

nk > min(n,m)

[with L. Venturi, A. Bandeira, ’17]

(W1,W2, . . .WK) ⇠ (fW1, . . . ,fWK) , fWk = UkWkU�1k�1 , Uk 2 GL(Rnk⇥nk) .

Page 23: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

MODEL SYMMETRIES➤ How much extra redundancy are we paying to achieve

instead of simply no poor-local minima? ➤ In the multilinear case, we don’t need

➤ We do the same analysis in the quotient space defined by the equivalence relationship .

➤ Construct paths on the Grassmanian manifold of linear subspaces

➤ Generalizes best known results for multilinear case (no assumptions on covariance).

Nu = 1

nk > min(n,m)

[with L. Venturi, A. Bandeira, ’17]

(W1,W2, . . .WK) ⇠ (fW1, . . . ,fWK) , fWk = UkWkU�1k�1 , Uk 2 GL(Rnk⇥nk) .

Theorem [LBB’17]: The Multilinear regressionE(X,Y )⇠P kWK . . .W1X � Y k2 has no poor local minima.

Page 24: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

BETWEEN LINEAR AND RELU: POLYNOMIAL NETS

➤ Quadratic nonlinearities are a simple extension of the linear case, by lifting or “kernelizing”:

⇢(z) = z2

⇢(Wx) = AWX , X = xxT , AW = (WkWTk )kM .

Page 25: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

BETWEEN LINEAR AND RELU: POLYNOMIAL NETS

➤ Quadratic nonlinearities are a simple extension of the linear case, by lifting or “kernelizing”:

➤ Level sets are connected with sufficient overparametrisation:

⇢(z) = z2

⇢(Wx) = AWX , X = xxT , AW = (WkWTk )kM .

Proposition: If Mk � 3N2k 8 k K , then the landscape of K-layerquadratic network is simple: Nu = 1 8 u.

Page 26: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

BETWEEN LINEAR AND RELU: POLYNOMIAL NETS➤ Quadratic nonlinearities are a simple extension of the

linear case, by lifting or “kernelizing”:

➤ Level sets are connected with sufficient overparametrisation:

➤ No poor local minima with much better bounds in the scalar output two-layer case:

⇢(z) = z2

⇢(Wx) = AWX , X = xxT , AW = (WkWTk )kM .

Proposition: If Mk � 3N2k 8 k K , then the landscape of K-layerquadratic network is simple: Nu = 1 8 u.

Theorem [LBB’17]: The two-layer quadratic network optimizationL(U,W ) = E(X,Y )⇠P kU(WX)2 � Y k2 has no poor local minima ifM � 2N .

Page 27: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

ASYMPTOTIC CONNECTEDNESS OF RELU

➤ Good behavior is recovered with nonlinear ReLU networks, provided they are sufficiently overparametrized:

➤ Setup: two-layer ReLU network:�(X;⇥) = W2⇢(W1X) , ⇢(z) = max(0, z).W1 2 Rm⇥n,W2 2 Rm

Page 28: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

ASYMPTOTIC CONNECTEDNESS OF RELU➤ Good behavior is recovered with nonlinear ReLU networks,

provided they are sufficiently overparametrized:

➤ Setup: two-layer ReLU network:

➤ Overparametrisation “wipes-out” local minima (and group symmetries).

➤ The bound is cursed by dimensionality, ie exponential in .

➤ Result is based on local linearization of the ReLU kernel (hence exponential price).

Theorem [BF’16]: For any ⇥A,⇥B 2 Rm⇥n,Rm,with E(⇥{A,B}) �, there exists path �(t)from ⇥A and ⇥B such that8 t , E(�(t)) max(�, ✏) and ✏ ⇠ m� 1

n .

�(X;⇥) = W2⇢(W1X) , ⇢(z) = max(0, z).W1 2 Rm⇥n,W2 2 Rm

n

Page 29: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

KERNELS ARE BACK?

➤ The underlying technique we described consists in “convexifying” the problem, by mapping neural parameters

to canonical parameters :

�(x;⇥) = Wk⇢(Wk�1 . . . ⇢(W1X))) , ⇥ = (W1, . . .Wk) ,

� = A(⇥)

�(X;⇥) = h (X),A(⇥)i .

⇥A

⇥B

A(⇥A)

A(⇥B)

Page 30: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

KERNELS ARE BACK?

➤ The underlying technique we described consists in “convexifying” the problem, by mapping neural parameters

to canonical parameters :

➤ This includes Empirical Risk Minimization (since RKHS is only queried on finite # of datapoints).

➤ See [Bietti&Mairal’17,Zhang et al’17, Bach’17] for related work.

�(x;⇥) = Wk⇢(Wk�1 . . . ⇢(W1X))) , ⇥ = (W1, . . .Wk) ,

� = A(⇥)

�(X;⇥) = h (X),A(⇥)i .

Corollary: [BBV’17] If dim{A(w), w 2 Rn} = q < 1and M � 2q, then E(W,U) = E|U⇢(WX)� Y |2,W 2 RM⇥N has no poor local minima if M � 2q.

Page 31: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

PARAMETRIC VS MANIFOLD OPTIMIZATION

➤ This suggests thinking about the problem in the functional space generated by the model:

F� = {' : Rn ! Rm ;'(x) = �(x;⇥) for some ⇥} .

F�

g⇤ : x 7! E(Y |x)g⇤

min'2F�

k'� g⇤kp<latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">AAACIXicbVDNSgMxGMzWv1r/Vj16CRZBBMuuCNpbURCPFawtNGXJpmk3NJtdkmyhrPssXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByyB6hD+BU9j3TozgxZ5ddirOFHCRuDkpgxx1zx6jbkSSkApNOFaq7Tqx7qRYakY4zUooUTTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif105077KTMhEnmgoy+6iXcKgjOOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxccAFq4BbUQQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit><latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">AAACIXicbVDNSgMxGMzWv1r/Vj16CRZBBMuuCNpbURCPFawtNGXJpmk3NJtdkmyhrPssXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByyB6hD+BU9j3TozgxZ5ddirOFHCRuDkpgxx1zx6jbkSSkApNOFaq7Tqx7qRYakY4zUooUTTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif105077KTMhEnmgoy+6iXcKgjOOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxccAFq4BbUQQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit><latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">AAACIXicbVDNSgMxGMzWv1r/Vj16CRZBBMuuCNpbURCPFawtNGXJpmk3NJtdkmyhrPssXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByyB6hD+BU9j3TozgxZ5ddirOFHCRuDkpgxx1zx6jbkSSkApNOFaq7Tqx7qRYakY4zUooUTTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif105077KTMhEnmgoy+6iXcKgjOOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxccAFq4BbUQQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit>

hf, gip := E{f(X)g(X)} .<latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FFrGo8=">AAACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf44l/xxQcV8W3/xrTbg24eSDg5515y7/ETRqWy7YmxtLyyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRRTlxFFSPtRBAU+Yy0/OFN7reeiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOppyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit><latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FFrGo8=">AAACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf44l/xxQcV8W3/xrTbg24eSDg5515y7/ETRqWy7YmxtLyyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRRTlxFFSPtRBAU+Yy0/OFN7reeiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOppyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit><latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FFrGo8=">AAACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf44l/xxQcV8W3/xrTbg24eSDg5515y7/ETRqWy7YmxtLyyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRRTlxFFSPtRBAU+Yy0/OFN7reeiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOppyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit>'

<latexit sha1_base64="brezGfS9xgUQ3GjHxx0Z0LO17Wg=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTTTlAU0FaluR8QwwSULLLeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZZZIuFsWZwDbFs99xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEEz/AKb0ihF/SOPhatJVTMHMMfoM8f5tuPeQ==</latexit><latexit sha1_base64="brezGfS9xgUQ3GjHxx0Z0LO17Wg=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTTTlAU0FaluR8QwwSULLLeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZZZIuFsWZwDbFs99xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEEz/AKb0ihF/SOPhatJVTMHMMfoM8f5tuPeQ==</latexit><latexit sha1_base64="brezGfS9xgUQ3GjHxx0Z0LO17Wg=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTTTlAU0FaluR8QwwSULLLeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZZZIuFsWZwDbFs99xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEEz/AKb0ihF/SOPhatJVTMHMMfoM8f5tuPeQ==</latexit>

Page 32: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

PARAMETRIC VS MANIFOLD OPTIMIZATION➤ This suggests thinking about the problem in the functional

space generated by the model:

➤ Sufficient conditions for success so far: ➤ convex and sufficiently large so that we can move freely within.

➤ What happens when the model is not overparametrised?

F� = {' : Rn ! Rm ;'(x) = �(x;⇥) for some ⇥} .

F�

g⇤ : x 7! E(Y |x)g⇤

F� ⇥

min'2F�

k'� g⇤kp<latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">AAACIXicbVDNSgMxGMzWv1r/Vj16CRZBBMuuCNpbURCPFawtNGXJpmk3NJtdkmyhrPssXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByyB6hD+BU9j3TozgxZ5ddirOFHCRuDkpgxx1zx6jbkSSkApNOFaq7Tqx7qRYakY4zUooUTTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif105077KTMhEnmgoy+6iXcKgjOOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxccAFq4BbUQQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit><latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">AAACIXicbVDNSgMxGMzWv1r/Vj16CRZBBMuuCNpbURCPFawtNGXJpmk3NJtdkmyhrPssXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByyB6hD+BU9j3TozgxZ5ddirOFHCRuDkpgxx1zx6jbkSSkApNOFaq7Tqx7qRYakY4zUooUTTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif105077KTMhEnmgoy+6iXcKgjOOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxccAFq4BbUQQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit><latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">AAACIXicbVDNSgMxGMzWv1r/Vj16CRZBBMuuCNpbURCPFawtNGXJpmk3NJtdkmyhrPssXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByyB6hD+BU9j3TozgxZ5ddirOFHCRuDkpgxx1zx6jbkSSkApNOFaq7Tqx7qRYakY4zUooUTTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif105077KTMhEnmgoy+6iXcKgjOOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxccAFq4BbUQQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit>

hf, gip := E{f(X)g(X)} .<latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FFrGo8=">AAACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf44l/xxQcV8W3/xrTbg24eSDg5515y7/ETRqWy7YmxtLyyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRRTlxFFSPtRBAU+Yy0/OFN7reeiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOppyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit><latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FFrGo8=">AAACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf44l/xxQcV8W3/xrTbg24eSDg5515y7/ETRqWy7YmxtLyyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRRTlxFFSPtRBAU+Yy0/OFN7reeiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOppyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit><latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FFrGo8=">AAACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf44l/xxQcV8W3/xrTbg24eSDg5515y7/ETRqWy7YmxtLyyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRRTlxFFSPtRBAU+Yy0/OFN7reeiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOppyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit>'

<latexit sha1_base64="brezGfS9xgUQ3GjHxx0Z0LO17Wg=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTTTlAU0FaluR8QwwSULLLeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZZZIuFsWZwDbFs99xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEEz/AKb0ihF/SOPhatJVTMHMMfoM8f5tuPeQ==</latexit><latexit sha1_base64="brezGfS9xgUQ3GjHxx0Z0LO17Wg=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTTTlAU0FaluR8QwwSULLLeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZZZIuFsWZwDbFs99xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEEz/AKb0ihF/SOPhatJVTMHMMfoM8f5tuPeQ==</latexit><latexit sha1_base64="brezGfS9xgUQ3GjHxx0Z0LO17Wg=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTTTlAU0FaluR8QwwSULLLeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZZZIuFsWZwDbFs99xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEEz/AKb0ihF/SOPhatJVTMHMMfoM8f5tuPeQ==</latexit>

Page 33: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

FROM SIMPLE LANDSCAPES TO ENERGY BARRIER

➤ The energy landscape of several prototypical models in statistical physics exhibit a so-called energy barrier, e.g. spherical spin glasses:

HN,p(�) = N�(p�1)/2

NX

i1,...ip=1

Ji1,...,ip�i1 · · ·�ip , � 2 SN�1(

pN) , Ji ⇠ N (0, 1).

RANDOM MATRICES AND COMPLEXITY OF SPIN GLASSES 5

and, for any integer k � 0,

⇥k,p(u) =

(1

2log(p� 1)� p�2

4(p�1)u2 � (k + 1)I1(u), if u �E1,

1

2log(p� 1)� p�2

p , if u � �E1.(2.16)

We note that ⇥p(u),⇥k,p(u) are non-decreasing, continuous functions on R, with maximalvalues 1

2log(p� 1), 1

2log(p� 1)� p�2

p , respectively (see Figure 1).We now give the logarithmic asymptotics of the complexity of spherical spin glasses. To

simplify the statement, we fix B = (�1, u), u 2 R, and we write CrtN,k(u) = CrtN,k(B),CrtN (u) = CrtN (B).

Theorem 2.5. For all p � 2 and k � 0 fixed,

limN!1

1

NlogECrtN,k(u) = ⇥k,p(u). (2.17)

!1.66 !1.65 !1.64 !1.63

!0.02

!0.01

0.01

0.02

u�E0 �E1 �E10

�Ec

Figure 1. The functions ⇥k,p for p = 3 and k = 0 (solid), k = 1 (dashed),k = 2 (dash-dotted), k = 10, k = 100 (both dotted). All these functionsagree for u � �E1.

Remark 2.6. It is straightforward to extend the last theorem to general Borel sets B (seeRemark 4.1). Furthermore, by symmetry, Theorem 2.5 also holds as stated for the randomvariables CrtN,N�l((u,1)), with l � 1 fixed, if one replaces ⇥k,p(u) by ⇥l�1,p(�u).

Remark 2.7. For the local minima, i.e. when k = 0, the limit formula given by Theo-rem 2.5 is precisely the formula given by physicists in [CS95], [CLR03]. Arguing via aTAP approach (to be described below in Section 6), they derive the following asymptoticcomplexity of local minima,

g(E) =1

2

n2� p

p� log

⇣pz2

2

⌘+

p� 1

2z2 � 2

p2z2

o, (2.18)

where z = 1

p�1

�� E � (E2 � 2(p�1)

p )1/2�. In Section 6, we show that, in fact, g(E) =

⇥0,p(2�1/2E). The factor 2�1/2 comes from the fact that in [CS95] the Hamiltonian Hhas a di↵erent normalization.

We also provide an exponential asymptotic for the expected total number of criticalvalues below level Nu.

Theorem 2.8. For all p � 2,

limN!1

1

NlogECrtN (u) = ⇥p(u). (2.19)

[Auffinger, Ben Arous, Cerny,’11]

Page 34: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

FROM SIMPLE LANDSCAPES TO ENERGY BARRIER?

➤ Does a similar macroscopic picture arise in our setting?

➤ Given homogeneous, assume ➤

➤ Define

➤ Best loss obtained by first projecting the data onto the best possible subspace of dimension and adding bounded noise in the complement.

➤ decreases with and

⇢(z)

⇢(hw,Xi) = hAw, (X)i ,with dim( (X)) = f(N) .

f�1(M)

�(M,N) = infS;dim(S)=f�1(M)

infU 2 Rm⇥M

W 2 RM⇥f�1(M)

supEkZk N � f�1(M),

PSZ = 0

EkU⇢(WPSX + Z)� Y k2

�(M,N) �(f(N), N) = minU,W

E(U,W ) .M

Page 35: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

FROM SIMPLE LANDSCAPES TO ENERGY BARRIER

➤ Does a similar macroscopic picture arise in our setting?

➤ Given homogeneous, assume ➤

➤ Define

➤ Best loss obtained by first projecting the data onto the best possible subspace of dimension and adding bounded noise in the complement.

➤ decreases with and

⇢(z)

⇢(hw,Xi) = hAw, (X)i ,with dim( (X)) = f(N) .

f�1(M)

�(M,N) = infS;dim(S)=f�1(M)

infU 2 Rm⇥M

W 2 RM⇥f�1(M)

supEkZk N � f�1(M),

PSZ = 0

EkU⇢(WPSX + Z)� Y k2

�(M,N) �(f(N), N) = minU,W

E(U,W ) .M

Conjecture [LBB’18]: The loss L(U,W ) = EkU⇢(WX)� Y k2has no poor local minima above the energy barrier �(M,N).

<latexit sha1_base64="+mVPjJvZLvbrtX4oIn9416unzBw=">AAACiHicbVFdb9MwFHUCjNHxUeCRlytaRCuVKqkE25CQpk5IPAw0pIUW1aWy3ZvGLLEj25mosv4XfhNv/Bvcrg+wcZ+Ozrmf5/Iyl9ZF0e8gvHX7zs7d3XuNvfsPHj5qPn7yxerKCEyEzrUZc2YxlwoTJ12O49IgK3iOI35+vNZHF2is1OrMLUucFmyhZCoFc56aNX9SpaWao3JAHf5wPK2PtfqOwlUGYXIyHL6MD6ZvV3CWIeTaWmifdJIejLrwDmjBXMZ5/X5FLyEBajLdGY278Aq+Ar38NmgDpZAxC0pDqbXxDQTLoZBKFgwY1xcIzvdFhWaxBM6MkWigTTk61vnY+9Rt92fNVtSPNgE3QbwFLbKN01nzF51rURX+JJEzaydxVLppzYyTIsdVg1YWSybO2QInHipWoJ3WGydX8MIzc0j9qqn2lmzYvytqVli7LLjPXN9ur2tr8n/apHLpwbSWqqwcKnE1KK1ycBrWb4G5NN7zfOkBE0b6XUFkzDDh/PMa3oT4+sk3QTLoH/ajz4PW0XDrxi55Rp6TDonJPjkiH8gpSYgIdoJe8Dp4E+6FcbgfHl6lhsG25in5J8LhHxK3vsQ=</latexit><latexit sha1_base64="+mVPjJvZLvbrtX4oIn9416unzBw=">AAACiHicbVFdb9MwFHUCjNHxUeCRlytaRCuVKqkE25CQpk5IPAw0pIUW1aWy3ZvGLLEj25mosv4XfhNv/Bvcrg+wcZ+Ozrmf5/Iyl9ZF0e8gvHX7zs7d3XuNvfsPHj5qPn7yxerKCEyEzrUZc2YxlwoTJ12O49IgK3iOI35+vNZHF2is1OrMLUucFmyhZCoFc56aNX9SpaWao3JAHf5wPK2PtfqOwlUGYXIyHL6MD6ZvV3CWIeTaWmifdJIejLrwDmjBXMZ5/X5FLyEBajLdGY278Aq+Ar38NmgDpZAxC0pDqbXxDQTLoZBKFgwY1xcIzvdFhWaxBM6MkWigTTk61vnY+9Rt92fNVtSPNgE3QbwFLbKN01nzF51rURX+JJEzaydxVLppzYyTIsdVg1YWSybO2QInHipWoJ3WGydX8MIzc0j9qqn2lmzYvytqVli7LLjPXN9ur2tr8n/apHLpwbSWqqwcKnE1KK1ycBrWb4G5NN7zfOkBE0b6XUFkzDDh/PMa3oT4+sk3QTLoH/ajz4PW0XDrxi55Rp6TDonJPjkiH8gpSYgIdoJe8Dp4E+6FcbgfHl6lhsG25in5J8LhHxK3vsQ=</latexit><latexit sha1_base64="+mVPjJvZLvbrtX4oIn9416unzBw=">AAACiHicbVFdb9MwFHUCjNHxUeCRlytaRCuVKqkE25CQpk5IPAw0pIUW1aWy3ZvGLLEj25mosv4XfhNv/Bvcrg+wcZ+Ozrmf5/Iyl9ZF0e8gvHX7zs7d3XuNvfsPHj5qPn7yxerKCEyEzrUZc2YxlwoTJ12O49IgK3iOI35+vNZHF2is1OrMLUucFmyhZCoFc56aNX9SpaWao3JAHf5wPK2PtfqOwlUGYXIyHL6MD6ZvV3CWIeTaWmifdJIejLrwDmjBXMZ5/X5FLyEBajLdGY278Aq+Ar38NmgDpZAxC0pDqbXxDQTLoZBKFgwY1xcIzvdFhWaxBM6MkWigTTk61vnY+9Rt92fNVtSPNgE3QbwFLbKN01nzF51rURX+JJEzaydxVLppzYyTIsdVg1YWSybO2QInHipWoJ3WGydX8MIzc0j9qqn2lmzYvytqVli7LLjPXN9ur2tr8n/apHLpwbSWqqwcKnE1KK1ycBrWb4G5NN7zfOkBE0b6XUFkzDDh/PMa3oT4+sk3QTLoH/ajz4PW0XDrxi55Rp6TDonJPjkiH8gpSYgIdoJe8Dp4E+6FcbgfHl6lhsG25in5J8LhHxK3vsQ=</latexit>

Page 36: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

FROM TOPOLOGY TO GEOMETRY

➤ The next question we are interested in is conditioning for descent.

➤ Even if level sets are connected, how easy it is to navigate through them?

➤ How “large” and regular are they?

easy to move from one energylevel to lower one

hard to move from one energylevel to lower one

Page 37: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

FROM TOPOLOGY TO GEOMETRY

➤ The next question we are interested in is conditioning for descent.

➤ Even if level sets are connected, how easy it is to navigate through them?

➤ We estimate level set geodesics and measure their length.

easy to move from one energylevel to lower one

hard to move from one energylevel to lower one

✓A

✓B

✓A

✓B

Page 38: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

FINDING CONNECTED COMPONENTS

➤ Suppose are such that

➤ They are in the same connected component of iff

there is a path such that

➤ Moreover, we penalize the length of the path:

✓1, ✓2

�(t), �(0) = ✓1, �(1) = ✓28 t 2 (0, 1) , E(�(t)) u0 .

E(✓1) = E(✓2) = u0

⌦u0

8 t 2 (0, 1) , E(�(t)) u0 and

Zk�(t)kdt M .

Page 39: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

FINDING CONNECTED COMPONENTS

➤ Suppose are such that

➤ They are in the same connected component of iff

there is a path such that

➤ Moreover, we penalize the length of the path:

➤Dynamic programming approach:

✓1, ✓2

�(t), �(0) = ✓1, �(1) = ✓28 t 2 (0, 1) , E(�(t)) u0 .

E(✓1) = E(✓2) = u0

⌦u0

8 t 2 (0, 1) , E(�(t)) u0 and

Zk�(t)kdt M .

✓1

✓2

Page 40: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

FINDING CONNECTED COMPONENTS

➤ Suppose are such that

➤ They are in the same connected component of iff

there is a path such that

➤ Moreover, we penalize the length of the path:

➤Dynamic programming approach:

✓1, ✓2

�(t), �(0) = ✓1, �(1) = ✓28 t 2 (0, 1) , E(�(t)) u0 .

E(✓1) = E(✓2) = u0

⌦u0

8 t 2 (0, 1) , E(�(t)) u0 and

Zk�(t)kdt M .

✓1

✓2

✓m =✓1 + ✓2

2

H

✓3 = arg min✓2H;E(✓)u0

k✓ � ✓mk .

✓3

Page 41: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

FINDING CONNECTED COMPONENTS

➤ Suppose are such that

➤ They are in the same connected component of iff

there is a path such that

➤ Moreover, we penalize the length of the path:

➤Dynamic programming approach:

✓1, ✓2

�(t), �(0) = ✓1, �(1) = ✓28 t 2 (0, 1) , E(�(t)) u0 .

E(✓1) = E(✓2) = u0

⌦u0

8 t 2 (0, 1) , E(�(t)) u0 and

Zk�(t)kdt M .

✓1

✓2

✓m =✓1 + ✓2

2

✓3 = arg min✓2H;E(✓)u0

k✓ � ✓mk .

✓3

Page 42: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

FINDING CONNECTED COMPONENTS

➤ Suppose are such that

➤ They are in the same connected component of iff

there is a path such that

➤ Moreover, we penalize the length of the path:

➤Dynamic programming approach:

✓1, ✓2

�(t), �(0) = ✓1, �(1) = ✓28 t 2 (0, 1) , E(�(t)) u0 .

E(✓1) = E(✓2) = u0

⌦u0

8 t 2 (0, 1) , E(�(t)) u0 and

Zk�(t)kdt M .

✓1

✓2

✓m =✓1 + ✓2

2

H

✓3 = arg min✓2H;E(✓)u0

k✓ � ✓mk .

✓3

Page 43: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

NUMERICAL EXPERIMENTS

➤ Compute length of geodesic in obtained by the algorithm and normalize it by the Euclidean distance. Measure of curviness of level sets.

⌦u

cubic polynomial CNN/MNIST

Page 44: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

NUMERICAL EXPERIMENTS

➤ Compute length of geodesic in obtained by the algorithm and normalize it by the Euclidean distance. Measure of curviness of level sets.

⌦u

CNN/CIFAR-10 LSTM/Penn

Under review as a conference paper at ICLR 2017

(1a) (1b)

(2a) (2b)

(3a) (3b)

(4a) (4b)

(5a) (5b)

Figure 1: (Column a) Average normalized geodesic length and (Column b) average number of beadsversus loss. (1) A quadratic regression task. (2) A cubic regression task. (3) A convnet for MNIST.(4) A convnet inspired by Krizhevsky for CIFAR10. (5) A RNN inspired by Zaremba for PTB nextword prediction.

The cubic regression task exhibits an interesting feature around L0 = .15 in Table 1 Fig. 2, wherethe normalized length spikes, but the number of required beads remains low. Up until this point, thecubic model is strongly convex, so this first spike seems to indicate the onset of non-convex behaviorand a concomitant radical change in the geometry of the loss surface for lower loss.

4.2 CONVOLUTIONAL NEURAL NETWORKS

To test the algorithm on larger architectures, we ran it on the MNIST hand written digit recognitiontask as well as the CIFAR10 image recognition task, indicated in Table 1, Figs. 3 and 4. Again,the data exhibits strong qualitative similarity with the previous models: normalized length remainslow until a threshold loss value, after which it grows approximately as a power law. Interestingly,

8

Page 45: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

ANALYSIS AND PERSPECTIVES➤ #of components does not increase: no detected poor local minima

so far when using typical datasets and typical architectures (at energy levels explored by SGD).

➤ Level sets become more irregular as energy decreases.

➤ Presence of “energy barrier”? extend to truncated Taylor?

➤ Kernels are back? CNN RKHS

➤ Open: “sweet spot” between overparametrisation and overfitting?

➤ Open: Role of Stochastic Optimization in this story?

Page 46: ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKScmx.caltech.edu/ipml/ipml-slides-bruna.pdf · see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang

THANKS!


Recommended