Some New Insights on Regularization and Interpolation...

Intro. GANs Interpolation

Some New Insights on Regularization and InterpolationMotivated from Neural Networks

Tengyuan Liang

Econometrics and Statistics

1 / 60


OUTLINE

Generative Adversarial Networks• statistical rates• pair regularization• optimization

Interpolation• regularization?• kernel ridgeless regression• GD on two layers ReLU networks

2 / 60


OUTLINE

Generative Adversarial Networks (unsupervised)• statistical rates• pair regularization• optimization

Interpolation (supervised)• regularization?• kernel ridgeless regression• GD on two layers ReLU networks

2 / 60


GANs

3 / 60


GENERATIVE ADVERSARIAL NETWORKS

• GAN Goodfellow et al. (2014)

• WGAN Arjovsky et al. (2017); Arjovskyand Bottou (2017)

• MMD GAN Li, Swersky, and Zemel(2015); Dziugaite, Roy, and Ghahramani(2015); Arbel, Sutherland, Binkowski, andGretton (2018)

• f -GAN Nowozin, Cseke, and Tomioka(2016)

• Sobolev GAN Mroueh et al. (2017)

• many others... Liu, Bousquet, andChaudhuri (2017); Tolstikhin, Gelly,Bousquet, Simon-Gabriel, and Scholkopf(2017)

Generator gθ, Discriminator fω

U(θ,ω) = EX∼Preal

h1(fω(X)) − EZ∼Pinput

h2(fω(gθ(Z)))

minθ

maxω

U(θ,ω)

GANs are widely used in practice, however

4 / 60


GENERATIVE ADVERSARIAL NETWORKS




h2(fω(gθ(Z)))

minθ

maxω

U(θ,ω)

GANs are widely used in practice, however

4 / 60


MUCH NEEDS TO BE UNDERSTOOD, IN THEORY

• Approximation:

what dist. can be approximated by the generator gθ(Z)?

• Statistical:

given n samples, what is the statistical/generalization error rate?

• Computational:

local convergence for practical optimization, how to stablize?

• Landscape:

are local saddle points good globally?

5 / 60


FORMULATION

DG dist. class by generator, FD func. class by discriminator, ν target dist.

population µ∗ ∶= argminµ∈DG

maxf∈FD

EY∼µ

f(Y) − EX∼ν

f(X)

νn empirical dist.

empirical µn ∶= argminµ∈DG

maxf∈FD

EY∼µ

f(Y) − EX∼νn

f(X)

• Density learning/estimation: long history nonparametric statisticstarget density ν ∈ Wα - Sobolev space with smoothness α ≥ 0

Stone (1982); Nemirovski (2000); Tsybakov (2009); Wassermann (2006)

• GAN statistical theory is neededArora and Zhang (2017); Arora et al. (2017a,b); Liu et al. (2017)

6 / 60


FORMULATION



maxf∈FD

EY∼µ

f(Y) − EX∼ν

f(X)

νn empirical dist.


maxf∈FD

EY∼µ

f(Y) − EX∼νn

f(X)




6 / 60


FORMULATION



maxf∈FD

EY∼µ

f(Y) − EX∼ν

f(X)

νn empirical dist.


maxf∈FD

EY∼µ

f(Y) − EX∼νn

f(X)




6 / 60


DISCRIMINATOR METRIC

Define the critic metric (IPM)

dF(µ,ν) ∶= supf∈F

EY∼µ

f(Y) − EX∼ν

f(X) .

• F Lip-1: Wasserstein metric dW

• F bounded by 1: total variation/Radon metric dTV

• RKHSH, F = f ∈H, ∥f∥H ≤ 1: MMD GAN• F Sobolev smoothness β: Sobolev GAN

Statistical question: statistical error rate with n-i.i.d samples, E dF(ν, µn)?

7 / 60


DISCRIMINATOR METRIC

Define the critic metric (IPM)

dF(µ,ν) ∶= supf∈F

EY∼µ

f(Y) − EX∼ν

f(X) .

• F Lip-1: Wasserstein metric dW

• F bounded by 1: total variation/Radon metric dTV

• RKHSH, F = f ∈H, ∥f∥H ≤ 1: MMD GAN• F Sobolev smoothness β: Sobolev GAN

Statistical question: statistical error rate with n-i.i.d samples, E dF(ν, µn)?

7 / 60


MINIMAX OPTIMAL RATES: SOBOLEV GAN

Consider the target density ν ∈ G = Wα Sobolev space with smoothness α > 0,and the evaluation metric F = Wβ with smoothness β > 0.

The minimax optimal rate is

infνn

supν∈G

E dF (ν, νn) ≍ n−α+β2α+d ∨ n−

12 .

Theorem (L. ’17 & L. ’18, Sobolev).

Here νn any estimator based on n samples. d-dim.

Mair and Ruymgaart (1996); Liang (2017); Singh et al. (2018)

8 / 60


MINIMAX OPTIMAL RATES: MMD GAN

Consider a reproducing kernel Hilbert space (RKHS)H• integral operator T with eigenvalue decay ti ≍ i−κ, 0 < κ <∞• evaluation metric F = f ∈H ∣ ∥f∥H ≤ 1

• target density ν(x) in G = ν ∣ ∥T−(α−1)/2ν∥H ≤ 1 with smoothness α > 0


infνn

supν∈G

E dF (ν, νn) ≾ n−(α+1)κ2ακ+2 ∨ n−

12 .

Theorem (L. ’18, RKHS).

κ > 1: intrinsic dim. ∑i≥1 ti = ∑i≥1 i−κ ≤ C, parametric rate n−(α+1)κ2ακ+2 ∨ n−

12 = n−1/2.

κ < 1: sample complexity scales n = ε2+ 2α+1 (

1κ−1), “effective dim.” 1/κ.

9 / 60


MINIMAX OPTIMAL RATES: MMD GAN

Consider a reproducing kernel Hilbert space (RKHS)H• integral operator T with eigenvalue decay ti ≍ i−κ, 0 < κ <∞• evaluation metric F = f ∈H ∣ ∥f∥H ≤ 1

• target density ν(x) in G = ν ∣ ∥T−(α−1)/2ν∥H ≤ 1 with smoothness α > 0


infνn

supν∈G

E dF (ν, νn) ≾ n−(α+1)κ2ακ+2 ∨ n−

12 .

Theorem (L. ’18, RKHS).

κ > 1: intrinsic dim. ∑i≥1 ti = ∑i≥1 i−κ ≤ C, parametric rate n−(α+1)κ2ακ+2 ∨ n−

12 = n−1/2.

κ < 1: sample complexity scales n = ε2+ 2α+1 (

1κ−1), “effective dim.” 1/κ.

9 / 60


ORACLE INEQUALITY

Generator class DG may not contain the target density ν: oracle approach.

Let DG be any generator class. The discriminator metric FD = Wβ, targetdensity ν ∈ Wα.

With empirical density νn(x) as plug-in, the GAN estimator

µn ∈ argminµ∈DG

maxf∈FD

∫ f(x)µ(x)dx − ∫ f(x)νn(x)dx ,

attains a sub-optimal rate

E dFD(µn,ν) ≤ minµ∈DG

dFD(µ,ν) + n−βd ∨

log n√

n.

Corollary (L. ’18).

In contrast, a smoothed/regularizedempirical density νn(x) as plug-in


maxf∈FD


a faster rate is attainable


dFD(µ,ν) + n−α+β2α+d ∨

1√

n.


Canas and Rosasco (2012)

10 / 60


ORACLE INEQUALITY

Generator class DG may not contain the target density ν: oracle approach.

Let DG be any generator class. The discriminator metric FD = Wβ, targetdensity ν ∈ Wα.

With empirical density νn(x) as plug-in, the GAN estimator


maxf∈FD


attains a sub-optimal rate


dFD(µ,ν) + n−βd ∨

log n√

n.


In contrast, a smoothed/regularizedempirical density νn(x) as plug-in


maxf∈FD


a faster rate is attainable


dFD(µ,ν) + n−α+β2α+d ∨

1√

n.


Canas and Rosasco (2012)

10 / 60


SUB-OPTIMALITY AND REGULARIZATION

Regularization helps achieve faster rate!however, notions of regularization/complexity is yet understood well for neural nets...

Use νn “smoothed” empirical estimate, that serves as regularization

For example, kernel smoothing - νn(x) = 1

nhnK (

x−xihn

)

practice: SGD still carries through, as sample from νn is easy as Gaussian mixtures

Turns out, this is used in practice, called “instance noise” or “dataaugmentation”

Sønderby et al. (2016); Liang et al. (2017); Arjovsky and Bottou (2017); Mescheder et al. (2018)

11 / 60


Parametric results and pair regularization

12 / 60


Consider the parametrized GAN estimator

θm,n ∈ argminθ∶gθ∈G

maxω∶fω∈F

Emfω(gθ(Z)) − Enfω(X) ,

where m and n denote the number of the generator samples and real samples.

13 / 60


GENERALIZED ORACLE INEQUALITY

approx. err. A1(F ,G,ν) ∶= supθ

infω

∥logν

µθ− fω∥

∞, A2(G,ν) ∶= inf

θ∥log

µθ

ν∥

1/2

∞,

sto. err. Sn,m(F ,G) ∶=

√

Pdim(F) (logm

m∨log n

n) ∨

√

Pdim(F G)logm

m,

Pdim(⋅) the pseudo-dimension of the neural network function.

E d2TV (ν,µθm,n

) ,E d2W (ν,µθm,n

) ,E dKL (ν∣∣µθm,n) + E dKL (µθm,n

∣∣ν)

≤ A1(F ,G,ν) +A2(G,ν) + Sn,m(F ,G) .

Theorem (L. ’18).

We emphasize on the interplay between (G,F) as a pair of tuningparameters for regularization.

14 / 60


PAIR REGULARIZATION

for instance, one simple form of the interplay is:fix G, as F increase ∶ A1(F ,G,ν) decrease, A2(G,ν) constant, Sn,m(F ,G) increase,fix F , as G increase ∶ A1(F ,G,ν) increase, A2(G,ν) decrease, Sn,m(F ,G) increase.

Generator Class

Dis

crim

inat

or C

lass

and

dominated by

, ,

,

15 / 60


Applications of pair regularization

16 / 60


APPLICATION I: PARAMETRIC RATES FOR LEAKY RELU NETWORKS

When the generator G and discriminator F are both leaky ReLU networkswith depth L (width properly chosen depends on dimension).

Generator

Discriminator

When the target density is realizable by the generator.

logµθ(x) = c1

L−1

∑l=1

d

∑i=1

1mli(x)≥0 + c0,

Bai et al. (2018)17 / 60


APPLICATION I: PARAMETRIC RATES FOR LEAKY RELU NETWORKS

When the generator G and discriminator F are both leaky ReLU networkswith depth L (width properly chosen depends on dimension).

E d2TV (ν,µθm,n

) ≾

√

d2L2 log(dL) (logm

m∨logn

n).

Theorem (L. ’18, leaky ReLU).

The results hold for very deep networks with depth L = o(√

n/ log n).

17 / 60


APPLICATION II: LEARNING MULTIVARIATE GAUSSIAN

GANs enjoy near optimal sampling complexity (w.r.t. dim. d), withproper choices of the architecture and activation,

E d2TV (ν,µθm,n

) ≾

√d2 log dn ∧m

.

Corollary (L. ’18, Gaussian).

18 / 60


PAIR REGULARIZATION: WHY GANS MIGHT BE BETTER

Generator Class

Dis

crim

inat

or C

lass

and

dominated by

data-memorization, empirical deviation

nonparametric density estimation cl

assi

c pa

ram

etric

mod

els

19 / 60


Optimization: local convergence

20 / 60


FORMULATION




h2(fω(gθ(Z)))

minθ

maxω

U(θ,ω)

• global optimization for general U(θ,ω) is hard Singh et al. (2000); Pfau and Vinyals

(2016); Salimans et al. (2016)

Local saddle point (θ∗,ω∗) such that no incentive to deviate locally

U(θ∗,ω) ≤ U(θ∗,ω∗) ≤ U(θ,ω∗) ,

for (θ,ω) in an open neighborhood of (θ∗,ω∗).

• also called local Nash Equilibrium (NE)• modest goal: initialized properly, algorithm converges to a local NE

21 / 60


MAIN MESSAGE: INTERACTION MATTERS

Exponential local convergence to stable equilibrium

However, “interaction term” matters, slows down the convergence⇐ curse

What if unstable? turns out “interaction term” matters, utilize it rendersexponential convergence⇐ blessing

22 / 60


MAIN MESSAGE: INTERACTION MATTERS

Exponential local convergence to stable equilibriumanalog to GD in single-player optimization, strongly convex caseintuitive picture: discrete-time SGA cycles inward to a stable equilibrium fast

However, “interaction term” matters, slows down the convergence⇐ cursecompared to conventional GD, strongly convex case, due to presence of ∇θωU∇θωUT

also we show a lower bound on TSGA to show the curse is necessary

What if unstable? turns out “interaction term” matters, utilize it rendersexponential convergence⇐ blessing

• SGA fails, modify the dynamics to utilize interaction• analog to single-player optimization, non-strongly convex case, is

surprising• single-player: first order methods cannot obtain error better than 1/T2 in

smooth, but non-strongly convex case, classic result Nemirovski and Yudin (1983);Nesterov (2013)

• two-player: we will show first order method can obtain exponentialconvergence to unstable equilibrium exp(−cT)

22 / 60


“However, no guarantees are known beyond the convex-concave setting and, moreimportantly for the paper, even in convex-concave games, no guarantees are knownfor the last-iterate pair.”

— Daskalakis, Ilyas, Syrgkanis, and Zeng (2017)

23 / 60


EXPONENTIAL CONVERGENCE TO UNSTABLE EQUILIBRIUM

OMD proposed in Daskalakis et al. (2017)

θt+1 = θt − 2η∇θU(θt,ωt) + η∇θU(θt−1,ωt−1)

ωt+1 =ωt + 2η∇ωU(θt,ωt) − η∇ωU(θt−1,ωt−1)

For bi-linear game U(θ,ω) = θTCω, to obtain ε-close solution

shown in Daskalakis et al. (2017) ∶ T ≿ ε−4 log

1ε

⋅ Poly(λmax(CCT

)

λmin(CCT))

we show ∶ T ≿ log1ε

⋅λmax(CCT

)

λmin(CCT)

Corollary (L. & Stokes, ’18).

24 / 60


Interpolation

25 / 60


HOW DO WE TEACH STAT/ML?

model complexityoptimal complexity

bias variance

generalization error

overfitting

26 / 60


HOW DO WE TEACH STAT/ML?

model complexity

optimal complexity

bias variance


overfitting

∞ ← λ λ → 0

λopt

DANGER ZONE!!

26 / 60


IS THIS REALLY WHAT’S HAPPENING IN PRACTICE?

model complexity


overfitting

∞ ← λ

DANGER ZONE??

λ → 0

λ → 0

27 / 60


Is explicit regularization λopt really needed?

Is interpolation really bad for statistics and machine learning?

28 / 60


Is explicit regularization λopt really needed?

Is interpolation really bad for statistics and machine learning?

28 / 60


AN EMPIRICAL EXAMPLE

0.0 0.2 0.4 0.6 0.8 1.0 1.2lambda

10 1

log(

erro

r)Kernel Regression on MNIST

digits pair [i,j][2,5][2,9][3,6][3,8][4,7]

λ = 0: the interpolated solution, perfect fit on training data.

MNIST data from LeCun et al. (2010)

29 / 60


AN EMPIRICAL EXAMPLE

0.0 0.2 0.4 0.6 0.8 1.0 1.2lambda

10 1

log(

erro

r)

Kernel Regression on MNIST

digits pair [i,j][2,5][2,6][2,7][2,8][2,9]

[3,5][3,6][3,7][3,8][3,9]

[4,5][4,6][4,7][4,8][4,9]

λ = 0: the interpolated solution, perfect fit on training data.

MNIST data from LeCun et al. (2010)

29 / 60


ISOLATED PHENOMENON? NO

• Methodology: deep learning, kernel learning, boosting, random forests . . . Zhang,Bengio, Hardt, Recht, and Vinyals (2016); Wyner, Olson, Bleich, and Mease (2017); Maennel, Bousquet, andGelly (2018)

• Datasets: MNIST, CIFAR-10, others Belkin, Ma, and Mandal (2018b)30 / 60


PUZZLES

Interpolated solutions performs very well in practicefor many (modern) methodology and datasets!

What is happening? “Overfitting” is not that bad . . .

31 / 60


OUR MESSAGE

Geometric properties of the data design X, high dimensionality, andcurvature of the kernel⇒ interpolated solution generalizes.

32 / 60


Potential theory in statistics/learning for interpolated solution:

• Analysis through explicit regularization (7)• Capacity control (7)• Early stopping (algorithmic) (7)• Stability analysis (algorithmic) (7)• Nonparametric smoothing analysis (7)• Inductive bias (3?) at least promising

Belkin, Hsu, and Mitra (2018a)

33 / 60


INDUCTIVE BIAS

There are many functions that behave exactly the same on training data, but themethod/algo. prefers certain functions

• kernels/RKHS: Representer Thm., min norm interpolation

• over-parametrized linear regression: 0-initialization, min norm interpolation

• matrix factorization, etc. Gunasekar, Woodworth, Bhojanapalli, Neyshabur, and Srebro (2017); Li,Ma, and Zhang (2017)

• two layers ReLU network Maennel, Bousquet, and Gelly (2018)

• Inductive bias (3?), at least promising

Zhang et al. (2016); Neyshabur et al. (2017); Liang et al. (2017); Belkin et al. (2018b)

34 / 60


INDUCTIVE BIAS

There are many functions that behave exactly the same on training data, but themethod/algo. prefers certain functions

• kernels/RKHS: Representer Thm., min norm interpolation

• over-parametrized linear regression: 0-initialization, min norm interpolation

• matrix factorization, etc. Gunasekar, Woodworth, Bhojanapalli, Neyshabur, and Srebro (2017); Li,Ma, and Zhang (2017)

• two layers ReLU network Maennel, Bousquet, and Gelly (2018)

• Inductive bias (3?), at least promising

Zhang et al. (2016); Neyshabur et al. (2017); Liang et al. (2017); Belkin et al. (2018b)

34 / 60


HISTORY: INTERPOLATION RULES

Understudied in the literature: especially when there is label noise

Recent progress on local/direct interpolation schemes:

• Geometric simplicial interpolation and weighted kNN Belkin, Hsu, and Mitra

(2018a)

• Nonparametric Nadaraya-Watson estimator with singular kernels Shepard

(1968); Devroye, Gyorfi, and Krzyzak (1998); Belkin, Rakhlin, and Tsybakov (2018c)

35 / 60


SIMPLICIAL INTERPOLATION

Belkin, Hsu, and Mitra (2018a) showed under regularity conditions, simplicial interpolation fn

lim supn→∞

E(fn(x) − f∗(x))2≤

2d + 2

E(f∗(x) − y)2

36 / 60


SINGULAR KERNEL

Shepard (1968); Devroye et al. (1998); Belkin et al. (2018c) showed for singular kernels

K(u) ∶= ∥u∥−aI∥u∥ ≤ 1⇒ limu→0

K(u) =∞

the Nadaraya-Watson estimator fn =∑n

i=1 yiK( x−xih )

∑ni=1 K( x−xi

h )achieves the optimal error when f∗

lies in Holder space with smoothness β

E(fn(x) − f∗(x))2∼ n−

2β2β+d

37 / 60


Global/inverse interpolation methods (kernel machines/neuralnetworks/boosting) performs better than the local interpolation schemesempirically.

Several conjectures have been made about global/inverse interpolationmethods, such as kernel machines in Belkin, Hsu, and Mitra (2018a), two layers ReLUnets Maennel, Bousquet, and Gelly (2018).

38 / 60


Interpolated min-norm solution for kernel ridge regression

39 / 60


PROBLEM FORMULATION

Given n i.i.d. pairs (xi, yi) drawn from unknown µ: xi are d-dim covari-ates inΩ ⊂ Rd, yi ∈ R are the response/labels.

To estimate

f∗(x) = E(y∣x = x),

which is assumed to lie in a Reproducing Kernel Hilbert Space (RKHS)H with kernel K(⋅, ⋅).

Smola and Scholkopf (1998); Wahba (1990); Shawe-Taylor and Cristianini (2004)

40 / 60


Conventional wisdom: Kernel Ridge Regression, explicit regularization λ ≠ 0added whenH is high- or infinite-dimensional

minf∈H

1n

n

∑i=1

(f(xi) − yi)2+ λ∥f∥2

H .

41 / 60


We study min-norm interpolation estimator f

f ∶= argminf∈H

∥f∥H, s.t. f(xi) = yi, ∀i ≤ n .

Equivalently

f(x) = K(x,X)K(X,X)−1Y

when K(X,X) ∈ Rn×n is invertible.

42 / 60


We study min-norm interpolation estimator f

f ∶= argminf∈H

∥f∥H, s.t. f(xi) = yi, ∀i ≤ n .

Equivalently

f(x) = K(x,X)K(X,X)−1Y

when K(X,X) ∈ Rn×n is invertible.

42 / 60


Look for adaptive “data-dependent” bounds φn,d(X, f∗) to understand whenand why interpolated estimator f generalizes.

We provide high-probability bounds on

integrated squared risk E(f(x) − f∗(x)) ≤ φn,d(X, f∗)

generalization error E(f(x) − y)2− E(f∗(x) − y)2

≤ φn,d(X, f∗)

43 / 60


ASSUMPTIONS

(A.1) High-dim: c ≤ d/n ≤ CΣd = Eµ[xx∗] satisfies ∥Σd∥ ≤ C and Tr(Σd)/d ≥ c.

(A.2) (8 +m)-moments: zi ∶= Σ−1/2d xi

each entries of zi are i.i.d. mean zero, with bounded (8 +m)-moments.

(A.3) Noise: E[(f∗(x) − y)2∣x = x] ≤ σ2 for all x ∈Ω.

(A.4) Non-linear kernel: for a non-linear smooth function h(⋅)

K(x, x ′) = h(1d⟨x, x ′⟩)

44 / 60


Define the following quantities related to curvature of h(⋅)

α ∶= h(0) + h ′′(0)Tr(Σ2

d)

d2, β ∶= h ′(0),

γ ∶= h(Tr(Σd)

d) − h(0) − h ′(0)

Tr(Σd)

d.

45 / 60


MAIN RESULTS

Define

φn,d(X, f∗) ∶=

8σ2∥Σd∥

d∑

j

λj (XX∗

d + αβ

11∗)

[γβ+ λj (

XX∗d + α

β11∗)]

2 + ∥f∗∥2H inf

0≤k≤n

⎧⎪⎪⎨⎪⎪⎩

1n∑j>kλj(KXK∗X) + 2M

√kn

⎫⎪⎪⎬⎪⎪⎭

Under (A.1)-(A.4), with prob. 1 − 2δ − d−2, interpolation estimator f

EY∣X ∥f − f∗∥2L2µ

EY∣X E(f(x) − y)2− E(f∗(x) − y)2

≤ φn,d(X, f∗) + ε(n, d).

The remainder term ε(n, d) = O(d−m

8+m log4.1 d) +O(n−12 log0.5(n/δ)).

Theorem (L. & Rakhlin, ’18).

46 / 60


MAIN RESULTS

Define

φn,d(X, f∗) ∶=

8σ2∥Σd∥

d∑

j

λj (XX∗

d + αβ

11∗)

[γβ+ λj (

XX∗d + α

β11∗)]

2 + ∥f∗∥2H inf

0≤k≤n

⎧⎪⎪⎨⎪⎪⎩

1n∑j>kλj(KXK∗X) + 2M

√kn

⎫⎪⎪⎬⎪⎪⎭

Under (A.1)-(A.4), with prob. 1 − 2δ − d−2, interpolation estimator f

EY∣X ∥f − f∗∥2L2µ

EY∣X E(f(x) − y)2− E(f∗(x) − y)2

≤ φn,d(X, f∗) + ε(n, d).

The remainder term ε(n, d) = O(d−m

8+m log4.1 d) +O(n−12 log0.5(n/δ)).

Theorem (L. & Rakhlin, ’18).

46 / 60


MAIN MESSAGE


φn,d(X, f∗) ∶=

8σ2∥Σd∥d

∑j

λj ( XX∗d + α

β11∗)

[γβ+ λj ( XX∗

d + αβ

11∗)]2+ ∥f∗∥2

H inf0≤k≤n

⎧⎪⎪⎪⎨⎪⎪⎪⎩

1

n∑j>kλj(KXK∗X) + 2M

¿ÁÁÀ k

n

⎫⎪⎪⎪⎬⎪⎪⎪⎭

Proof is different from classic RKHS analysis with explicit regularization.

47 / 60


GEOMETRIC PROPERTIES OF DESIGN

Geometric properties of the data design X, high dimensionality, and curvature of the kernel⇒ interpolatedsolution generalizes.

8σ2∥Σd∥d

∑j

λj ( XX∗d + α

β11∗)

[γβ+ λj ( XX∗

d + αβ

11∗)]2+ ∥f∗∥2

H inf0≤k≤n

⎧⎪⎪⎪⎨⎪⎪⎪⎩

1


¿ÁÁÀ k

n

⎫⎪⎪⎪⎬⎪⎪⎪⎭

0 20 40 60 80 100index

10 4

10 3

10 2

log(

eige

nval

ue)

Spectral Decay of Kernels on MNISTdigits pair [i,j]

[2,5][2,9][3,6][3,8][4,7]

0 20 40 60 80 100index

10 1

100

101

log(

eige

nval

ue)

Spectral Decay of Cov on MNISTdigits pair [i,j]

[2,5][2,9][3,6][3,8][4,7]

Spectral decay of λj(KXK∗X) and λj ( XX∗d )

Preferable geometric properties of the design. Not all design works!

48 / 60


HIGH DIMENSIONALITY


8σ2∥Σd∥d

∑j

λj ( XX∗d + α

β11∗)

[γβ+ λj ( XX∗

d + αβ

11∗)]2+ ∥f∗∥2

H inf0≤k≤n

⎧⎪⎪⎪⎨⎪⎪⎪⎩

1


¿ÁÁÀ k

n

⎫⎪⎪⎪⎬⎪⎪⎪⎭

Scalings:• c < d/n < C, typical high-dim scaling in RMT, El Karoui (2010); Johnstone (2001)

• scaling: K(x, x ′) = h(⟨x, x ′⟩/d), default choice for high dim. data in computingpackages, e.g. Scikit-learn Pedregosa et al. (2011)

• bounds work for large (d,n) regime,ε(n, d) = O(d−m/(8+m) log4.1 d) +O(n−1/2 log0.5(n/δ))

Blessings of high dimensionality:• similar effect observed in local/direct interpolating schemes in Belkin et al. (2018a)

for simplicial interpolation and weighted kNN

• Kernel “ridgeless” regression is a global/inverse interpolation scheme

49 / 60


HIGH DIMENSIONALITY


8σ2∥Σd∥d

∑j

λj ( XX∗d + α

β11∗)

[γβ+ λj ( XX∗

d + αβ

11∗)]2+ ∥f∗∥2

H inf0≤k≤n

⎧⎪⎪⎪⎨⎪⎪⎪⎩

1


¿ÁÁÀ k

n

⎫⎪⎪⎪⎬⎪⎪⎪⎭

Scalings:• c < d/n < C, typical high-dim scaling in RMT, El Karoui (2010); Johnstone (2001)

• scaling: K(x, x ′) = h(⟨x, x ′⟩/d), default choice for high dim. data in computingpackages, e.g. Scikit-learn Pedregosa et al. (2011)

• bounds work for large (d,n) regime,ε(n, d) = O(d−m/(8+m) log4.1 d) +O(n−1/2 log0.5(n/δ))

Blessings of high dimensionality:• similar effect observed in local/direct interpolating schemes in Belkin et al. (2018a)

for simplicial interpolation and weighted kNN

• Kernel “ridgeless” regression is a global/inverse interpolation scheme

49 / 60


CURVATURE AND IMPLICIT REGULARIZATION


8σ2∥Σd∥d

∑j

λj ( XX∗d + α

β11∗)

[γβ+ λj ( XX∗

d + αβ

11∗)]2+ ∥f∗∥2

H inf0≤k≤n

⎧⎪⎪⎪⎨⎪⎪⎪⎩

1


¿ÁÁÀ k

n

⎫⎪⎪⎪⎬⎪⎪⎪⎭

Role of implicit regularization γβ≠ 0: due to curvature/non-linearity of kernel

• the analysis is very different from that in explicit regularization in RKHS Caponnettoand De Vito (2007)

• borrow tools from recent development in RMT for kernel matrices in El Karoui (2010)

effective dim

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

classic analysis: ∑jλj

λ∗+λj

our analysis: ∑jλj

(λ∗+λj)2

50 / 60


CURVATURE AND IMPLICIT REGULARIZATION


8σ2∥Σd∥d

∑j

λj ( XX∗d + α

β11∗)

[γβ+ λj ( XX∗

d + αβ

11∗)]2+ ∥f∗∥2

H inf0≤k≤n

⎧⎪⎪⎪⎨⎪⎪⎪⎩

1


¿ÁÁÀ k

n

⎫⎪⎪⎪⎬⎪⎪⎪⎭

Role of implicit regularization γβ≠ 0: due to curvature/non-linearity of kernel

effective dim

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

classic: ∑jλj

λ∗+λj

ours: ∑jλj

(λ∗+λj)2

naive: ∑j1λj

λj

λj

+λ⋅

λj

λj

( + )λ⋅

λj

2

1

λj

51 / 60


MAIN MESSAGE


“implicit regularization” + “inductive bias”

52 / 60


“Explicit regularization may improve generalization performance, but is neithernecessary nor by itself sufficient for controlling generalization error.”

— Zhang, Bengio, Hardt, Recht, and Vinyals (2016)

53 / 60


Gradient descent on two layers ReLU Networks

54 / 60


FORMULATION

Two layers ReLU networks

ft(x) =m∑i=1

wi(t)σ(xTui(t)).

with gradient descent (GD) on parameters wi(t),ui(t)

dwi(t)dt

= −Ez[∂l(y, f(x))

∂fσ(xTui)]

dui(t)dt

= −Ez[∂l(y, f(x))

∂fwi1xTui≥0x]

Initialization: m large, ui random from uniform spherical dist. with

∣wi∣ = ∥ui∥ =1

√m.

Algorithmic approximation: given (x,y), run GD with two layers ReLUnetworks, how ft(x) approximates f∗(x) = E(y∣x = x)? interpolates y?

No further assumption on f∗ besides it lies in L2.Maennel, Bousquet, and Gelly (2018)

55 / 60


VIEW GD ON RELU NETWORK AS DYNAMIC KERNELS

dEx [(f∗(x) − ft(x))2] = −2Ex,x [(f∗(x) − ft(x)) Kt(x, x) (f∗(x) − ft(x))] dt.

Lemma (Dou & L. ’18+).

View NN as fixed kernel:

limm→∞

K0(x, x) = 2⎡⎢⎢⎢⎢⎣

π − arccos(t)π

t +

√1 − t2

2π

⎤⎥⎥⎥⎥⎦

, where t = ⟨x, x⟩

Rahimi and Recht (2008); Cho and Saul (2009); Daniely et al. (2016); Bach (2017)

We view NN as dynamic kernel! We provide mean-field approx. (asm→∞), PDE characterization (Distribution Dynamics) for ρt thus Kt

Mei, Montanari, and Nguyen (2018); Rotskoff and Vanden-Eijnden (2018)

56 / 60


REPRESENTATION BENEFITS

NN: data-dependent basis, an adaptive representation learned from dataClassic nonparametric: fixed basis from analysis, not adaptive to data

Heuristic justification, but what does it really mean mathematically?

Follow GD dynamics to any stationarity, denote corresp. RKHS asK⋆ (kernel K⋆):

For any f∗ ∈ L2µ

• Function computed by GD on NN is proj. to RKHS K⋆

limt→∞

f GDt = H⋆f∗ ∈ K⋆

• Residual lies in a smaller space

residual ∶= f∗ − limt→∞

f GDt

residual ∈ K⊥GD ⊂ K⊥⋆

Theorem (Dou & L., ’18+).

K⋆ is adaptive to f∗! Gap in space: non-trivial decomposition.

57 / 60


REPRESENTATION BENEFITS

NN: data-dependent basis, an adaptive representation learned from dataClassic nonparametric: fixed basis from analysis, not adaptive to data

Heuristic justification, but what does it really mean mathematically?

Follow GD dynamics to any stationarity, denote corresp. RKHS asK⋆ (kernel K⋆):

For any f∗ ∈ L2µ

• Function computed by GD on NN is proj. to RKHS K⋆

limt→∞

f GDt = H⋆f∗ ∈ K⋆

• Residual lies in a smaller space

residual ∶= f∗ − limt→∞

f GDt

residual ∈ K⊥GD ⊂ K⊥⋆

Theorem (Dou & L., ’18+).

K⋆ is adaptive to f∗! Gap in space: non-trivial decomposition.

57 / 60


INTERPOLATION BENEFITS

Running GD on NN is learning the data-dependent kernel andperforming least-squares (RKHS) simultaneously.

The kernel is adaptive to task f∗, so the least squares proj. f GD∞ lies in K⋆,

and the residual f∗ − f GD∞ is smaller.

Having an (trainable) additional layer serves as “implicit regularization” onKt, faster interpolation

more benefits and generalizations see Dou & L. ’18+, to be posted.

58 / 60


CONCLUSION

• Minimax optimal rates does not explain the empirical success of neuralnetworks.

• One needs new adaptive (to properties of data), data-dependentframework/understanding.

• Requires new insights on regularization and interpolation.

image credit to Internet

59 / 60


CONCLUSION

• Minimax optimal rates does not explain the empirical success of neuralnetworks.

• One needs new adaptive (to properties of data), data-dependentframework/understanding.

• Requires new insights on regularization and interpolation.

image credit to Internet

59 / 60

References

Thank you!

Liang, T. (2018). — On How Well Generative Adversarial Networks Learn Densities: Nonparametric and

Parametric Results. available on arXiv:1811.03179 under review

Liang, T. & Rakhlin, A. (2018). — Just Interpolate: Kernel “Ridgeless” Regression Can Generalize. available

on arXiv:1808.00387 revision invited

Liang, T. & Stokes, J. (2018). — Interaction Matters: A Note on Non-asymptotic Local Convergence of

Generative Adversarial Networks. available on arXiv:1802.06132 AISTATS 2019, to appear

Liang, T., Poggio, T., Rakhlin, A. & Stokes, J. (2017). — Fisher-Rao Metric, Geometry, and Complexity of

Neural Networks. available on arXiv:1711.01530 AISTATS 2019, to appear

Dou, X. & Liang, T. (2018+). — Training Neural Networks as Learning Data-adaptive Kernels: Provable

Representation and Approximation Benefits. available on hereMichael Arbel, Dougal J Sutherland, Mikołaj Binkowski, and Arthur Gretton. On gradient regularizers for mmd gans. arXiv preprint

arXiv:1805.11565, 2018.Martin Arjovsky and Leon Bottou. Towards principled methods for training generative adversarial networks. arXiv preprint

arXiv:1701.04862, 2017.Martin Arjovsky, Soumith Chintala, and Leon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans).

arXiv preprint arXiv:1703.00573, 2017a.Sanjeev Arora, Andrej Risteski, and Yi Zhang. Theoretical limitations of encoder-decoder gan architectures. arXiv preprint arXiv:1711.02651,

2017b.Francis Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1–53, 2017.Yu Bai, Tengyu Ma, and Andrej Risteski. Approximability of discriminators implies diversity in gans. arXiv preprint arXiv:1806.10586, 2018.Mikhail Belkin, Daniel Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that

interpolate. arXiv preprint arXiv:1806.05161, 2018a.Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. arXiv preprint

arXiv:1802.01396, 2018b.Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation contradict statistical optimality? arXiv preprint

arXiv:1806.09471, 2018c.Guillermo Canas and Lorenzo Rosasco. Learning probability measures with respect to optimal transport metrics. In Advances in Neural

Information Processing Systems, pages 2492–2500, 2012.Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational

Mathematics, 7(3):331–368, 2007.Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pages 342–350,

2009.Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual

view on expressivity. In Advances In Neural Information Processing Systems, pages 2253–2261, 2016.Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. arXiv preprint

arXiv:1711.00141, 2017.Luc Devroye, Laszlo Gyorfi, and Adam Krzyzak. The hilbert kernel regression estimate. Journal of Multivariate Analysis, 65(2):209–227, 1998.Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy

optimization. arXiv preprint arXiv:1505.03906, 2015.Noureddine El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):1–50, 2010.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.

Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix

factorization. In Advances in Neural Information Processing Systems, pages 6151–6159, 2017.Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pages 295–327, 2001.Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun.

com/exdb/mnist, 2, 2010.Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix recovery. arXiv preprint

arXiv:1712.09203, 2017.Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In Proceedings of the 32nd International Conference on

Machine Learning (ICML-15), pages 1718–1727, 2015.Tengyuan Liang. How well can generative adversarial networks learn densities: A nonparametric view. arXiv preprint arXiv:1712.08244,

2017.Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks.

arXiv preprint arXiv:1711.01530, 2017.Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of generative adversarial learning.

arXiv preprint arXiv:1705.08991, 2017.Hartmut Maennel, Olivier Bousquet, and Sylvain Gelly. Gradient descent quantizes relu network features. arXiv preprint arXiv:1803.08367,

2018.Bernard A Mair and Frits H Ruymgaart. Statistical inverse estimation in hilbert scales. SIAM Journal on Applied Mathematics, 56(5):

1424–1444, 1996.Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layers neural networks. arXiv preprint

arXiv:1804.06561, 2018.Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In International

Conference on Machine Learning, pages 3478–3487, 2018.Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev gan. arXiv preprint arXiv:1711.04894, 2017.A Nemirovski and D Yudin. Information-based complexity of mathematical programming. Izvestia AN SSSR, Ser. Tekhnicheskaya Kibernetika

(the journal is translated to English as Engineering Cybernetics. Soviet J. Computer & Systems Sci.), 1, 1983.Arkadi Nemirovski. Topics in non-parametric. Ecole d’Ete de Probabilites de Saint-Flour, 28:85, 2000.Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In Advances in

Neural Information Processing Systems, pages 5947–5956, 2017.Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence

minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,

A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of MachineLearning Research, 12:2825–2830, 2011.

David Pfau and Oriol Vinyals. Connecting generative adversarial networks and actor-critic methods. arXiv preprint arXiv:1610.01945, 2016.Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages

1177–1184, 2008.Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape

and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915, 2018.Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In

Advances in Neural Information Processing Systems, pages 2234–2242, 2016.John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.Donald Shepard. A two-dimensional interpolation function for irregularly-spaced data. In Proceedings of the 1968 23rd ACM national

conference, pages 517–524. ACM, 1968.Satinder Singh, Michael Kearns, and Yishay Mansour. Nash convergence of gradient dynamics in general-sum games. In Proceedings of the

Sixteenth conference on Uncertainty in artificial intelligence, pages 541–548. Morgan Kaufmann Publishers Inc., 2000.Shashank Singh, Ananya Uppal, Boyue Li, Chun-Liang Li, Manzil Zaheer, and Barnabas Poczos. Nonparametric density estimation under

adversarial losses. arXiv preprint arXiv:1805.08836, 2018.Alex J Smola and Bernhard Scholkopf. Learning with kernels, volume 4. Citeseer, 1998.Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszar. Amortised map inference for image super-resolution.

arXiv preprint arXiv:1610.04490, 2016.Charles J Stone. Optimal global rates of convergence for nonparametric regression. The annals of statistics, pages 1040–1053, 1982.Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and Bernhard Scholkopf. Adagan: Boosting generative

models. In Advances in Neural Information Processing Systems, pages 5424–5433, 2017.Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009.Grace Wahba. Spline models for observational data, volume 59. Siam, 1990.Larry Wassermann. All of nonparametric statistics. Springer Science+ Business Media, New York, 2006.Abraham J Wyner, Matthew Olson, Justin Bleich, and David Mease. Explaining the success of adaboost and random forests as interpolating

classifiers. The Journal of Machine Learning Research, 18(1):1558–1590, 2017.Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking

generalization. arXiv preprint arXiv:1611.03530, 2016.

60 / 60

http://arxiv.org/pdf/1811.03179.pdf

https://arxiv.org/pdf/1808.00387.pdf



http://faculty.chicagobooth.edu/Tengyuan.Liang/pdf/arxiv_kernel_nn.pdf

Date post:	11-Mar-2019
Category:	Documents
Upload:	vuphuc
View:	225 times
Download:	0 times

Some New Insights on Regularization and Interpolation...

Documents