+ All Categories
Home > Documents > The Marginal Value of Adaptive Methods in Machine...

The Marginal Value of Adaptive Methods in Machine...

Date post: 27-Jun-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
1
The Marginal Value of Adaptive Methods in Machine Learning Ashia C. Wilson*, Rebecca Roelofs*, Mitchell Stern*, Nathan Srebro , Benjamin Recht* * University of California, Berkeley Toyota Technological Institute at Chicago Introduction Adaptive optimization methods, which include AdaGrad, RMSProp and Adam, are very popular for training deep neural networks We show for simple over-parameterized problems, adaptive methods often find drastically different solutions than non- adaptive methods, such as stochastic gradient descent (SGD) We construct an illustrative binary classification problem where the data is linearly separable, SGD achieve zero test error, and AdaGrad and Adam attain test errors arbitrarily close to 1/2 We study the empirical generalization capability of adaptive methods on several state-of-the art deep learning models. We observe that adaptive methods have worse (often much worse) test error than SGD. Background Illustrative Example Experiments Least squares loss function X 2 R nd = w y d >> n We consider the following over-parameterized problem Label y i 2 {-1, 1} is equal to 1 with probability y i 1 0 1 0 0 x i = 1 1 1 1 0 At unique location: Five ’s for One for 1 1 y i = 1 1 y i =-1 The data is constructed in the following way 0 SGD (+mom) finds the minimum norm solution w ada := sign(X T y ) If it is feasible, Adaptive methods find uniform weight solution w sgd = X T (XX T ) -1 y We conduct experiments on four datasets Minimal changes to online codebases and experiments were repeated 5x with random initialization Step-size analysis on CIFAR-10 Pick the largest SGD step-size that does not diverge ` Experiment 2: Penn-Treebank Generative Parsing with 3-layer LSTM Conclusion With the same hyper-parameter tuning, SGD and SGD with momentum outperform adaptive methods on unseen data Adaptive methods display fast initial progress, then plateau What implications does this have for model selection? 0 min w f (w ) := 1 2 kXw - y k 2 2 1 2 + Dataset Task Architecture CIFAR Image Classification Deep Convolutional War & Peace Language Model 2-Layer LSTM Penn Treebank Generative Parsing 3-Layer LSTM Penn Treebank Discriminative Parsing 2-Layer LSTM +FeedForward w k+1 = w k - k rf (w k ) Adam Adaptive methods “adapt” to the relative scale of parameters. The scale information, however, might be important for learning, causing us to lose out! Stochastic Gradient Descent Popular algorithms for deep learning: Momentum w k+1 = w k - k rf (w k )+ mom Adaptivity w k +1 = w k - k H -1 k rf (w k ) H k := diag k X i=1 i rf (x i ) ◦rf (x i ) ! 1/2 mom := β k (w k - w k-1 ) Experiment 1: Cifar-10 with VGG+BN+Dropout Model w k +1 = w k - k H -1 k rf (w k )+ H -1 k H k -1 mom
Transcript
Page 1: The Marginal Value of Adaptive Methods in Machine Learningmitchell/files/nips-2017_poster_marginal-value...The Marginal Value of Adaptive Methods in Machine Learning Ashia C. Wilson*,

The Marginal Value of Adaptive Methods in Machine LearningAshia C. Wilson*, Rebecca Roelofs*, Mitchell Stern*, Nathan Srebro , Benjamin Recht*

* University of California, Berkeley Toyota Technological Institute at Chicago††

Introduction• Adaptive optimization methods, which include AdaGrad,

RMSProp and Adam, are very popular for training deep neural networks

• We show for simple over-parameterized problems, adaptive methods often find drastically different solutions than non-adaptive methods, such as stochastic gradient descent (SGD)

• We construct an illustrative binary classification problem where the data is linearly separable, SGD achieve zero test error, and AdaGrad and Adam attain test errors arbitrarily close to 1/2

• We study the empirical generalization capability of adaptive methods on several state-of-the art deep learning models. We observe that adaptive methods have worse (often much worse) test error than SGD.

Background

Illustrative Example

Experiments

• Least squares loss function

X 2 Rn⇥d =w yd >> n

We consider the following over-parameterized problem

• Label yi 2 {�1, 1} is equal to 1 with probability

yi 1 01 0 0xi = 1 1 1 1 0

At unique location: Five ’s for One for1

1

yi=11yi=�1

• The data is constructed in the following way

0

SGD (+mom) finds the minimum norm solution

wada := ⇢ sign(XT y)

If it is feasible, Adaptive methods find uniform weight solution

wsgd = XT (XXT )�1y

We conduct experiments on four datasets

Minimal changes to online codebases and experiments were repeated 5x with random initialization

Step-size analysis on CIFAR-10 • Pick the largest SGD step-size that does not diverge

`

Experiment 2: Penn-Treebank Generative Parsing with 3-layer LSTM

Conclusion• With the same hyper-parameter tuning, SGD and SGD with

momentum outperform adaptive methods on unseen data• Adaptive methods display fast initial progress, then plateau• What implications does this have for model selection?

0

minw

f(w) :=1

2kXw � yk22

1

2+ ✏

Dataset Task Architecture

CIFAR Image Classification Deep Convolutional

War & Peace Language Model 2-Layer LSTM

Penn Treebank Generative Parsing 3-Layer LSTM

Penn Treebank Discriminative Parsing

2-Layer LSTM +FeedForward

wk+1 = wk � ↵krf(wk)

Adam

• Adaptive methods “adapt” to the relative scale of parameters. The scale information, however, might be important for learning, causing us to lose out!

Stochastic Gradient Descent

Popular algorithms for deep learning:

Momentumwk+1 = wk � ↵krf(wk) + mom

Adaptivitywk+1 = wk � ↵kH

�1k rf(wk)

Hk := diag

kX

i=1

⌘irf(xi) � rf(xi)

!1/2

mom := �k(wk � wk�1)

Experiment 1: Cifar-10 with VGG+BN+Dropout Model

wk+1 = wk � ↵kH�1k rf(wk) +H�1

k Hk�1mom

Recommended