GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Optimization Methods for Large-Scale Machine Learning
Frank E. Curtis, Lehigh University
presented at
East Coast Optimization MeetingGeorge Mason University
Fairfax, Virginia
April 2, 2021
Optimization Methods for Large-Scale Machine Learning 1 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
References
? Leon Bottou, Frank E. Curtis, and Jorge Nocedal.
Optimization Methods for Large-Scale Machine Learning.
SIAM Review, 60(2):223–311, 2018.
? Frank E. Curtis and Katya Scheinberg.
Optimization Methods for Supervised Machine Learning: From Linear Models toDeep Learning.
In INFORMS Tutorials in Operations Research, chapter 5, pages 89–114. Institutefor Operations Research and the Management Sciences (INFORMS), 2017.
Optimization Methods for Large-Scale Machine Learning 2 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Motivating questions
I How do optimization problems arise in machine learning applications, andwhat makes them challenging?
I What have been the most successful optimization methods for large-scalemachine learning, and why?
I What recent advances have been made in the design of algorithms, and whatare open questions in this research area?
Optimization Methods for Large-Scale Machine Learning 3 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Outline
GD and SG
GD vs. SG
Beyond SG
Noise Reduction Methods
Second-Order Methods
Conclusion
Optimization Methods for Large-Scale Machine Learning 4 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Outline
GD and SG
GD vs. SG
Beyond SG
Noise Reduction Methods
Second-Order Methods
Conclusion
Optimization Methods for Large-Scale Machine Learning 5 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Learning problems and (surrogate) optimization problems
Learn a prediction function h : X → Y to solve
maxh∈H
∫X×Y
1[h(x) ≈ y]dP (x, y)
Various meanings for h(x) ≈ y depending on the goal:
I Binary classification, with y ∈ −1,+1: y · h(x) > 0.
I Regression, with y ∈ Rny : ‖h(x)− y‖ ≤ δ.Parameterizing h by w ∈ Rd, we aim to solve
maxw∈Rd
∫X×Y
1[h(w;x) ≈ y]dP (x, y)
Now, common practice is to replace the indicator with a smooth loss. . .
Optimization Methods for Large-Scale Machine Learning 6 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Stochastic optimization
Over a parameter vector w ∈ Rd and given
`(·; y) h(w;x) (loss w.r.t. “true label” prediction w.r.t. “features”),
consider the unconstrained optimization problem
minw∈Rd
f(w), where f(w) = E(x,y)[`(h(w;x), y)].
Given training set (xi, yi)ni=1, approximate problem given by
minw∈Rd
fn(w), where fn(w) =1
n
n∑i=1
`(h(w;xi), yi).
Optimization Methods for Large-Scale Machine Learning 7 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Text classification
SIAM REVIEW c© 2018 Society for Industrial and Applied MathematicsVol. 60, No. 2, pp. 223–311
Optimization Methods forLarge-Scale Machine Learning∗
Leon Bottou†
Frank E. Curtis‡
Jorge Nocedal§
Abstract. This paper provides a review and commentary on the past, present, and future of numeri-cal optimization algorithms in the context of machine learning applications. Through casestudies on text classification and the training of deep neural networks, we discuss how op-timization problems arise in machine learning and what makes them challenging. A majortheme of our study is that large-scale machine learning represents a distinctive setting inwhich the stochastic gradient (SG) method has traditionally played a central role whileconventional gradient-based nonlinear optimization techniques typically falter. Based onthis viewpoint, we present a comprehensive theory of a straightforward, yet versatile SGalgorithm, discuss its practical behavior, and highlight opportunities for designing algo-rithms with improved performance. This leads to a discussion about the next generationof optimization methods for large-scale machine learning, including an investigation of twomain streams of research on techniques that diminish noise in the stochastic directions andmethods that make use of second-order derivative approximations.
Key words. numerical optimization, machine learning, stochastic gradient methods, algorithm com-plexity analysis, noise reduction methods, second-order methods
AMS subject classifications. 65K05, 68Q25, 68T05, 90C06, 90C30, 90C90
DOI. 10.1137/16M1080173
Contents
1 Introduction 224
2 Machine Learning Case Studies 2262.1 Text Classification via Convex Optimization . . . . . . . . . . . . . . 2262.2 Perceptual Tasks via Deep Neural Networks . . . . . . . . . . . . . . 2282.3 Formal Machine Learning Procedure . . . . . . . . . . . . . . . . . . . 231
3 Overview of Optimization Methods 2353.1 Formal Optimization Problem Statements . . . . . . . . . . . . . . . . 235
∗Received by the editors June 16, 2016; accepted for publication (in revised form) April 19, 2017;published electronically May 8, 2018.
http://www.siam.org/journals/sirev/60-2/M108017.htmlFunding: The work of the second author was supported by U.S. Department of Energy grant
DE-SC0010615 and U.S. National Science Foundation grant DMS-1016291. The work of the thirdauthor was supported by Office of Naval Research grant N00014-14-1-0313 P00003 and Departmentof Energy grant DE-FG02-87ER25047s.
†Facebook AI Research, New York, NY 10003 ([email protected]).‡Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA 18015
([email protected]).§Department of Industrial Engineering and Management Sciences, Northwestern University,
Evanston, IL 60201 ([email protected]).
223 minw∈Rd
1
n
n∑i=1
log(1 + exp(−(wT xi)yi)) +λ
2‖w‖22
math
poetry
Optimization Methods for Large-Scale Machine Learning 8 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Image / speech recognition
What pixel combinations represent the number 4?
What sounds are these? (“Here comes the sun” – The Beatles)
Optimization Methods for Large-Scale Machine Learning 9 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Deep neural networks
h(w;x) = al(Wl . . . (a2(W2(a1(W1x+ ω1)) + ω2)) . . . )
Inp
ut
Layer
Ou
tpu
tL
ayer
Hidden Layers
x5
x4
x3
x2
x1
h14
h13
h12
h11
h24
h23
h22
h21
h3
h2
h1
[W1]54
[W1]11
[W2]44
[W2]11
[W3]43
[W3]11
Figure: Illustration of a DNN
Optimization Methods for Large-Scale Machine Learning 10 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Tradeoffs of large-scale learning
Bottou, Bousquet (2008) and Bottou (2010)
Notice that we went from our true problem
maxh∈H
∫X×Y
1[h(x) ≈ y]dP (x, y)
to say that we’ll find our solution h ≡ h(w; ·) by (approximately) solving
minw∈Rd
1
n
n∑i=1
`(h(w;xi), yi).
Three sources of error:
I approximation
I estimation
I optimization
Optimization Methods for Large-Scale Machine Learning 11 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Approximation error
Choice of prediction function family H has important implications; e.g.,
HC := h ∈ H : Ω(h) ≤ C.
C
misclassification rate
testing
training
training time
misclassification rate
testing
training
Figure: Illustration of C and training time vs. misclassification rate
Optimization Methods for Large-Scale Machine Learning 12 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Problems of interest
Let’s focus on the expected loss/risk problem
minw∈Rd
f(w), where f(w) = E(x,y)[`(h(w;x), y)]
and the empirical loss/risk problem
minw∈Rd
fn(w), where fn(w) =1
n
n∑i=1
`(h(w;xi), yi).
For this talk, let’s assume
I f is continuously differentiable, bounded below, and potentially nonconvex;
I ∇f is L-Lipschitz continuous, i.e., ‖∇f(w)−∇f(w)‖2 ≤ L‖w − w‖2.
Optimization Methods for Large-Scale Machine Learning 13 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Gradient descentAim: Find a stationary point, i.e., w with ∇f(w) = 0.
Algorithm GD : Gradient Descent
1: choose an initial point w0 ∈ Rn and stepsize α > 02: for k ∈ 0, 1, 2, . . . do3: set wk+1 ← wk − α∇f(wk)4: end for
wk
f(wk)
f(wk) +∇f(wk)T (w − wk) + 12L‖w − wk‖
22
f(wk) +∇f(wk)T (w − wk) + 12c‖w − wk‖
22
f(w)? f(w)?
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Gradient descentAim: Find a stationary point, i.e., w with ∇f(w) = 0.
Algorithm GD : Gradient Descent
1: choose an initial point w0 ∈ Rn and stepsize α > 02: for k ∈ 0, 1, 2, . . . do3: set wk+1 ← wk − α∇f(wk)4: end for
wk
f(wk)
f(wk) +∇f(wk)T (w − wk) + 12L‖w − wk‖
22
f(wk) +∇f(wk)T (w − wk) + 12c‖w − wk‖
22
f(w)? f(w)?
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Gradient descentAim: Find a stationary point, i.e., w with ∇f(w) = 0.
Algorithm GD : Gradient Descent
1: choose an initial point w0 ∈ Rn and stepsize α > 02: for k ∈ 0, 1, 2, . . . do3: set wk+1 ← wk − α∇f(wk)4: end for
wk
f(wk)
f(wk) +∇f(wk)T (w − wk) + 12L‖w − wk‖
22
f(wk) +∇f(wk)T (w − wk) + 12c‖w − wk‖
22
f(w)? f(w)?
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Gradient descentAim: Find a stationary point, i.e., w with ∇f(w) = 0.
Algorithm GD : Gradient Descent
1: choose an initial point w0 ∈ Rn and stepsize α > 02: for k ∈ 0, 1, 2, . . . do3: set wk+1 ← wk − α∇f(wk)4: end for
wk
f(wk)
f(wk) +∇f(wk)T (w − wk) + 12L‖w − wk‖
22
f(wk) +∇f(wk)T (w − wk) + 12c‖w − wk‖
22
f(w)? f(w)?
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
GD theory
Theorem GD
If α ∈ (0, 1/L], then∞∑k=0
‖∇f(wk)‖22 <∞, which implies ∇f(wk) → 0.
If, in addition, f is c-strongly convex, then for all k ≥ 1:
f(wk)− f∗ ≤ (1− αc)k(f(x0)− f∗).
Proof.
f(wk+1) ≤ f(wk) +∇f(wk)T (wk+1 − wk) + 12L‖wk+1 − wk‖22
· · · (due to stepsize choice)
≤ f(wk)− 12α‖∇f(wk)‖22
≤ f(wk)− αc(f(wk)− f∗).
=⇒ f(wk+1)− f∗ ≤ (1− αc)(f(wk)− f∗).
Optimization Methods for Large-Scale Machine Learning 15 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
GD illustration
Figure: GD with fixed stepsize
Optimization Methods for Large-Scale Machine Learning 16 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Stochastic gradient method (SG)
Invented by Herbert Robbins and Sutton Monro in 1951.
Sutton Monro, former Lehigh faculty member
Optimization Methods for Large-Scale Machine Learning 17 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Stochastic gradient descent
Approximate gradient only; e.g., random ik so E[∇w`(h(w;xik ), yik )|w] = ∇f(w).
Algorithm SG : Stochastic Gradient
1: choose an initial point w0 ∈ Rn and stepsizes αk > 02: for k ∈ 0, 1, 2, . . . do3: set wk+1 ← wk − αkgk, where gk ≈ ∇f(wk)4: end for
Not a descent method!. . . but can guarantee eventual descent in expectation (with Ek[gk] = ∇f(wk)):
f(wk+1) ≤ f(wk) +∇f(wk)T (wk+1 − wk) + 12L‖wk+1 − wk‖22
= f(wk)− αk∇f(wk)T gk + 12α2kL‖gk‖
22
=⇒ Ek[f(wk+1)] ≤ f(wk)− αk‖∇f(wk)‖22 + 12α2kLEk[‖gk‖22].
Markov process: wk+1 depends only on wk and random choice at iteration k.
Optimization Methods for Large-Scale Machine Learning 18 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
SG theory
Theorem SG
If Ek[‖gk‖22] ≤M + ‖∇f(wk)‖22, then:
αk =1
L=⇒ E
1
k
k∑j=1
‖∇f(wj)‖22
≤Mαk = O
(1
k
)=⇒ E
k∑j=1
αj‖∇f(wj)‖22
<∞.If, in addition, f is c-strongly convex, then:
αk =1
L=⇒ E[f(wk)− f∗] ≤ O
((αL)(M/c)
2
)αk = O
(1
k
)=⇒ E[f(wk)− f∗] = O
((L/c)(M/c)
k
).
(*Assumed unbiased gradient estimates; see paper for more generality.)
Optimization Methods for Large-Scale Machine Learning 19 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Why O(1/k)?
Mathematically:∞∑k=1
αk =∞ while∞∑k=1
α2k <∞
Graphically (sequential version of constant stepsize result):
Optimization Methods for Large-Scale Machine Learning 20 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
SG illustration
Figure: SG with fixed stepsize (left) vs. diminishing stepsizes (right)
Optimization Methods for Large-Scale Machine Learning 21 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Outline
GD and SG
GD vs. SG
Beyond SG
Noise Reduction Methods
Second-Order Methods
Conclusion
Optimization Methods for Large-Scale Machine Learning 22 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Why SG over GD for large-scale machine learning?
GD: E[fn(wk)− fn,∗] = O(ρk) linear convergence
SG: E[fn(wk)− fn,∗] = O(1/k) sublinear convergence
So why SG?
Motivation Explanation
Intuitive data “redundancy”
Empirical SG vs. L-BFGS with batch gradient (below)
Theoretical E[fn(wk)− fn,∗] = O(1/k) and E[f(wk)− f∗] = O(1/k)
0 0.5 1 1.5 2 2.5 3 3.5 4
x 105
0
0.1
0.2
0.3
0.4
0.5
0.6
Accessed Data Points
Em
piri
ca
l Ris
k
SGD
LBFGS
4
Optimization Methods for Large-Scale Machine Learning 23 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Work complexityTime, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010).
Time Time for
Convergence rate per iteration ε-optimality
GD: E[fn(wk)− fn,∗] = O(ρk) + O(n) =⇒ n log(1/ε)
SG: E[fn(wk)− fn,∗] = O(1/k) + O(1) =⇒ 1/ε
Considering total (estimation + optimization) error as
E = E[f(wn)− f(w∗)] + E[f(wn)− f(wn)] ∼ 1n
+ ε
and a time budget T , one finds:
I SG: Process as many samples as possible (n ∼ T ), leading to
E ∼1
T.
I GD: With n ∼ T / log(1/ε), minimizing E yields ε ∼ 1/T and
E ∼log(T )
T+
1
T.
Optimization Methods for Large-Scale Machine Learning 24 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Outline
GD and SG
GD vs. SG
Beyond SG
Noise Reduction Methods
Second-Order Methods
Conclusion
Optimization Methods for Large-Scale Machine Learning 25 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
End of the story?
SG is great! Let’s keep proving how great it is!
I SG is “stable with respect to inputs”
I SG avoids “steep minima”
I SG avoids “saddle points”
I . . . (many more)
No, we should want more. . .
I SG requires a lot of “hyperparameter” tuning
I Sublinear convergence is not satisfactory
I . . . “linearly” convergent method eventually wins
I . . . with higher budget, faster computation, parallel?, distributed?
Also, any “gradient”-based method is not scale invariant.
Optimization Methods for Large-Scale Machine Learning 26 of 59