Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Optimization Methods for Large-Scale Machine Learning

Frank E. Curtis, Lehigh University

presented at

East Coast Optimization MeetingGeorge Mason University

Fairfax, Virginia

April 2, 2021

Optimization Methods for Large-Scale Machine Learning 1 of 59


References

? Leon Bottou, Frank E. Curtis, and Jorge Nocedal.

Optimization Methods for Large-Scale Machine Learning.

SIAM Review, 60(2):223–311, 2018.

? Frank E. Curtis and Katya Scheinberg.

Optimization Methods for Supervised Machine Learning: From Linear Models toDeep Learning.

In INFORMS Tutorials in Operations Research, chapter 5, pages 89–114. Institutefor Operations Research and the Management Sciences (INFORMS), 2017.



Motivating questions

I How do optimization problems arise in machine learning applications, andwhat makes them challenging?

I What have been the most successful optimization methods for large-scalemachine learning, and why?

I What recent advances have been made in the design of algorithms, and whatare open questions in this research area?



Outline

GD and SG

GD vs. SG

Beyond SG

Noise Reduction Methods

Second-Order Methods

Conclusion



Outline

GD and SG

GD vs. SG

Beyond SG



Conclusion



Learning problems and (surrogate) optimization problems

Learn a prediction function h : X → Y to solve

maxh∈H

∫X×Y

1[h(x) ≈ y]dP (x, y)

Various meanings for h(x) ≈ y depending on the goal:

I Binary classification, with y ∈ −1,+1: y · h(x) > 0.

I Regression, with y ∈ Rny : ‖h(x)− y‖ ≤ δ.Parameterizing h by w ∈ Rd, we aim to solve

maxw∈Rd

∫X×Y

1[h(w;x) ≈ y]dP (x, y)

Now, common practice is to replace the indicator with a smooth loss. . .



Stochastic optimization

Over a parameter vector w ∈ Rd and given

`(·; y) h(w;x) (loss w.r.t. “true label” prediction w.r.t. “features”),

consider the unconstrained optimization problem

minw∈Rd

f(w), where f(w) = E(x,y)[`(h(w;x), y)].

Given training set (xi, yi)ni=1, approximate problem given by

minw∈Rd

fn(w), where fn(w) =1

n

n∑i=1

`(h(w;xi), yi).



Text classification

SIAM REVIEW c© 2018 Society for Industrial and Applied MathematicsVol. 60, No. 2, pp. 223–311

Optimization Methods forLarge-Scale Machine Learning∗

Leon Bottou†

Frank E. Curtis‡

Jorge Nocedal§

Abstract. This paper provides a review and commentary on the past, present, and future of numeri-cal optimization algorithms in the context of machine learning applications. Through casestudies on text classification and the training of deep neural networks, we discuss how op-timization problems arise in machine learning and what makes them challenging. A majortheme of our study is that large-scale machine learning represents a distinctive setting inwhich the stochastic gradient (SG) method has traditionally played a central role whileconventional gradient-based nonlinear optimization techniques typically falter. Based onthis viewpoint, we present a comprehensive theory of a straightforward, yet versatile SGalgorithm, discuss its practical behavior, and highlight opportunities for designing algo-rithms with improved performance. This leads to a discussion about the next generationof optimization methods for large-scale machine learning, including an investigation of twomain streams of research on techniques that diminish noise in the stochastic directions andmethods that make use of second-order derivative approximations.

Key words. numerical optimization, machine learning, stochastic gradient methods, algorithm com-plexity analysis, noise reduction methods, second-order methods

AMS subject classifications. 65K05, 68Q25, 68T05, 90C06, 90C30, 90C90

DOI. 10.1137/16M1080173

Contents

1 Introduction 224

2 Machine Learning Case Studies 2262.1 Text Classification via Convex Optimization . . . . . . . . . . . . . . 2262.2 Perceptual Tasks via Deep Neural Networks . . . . . . . . . . . . . . 2282.3 Formal Machine Learning Procedure . . . . . . . . . . . . . . . . . . . 231

3 Overview of Optimization Methods 2353.1 Formal Optimization Problem Statements . . . . . . . . . . . . . . . . 235

∗Received by the editors June 16, 2016; accepted for publication (in revised form) April 19, 2017;published electronically May 8, 2018.

http://www.siam.org/journals/sirev/60-2/M108017.htmlFunding: The work of the second author was supported by U.S. Department of Energy grant

DE-SC0010615 and U.S. National Science Foundation grant DMS-1016291. The work of the thirdauthor was supported by Office of Naval Research grant N00014-14-1-0313 P00003 and Departmentof Energy grant DE-FG02-87ER25047s.

†Facebook AI Research, New York, NY 10003 ([email protected]).‡Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA 18015

([email protected]).§Department of Industrial Engineering and Management Sciences, Northwestern University,

Evanston, IL 60201 ([email protected]).

223 minw∈Rd

1

n

n∑i=1

log(1 + exp(−(wT xi)yi)) +λ

2‖w‖22

math

poetry



Image / speech recognition

What pixel combinations represent the number 4?

What sounds are these? (“Here comes the sun” – The Beatles)



Deep neural networks

h(w;x) = al(Wl . . . (a2(W2(a1(W1x+ ω1)) + ω2)) . . . )

Inp

ut

Layer

Ou

tpu

tL

ayer

Hidden Layers

x5

x4

x3

x2

x1

h14

h13

h12

h11

h24

h23

h22

h21

h3

h2

h1

[W1]54

[W1]11

[W2]44

[W2]11

[W3]43

[W3]11

Figure: Illustration of a DNN



Tradeoffs of large-scale learning

Bottou, Bousquet (2008) and Bottou (2010)

Notice that we went from our true problem

maxh∈H

∫X×Y

1[h(x) ≈ y]dP (x, y)

to say that we’ll find our solution h ≡ h(w; ·) by (approximately) solving

minw∈Rd

1

n

n∑i=1

`(h(w;xi), yi).

Three sources of error:

I approximation

I estimation

I optimization



Approximation error

Choice of prediction function family H has important implications; e.g.,

HC := h ∈ H : Ω(h) ≤ C.

C

misclassification rate

testing

training

training time

misclassification rate

testing

training

Figure: Illustration of C and training time vs. misclassification rate



Problems of interest

Let’s focus on the expected loss/risk problem

minw∈Rd

f(w), where f(w) = E(x,y)[`(h(w;x), y)]

and the empirical loss/risk problem

minw∈Rd

fn(w), where fn(w) =1

n

n∑i=1

`(h(w;xi), yi).

For this talk, let’s assume

I f is continuously differentiable, bounded below, and potentially nonconvex;

I ∇f is L-Lipschitz continuous, i.e., ‖∇f(w)−∇f(w)‖2 ≤ L‖w − w‖2.



Gradient descentAim: Find a stationary point, i.e., w with ∇f(w) = 0.

Algorithm GD : Gradient Descent

1: choose an initial point w0 ∈ Rn and stepsize α > 02: for k ∈ 0, 1, 2, . . . do3: set wk+1 ← wk − α∇f(wk)4: end for

wk

f(wk)

f(wk) +∇f(wk)T (w − wk) + 12L‖w − wk‖

22

f(wk) +∇f(wk)T (w − wk) + 12c‖w − wk‖

22

f(w)? f(w)?






wk

f(wk)

f(wk) +∇f(wk)T (w − wk) + 12L‖w − wk‖

22

f(wk) +∇f(wk)T (w − wk) + 12c‖w − wk‖

22

f(w)? f(w)?






wk

f(wk)

f(wk) +∇f(wk)T (w − wk) + 12L‖w − wk‖

22

f(wk) +∇f(wk)T (w − wk) + 12c‖w − wk‖

22

f(w)? f(w)?






wk

f(wk)

f(wk) +∇f(wk)T (w − wk) + 12L‖w − wk‖

22

f(wk) +∇f(wk)T (w − wk) + 12c‖w − wk‖

22

f(w)? f(w)?



GD theory

Theorem GD

If α ∈ (0, 1/L], then∞∑k=0

‖∇f(wk)‖22 <∞, which implies ∇f(wk) → 0.

If, in addition, f is c-strongly convex, then for all k ≥ 1:

f(wk)− f∗ ≤ (1− αc)k(f(x0)− f∗).

Proof.

f(wk+1) ≤ f(wk) +∇f(wk)T (wk+1 − wk) + 12L‖wk+1 − wk‖22

· · · (due to stepsize choice)

≤ f(wk)− 12α‖∇f(wk)‖22

≤ f(wk)− αc(f(wk)− f∗).

=⇒ f(wk+1)− f∗ ≤ (1− αc)(f(wk)− f∗).



GD illustration

Figure: GD with fixed stepsize



Stochastic gradient method (SG)

Invented by Herbert Robbins and Sutton Monro in 1951.

Sutton Monro, former Lehigh faculty member



Stochastic gradient descent

Approximate gradient only; e.g., random ik so E[∇w`(h(w;xik ), yik )|w] = ∇f(w).

Algorithm SG : Stochastic Gradient

1: choose an initial point w0 ∈ Rn and stepsizes αk > 02: for k ∈ 0, 1, 2, . . . do3: set wk+1 ← wk − αkgk, where gk ≈ ∇f(wk)4: end for

Not a descent method!. . . but can guarantee eventual descent in expectation (with Ek[gk] = ∇f(wk)):

f(wk+1) ≤ f(wk) +∇f(wk)T (wk+1 − wk) + 12L‖wk+1 − wk‖22

= f(wk)− αk∇f(wk)T gk + 12α2kL‖gk‖

22

=⇒ Ek[f(wk+1)] ≤ f(wk)− αk‖∇f(wk)‖22 + 12α2kLEk[‖gk‖22].

Markov process: wk+1 depends only on wk and random choice at iteration k.



SG theory

Theorem SG

If Ek[‖gk‖22] ≤M + ‖∇f(wk)‖22, then:

αk =1

L=⇒ E

1

k

k∑j=1

‖∇f(wj)‖22

≤Mαk = O

(1

k

)=⇒ E

k∑j=1

αj‖∇f(wj)‖22

<∞.If, in addition, f is c-strongly convex, then:

αk =1

L=⇒ E[f(wk)− f∗] ≤ O

((αL)(M/c)

2

)αk = O

(1

k

)=⇒ E[f(wk)− f∗] = O

((L/c)(M/c)

k

).

(*Assumed unbiased gradient estimates; see paper for more generality.)



Why O(1/k)?

Mathematically:∞∑k=1

αk =∞ while∞∑k=1

α2k <∞

Graphically (sequential version of constant stepsize result):



SG illustration

Figure: SG with fixed stepsize (left) vs. diminishing stepsizes (right)



Outline

GD and SG

GD vs. SG

Beyond SG



Conclusion



Why SG over GD for large-scale machine learning?

GD: E[fn(wk)− fn,∗] = O(ρk) linear convergence

SG: E[fn(wk)− fn,∗] = O(1/k) sublinear convergence

So why SG?

Motivation Explanation

Intuitive data “redundancy”

Empirical SG vs. L-BFGS with batch gradient (below)

Theoretical E[fn(wk)− fn,∗] = O(1/k) and E[f(wk)− f∗] = O(1/k)

0 0.5 1 1.5 2 2.5 3 3.5 4

x 105

0

0.1

0.2

0.3

0.4

0.5

0.6

Accessed Data Points

Em

piri

ca

l Ris

k

SGD

LBFGS

4



Work complexityTime, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010).

Time Time for

Convergence rate per iteration ε-optimality

GD: E[fn(wk)− fn,∗] = O(ρk) + O(n) =⇒ n log(1/ε)

SG: E[fn(wk)− fn,∗] = O(1/k) + O(1) =⇒ 1/ε

Considering total (estimation + optimization) error as

E = E[f(wn)− f(w∗)] + E[f(wn)− f(wn)] ∼ 1n

+ ε

and a time budget T , one finds:

I SG: Process as many samples as possible (n ∼ T ), leading to

E ∼1

T.

I GD: With n ∼ T / log(1/ε), minimizing E yields ε ∼ 1/T and

E ∼log(T )

T+

1

T.



Outline

GD and SG

GD vs. SG

Beyond SG



Conclusion



End of the story?

SG is great! Let’s keep proving how great it is!

I SG is “stable with respect to inputs”

I SG avoids “steep minima”

I SG avoids “saddle points”

I . . . (many more)

No, we should want more. . .

I SG requires a lot of “hyperparameter” tuning

I Sublinear convergence is not satisfactory

I . . . “linearly” convergent method eventually wins

I . . . with higher budget, faster computation, parallel?, distributed?

Also, any “gradient”-based method is not scale invariant.


Date post:	16-Oct-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Optimization Methods for Large-Scale Machine Learning

Documents