+ All Categories
Home > Documents > First-Order Optimization Algorithms for Machine Learning...

First-Order Optimization Algorithms for Machine Learning...

Date post: 28-Feb-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
24
Mini-Batches and Batching Stochastic Average Gradient First-Order Optimization Algorithms for Machine Learning Variance-Reduced Stochastic Gradient Mark Schmidt University of British Columbia Summer 2020
Transcript
Page 1: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

First-Order Optimization Algorithms for Machine LearningVariance-Reduced Stochastic Gradient

Mark Schmidt

University of British Columbia

Summer 2020

Page 2: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Better Methods for Smooth Objectives and Finite Datasets?

Stochastic vs. deterministic methods

• Goal = best of both worlds: linear rate with O(1) iteration cost

hybridlog(

exce

ss c

ost)

stochastic

deterministic

timeStochastic methods:

O(1/ε) iterations but requires 1 gradient per iterations.Rates are unimprovable for general stochastic objectives.

Deterministic methods:O(log(1/ε)) iterations but requires n gradients per iteration.The faster rate is possible because n is finite.

For finite n, can we design a better method?

Page 3: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Hybrid Deterministic-StochasticApproach 1: control the sample size.Deterministic method uses all n gradients,

∇f(wk) =1

n

n∑i=1

∇fi(wk).

Stochastic method approximates it with 1 sample,

∇fik(wk) ≈ 1

n

n∑i=1

∇fi(wk).

A common variant is to use larger sample Bk (“mini-batch”),

1

|Bk|∑i∈Bk∇fi(wk) ≈ 1

n

n∑i=1

∇fi(wk),

particularly useful for vectorization/parallelization.For example, with 16 cores set |Bk| = 16 and compute 16 gradients at once.

Page 4: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Mini-Batching as Gradient Descent with ErrorThe SG method with a sample Bk (“mini-batch”) uses iterations

wk+1 = wk − αk

|Bk|∑i∈Bk∇fi(wk).

Let’s view this as a “gradient method with error”,

wk+1 = wk − αk(∇f(wk) + ek),

where ek is the difference between approximate and true gradient.(ek = gk −∇f(wk) for approximation gk)

If you use αk = 1/L, then using descent lemma this algorithm has

f(wk+1) ≤ f(wk)− 1

2L‖∇f(wk)‖2︸ ︷︷ ︸

good

+1

2L‖ek‖2︸ ︷︷ ︸bad

,

for any error ek (not necessarily unbiased or even stochastic).

Page 5: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Effect of Error on Convergence Rate

Our progress bound with αk = 1/L and error in the gradient of ek is

f(wk+1) ≤ f(wk)− 1

2L‖∇f(wk)‖2︸ ︷︷ ︸

good

+1

2L‖ek‖2︸ ︷︷ ︸bad

,

and notice that you are guaranteed to decrease f is ‖ek‖ < ‖∇f(wk)‖.

Connection between “error-free” rate and “with error” rate:

If “error-free” rate is O(1/k), you maintain this rate if ‖ek‖2 = O(1/k).If “error-free” rate is O(ρk), you maintain this rate if ‖ek‖2 = O(ρk).If error goes to zero more slowly, then rate that it goes to zero becomes bottleneck.

So to understanding effect of batch-size, need to know how |Bk| affects ‖ek‖2.

Page 6: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Effect of Batch Size on Error

Effect of batch size |Bk| control error size ek.

If we sample with replacement we get

E[‖ek‖2] = 1

|Bk|σ2,

where σ2 is the variance of the gradient norms.

“Doubling the batch size cuts the error in half”.

If we sample without replacement from a training set of size n we get

E[‖ek‖2] = n− |Bk|n

1

|Bk|σ2,

which drives error to zero as batch size approaches n.

For O(ρk) linear convergence, need a schedule like |Bk+1| = |Bk|/ρ.

For O(1/k) sublinear convergence, need a schedule like |Bk+1| = |Bk|+ const.

Page 7: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Batching: Growing-Batch-Size Methods

The SG method with a sample Bk uses iterations

wk+1 = wk − αk

|Bk|∑i∈Bk∇fi(wk).

For a fixed sample size |Bk|, the rate is sublinear.With fixed step-size, doubling batch size halves radius of “ball” around solution.Still need step-size to go to zero to get convergence.

But we can grow |Bk| to achieve a faster rate:Early iterations are cheap like SG iterations.Later iterations can use a sophisticated gradient method.

No need to set a magical step-size: use a line-search.Can incorporate linear-time approximations to Newton.

Another approach: at some point switch from stochastic to deterministic:Often after a small number of passes (but hard to know when to switch).

Page 8: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Variance-Reduction

Increasing the batch size is a form of variance-reduction.

A way to decrease the size of the variance in SGD (“bad” term).

Many other forms of variance reduction exist.

Control variates, importance sampling, re-parameterization trick, and so on.

These improve constants in SGD convergence rate.

But don’t improve rate unless objecctive is smooth and variance goes to zero.

Page 9: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Outline

1 Mini-Batches and Batching

2 Stochastic Average Gradient

Page 10: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Previously: Better Methods for Smooth Objectives and Finite Datasets

Stochastic vs. deterministic methods

• Goal = best of both worlds: linear rate with O(1) iteration cost

hybridlog(

exce

ss c

ost)

stochastic

deterministic

timeStochastic methods:

O(1/ε) iterations but requires 1 gradient per iterations.Deterministic methods:

O(log(1/ε)) iterations but requires n gradients per iteration.Growing-batch (“batching”) or “switching” methods:

O(log(1/ε)) iterations, requires fewer than n gradients in early iterations.

Page 11: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Stochastic Average Gradient

Growing |Bk| eventually requires O(n) iteration cost.

Can we have 1 gradient per iteration and only O(log(1/ε)) iterations?YES! First method was the stochastic average gradient (SAG) algorithm in 2012.

To motivate SAG, let’s view gradient descent as performing the iteration

wk+1 = wk − αk

n

n∑i=1

vki ,

where on each step we set vki = ∇fi(wk) for all i.

SAG method: only set vkik = ∇fik(wk) for a randomly-chosen ik.

All other vki are kept at their previous value.

Page 12: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Stochastic Average Gradient

We can think of SAG as having a memory:v1v2...vn

,where vki is the gradient ∇fi(wk) from the last k where i was selected.

On each iteration we:

Randomly choose one of the vi and update it to the current gradient.We take a step in the direction of the average of these vi.

Page 13: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Stochastic Average GradientBasic SAG algorithm (maintains g =

∑ni=1 vi):

Set g = 0 and gradient approximation vi = 0 for i = 1, 2, . . . , n.while(1)

Sample i from {1, 2, . . . , n}.Compute ∇fi(w).g = g − vi +∇fi(w).vi = ∇fi(w).w = w − α

ng.

Iteration cost is O(d), and “lazy updates” allow O(z) with sparse gradients.

For linear models where fi(w) = h(w>xi), it only requires O(n) memory:

∇fi(w) = h′(w>xi)︸ ︷︷ ︸scalar

xi︸︷︷︸data

.

Least squares is h(z) = 12 (z − yi)2, logistic is h(z) = log(1 + exp(−yiz)), etc.

For neural networks, would need to store all activations (typically impractical).

Page 14: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Stochastic Average Gradient

The SAG iteration is

wk+1 = wk − αk

n

n∑i=1

vki ,

where on each iteration we set vkik = ∇fik(wk) for a randomly-chosen ik.

Unlike batching, we use a gradient for every example.

But the gradients might be out of date.

Stochastic variant of earlier increment aggregated gradient (IAG).

Selects ik cyclically, which destroys performance.

Key proof idea: vki → ∇fi(w∗) at the same rate that wk → w∗:

So the variance ‖ek‖2 (“bad term”) converges linearly to 0.

Page 15: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Convergence Rate of SAG

If each ∇fi is L−continuous and f is strongly-convex, with αk = 1/16L SAG has

E[f(wk)− f(w∗)] 6 O

((1−min

16L,1

8n

})k)

Number of ∇fi evaluations to reach accuracy ε:

Stochastic: O(Lµ (1/ε)). (Best when n is enormous)

Gradient: O(nLµ log(1/ε)).

Nesterov: O(n√

Lµ log(1/ε)). (Best when n is small and L/µ is big)

SAG: O(max{n, Lµ } log(1/ε)).

But note that the L values are again different between algorithms.

Page 16: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Comparing Deterministic and Stochastic Methods

Two benchmark L2-regularized logistic regression datasets:

0 10 20 30 40 50

10−6

10−4

10−2

100

Effective Passes

Ob

jec

tive

min

us

Op

tim

um

AFG

L−BFGS

SGASG

IAG

0 10 20 30 40 50

10−4

10−3

10−2

10−1

100

Effective Passes

Ob

jec

tive

min

us

Op

tim

um

AFG

L−BFGS

SG

ASG

IAG

Averaging makes SG work better, deterministic methods eventually catch up.

Page 17: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

SAG Compared to Deterministic/Stochastic Methods

Two benchmark L2-regularized logistic regression datasets:

0 10 20 30 40 50

10−12

10−10

10−8

10−6

10−4

10−2

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

AFG

L−BFGS

SG

ASG

IAG

SAG−LS

0 10 20 30 40 50

10−15

10−10

10−5

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

AFGL−BFGS

SG

ASG

IAG

SAG−LS

Starts like stochastic but linear rate, SAG step-size set to L̂ approximation.

Page 18: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Discussion of SAG and Beyond

Bonus slides discuss practical issues related to SAG:

Setting step-size with an approximation to L.Deciding when to stop.Lipschitz sampling of training examples.

Improves rate for SAG, only changes constants for SG.

There are now a bunch of stochastic algorithm with fast rates:

SDCA, MISO, mixedGrad, SVRG, S2GD, Finito, SAGA, etc.Accelerated/Newton-like/coordinate-wise/proximal/ADMM versions.Analysis in non-convex settings, including new algorithms for PCA.You can apparently get medals for research:https://ismp2018.sciencesconf.org/data/pages/_SJP8196.jpg

Most notable variation is SVRG which gets rid of the memory...

Page 19: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Stochastic Variance-Reduced Gradient (SVRG)

SVRG algorithm: gets rid of memory by occasionally computing exact gradient.

wk+1 = wk − αk(∇fik(wk)−∇fik(ws) +∇f(ws)︸ ︷︷ ︸mean zero

),

where ws is updated every m iterations.

Convergence properties similar to SAG (for suitable m).

Unbiased: E[∇fik(ws)] = ∇f(ws) (special case of “control variate”).

Theoretically m depends on L, µ, and n (some analyses randomize it).

In practice m = n seems to work well.

O(d) storage at average cost of 3 gradients per iteration.

Page 20: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

End of Part 2: Key Ideas

Typical ML problems are written as optimization problem

argminw∈Rd

F (w) =1

n

n∑i=1

fi(w>xi) + λr(w).

Coordinate optimization:Faster than gradient descent if iterations are d-times cheaper.Allows non-smooth r if it’s separable.

Stochastic subgradient:Iteration cost is n-times cheaper than [sub]gradient descent.For non-smooth problems, convergence rate is same as subgradient method.For smooth problems, number of iterations is much higher than gradient descent.Effect of constant step size and batch size.

SAG and SVRG:Special case when F is smooth.Same low cost as stochastic gradient methods.But similar convergence rate to gradient descent (many extensions exist).

Page 21: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Even Bigger Problems?What about datasets that don’t fit on one machine?

We need to consider parallel and distributed optimization.

New issues:Synchronization: we may not want to wait for the slowest machine.Communication: it’s expensive to transfer data and parameters across machines.Failures: in huge-scale settings, machine failure probability is non-trivial.Batch size: for SGD is it better to get more parallelism or more iterations?

“Embarassingly” parallel solution:Split data across machines, each machine computes gradient of their subset.Papers present more fancy methods, but always try this first (“linear speedup”).

Fancier methods:Asyncronous stochastic subgradient (works fine if you make the step-size smaller).Parallel coordinate optimization (works fine if you make the step-size smaller).Decentralized gradient (needs a smaller step-size and an “EXTRA” trick).

Page 22: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Skipped Topics: Kernel Methods and Dual Methods

In previous years, I’ve covered the following topics:1 Kernel methods:

Allows using some exponential- or infinite-sized feature sets.Allows defining a “similarity” between training examples rather than features.Mercer’s theorem and how to determine if a kernel is valid.Representer theorem and models allowing kernel trick.Multiple kernel learning and connection to structured sparsity.Large-scale kernel approximations that avoid the high cost.

2 Dual methods:

Lagrangian function, dual function, and convex conjugate.Fenchel dual for deriving duals of “loss plus regularizer” problems.Connection between stochastic subgradient method and dual coordinate ascent.Turning non-smooth problems into equivalent smooth problems.Line-search for stochastic subgradient methods.

If you’re interested, I put the slides on these topics here:https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L12.5.pdf

Page 23: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

Summary

Mini-batches and effect of batch size.:

Doubling batch size halves the variance.Growing batch size leads to faster rate in terms of iterations.

And makes it easier to set the step-size and use Newton-like methods.

Stochastic average gradient: O(log(1/ε)) iterations with 1 gradient per iteration.

SVRG removes the memory requirement of SAG.

Next time: optimization with n =∞ (possibly non-IID).

Page 24: First-Order Optimization Algorithms for Machine Learning ...schmidtm/Courses/5XX-S20/S11.pdfBatching: Growing-Batch-Size Methods The SG method with a sample Bk uses iterations wk+1

Mini-Batches and Batching Stochastic Average Gradient

SAG Practical Implementation Issues

Implementation tricks:Improve performance at start using 1

mg instead of 1ng.

m is the number of examples visited.

Common to use αk = 1/L and use adaptive L.

Start with L̂ = 1 and double it whenever we don’t satisfiy

fik

(wk − 1

L̂∇fik (w

k)

)≤ fik (w

k)− 1

2L̂‖∇fik (w

k)‖2,

and ‖∇fik (wk)‖ is non-trivial. Costs O(1) for linear models in terms of n and d.

Can use ‖wk+1 − wk‖/α = 1n‖g‖ ≈ ‖∇f(wk)‖ to decide when to stop.

Lipschitz sampling of examples improves convergence rate:As with coordinate descent, sample the ones that can change quickly more often.For classic SG methods, this only changes constants.


Recommended