First-Order Optimization Algorithms for Machine Learning...

Mini-Batches and Batching Stochastic Average Gradient

First-Order Optimization Algorithms for Machine LearningVariance-Reduced Stochastic Gradient

Mark Schmidt

University of British Columbia

Summer 2020


Better Methods for Smooth Objectives and Finite Datasets?

Stochastic vs. deterministic methods

• Goal = best of both worlds: linear rate with O(1) iteration cost

hybridlog(

exce

ss c

ost)

stochastic

deterministic

timeStochastic methods:

O(1/ε) iterations but requires 1 gradient per iterations.Rates are unimprovable for general stochastic objectives.

Deterministic methods:O(log(1/ε)) iterations but requires n gradients per iteration.The faster rate is possible because n is finite.

For finite n, can we design a better method?


Hybrid Deterministic-StochasticApproach 1: control the sample size.Deterministic method uses all n gradients,

∇f(wk) =1

n

n∑i=1

∇fi(wk).

Stochastic method approximates it with 1 sample,

∇fik(wk) ≈ 1

n

n∑i=1

∇fi(wk).

A common variant is to use larger sample Bk (“mini-batch”),

1

|Bk|∑i∈Bk∇fi(wk) ≈ 1

n

n∑i=1

∇fi(wk),

particularly useful for vectorization/parallelization.For example, with 16 cores set |Bk| = 16 and compute 16 gradients at once.


Mini-Batching as Gradient Descent with ErrorThe SG method with a sample Bk (“mini-batch”) uses iterations

wk+1 = wk − αk

|Bk|∑i∈Bk∇fi(wk).

Let’s view this as a “gradient method with error”,

wk+1 = wk − αk(∇f(wk) + ek),

where ek is the difference between approximate and true gradient.(ek = gk −∇f(wk) for approximation gk)

If you use αk = 1/L, then using descent lemma this algorithm has

f(wk+1) ≤ f(wk)− 1

2L‖∇f(wk)‖2︸︷︷︸

good

+1

2L‖ek‖2︸︷︷︸bad

,

for any error ek (not necessarily unbiased or even stochastic).


Effect of Error on Convergence Rate

Our progress bound with αk = 1/L and error in the gradient of ek is

f(wk+1) ≤ f(wk)− 1

2L‖∇f(wk)‖2︸︷︷︸

good

+1

2L‖ek‖2︸︷︷︸bad

,

and notice that you are guaranteed to decrease f is ‖ek‖ < ‖∇f(wk)‖.

Connection between “error-free” rate and “with error” rate:

If “error-free” rate is O(1/k), you maintain this rate if ‖ek‖2 = O(1/k).If “error-free” rate is O(ρk), you maintain this rate if ‖ek‖2 = O(ρk).If error goes to zero more slowly, then rate that it goes to zero becomes bottleneck.

So to understanding effect of batch-size, need to know how |Bk| affects ‖ek‖2.


Effect of Batch Size on Error

Effect of batch size |Bk| control error size ek.

If we sample with replacement we get

E[‖ek‖2] = 1

|Bk|σ2,

where σ2 is the variance of the gradient norms.

“Doubling the batch size cuts the error in half”.

If we sample without replacement from a training set of size n we get

E[‖ek‖2] = n− |Bk|n

1

|Bk|σ2,

which drives error to zero as batch size approaches n.

For O(ρk) linear convergence, need a schedule like |Bk+1| = |Bk|/ρ.

For O(1/k) sublinear convergence, need a schedule like |Bk+1| = |Bk|+ const.


Batching: Growing-Batch-Size Methods

The SG method with a sample Bk uses iterations

wk+1 = wk − αk

|Bk|∑i∈Bk∇fi(wk).

For a fixed sample size |Bk|, the rate is sublinear.With fixed step-size, doubling batch size halves radius of “ball” around solution.Still need step-size to go to zero to get convergence.

But we can grow |Bk| to achieve a faster rate:Early iterations are cheap like SG iterations.Later iterations can use a sophisticated gradient method.

No need to set a magical step-size: use a line-search.Can incorporate linear-time approximations to Newton.

Another approach: at some point switch from stochastic to deterministic:Often after a small number of passes (but hard to know when to switch).


Variance-Reduction

Increasing the batch size is a form of variance-reduction.

A way to decrease the size of the variance in SGD (“bad” term).

Many other forms of variance reduction exist.

Control variates, importance sampling, re-parameterization trick, and so on.

These improve constants in SGD convergence rate.

But don’t improve rate unless objecctive is smooth and variance goes to zero.


Outline

1 Mini-Batches and Batching

2 Stochastic Average Gradient


Previously: Better Methods for Smooth Objectives and Finite Datasets

Stochastic vs. deterministic methods

• Goal = best of both worlds: linear rate with O(1) iteration cost

hybridlog(

exce

ss c

ost)

stochastic

deterministic

timeStochastic methods:

O(1/ε) iterations but requires 1 gradient per iterations.Deterministic methods:

O(log(1/ε)) iterations but requires n gradients per iteration.Growing-batch (“batching”) or “switching” methods:

O(log(1/ε)) iterations, requires fewer than n gradients in early iterations.


Stochastic Average Gradient

Growing |Bk| eventually requires O(n) iteration cost.

Can we have 1 gradient per iteration and only O(log(1/ε)) iterations?YES! First method was the stochastic average gradient (SAG) algorithm in 2012.

To motivate SAG, let’s view gradient descent as performing the iteration

wk+1 = wk − αk

n

n∑i=1

vki ,

where on each step we set vki = ∇fi(wk) for all i.

SAG method: only set vkik = ∇fik(wk) for a randomly-chosen ik.

All other vki are kept at their previous value.



We can think of SAG as having a memory:v1v2...vn

,where vki is the gradient ∇fi(wk) from the last k where i was selected.

On each iteration we:

Randomly choose one of the vi and update it to the current gradient.We take a step in the direction of the average of these vi.


Stochastic Average GradientBasic SAG algorithm (maintains g =

∑ni=1 vi):

Set g = 0 and gradient approximation vi = 0 for i = 1, 2, . . . , n.while(1)

Sample i from {1, 2, . . . , n}.Compute ∇fi(w).g = g − vi +∇fi(w).vi = ∇fi(w).w = w − α

ng.

Iteration cost is O(d), and “lazy updates” allow O(z) with sparse gradients.

For linear models where fi(w) = h(w>xi), it only requires O(n) memory:

∇fi(w) = h′(w>xi)︸︷︷︸scalar

xi︸︷︷︸data

.

Least squares is h(z) = 12 (z − yi)2, logistic is h(z) = log(1 + exp(−yiz)), etc.

For neural networks, would need to store all activations (typically impractical).



The SAG iteration is

wk+1 = wk − αk

n

n∑i=1

vki ,

where on each iteration we set vkik = ∇fik(wk) for a randomly-chosen ik.

Unlike batching, we use a gradient for every example.

But the gradients might be out of date.

Stochastic variant of earlier increment aggregated gradient (IAG).

Selects ik cyclically, which destroys performance.

Key proof idea: vki → ∇fi(w∗) at the same rate that wk → w∗:

So the variance ‖ek‖2 (“bad term”) converges linearly to 0.


Convergence Rate of SAG

If each ∇fi is L−continuous and f is strongly-convex, with αk = 1/16L SAG has

E[f(wk)− f(w∗)] 6 O

((1−min

{µ

16L,1

8n

})k)

Number of ∇fi evaluations to reach accuracy ε:

Stochastic: O(Lµ (1/ε)). (Best when n is enormous)

Gradient: O(nLµ log(1/ε)).

Nesterov: O(n√

Lµ log(1/ε)). (Best when n is small and L/µ is big)

SAG: O(max{n, Lµ } log(1/ε)).

But note that the L values are again different between algorithms.


Comparing Deterministic and Stochastic Methods

Two benchmark L2-regularized logistic regression datasets:

0 10 20 30 40 50

10−6

10−4

10−2

100

Effective Passes

Ob

jec

tive

min

us

Op

tim

um

AFG

L−BFGS

SGASG

IAG

0 10 20 30 40 50

10−4

10−3

10−2

10−1

100

Effective Passes

Ob

jec

tive

min

us

Op

tim

um

AFG

L−BFGS

SG

ASG

IAG

Averaging makes SG work better, deterministic methods eventually catch up.


SAG Compared to Deterministic/Stochastic Methods

Two benchmark L2-regularized logistic regression datasets:

0 10 20 30 40 50

10−12

10−10

10−8

10−6

10−4

10−2

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

AFG

L−BFGS

SG

ASG

IAG

SAG−LS

0 10 20 30 40 50

10−15

10−10

10−5

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

AFGL−BFGS

SG

ASG

IAG

SAG−LS

Starts like stochastic but linear rate, SAG step-size set to L̂ approximation.


Discussion of SAG and Beyond

Bonus slides discuss practical issues related to SAG:

Setting step-size with an approximation to L.Deciding when to stop.Lipschitz sampling of training examples.

Improves rate for SAG, only changes constants for SG.

There are now a bunch of stochastic algorithm with fast rates:

SDCA, MISO, mixedGrad, SVRG, S2GD, Finito, SAGA, etc.Accelerated/Newton-like/coordinate-wise/proximal/ADMM versions.Analysis in non-convex settings, including new algorithms for PCA.You can apparently get medals for research:https://ismp2018.sciencesconf.org/data/pages/_SJP8196.jpg

Most notable variation is SVRG which gets rid of the memory...

https://ismp2018.sciencesconf.org/data/pages/_SJP8196.jpg


Stochastic Variance-Reduced Gradient (SVRG)

SVRG algorithm: gets rid of memory by occasionally computing exact gradient.

wk+1 = wk − αk(∇fik(wk)−∇fik(ws) +∇f(ws)︸︷︷︸mean zero

),

where ws is updated every m iterations.

Convergence properties similar to SAG (for suitable m).

Unbiased: E[∇fik(ws)] = ∇f(ws) (special case of “control variate”).

Theoretically m depends on L, µ, and n (some analyses randomize it).

In practice m = n seems to work well.

O(d) storage at average cost of 3 gradients per iteration.


End of Part 2: Key Ideas

Typical ML problems are written as optimization problem

argminw∈Rd

F (w) =1

n

n∑i=1

fi(w>xi) + λr(w).

Coordinate optimization:Faster than gradient descent if iterations are d-times cheaper.Allows non-smooth r if it’s separable.

Stochastic subgradient:Iteration cost is n-times cheaper than [sub]gradient descent.For non-smooth problems, convergence rate is same as subgradient method.For smooth problems, number of iterations is much higher than gradient descent.Effect of constant step size and batch size.

SAG and SVRG:Special case when F is smooth.Same low cost as stochastic gradient methods.But similar convergence rate to gradient descent (many extensions exist).


Even Bigger Problems?What about datasets that don’t fit on one machine?

We need to consider parallel and distributed optimization.

New issues:Synchronization: we may not want to wait for the slowest machine.Communication: it’s expensive to transfer data and parameters across machines.Failures: in huge-scale settings, machine failure probability is non-trivial.Batch size: for SGD is it better to get more parallelism or more iterations?

“Embarassingly” parallel solution:Split data across machines, each machine computes gradient of their subset.Papers present more fancy methods, but always try this first (“linear speedup”).

Fancier methods:Asyncronous stochastic subgradient (works fine if you make the step-size smaller).Parallel coordinate optimization (works fine if you make the step-size smaller).Decentralized gradient (needs a smaller step-size and an “EXTRA” trick).


Skipped Topics: Kernel Methods and Dual Methods

In previous years, I’ve covered the following topics:1 Kernel methods:

Allows using some exponential- or infinite-sized feature sets.Allows defining a “similarity” between training examples rather than features.Mercer’s theorem and how to determine if a kernel is valid.Representer theorem and models allowing kernel trick.Multiple kernel learning and connection to structured sparsity.Large-scale kernel approximations that avoid the high cost.

2 Dual methods:

Lagrangian function, dual function, and convex conjugate.Fenchel dual for deriving duals of “loss plus regularizer” problems.Connection between stochastic subgradient method and dual coordinate ascent.Turning non-smooth problems into equivalent smooth problems.Line-search for stochastic subgradient methods.

If you’re interested, I put the slides on these topics here:https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L12.5.pdf

https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L12.5.pdf


Summary

Mini-batches and effect of batch size.:

Doubling batch size halves the variance.Growing batch size leads to faster rate in terms of iterations.

And makes it easier to set the step-size and use Newton-like methods.

Stochastic average gradient: O(log(1/ε)) iterations with 1 gradient per iteration.

SVRG removes the memory requirement of SAG.

Next time: optimization with n =∞ (possibly non-IID).


SAG Practical Implementation Issues

Implementation tricks:Improve performance at start using 1

mg instead of 1ng.

m is the number of examples visited.

Common to use αk = 1/L and use adaptive L.

Start with L̂ = 1 and double it whenever we don’t satisfiy

fik

(wk − 1

L̂∇fik (w

k)

)≤ fik (w

k)− 1

2L̂‖∇fik (w

k)‖2,

and ‖∇fik (wk)‖ is non-trivial. Costs O(1) for linear models in terms of n and d.

Can use ‖wk+1 − wk‖/α = 1n‖g‖ ≈ ‖∇f(wk)‖ to decide when to stop.

Lipschitz sampling of examples improves convergence rate:As with coordinate descent, sample the ones that can change quickly more often.For classic SG methods, this only changes constants.

Date post:	28-Feb-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

First-Order Optimization Algorithms for Machine Learning...

Documents