Mini-Batches and Batching Stochastic Average Gradient
First-Order Optimization Algorithms for Machine LearningVariance-Reduced Stochastic Gradient
Mark Schmidt
University of British Columbia
Summer 2020
Mini-Batches and Batching Stochastic Average Gradient
Better Methods for Smooth Objectives and Finite Datasets?
Stochastic vs. deterministic methods
• Goal = best of both worlds: linear rate with O(1) iteration cost
hybridlog(
exce
ss c
ost)
stochastic
deterministic
timeStochastic methods:
O(1/ε) iterations but requires 1 gradient per iterations.Rates are unimprovable for general stochastic objectives.
Deterministic methods:O(log(1/ε)) iterations but requires n gradients per iteration.The faster rate is possible because n is finite.
For finite n, can we design a better method?
Mini-Batches and Batching Stochastic Average Gradient
Hybrid Deterministic-StochasticApproach 1: control the sample size.Deterministic method uses all n gradients,
∇f(wk) =1
n
n∑i=1
∇fi(wk).
Stochastic method approximates it with 1 sample,
∇fik(wk) ≈ 1
n
n∑i=1
∇fi(wk).
A common variant is to use larger sample Bk (“mini-batch”),
1
|Bk|∑i∈Bk∇fi(wk) ≈ 1
n
n∑i=1
∇fi(wk),
particularly useful for vectorization/parallelization.For example, with 16 cores set |Bk| = 16 and compute 16 gradients at once.
Mini-Batches and Batching Stochastic Average Gradient
Mini-Batching as Gradient Descent with ErrorThe SG method with a sample Bk (“mini-batch”) uses iterations
wk+1 = wk − αk
|Bk|∑i∈Bk∇fi(wk).
Let’s view this as a “gradient method with error”,
wk+1 = wk − αk(∇f(wk) + ek),
where ek is the difference between approximate and true gradient.(ek = gk −∇f(wk) for approximation gk)
If you use αk = 1/L, then using descent lemma this algorithm has
f(wk+1) ≤ f(wk)− 1
2L‖∇f(wk)‖2︸ ︷︷ ︸
good
+1
2L‖ek‖2︸ ︷︷ ︸bad
,
for any error ek (not necessarily unbiased or even stochastic).
Mini-Batches and Batching Stochastic Average Gradient
Effect of Error on Convergence Rate
Our progress bound with αk = 1/L and error in the gradient of ek is
f(wk+1) ≤ f(wk)− 1
2L‖∇f(wk)‖2︸ ︷︷ ︸
good
+1
2L‖ek‖2︸ ︷︷ ︸bad
,
and notice that you are guaranteed to decrease f is ‖ek‖ < ‖∇f(wk)‖.
Connection between “error-free” rate and “with error” rate:
If “error-free” rate is O(1/k), you maintain this rate if ‖ek‖2 = O(1/k).If “error-free” rate is O(ρk), you maintain this rate if ‖ek‖2 = O(ρk).If error goes to zero more slowly, then rate that it goes to zero becomes bottleneck.
So to understanding effect of batch-size, need to know how |Bk| affects ‖ek‖2.
Mini-Batches and Batching Stochastic Average Gradient
Effect of Batch Size on Error
Effect of batch size |Bk| control error size ek.
If we sample with replacement we get
E[‖ek‖2] = 1
|Bk|σ2,
where σ2 is the variance of the gradient norms.
“Doubling the batch size cuts the error in half”.
If we sample without replacement from a training set of size n we get
E[‖ek‖2] = n− |Bk|n
1
|Bk|σ2,
which drives error to zero as batch size approaches n.
For O(ρk) linear convergence, need a schedule like |Bk+1| = |Bk|/ρ.
For O(1/k) sublinear convergence, need a schedule like |Bk+1| = |Bk|+ const.
Mini-Batches and Batching Stochastic Average Gradient
Batching: Growing-Batch-Size Methods
The SG method with a sample Bk uses iterations
wk+1 = wk − αk
|Bk|∑i∈Bk∇fi(wk).
For a fixed sample size |Bk|, the rate is sublinear.With fixed step-size, doubling batch size halves radius of “ball” around solution.Still need step-size to go to zero to get convergence.
But we can grow |Bk| to achieve a faster rate:Early iterations are cheap like SG iterations.Later iterations can use a sophisticated gradient method.
No need to set a magical step-size: use a line-search.Can incorporate linear-time approximations to Newton.
Another approach: at some point switch from stochastic to deterministic:Often after a small number of passes (but hard to know when to switch).
Mini-Batches and Batching Stochastic Average Gradient
Variance-Reduction
Increasing the batch size is a form of variance-reduction.
A way to decrease the size of the variance in SGD (“bad” term).
Many other forms of variance reduction exist.
Control variates, importance sampling, re-parameterization trick, and so on.
These improve constants in SGD convergence rate.
But don’t improve rate unless objecctive is smooth and variance goes to zero.
Mini-Batches and Batching Stochastic Average Gradient
Outline
1 Mini-Batches and Batching
2 Stochastic Average Gradient
Mini-Batches and Batching Stochastic Average Gradient
Previously: Better Methods for Smooth Objectives and Finite Datasets
Stochastic vs. deterministic methods
• Goal = best of both worlds: linear rate with O(1) iteration cost
hybridlog(
exce
ss c
ost)
stochastic
deterministic
timeStochastic methods:
O(1/ε) iterations but requires 1 gradient per iterations.Deterministic methods:
O(log(1/ε)) iterations but requires n gradients per iteration.Growing-batch (“batching”) or “switching” methods:
O(log(1/ε)) iterations, requires fewer than n gradients in early iterations.
Mini-Batches and Batching Stochastic Average Gradient
Stochastic Average Gradient
Growing |Bk| eventually requires O(n) iteration cost.
Can we have 1 gradient per iteration and only O(log(1/ε)) iterations?YES! First method was the stochastic average gradient (SAG) algorithm in 2012.
To motivate SAG, let’s view gradient descent as performing the iteration
wk+1 = wk − αk
n
n∑i=1
vki ,
where on each step we set vki = ∇fi(wk) for all i.
SAG method: only set vkik = ∇fik(wk) for a randomly-chosen ik.
All other vki are kept at their previous value.
Mini-Batches and Batching Stochastic Average Gradient
Stochastic Average Gradient
We can think of SAG as having a memory:v1v2...vn
,where vki is the gradient ∇fi(wk) from the last k where i was selected.
On each iteration we:
Randomly choose one of the vi and update it to the current gradient.We take a step in the direction of the average of these vi.
Mini-Batches and Batching Stochastic Average Gradient
Stochastic Average GradientBasic SAG algorithm (maintains g =
∑ni=1 vi):
Set g = 0 and gradient approximation vi = 0 for i = 1, 2, . . . , n.while(1)
Sample i from {1, 2, . . . , n}.Compute ∇fi(w).g = g − vi +∇fi(w).vi = ∇fi(w).w = w − α
ng.
Iteration cost is O(d), and “lazy updates” allow O(z) with sparse gradients.
For linear models where fi(w) = h(w>xi), it only requires O(n) memory:
∇fi(w) = h′(w>xi)︸ ︷︷ ︸scalar
xi︸︷︷︸data
.
Least squares is h(z) = 12 (z − yi)2, logistic is h(z) = log(1 + exp(−yiz)), etc.
For neural networks, would need to store all activations (typically impractical).
Mini-Batches and Batching Stochastic Average Gradient
Stochastic Average Gradient
The SAG iteration is
wk+1 = wk − αk
n
n∑i=1
vki ,
where on each iteration we set vkik = ∇fik(wk) for a randomly-chosen ik.
Unlike batching, we use a gradient for every example.
But the gradients might be out of date.
Stochastic variant of earlier increment aggregated gradient (IAG).
Selects ik cyclically, which destroys performance.
Key proof idea: vki → ∇fi(w∗) at the same rate that wk → w∗:
So the variance ‖ek‖2 (“bad term”) converges linearly to 0.
Mini-Batches and Batching Stochastic Average Gradient
Convergence Rate of SAG
If each ∇fi is L−continuous and f is strongly-convex, with αk = 1/16L SAG has
E[f(wk)− f(w∗)] 6 O
((1−min
{µ
16L,1
8n
})k)
Number of ∇fi evaluations to reach accuracy ε:
Stochastic: O(Lµ (1/ε)). (Best when n is enormous)
Gradient: O(nLµ log(1/ε)).
Nesterov: O(n√
Lµ log(1/ε)). (Best when n is small and L/µ is big)
SAG: O(max{n, Lµ } log(1/ε)).
But note that the L values are again different between algorithms.
Mini-Batches and Batching Stochastic Average Gradient
Comparing Deterministic and Stochastic Methods
Two benchmark L2-regularized logistic regression datasets:
0 10 20 30 40 50
10−6
10−4
10−2
100
Effective Passes
Ob
jec
tive
min
us
Op
tim
um
AFG
L−BFGS
SGASG
IAG
0 10 20 30 40 50
10−4
10−3
10−2
10−1
100
Effective Passes
Ob
jec
tive
min
us
Op
tim
um
AFG
L−BFGS
SG
ASG
IAG
Averaging makes SG work better, deterministic methods eventually catch up.
Mini-Batches and Batching Stochastic Average Gradient
SAG Compared to Deterministic/Stochastic Methods
Two benchmark L2-regularized logistic regression datasets:
0 10 20 30 40 50
10−12
10−10
10−8
10−6
10−4
10−2
100
Effective Passes
Ob
jec
tiv
e m
inu
s O
ptim
um
AFG
L−BFGS
SG
ASG
IAG
SAG−LS
0 10 20 30 40 50
10−15
10−10
10−5
100
Effective Passes
Ob
jec
tiv
e m
inu
s O
ptim
um
AFGL−BFGS
SG
ASG
IAG
SAG−LS
Starts like stochastic but linear rate, SAG step-size set to L̂ approximation.
Mini-Batches and Batching Stochastic Average Gradient
Discussion of SAG and Beyond
Bonus slides discuss practical issues related to SAG:
Setting step-size with an approximation to L.Deciding when to stop.Lipschitz sampling of training examples.
Improves rate for SAG, only changes constants for SG.
There are now a bunch of stochastic algorithm with fast rates:
SDCA, MISO, mixedGrad, SVRG, S2GD, Finito, SAGA, etc.Accelerated/Newton-like/coordinate-wise/proximal/ADMM versions.Analysis in non-convex settings, including new algorithms for PCA.You can apparently get medals for research:https://ismp2018.sciencesconf.org/data/pages/_SJP8196.jpg
Most notable variation is SVRG which gets rid of the memory...
Mini-Batches and Batching Stochastic Average Gradient
Stochastic Variance-Reduced Gradient (SVRG)
SVRG algorithm: gets rid of memory by occasionally computing exact gradient.
wk+1 = wk − αk(∇fik(wk)−∇fik(ws) +∇f(ws)︸ ︷︷ ︸mean zero
),
where ws is updated every m iterations.
Convergence properties similar to SAG (for suitable m).
Unbiased: E[∇fik(ws)] = ∇f(ws) (special case of “control variate”).
Theoretically m depends on L, µ, and n (some analyses randomize it).
In practice m = n seems to work well.
O(d) storage at average cost of 3 gradients per iteration.
Mini-Batches and Batching Stochastic Average Gradient
End of Part 2: Key Ideas
Typical ML problems are written as optimization problem
argminw∈Rd
F (w) =1
n
n∑i=1
fi(w>xi) + λr(w).
Coordinate optimization:Faster than gradient descent if iterations are d-times cheaper.Allows non-smooth r if it’s separable.
Stochastic subgradient:Iteration cost is n-times cheaper than [sub]gradient descent.For non-smooth problems, convergence rate is same as subgradient method.For smooth problems, number of iterations is much higher than gradient descent.Effect of constant step size and batch size.
SAG and SVRG:Special case when F is smooth.Same low cost as stochastic gradient methods.But similar convergence rate to gradient descent (many extensions exist).
Mini-Batches and Batching Stochastic Average Gradient
Even Bigger Problems?What about datasets that don’t fit on one machine?
We need to consider parallel and distributed optimization.
New issues:Synchronization: we may not want to wait for the slowest machine.Communication: it’s expensive to transfer data and parameters across machines.Failures: in huge-scale settings, machine failure probability is non-trivial.Batch size: for SGD is it better to get more parallelism or more iterations?
“Embarassingly” parallel solution:Split data across machines, each machine computes gradient of their subset.Papers present more fancy methods, but always try this first (“linear speedup”).
Fancier methods:Asyncronous stochastic subgradient (works fine if you make the step-size smaller).Parallel coordinate optimization (works fine if you make the step-size smaller).Decentralized gradient (needs a smaller step-size and an “EXTRA” trick).
Mini-Batches and Batching Stochastic Average Gradient
Skipped Topics: Kernel Methods and Dual Methods
In previous years, I’ve covered the following topics:1 Kernel methods:
Allows using some exponential- or infinite-sized feature sets.Allows defining a “similarity” between training examples rather than features.Mercer’s theorem and how to determine if a kernel is valid.Representer theorem and models allowing kernel trick.Multiple kernel learning and connection to structured sparsity.Large-scale kernel approximations that avoid the high cost.
2 Dual methods:
Lagrangian function, dual function, and convex conjugate.Fenchel dual for deriving duals of “loss plus regularizer” problems.Connection between stochastic subgradient method and dual coordinate ascent.Turning non-smooth problems into equivalent smooth problems.Line-search for stochastic subgradient methods.
If you’re interested, I put the slides on these topics here:https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L12.5.pdf
Mini-Batches and Batching Stochastic Average Gradient
Summary
Mini-batches and effect of batch size.:
Doubling batch size halves the variance.Growing batch size leads to faster rate in terms of iterations.
And makes it easier to set the step-size and use Newton-like methods.
Stochastic average gradient: O(log(1/ε)) iterations with 1 gradient per iteration.
SVRG removes the memory requirement of SAG.
Next time: optimization with n =∞ (possibly non-IID).
Mini-Batches and Batching Stochastic Average Gradient
SAG Practical Implementation Issues
Implementation tricks:Improve performance at start using 1
mg instead of 1ng.
m is the number of examples visited.
Common to use αk = 1/L and use adaptive L.
Start with L̂ = 1 and double it whenever we don’t satisfiy
fik
(wk − 1
L̂∇fik (w
k)
)≤ fik (w
k)− 1
2L̂‖∇fik (w
k)‖2,
and ‖∇fik (wk)‖ is non-trivial. Costs O(1) for linear models in terms of n and d.
Can use ‖wk+1 − wk‖/α = 1n‖g‖ ≈ ‖∇f(wk)‖ to decide when to stop.
Lipschitz sampling of examples improves convergence rate:As with coordinate descent, sample the ones that can change quickly more often.For classic SG methods, this only changes constants.