A Beautiful Paper
Benedetta MoriniDipartimento di Ingegneria Industriale, Universita di Firenze
Firenze, 7 febbraio 2020
Suspense...
My options:
Suspense...
My options:
1 Inexact Newton methods: iterative linear algebra, forcingterms, globalization strategies in Krylov subspaces...
Suspense...
My options:
1 Inexact Newton methods: iterative linear algebra, forcingterms, globalization strategies in Krylov subspaces...
2 Interior Point methods: stability and accuracy. Reducedand unreduced KKT systems, stability of linear solvers,conditioning of linear systems, propagation of roundofferrors...
Suspense...
My options:
1 Inexact Newton methods: iterative linear algebra, forcingterms, globalization strategies in Krylov subspaces...
2 Interior Point methods: stability and accuracy. Reducedand unreduced KKT systems, stability of linear solvers,conditioning of linear systems, propagation of roundofferrors...
DOPO AVER FATTO PREOCCUPARE MARCO...
Trust-region methods
Solve an unconstrained, possibly nonconvex, optimizationproblem minx∈Rn f(x), f smooth and bounded below.
0. Given x0, δ0, γ > 1, η1 ∈ (0, 1), η2 > 0. Set k = 0.
1. Model construction:
Build a model mDk that approximates f on B(xk, δk)
mDk (xk + s) = f(xk) + ∇f(xk)T s + sT Hks Hk ≈ ∇2f(xk)
2. Step calculation:
Compute sk = argmin‖s‖≤δkmD
k (s) approximately (the standard Cauchy
decrease condition is satisfied)
3. Acceptance of the trial point:
Compute ρk =f(xk) − f(xk + sk)
mk(xk) − mk(xk + sk)
If ρk ≥ η1
Set xk+1 = xk + sk, δk+1 = γδk go to Step 1 (successful iteration)
Else xk+1 = xk, δk+1 = γ−1δk go to Step 1 (unsuccessful iteration)
Set k = k + 1
Trust-region methods with approximate
functions/gradients/Hessians
f is computed approximately to reduce the computational costor
f(x, ζ) is the noisy computable version of f(x) and the noise ζ
is random.
Common settings in machine learning:
1 f(x) = Eζ [f(x, ζ)], ∀xnoise unbiased;
2 f(x) = 1N
∑Ni=1 fi(x), N ≫ 1
surrogate problem/supervised learning, sample average approximation of (1)
3 f(x) 6= Eζ [f(x, ζ)]large noise with some positive probability
Build a model mk that approximates f on B(xk, δk)
mk(xk + s) = fk + gTk s + sT Hks
fk, gk, Hk estimates of f(xk), ∇f(xk), ∇2f(xk) respectively
Deterministic/probabilistic/stochastic analysis
mk(xk + s) = fk + gTk s + s
THks
1 In the finite-sum minimization case
f(x) =1
N
N∑
i=1
fi(x), N ≫ 1
consider schemes where the model is based on subsampling
fk =1
|Ik|
∑
i∈Ik
fi(xk), Ik ⊆ {1, . . . , N}
gk =1
|Ik|
∑
i∈Ik
∇fi(xk)
Hk =1
|Jk|
∑
i∈Jk
∇2fi(xk) |Jk| ≤ |Ik|, Jk ⊆ Ik
Deterministic analysis: if fk and gk have increasing, up to full,accuracy along the iterations then properties relies on the determistictrust-region approach.
Deterministic/probabilistic/stochastic analysis
mk(xk + s) = fk + gTk s + sT Hks
1 Consider a trust-region scheme that requires
at most O(ǫ−2) iterations to satisfy ‖∇f(xk)‖ ≤ ǫ
if mk is sufficiently accurate, i.e., |mk − mDk | ≤ τk, for some τk > 0.
Suppose we can determine a sufficiently accurate model withprobability 1 − η
P (|mk − mDk | ≤ τk) ≥ 1 − η
The probability of having a sufficiently accurate models along Kiterations is (1 − η)K (independent events).
Then, the complexity result O(ǫ−2) holds with probability 1 − η if
1 − η = (1 − η)K
i.e.,
η = 1 − (1 − η)1/K = O
(η
K
)
= O(ηǫ2
)
Deterministic/probabilistic/stochastic analysis
Bernstein inequality
fk =1
|Ik|
∑
i∈Ik
fi(xk),
P (|fk − f(xk)| ≤ τk) ≥ 1 − η
if
|Ik| ≥ min
{
N,
⌈2
τk
(Vf
τk+
2ωf (xk)
3
)
log
(2
η
)⌉}
where E(|fi(x) − f(x)|2) ≤ Vf and maxi∈{1,...,N} |fi(x)| ≤ ωf (x)
The failure probability appears in the “log factor” and it is not thedominating cost.
Analogous results for gk, Hk.
Deterministic/probabilistic/stochastic analysis
With very small modifications, adapt a standard trust-regionmethod to stochastic nonlinear not necessarily convex functions
Random models of f are used at each iteration to compute the nextpotential iterate.
Random estimates of the function values at the current iterate and thepotential iterate are used to gauge the progress that is being made.
Stochastic analysis:
Convergence analysis relies on requirements that these models and theseestimates are sufficiently accurate with sufficiently high probability.
Probabilities are not increasing, they need to be above a certain constant.
No assumptions are made about how these models and estimates aregenerated.
If a model or estimate is inaccurate, it can be arbitrarily inaccurate.
The paper
Ruobing Chen, Matt Menickelly, Katya Scheinberg
Stochastic Optimization using a trust-region method and random models
Mathematical Programming Ser. A, 2018
STORM: SThocastic Optimization with Random Models
“With roots in statistics and computer science, ML has a mathematicaloptimization engine as its core”, Frank E. Curtis and Katya Scheinberg,INFORMS Tutorials in Operations Research, 2017
Using a first-order model (Hk ≡ 0) STORM is stochastic gradient methodwith adaptive stepsize
mk(xk + s) = fk + gTk s sk = argmin
‖s‖≤δk
mk(s) = −δkgk
STORM: SThocastic Optimization with Random Models
0. Given x0, δ0, γ > 1, η1 ∈ (0, 1), η2 > 0. Set k = 0.
1. Model construction:
Build a model mk that approximates f on B(xk, δk)
mk(xk + s) = fk + gTk s + sT Hks
2. Step calculation:
Compute sk = argmin‖s‖≤δkmk(s) approximately (the standard Cauchy
decrease condition is satisfied)
3. Estimates calculation:
Obtain estimates f0k and fs
k of f(xk) and f(xk + sk) respectively.
4. Acceptance of the trial point:
Compute ρk =f0
k − fsk
mk(xk) − mk(xk + sk)
If ρk ≥ η1 and ‖gk‖ ≥ η2δk
Set xk+1 = xk + sk, δk+1 = γδk go to Step 1 (successful iteration)
Else xk+1 = xk, δk+1 = γ−1δk go to Step 1 (unsuccessful iteration)
STORM is a stochastic process generating random models{Mk} and quantities {Xk, Sk, ∆k, F
0k , FS
k }
mk denotes a realization of the random models Mk, xk is a realization of therandom iterates Xk, etc...
A successful iteration
ρk ≥ η1 and ‖gk‖ ≥ η2δk
does not necessarily yield an actual reduction in the true function f sincethe step acceptance decision is made on f0
k ≈ f(xk) and fsk ≈ f(xk + sk).
If f0k and fs
k are not accurate enough, a successful iteration can result in anincrease of the true function value.
Hence two types of successful iterations occur
those where f is in fact decreased ⇒ true successful iterations
those where the decrease of f can be arbitrarily small or negative ⇒
false successful iterations
Role of δk
STORM does not detect true/false successful iterations.But, if f0
k and fsk are sufficiently accurate, true successful
iterations occur sufficiently often for convergence to hold.
Consider a first-order model mk
mk(xk + s) = fk + gTk s sk = argmin
‖s‖≤δk
mk(s) = −δkgk
A successul iterate ρk ≥ η1 and ‖gk‖ ≥ η2δk implies:
f0k − fs
k ≥ η1(mk(0) − mk(xk + sk))︸ ︷︷ ︸
ρk≥η1
= η1δk‖gk‖2 ≥ η1η2
2δ2k
︸ ︷︷ ︸
‖gk‖≥η2δk
If |f(xk)− f0k | and |f(xk + sk)− fs
k | are sufficiently smaller than η1η22δ2
k, then thedecrease in the estimates is larger than the noise.
⇓
δk controls the accuracy of function and gradient estimates and serves as a guess
of the true function decrease
Probabilistic models and estimates
Consider models Mk and estimate F 0k , F s
k whose accuracy isproportional to ∆k with some probability and conditioned on
the past
1 Suppose ∇f is Lipschitz continuous, mk is a κ-fully linear model of f
if
‖∇f(y) −∇mk(y)‖ ≤ κ δk, |f(y) − mk(y)| ≤ κ δ2k ∀y ∈ B(xk, δk)
2 A sequence of random models {Mk} is α-probabilistically κ-fullylinear with respect to {B(Xk, ∆k)} if Mk, conditioned on the past, isa κ-fully linear model of f with probability P ≥ α.
Use Bernstein inequality in:averaging techniques and linear interpolation models for f(x) = Eζ [f(x, ζ)];
subsampled sums in fx) = 1N
∑Ni=1 fi(x)
Conn, Scheinberg, Vicente, Introduction to derivative-free optimization, 2009
Bandeira, Scheinberg, Vicente, SIOPT 2014
Probabilistic models and estimates
Consider models Mk and estimate F 0k , F s
k whose accuracy isproportional to ∆k with some probability and conditioned on
the past
1 f0k , fs
k are ǫf -accurate estimates of f(xk) and f(xk + sk) if
|f(xk) − f0k | ≤ ǫfδ
2k, |f(xk + sk) − f
sk | ≤ ǫfδ
2k
2 A sequence of random estimates {Fk} is β-probabilisticallyǫf -accurate with respect to {Xk, ∆k, Sk} if F 0
k , F sk , conditioned on the
past, are ǫf -accurate estimates of f(xk) and f(xk + sk) respectivelywith probability P ≥ β.
Use Bernstein inequality in:averaging techniques for f(x) = Eζ [f(x, ζ)];
subsampled sums in fx) = 1N
∑Ni=1 fi(x)
Properties of STORM
If STORM has access to α-probabilistically κ-fully linearmodels and to β-probabilistically ǫf -accurate estimates then it
may convergence properties with probability one.
Consider kth iteration (successful/unsuccessful) and the random function
Φk = νf(Xk) + (1 − ν)∆2k, ν ∈ (0, 1) fixed constant
Analyze the four possible outcomes at each iteration:
1 Good model & good function estimates;
2 Good model & bad function estimates;
3 Bad model & good function estimates;
4 Bad model & bad function estimates.
Properties of STORM
Φk = νf(Xk) + (1 − ν)∆2k, ν ∈ (0, 1) fixed constant
It is possible to select η2 (‖gk‖ ≥ η2δk), ǫf , α, β and ν, independent of k, sothat conditioned to the past,
E [Φk+1 − Φk] ≤ −σ∆2k, σ > 0
Successful iteration: xk+1 = xk + sk, δk+1 = γδk with γ > 1,
φk+1 − φk = ν(f(xk+1) − f(xk)) + (1 − ν)(γ2 − 1)δ2k
︸ ︷︷ ︸
>0
Unsuccessful iteration: xk+1 = xk, δk+1 =1
γδk with γ > 1,
φk+1 − φk = (1 − ν)
(1
γ2− 1
)
δ2k < 0
Properties of STORM
Φk = νf(Xk) + (1 − ν)∆2k, ν ∈ (0, 1) fixed constant
1 Unless the model and the estimates are bad, a decrease in Φk occurs.
2 When the model and the estimates are bad, an increase of Φk mayoccur.
Good model and good estimates (probability P ≥ αβ):
Φk+1 − Φk ≤(−νC2 + (1 − ν)(γ2 − 1)
)∆2
k ≡ Bk,2 < 0
Good model and bad estimates or viceversa (probability P ≤ (1 − α) orP ≤ (1 − β)):
Φk+1 − Φk ≤(−νC2 + (1 − ν)(γ2 − 1)
)∆2
k ≡ Bk,1 < 0
Bad model and estimates (probability P ≤ αβ):
Φk+1 − Φk ≤(νC3 + (1 − ν)(γ2 − 1)
)∆2
k ≡ Bk,3 > 0
Properties of STORM
Proper choice of α and β yields a decrease of φk in expectation
E [Φk+1 − Φk] ≤ αβBk,2︸ ︷︷ ︸
<0
+ ((1 − α) + (1 − β))Bk1︸ ︷︷ ︸
<0
+ (1 − α)(1 − β)Bk,3︸ ︷︷ ︸
>0
≤ −σ∆2k
with σ > 0 if some τ ∈ (0, 1) and τ2 > 0
(1 − α)(1 − β) < τ1 < 1 andαβ − 1
2
(1 − α)(1 − β)> τ2
Bricks for properties of STORM
successful iterate: ρk ≥ η1 and ‖gk‖ ≥ η2δk
If ‖gk‖/δk is sufficiently large and mk is κ-fully-linear
f(xk + sk) − f(xk) < 0
(The step may be reject if f0k , fs
k are not accurate enough)
If f0k , fs
k is ǫf -accurate with ǫf small enough and the iterate is successful
f(xk + sk) − f(xk) < 0
If ‖gk‖/δk is sufficiently large, mk is κ-fully-linear and f0k , fs
k are ǫf -accurate
the iterate is successful
Lim-type convergence of STORM
f is bounded from below and ∆k > 0 ⇒ Φk is bounded below for all k.
∇f is Lipschitz continuous.
With probability 1:
∞∑
k=1
∆2k < ∞
limk→∞
‖∇f(Xk)‖ = 0
The second result is showed using Lipschitz continuity of ∇f and does not
depend on the stochastic nature of the algorithm.
Open issues and perspectives
Numerical experiments show that the bounds on probability suggestedby theory are far from being tight.
Numerical experiments show good results using adaptive sample rulessuch as
|Ik| = max
{
|I0| + c k,
⌈1
δ2k
⌉}
, c ≥ 1
Numerical experiments on regularized logistic loss compare againstochastic gradients that take adaptive stepsize but do not computeestimates of the loss.
The true function produced by stochastic gradients can vary widelyover the horizon while it decreases fairly stable over successfuliterations.