A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing...

A Beautiful Paper

Benedetta MoriniDipartimento di Ingegneria Industriale, Universita di Firenze

Firenze, 7 febbraio 2020

Suspense...

My options:

Suspense...

My options:

1 Inexact Newton methods: iterative linear algebra, forcingterms, globalization strategies in Krylov subspaces...

Suspense...

My options:


2 Interior Point methods: stability and accuracy. Reducedand unreduced KKT systems, stability of linear solvers,conditioning of linear systems, propagation of roundofferrors...

Suspense...

My options:


2 Interior Point methods: stability and accuracy. Reducedand unreduced KKT systems, stability of linear solvers,conditioning of linear systems, propagation of roundofferrors...

DOPO AVER FATTO PREOCCUPARE MARCO...

Trust-region methods

Solve an unconstrained, possibly nonconvex, optimizationproblem minx∈Rn f(x), f smooth and bounded below.

0. Given x0, δ0, γ > 1, η1 ∈ (0, 1), η2 > 0. Set k = 0.

1. Model construction:

Build a model mDk that approximates f on B(xk, δk)

mDk (xk + s) = f(xk) + ∇f(xk)T s + sT Hks Hk ≈ ∇2f(xk)

2. Step calculation:

Compute sk = argmin‖s‖≤δkmD

k (s) approximately (the standard Cauchy

decrease condition is satisfied)

3. Acceptance of the trial point:

Compute ρk =f(xk) − f(xk + sk)

mk(xk) − mk(xk + sk)

If ρk ≥ η1

Set xk+1 = xk + sk, δk+1 = γδk go to Step 1 (successful iteration)

Else xk+1 = xk, δk+1 = γ−1δk go to Step 1 (unsuccessful iteration)

Set k = k + 1

Trust-region methods with approximate

functions/gradients/Hessians

f is computed approximately to reduce the computational costor

f(x, ζ) is the noisy computable version of f(x) and the noise ζ

is random.

Common settings in machine learning:

1 f(x) = Eζ [f(x, ζ)], ∀xnoise unbiased;

2 f(x) = 1N

∑Ni=1 fi(x), N ≫ 1

surrogate problem/supervised learning, sample average approximation of (1)

3 f(x) 6= Eζ [f(x, ζ)]large noise with some positive probability

Build a model mk that approximates f on B(xk, δk)

mk(xk + s) = fk + gTk s + sT Hks

fk, gk, Hk estimates of f(xk), ∇f(xk), ∇2f(xk) respectively

Deterministic/probabilistic/stochastic analysis

mk(xk + s) = fk + gTk s + s

THks

1 In the finite-sum minimization case

f(x) =1

N

N∑

i=1

fi(x), N ≫ 1

consider schemes where the model is based on subsampling

fk =1

|Ik|

∑

i∈Ik

fi(xk), Ik ⊆ {1, . . . , N}

gk =1

|Ik|

∑

i∈Ik

∇fi(xk)

Hk =1

|Jk|

∑

i∈Jk

∇2fi(xk) |Jk| ≤ |Ik|, Jk ⊆ Ik

Deterministic analysis: if fk and gk have increasing, up to full,accuracy along the iterations then properties relies on the determistictrust-region approach.



1 Consider a trust-region scheme that requires

at most O(ǫ−2) iterations to satisfy ‖∇f(xk)‖ ≤ ǫ

if mk is sufficiently accurate, i.e., |mk − mDk | ≤ τk, for some τk > 0.

Suppose we can determine a sufficiently accurate model withprobability 1 − η

P (|mk − mDk | ≤ τk) ≥ 1 − η

The probability of having a sufficiently accurate models along Kiterations is (1 − η)K (independent events).

Then, the complexity result O(ǫ−2) holds with probability 1 − η if

1 − η = (1 − η)K

i.e.,

η = 1 − (1 − η)1/K = O

(η

K

)

= O(ηǫ2

)


Bernstein inequality

fk =1

|Ik|

∑

i∈Ik

fi(xk),

P (|fk − f(xk)| ≤ τk) ≥ 1 − η

if

|Ik| ≥ min

{

N,

⌈2

τk

(Vf

τk+

2ωf (xk)

3

)

log

(2

η

)⌉}

where E(|fi(x) − f(x)|2) ≤ Vf and maxi∈{1,...,N} |fi(x)| ≤ ωf (x)

The failure probability appears in the “log factor” and it is not thedominating cost.

Analogous results for gk, Hk.


With very small modifications, adapt a standard trust-regionmethod to stochastic nonlinear not necessarily convex functions

Random models of f are used at each iteration to compute the nextpotential iterate.

Random estimates of the function values at the current iterate and thepotential iterate are used to gauge the progress that is being made.

Stochastic analysis:

Convergence analysis relies on requirements that these models and theseestimates are sufficiently accurate with sufficiently high probability.

Probabilities are not increasing, they need to be above a certain constant.

No assumptions are made about how these models and estimates aregenerated.

If a model or estimate is inaccurate, it can be arbitrarily inaccurate.

The paper

Ruobing Chen, Matt Menickelly, Katya Scheinberg

Stochastic Optimization using a trust-region method and random models

Mathematical Programming Ser. A, 2018

STORM: SThocastic Optimization with Random Models

“With roots in statistics and computer science, ML has a mathematicaloptimization engine as its core”, Frank E. Curtis and Katya Scheinberg,INFORMS Tutorials in Operations Research, 2017

Using a first-order model (Hk ≡ 0) STORM is stochastic gradient methodwith adaptive stepsize

mk(xk + s) = fk + gTk s sk = argmin

‖s‖≤δk

mk(s) = −δkgk

STORM: SThocastic Optimization with Random Models

0. Given x0, δ0, γ > 1, η1 ∈ (0, 1), η2 > 0. Set k = 0.

1. Model construction:

Build a model mk that approximates f on B(xk, δk)


2. Step calculation:

Compute sk = argmin‖s‖≤δkmk(s) approximately (the standard Cauchy

decrease condition is satisfied)

3. Estimates calculation:

Obtain estimates f0k and fs

k of f(xk) and f(xk + sk) respectively.

4. Acceptance of the trial point:

Compute ρk =f0

k − fsk

mk(xk) − mk(xk + sk)

If ρk ≥ η1 and ‖gk‖ ≥ η2δk

Set xk+1 = xk + sk, δk+1 = γδk go to Step 1 (successful iteration)

Else xk+1 = xk, δk+1 = γ−1δk go to Step 1 (unsuccessful iteration)

STORM is a stochastic process generating random models{Mk} and quantities {Xk, Sk, ∆k, F

0k , FS

k }

mk denotes a realization of the random models Mk, xk is a realization of therandom iterates Xk, etc...

A successful iteration

ρk ≥ η1 and ‖gk‖ ≥ η2δk

does not necessarily yield an actual reduction in the true function f sincethe step acceptance decision is made on f0

k ≈ f(xk) and fsk ≈ f(xk + sk).

If f0k and fs

k are not accurate enough, a successful iteration can result in anincrease of the true function value.

Hence two types of successful iterations occur

those where f is in fact decreased ⇒ true successful iterations

those where the decrease of f can be arbitrarily small or negative ⇒

false successful iterations

Role of δk

STORM does not detect true/false successful iterations.But, if f0

k and fsk are sufficiently accurate, true successful

iterations occur sufficiently often for convergence to hold.

Consider a first-order model mk

mk(xk + s) = fk + gTk s sk = argmin

‖s‖≤δk

mk(s) = −δkgk

A successul iterate ρk ≥ η1 and ‖gk‖ ≥ η2δk implies:

f0k − fs

k ≥ η1(mk(0) − mk(xk + sk))︸︷︷︸

ρk≥η1

= η1δk‖gk‖2 ≥ η1η2

2δ2k

︸︷︷︸

‖gk‖≥η2δk

If |f(xk)− f0k | and |f(xk + sk)− fs

k | are sufficiently smaller than η1η22δ2

k, then thedecrease in the estimates is larger than the noise.

⇓

δk controls the accuracy of function and gradient estimates and serves as a guess

of the true function decrease

Probabilistic models and estimates

Consider models Mk and estimate F 0k , F s

k whose accuracy isproportional to ∆k with some probability and conditioned on

the past

1 Suppose ∇f is Lipschitz continuous, mk is a κ-fully linear model of f

if

‖∇f(y) −∇mk(y)‖ ≤ κ δk, |f(y) − mk(y)| ≤ κ δ2k ∀y ∈ B(xk, δk)

2 A sequence of random models {Mk} is α-probabilistically κ-fullylinear with respect to {B(Xk, ∆k)} if Mk, conditioned on the past, isa κ-fully linear model of f with probability P ≥ α.

Use Bernstein inequality in:averaging techniques and linear interpolation models for f(x) = Eζ [f(x, ζ)];

subsampled sums in fx) = 1N

∑Ni=1 fi(x)

Conn, Scheinberg, Vicente, Introduction to derivative-free optimization, 2009

Bandeira, Scheinberg, Vicente, SIOPT 2014

Probabilistic models and estimates

Consider models Mk and estimate F 0k , F s

k whose accuracy isproportional to ∆k with some probability and conditioned on

the past

1 f0k , fs

k are ǫf -accurate estimates of f(xk) and f(xk + sk) if

|f(xk) − f0k | ≤ ǫfδ

2k, |f(xk + sk) − f

sk | ≤ ǫfδ

2k

2 A sequence of random estimates {Fk} is β-probabilisticallyǫf -accurate with respect to {Xk, ∆k, Sk} if F 0

k , F sk , conditioned on the

past, are ǫf -accurate estimates of f(xk) and f(xk + sk) respectivelywith probability P ≥ β.

Use Bernstein inequality in:averaging techniques for f(x) = Eζ [f(x, ζ)];

subsampled sums in fx) = 1N

∑Ni=1 fi(x)

Properties of STORM

If STORM has access to α-probabilistically κ-fully linearmodels and to β-probabilistically ǫf -accurate estimates then it

may convergence properties with probability one.

Consider kth iteration (successful/unsuccessful) and the random function

Φk = νf(Xk) + (1 − ν)∆2k, ν ∈ (0, 1) fixed constant

Analyze the four possible outcomes at each iteration:

1 Good model & good function estimates;

2 Good model & bad function estimates;

3 Bad model & good function estimates;

4 Bad model & bad function estimates.

Properties of STORM


It is possible to select η2 (‖gk‖ ≥ η2δk), ǫf , α, β and ν, independent of k, sothat conditioned to the past,

E [Φk+1 − Φk] ≤ −σ∆2k, σ > 0

Successful iteration: xk+1 = xk + sk, δk+1 = γδk with γ > 1,

φk+1 − φk = ν(f(xk+1) − f(xk)) + (1 − ν)(γ2 − 1)δ2k

︸︷︷︸

>0

Unsuccessful iteration: xk+1 = xk, δk+1 =1

γδk with γ > 1,

φk+1 − φk = (1 − ν)

(1

γ2− 1

)

δ2k < 0

Properties of STORM


1 Unless the model and the estimates are bad, a decrease in Φk occurs.

2 When the model and the estimates are bad, an increase of Φk mayoccur.

Good model and good estimates (probability P ≥ αβ):

Φk+1 − Φk ≤(−νC2 + (1 − ν)(γ2 − 1)

)∆2

k ≡ Bk,2 < 0

Good model and bad estimates or viceversa (probability P ≤ (1 − α) orP ≤ (1 − β)):

Φk+1 − Φk ≤(−νC2 + (1 − ν)(γ2 − 1)

)∆2

k ≡ Bk,1 < 0

Bad model and estimates (probability P ≤ αβ):

Φk+1 − Φk ≤(νC3 + (1 − ν)(γ2 − 1)

)∆2

k ≡ Bk,3 > 0

Properties of STORM

Proper choice of α and β yields a decrease of φk in expectation

E [Φk+1 − Φk] ≤ αβBk,2︸︷︷︸

<0

+ ((1 − α) + (1 − β))Bk1︸︷︷︸

<0

+ (1 − α)(1 − β)Bk,3︸︷︷︸

>0

≤ −σ∆2k

with σ > 0 if some τ ∈ (0, 1) and τ2 > 0

(1 − α)(1 − β) < τ1 < 1 andαβ − 1

2

(1 − α)(1 − β)> τ2

Bricks for properties of STORM

successful iterate: ρk ≥ η1 and ‖gk‖ ≥ η2δk

If ‖gk‖/δk is sufficiently large and mk is κ-fully-linear

f(xk + sk) − f(xk) < 0

(The step may be reject if f0k , fs

k are not accurate enough)

If f0k , fs

k is ǫf -accurate with ǫf small enough and the iterate is successful

f(xk + sk) − f(xk) < 0

If ‖gk‖/δk is sufficiently large, mk is κ-fully-linear and f0k , fs

k are ǫf -accurate

the iterate is successful

Lim-type convergence of STORM

f is bounded from below and ∆k > 0 ⇒ Φk is bounded below for all k.

∇f is Lipschitz continuous.

With probability 1:

∞∑

k=1

∆2k < ∞

limk→∞

‖∇f(Xk)‖ = 0

The second result is showed using Lipschitz continuity of ∇f and does not

depend on the stochastic nature of the algorithm.

Open issues and perspectives

Numerical experiments show that the bounds on probability suggestedby theory are far from being tight.

Numerical experiments show good results using adaptive sample rulessuch as

|Ik| = max

{

|I0| + c k,

⌈1

δ2k

⌉}

, c ≥ 1

Numerical experiments on regularized logistic loss compare againstochastic gradients that take adaptive stepsize but do not computeestimates of the loss.

The true function produced by stochastic gradients can vary widelyover the horizon while it decreases fairly stable over successfuliterations.

Date post:	27-Mar-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing...

Documents