+ All Categories
Home > Documents > A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing...

A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing...

Date post: 27-Mar-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
24
A Beautiful Paper Benedetta Morini Dipartimento di Ingegneria Industriale, Universit` a di Firenze Firenze, 7 febbraio 2020
Transcript
Page 1: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

A Beautiful Paper

Benedetta MoriniDipartimento di Ingegneria Industriale, Universita di Firenze

Firenze, 7 febbraio 2020

Page 2: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Suspense...

My options:

Page 3: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Suspense...

My options:

1 Inexact Newton methods: iterative linear algebra, forcingterms, globalization strategies in Krylov subspaces...

Page 4: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Suspense...

My options:

1 Inexact Newton methods: iterative linear algebra, forcingterms, globalization strategies in Krylov subspaces...

2 Interior Point methods: stability and accuracy. Reducedand unreduced KKT systems, stability of linear solvers,conditioning of linear systems, propagation of roundofferrors...

Page 5: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Suspense...

My options:

1 Inexact Newton methods: iterative linear algebra, forcingterms, globalization strategies in Krylov subspaces...

2 Interior Point methods: stability and accuracy. Reducedand unreduced KKT systems, stability of linear solvers,conditioning of linear systems, propagation of roundofferrors...

DOPO AVER FATTO PREOCCUPARE MARCO...

Page 6: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Trust-region methods

Solve an unconstrained, possibly nonconvex, optimizationproblem minx∈Rn f(x), f smooth and bounded below.

0. Given x0, δ0, γ > 1, η1 ∈ (0, 1), η2 > 0. Set k = 0.

1. Model construction:

Build a model mDk that approximates f on B(xk, δk)

mDk (xk + s) = f(xk) + ∇f(xk)T s + sT Hks Hk ≈ ∇2f(xk)

2. Step calculation:

Compute sk = argmin‖s‖≤δkmD

k (s) approximately (the standard Cauchy

decrease condition is satisfied)

3. Acceptance of the trial point:

Compute ρk =f(xk) − f(xk + sk)

mk(xk) − mk(xk + sk)

If ρk ≥ η1

Set xk+1 = xk + sk, δk+1 = γδk go to Step 1 (successful iteration)

Else xk+1 = xk, δk+1 = γ−1δk go to Step 1 (unsuccessful iteration)

Set k = k + 1

Page 7: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Trust-region methods with approximate

functions/gradients/Hessians

f is computed approximately to reduce the computational costor

f(x, ζ) is the noisy computable version of f(x) and the noise ζ

is random.

Common settings in machine learning:

1 f(x) = Eζ [f(x, ζ)], ∀xnoise unbiased;

2 f(x) = 1N

∑Ni=1 fi(x), N ≫ 1

surrogate problem/supervised learning, sample average approximation of (1)

3 f(x) 6= Eζ [f(x, ζ)]large noise with some positive probability

Build a model mk that approximates f on B(xk, δk)

mk(xk + s) = fk + gTk s + sT Hks

fk, gk, Hk estimates of f(xk), ∇f(xk), ∇2f(xk) respectively

Page 8: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Deterministic/probabilistic/stochastic analysis

mk(xk + s) = fk + gTk s + s

THks

1 In the finite-sum minimization case

f(x) =1

N

N∑

i=1

fi(x), N ≫ 1

consider schemes where the model is based on subsampling

fk =1

|Ik|

i∈Ik

fi(xk), Ik ⊆ {1, . . . , N}

gk =1

|Ik|

i∈Ik

∇fi(xk)

Hk =1

|Jk|

i∈Jk

∇2fi(xk) |Jk| ≤ |Ik|, Jk ⊆ Ik

Deterministic analysis: if fk and gk have increasing, up to full,accuracy along the iterations then properties relies on the determistictrust-region approach.

Page 9: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Deterministic/probabilistic/stochastic analysis

mk(xk + s) = fk + gTk s + sT Hks

1 Consider a trust-region scheme that requires

at most O(ǫ−2) iterations to satisfy ‖∇f(xk)‖ ≤ ǫ

if mk is sufficiently accurate, i.e., |mk − mDk | ≤ τk, for some τk > 0.

Suppose we can determine a sufficiently accurate model withprobability 1 − η

P (|mk − mDk | ≤ τk) ≥ 1 − η

The probability of having a sufficiently accurate models along Kiterations is (1 − η)K (independent events).

Then, the complexity result O(ǫ−2) holds with probability 1 − η if

1 − η = (1 − η)K

i.e.,

η = 1 − (1 − η)1/K = O

K

)

= O(ηǫ2

)

Page 10: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Deterministic/probabilistic/stochastic analysis

Bernstein inequality

fk =1

|Ik|

i∈Ik

fi(xk),

P (|fk − f(xk)| ≤ τk) ≥ 1 − η

if

|Ik| ≥ min

{

N,

⌈2

τk

(Vf

τk+

2ωf (xk)

3

)

log

(2

η

)⌉}

where E(|fi(x) − f(x)|2) ≤ Vf and maxi∈{1,...,N} |fi(x)| ≤ ωf (x)

The failure probability appears in the “log factor” and it is not thedominating cost.

Analogous results for gk, Hk.

Page 11: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Deterministic/probabilistic/stochastic analysis

With very small modifications, adapt a standard trust-regionmethod to stochastic nonlinear not necessarily convex functions

Random models of f are used at each iteration to compute the nextpotential iterate.

Random estimates of the function values at the current iterate and thepotential iterate are used to gauge the progress that is being made.

Stochastic analysis:

Convergence analysis relies on requirements that these models and theseestimates are sufficiently accurate with sufficiently high probability.

Probabilities are not increasing, they need to be above a certain constant.

No assumptions are made about how these models and estimates aregenerated.

If a model or estimate is inaccurate, it can be arbitrarily inaccurate.

Page 12: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

The paper

Ruobing Chen, Matt Menickelly, Katya Scheinberg

Stochastic Optimization using a trust-region method and random models

Mathematical Programming Ser. A, 2018

STORM: SThocastic Optimization with Random Models

“With roots in statistics and computer science, ML has a mathematicaloptimization engine as its core”, Frank E. Curtis and Katya Scheinberg,INFORMS Tutorials in Operations Research, 2017

Using a first-order model (Hk ≡ 0) STORM is stochastic gradient methodwith adaptive stepsize

mk(xk + s) = fk + gTk s sk = argmin

‖s‖≤δk

mk(s) = −δkgk

Page 13: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

STORM: SThocastic Optimization with Random Models

0. Given x0, δ0, γ > 1, η1 ∈ (0, 1), η2 > 0. Set k = 0.

1. Model construction:

Build a model mk that approximates f on B(xk, δk)

mk(xk + s) = fk + gTk s + sT Hks

2. Step calculation:

Compute sk = argmin‖s‖≤δkmk(s) approximately (the standard Cauchy

decrease condition is satisfied)

3. Estimates calculation:

Obtain estimates f0k and fs

k of f(xk) and f(xk + sk) respectively.

4. Acceptance of the trial point:

Compute ρk =f0

k − fsk

mk(xk) − mk(xk + sk)

If ρk ≥ η1 and ‖gk‖ ≥ η2δk

Set xk+1 = xk + sk, δk+1 = γδk go to Step 1 (successful iteration)

Else xk+1 = xk, δk+1 = γ−1δk go to Step 1 (unsuccessful iteration)

Page 14: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

STORM is a stochastic process generating random models{Mk} and quantities {Xk, Sk, ∆k, F

0k , FS

k }

mk denotes a realization of the random models Mk, xk is a realization of therandom iterates Xk, etc...

A successful iteration

ρk ≥ η1 and ‖gk‖ ≥ η2δk

does not necessarily yield an actual reduction in the true function f sincethe step acceptance decision is made on f0

k ≈ f(xk) and fsk ≈ f(xk + sk).

If f0k and fs

k are not accurate enough, a successful iteration can result in anincrease of the true function value.

Hence two types of successful iterations occur

those where f is in fact decreased ⇒ true successful iterations

those where the decrease of f can be arbitrarily small or negative ⇒

false successful iterations

Page 15: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Role of δk

STORM does not detect true/false successful iterations.But, if f0

k and fsk are sufficiently accurate, true successful

iterations occur sufficiently often for convergence to hold.

Consider a first-order model mk

mk(xk + s) = fk + gTk s sk = argmin

‖s‖≤δk

mk(s) = −δkgk

A successul iterate ρk ≥ η1 and ‖gk‖ ≥ η2δk implies:

f0k − fs

k ≥ η1(mk(0) − mk(xk + sk))︸ ︷︷ ︸

ρk≥η1

= η1δk‖gk‖2 ≥ η1η2

2δ2k

︸ ︷︷ ︸

‖gk‖≥η2δk

If |f(xk)− f0k | and |f(xk + sk)− fs

k | are sufficiently smaller than η1η22δ2

k, then thedecrease in the estimates is larger than the noise.

δk controls the accuracy of function and gradient estimates and serves as a guess

of the true function decrease

Page 16: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Probabilistic models and estimates

Consider models Mk and estimate F 0k , F s

k whose accuracy isproportional to ∆k with some probability and conditioned on

the past

1 Suppose ∇f is Lipschitz continuous, mk is a κ-fully linear model of f

if

‖∇f(y) −∇mk(y)‖ ≤ κ δk, |f(y) − mk(y)| ≤ κ δ2k ∀y ∈ B(xk, δk)

2 A sequence of random models {Mk} is α-probabilistically κ-fullylinear with respect to {B(Xk, ∆k)} if Mk, conditioned on the past, isa κ-fully linear model of f with probability P ≥ α.

Use Bernstein inequality in:averaging techniques and linear interpolation models for f(x) = Eζ [f(x, ζ)];

subsampled sums in fx) = 1N

∑Ni=1 fi(x)

Conn, Scheinberg, Vicente, Introduction to derivative-free optimization, 2009

Bandeira, Scheinberg, Vicente, SIOPT 2014

Page 17: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Probabilistic models and estimates

Consider models Mk and estimate F 0k , F s

k whose accuracy isproportional to ∆k with some probability and conditioned on

the past

1 f0k , fs

k are ǫf -accurate estimates of f(xk) and f(xk + sk) if

|f(xk) − f0k | ≤ ǫfδ

2k, |f(xk + sk) − f

sk | ≤ ǫfδ

2k

2 A sequence of random estimates {Fk} is β-probabilisticallyǫf -accurate with respect to {Xk, ∆k, Sk} if F 0

k , F sk , conditioned on the

past, are ǫf -accurate estimates of f(xk) and f(xk + sk) respectivelywith probability P ≥ β.

Use Bernstein inequality in:averaging techniques for f(x) = Eζ [f(x, ζ)];

subsampled sums in fx) = 1N

∑Ni=1 fi(x)

Page 18: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Properties of STORM

If STORM has access to α-probabilistically κ-fully linearmodels and to β-probabilistically ǫf -accurate estimates then it

may convergence properties with probability one.

Consider kth iteration (successful/unsuccessful) and the random function

Φk = νf(Xk) + (1 − ν)∆2k, ν ∈ (0, 1) fixed constant

Analyze the four possible outcomes at each iteration:

1 Good model & good function estimates;

2 Good model & bad function estimates;

3 Bad model & good function estimates;

4 Bad model & bad function estimates.

Page 19: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Properties of STORM

Φk = νf(Xk) + (1 − ν)∆2k, ν ∈ (0, 1) fixed constant

It is possible to select η2 (‖gk‖ ≥ η2δk), ǫf , α, β and ν, independent of k, sothat conditioned to the past,

E [Φk+1 − Φk] ≤ −σ∆2k, σ > 0

Successful iteration: xk+1 = xk + sk, δk+1 = γδk with γ > 1,

φk+1 − φk = ν(f(xk+1) − f(xk)) + (1 − ν)(γ2 − 1)δ2k

︸ ︷︷ ︸

>0

Unsuccessful iteration: xk+1 = xk, δk+1 =1

γδk with γ > 1,

φk+1 − φk = (1 − ν)

(1

γ2− 1

)

δ2k < 0

Page 20: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Properties of STORM

Φk = νf(Xk) + (1 − ν)∆2k, ν ∈ (0, 1) fixed constant

1 Unless the model and the estimates are bad, a decrease in Φk occurs.

2 When the model and the estimates are bad, an increase of Φk mayoccur.

Good model and good estimates (probability P ≥ αβ):

Φk+1 − Φk ≤(−νC2 + (1 − ν)(γ2 − 1)

)∆2

k ≡ Bk,2 < 0

Good model and bad estimates or viceversa (probability P ≤ (1 − α) orP ≤ (1 − β)):

Φk+1 − Φk ≤(−νC2 + (1 − ν)(γ2 − 1)

)∆2

k ≡ Bk,1 < 0

Bad model and estimates (probability P ≤ αβ):

Φk+1 − Φk ≤(νC3 + (1 − ν)(γ2 − 1)

)∆2

k ≡ Bk,3 > 0

Page 21: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Properties of STORM

Proper choice of α and β yields a decrease of φk in expectation

E [Φk+1 − Φk] ≤ αβBk,2︸ ︷︷ ︸

<0

+ ((1 − α) + (1 − β))Bk1︸ ︷︷ ︸

<0

+ (1 − α)(1 − β)Bk,3︸ ︷︷ ︸

>0

≤ −σ∆2k

with σ > 0 if some τ ∈ (0, 1) and τ2 > 0

(1 − α)(1 − β) < τ1 < 1 andαβ − 1

2

(1 − α)(1 − β)> τ2

Page 22: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Bricks for properties of STORM

successful iterate: ρk ≥ η1 and ‖gk‖ ≥ η2δk

If ‖gk‖/δk is sufficiently large and mk is κ-fully-linear

f(xk + sk) − f(xk) < 0

(The step may be reject if f0k , fs

k are not accurate enough)

If f0k , fs

k is ǫf -accurate with ǫf small enough and the iterate is successful

f(xk + sk) − f(xk) < 0

If ‖gk‖/δk is sufficiently large, mk is κ-fully-linear and f0k , fs

k are ǫf -accurate

the iterate is successful

Page 23: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Lim-type convergence of STORM

f is bounded from below and ∆k > 0 ⇒ Φk is bounded below for all k.

∇f is Lipschitz continuous.

With probability 1:

∞∑

k=1

∆2k < ∞

limk→∞

‖∇f(Xk)‖ = 0

The second result is showed using Lipschitz continuity of ∇f and does not

depend on the stochastic nature of the algorithm.

Page 24: A Beautiful Paper · 2020. 2. 17. · 1 Inexact Newton methods: iterative linear algebra, forcing terms, globalization strategies in Krylov subspaces... 2 Interior Point methods:

Open issues and perspectives

Numerical experiments show that the bounds on probability suggestedby theory are far from being tight.

Numerical experiments show good results using adaptive sample rulessuch as

|Ik| = max

{

|I0| + c k,

⌈1

δ2k

⌉}

, c ≥ 1

Numerical experiments on regularized logistic loss compare againstochastic gradients that take adaptive stepsize but do not computeestimates of the loss.

The true function produced by stochastic gradients can vary widelyover the horizon while it decreases fairly stable over successfuliterations.


Recommended