+ All Categories
Home > Documents > L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S...

L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S...

Date post: 15-Oct-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
25
QUASI-NEWTON METHODS 1
Transcript
Page 1: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

QUASI-NEWTON METHODS

1

Page 2: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

PROBLEMS WITH NEWTON’S METHOD

• Newton’s method is expensive• Compute mixed partials:• Invert Hessian:

O(N2)

• Gradient descent is cheap• Compute partials: O(N)

O(N3)

• Other problems:• When quadratic approximation is bad, Newton is a waste• Hessian becomes poorly conditioned• Nonconvex problems: indefinite Hessian = ascent step

f(x, y) = (1� x)2 + 100(y � x2)2Rosenbrock banana function

2

Page 3: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

CONVERGENCE RATESAttain super-linear performance without all the fuss

Linear kek+1k = CkekkQuadratic kek+1k = Ckekk2

Superlinear kek+1k = Ckekkp, p > 1

Gradient methods

Newton

As good as quadratic?

3

Page 4: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

QUASI-NEWTON METHODSNewton approximation

Quasi-Newton approximation

Hkdk = �g

kSolve

Solve Bkdk = �gk

Approximate Hessian

What conditions should B satisfy?

f(x) ⇡ (x� xk)T g +

1

2(x� x

k)TH(x� xk)

f(x) ⇡ (x� xk)T gk +1

2(x� xk)TBk(x� xk)

4

Page 5: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

WHEN DOES THIS WORK?Bkdk = �gk

GradientSearch direction

h�g,�B�1gi = hg,�B�1gi = gTB�1g > 0

Search direction: acute angle with negative gradient

Must be PD

5

Page 6: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

SECANT CONDITIONrf(x)�rf(y) ⇡ H(x� y)

Secant condition

�x = xk+1 � xk

�g = rf(xk+1)�rf(xk)

�g ⇡ Bk+1�x

6

Page 7: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

BFGS

7

Page 8: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

BASIC QUASI-NEWTON METHOD

�x = xk+1 � xk

�g = rf(xk+1)�rf(xk)

�g ⇡ Bk+1�x

{StandardNewton

stuff

Many solutions

Start with xk, Bk

Solve Bkd = �rf(xk)

xk+1= xk

+ ⌧d

Find new approximate Hessian:

solve

8

Page 9: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

BFGS UPDATE

�g ⇡ Bk+1�x

minimizeBk+1

kBk+1 �Bkk2

subject to Bk+1 = (Bk+1)T

Bk+1 = Bk +�g�gT

�gT�x� (Bk�x)(Bk�x)T

�xTBk�x

Broyden-Fletcher-Goldfarb-Shanno

{Solve

Outer products

Scalars9

Page 10: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

BFGS METHODStart with xk, Bk

Solve Bkd = rf(xk)

xk+1= xk

+ ⌧d

Find new approximate Hessian:

�x = xk+1 � xk

�g = rf(xk+1)�rf(xk)

{StandardNewton

stuff

Bk+1 = Bk +�g�gT

�gT�x� (Bk�x)(Bk�x)T

�xTBk�x

What’s the rate-limiting step??10

Page 11: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

RANK-1 STRATEGYd = (Bk)�1rf(xk)

Bk+1 = Bk +�g�gT

�gT�x� (Bk�x)(Bk�x)T

�xTBk�x

Rank-1 matrices

Bk+1 = Bk + ↵�g�gT � �(Bk�x)(Bk�x)T

(A+ UV )�1 = A�1 �A�1U(I + V A�1U)�1V A�1Woodbury Identity

11

Page 12: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

RANK-1 STRATEGY

Bk+1 = Bk + ↵�g�gT � �(Bk�x)(Bk�x)T

U

(A+ UV )�1 = A�1 �A�1U(I + V A�1U)�1V A�1Woodbury Identity

V

Bk+1 = Bk +

0

@| |

↵�g �Bk�x| |

1

A✓

�gT

(Bk�x)T

12

Page 13: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

RANK-1 STRATEGYd = (Bk)�1rf(xk)

Bk+1 = Bk +�g�gT

�gT�x� (Bk�x)(Bk�x)T

�xTBk�x

Rank-1 matrices

(Bk+1)�1 =

✓I � �x�gT

�gT�x

◆(Bk)�1

✓I � �g�xT

�gT�x

◆+

�x�xT

�gT�x

Complexity??

WoodburyIdentity

BFGS update

13

Page 14: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

SECANT METHOD

xk

xk+1

xk+1

xk

xk�1

Solve: h(x) = rf(x) = 0

rh = r2fh(xk)� h(xk�1) ⇡ rh(xk)(xk � xk�1)

�g ⇡ Bk�x

14

Page 15: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

CONVERGENCEExact line search+strong convexity= n-step quadratic convergence

kxk+n � x?k ckxk � x?k2

Backtracking search+strong convexity+smooth Hessian= superlinear

kxk+1 � x?kkxk � x?k ! 0

See O’Leary, “Scientific Computing”, 200915

Page 16: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

EXAMPLE

n=100, m=500

minimize cTx�mX

i=1

log(bi � aTi x)

16

Page 17: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

SQUARE-ROOT BFGS

Bk = (Lk)TLk Bk+1 = (Lk+1)TLk+1

(Bk)�1 (Bk+1)�1O(N2)

O(N2)

Fails when conditioning is bad

See O’Leary, “Scientific Computing”, 200917

Page 18: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

L-BFGS

• Pick some memory parameter: m ⇡ 25

• Assume Bk�m = I

• Store only m most recent {�x,�g}

• Evalue (Bk+1)�1rf(xk) using recursive definition

(Bk+1)�1 =

✓I � �x�gT

�gT�x

◆(Bk)�1

✓I � �g�xT

�gT�x

◆+

�x�xT

�gT�x

18

Page 19: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

REALLY LOW MEMORY: CGQuadratic problems

hdi, Hdji = 0

minimize xTHx+ b

Tx

dk+1 = �rf(xk+1) + �dk

Conjugate descent directions

Reduced Graham-Schmidt process

19

Page 20: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

REALLY LOW MEMORY: CG• Initialize: d0 = �rf(x0)

...

• Line search

xk+1 = xk + ⌧dk

• dk+1 = �rf(xk+1) + �dk

� =(rf(xk+1)�rf(xk))Trf(xk+1)

rf(xk)Trf(xk)

Polak-RibiereRule

(or others)

See O’Leary, “Scientific Computing”, 200920

Page 21: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

WHEN TO USE QUASI-NEWTON?

Problem should be:• Smooth• Unconstrained

CheapGradients

ExpensiveGradients

BFGS Conjugate Gradients

21

Page 22: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

MAX-ENT LOGISTIC REGRESSION

P (Yi = 1) =ew

T1 x

1 + ewT1 x

P (Yi = 0) =1

1 + ewT1 x

P (Yi = 0) =1

Z=

1P

l ewT

l x

Two categories…

P (Yi = k) =1

Zew

Ti x =

ewTi x

1 +P

l ewT

l x

22

Page 23: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

MAXIMUM LIKELIHOOD

maximize P (w) =Y

n

P (Yi|w) =Y

i

ewTki

xi

Pl e

wTl xi

Negative log-likelihood

Expensive for many labels: BFGS

minimizew

� logP (w) =X

i

�wTkixi +

X

i

log(X

l

ewTl xi)

23

Page 24: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

� =

NEURAL NETS: LOGISTIC REGRESSION ON STEROIDS

y = �(X3�(X2�(X1D)))

Backpropagation is expensive: L-BFGS

24

Page 25: L10 quasi newton - University Of Marylandtomg/course/cmsc764/L10... · PROBLEMS WITH NEWTON’S METHOD • Newton’s method is expensive • Compute mixed partials: • Invert Hessian:

SUMMARY: COMPLEXITY

Method Work Memory ConvergenceNewton O(N3) O(N2) Fastest (quadratic)BFGS O(N2) O(N2) SuperlinearL-BFGS O(NM) O(NM) Linear/superlinear

CG O(N) O(N) Linear/superlinear

All of these get used!

25


Recommended