+ All Categories
Home > Documents > OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

Date post: 01-Oct-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
15
OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS Department of Statistical Sciences and Operations Research Virginia Commonwealth University Sept 23, 2013 (Lecture 8) Nonlinear Optimization Sept 23, 2013 1 / 15
Transcript
Page 1: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

OPER 627: Nonlinear OptimizationLecture 8: BFGS and L-BFGS

Department of Statistical Sciences and Operations ResearchVirginia Commonwealth University

Sept 23, 2013

(Lecture 8) Nonlinear Optimization Sept 23, 2013 1 / 15

Page 2: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

Announcement

Mid-term, Oct. 21st. In class, 5:30pm-6:45pm

You can bring a two-sided cheat sheet (standard A4 paper)

No notes, reference books, computers allowed

The exam will cover all the materials up to Oct. 16th

(Lecture 8) Nonlinear Optimization Sept 23, 2013 2 / 15

Page 3: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

Quiz

Why pure Newton method is not practically useful?

What are the concerns for modified Newton method?

What is the idea of quasi-Newton method? What is the typicalconvergence rate?What is the requirement for updating the approximation matrixHk?

YES/NO questions:Modified Newton method = Modified Hessian + line searchModified Newton method has a quadratic local convergenceIf the initial approximation of the inverse Hessian matrix H0 is PD,following BFGS update routine, and Wolfe condition is used, thenHk is PD for each iteration k

(Lecture 8) Nonlinear Optimization Sept 23, 2013 3 / 15

Page 4: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

Today’s Outline

Convergence of BFGS

Limit-memory BFGS for large-scale optimization

Readings: NW Chapter 6.4, Chapter 7.2

(Lecture 8) Nonlinear Optimization Sept 23, 2013 4 / 15

Page 5: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

Global convergence

Settings:BFGS applied to asmooth convex functionStart from an arbitrarypoint x0, and arbitraryinitial PD Hessianapproximation H0

Use a line searchmethod under Wolfecondition

Assumption 1: eig(∇2f (x)) ∈ [m,M], forall x ∈ L := {x | f (x) ≤ f (x0)}

Theorem{xk} generated by BFGS converges to theminimizer x∗ of f(Proof: page 154 in NW)

Warning: no global convergence result for general nonlinear function!

(Lecture 8) Nonlinear Optimization Sept 23, 2013 5 / 15

Page 6: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

Local convergence

Setting: BFGS applied to a general nonlinear smooth function

Assumption 2: ∇2f (x) is Lipschitz continuous at x∗, i.e., ∃L,‖∇2f (x)−∇2f (x∗)‖ ≤ L‖x − x∗‖, for all x “close to” x∗

TheoremSuppose Assumption 2 holds, and ‖xk − x∗‖ converges to 0 “fastenough” that

∞∑k=1

‖xk − x∗‖ <∞

(will hold if Assumption 1 holds) then BFGS converges superlinearly

⇒(Lecture 8) Nonlinear Optimization Sept 23, 2013 6 / 15

Page 7: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

A couple implementation remarks

1 Always try stepsize αk = 1 first. Why?Eventually αk = 1 will be satisfied and yields superlinearconvergence

2 In Wolfe condition, parameters c1 = 10−4, c2 = 0.9

In general practical implementation, we concern about:Computational complexity (# of basic computations)Storage cost (memory)Many others

(Lecture 8) Nonlinear Optimization Sept 23, 2013 7 / 15

Page 8: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

Large scale, large concern

1 Computational cost for Hessian matrix, or even gradient

2 Storage of data: memory issue

3 Solving linear systems: matrix factorization

If the problem is large scale, but sparse, then a lot of algorithms thatexploit sparsity in the level of linear algebra!

(Lecture 8) Nonlinear Optimization Sept 23, 2013 8 / 15

Page 9: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

LBFGS: limited-memory BFGS

Key idea:Implicit approximation using just a few vectors of size nRather than storing the fully dense n × n matrix

Use information from the most recent iterations to construct theHessian approximation

People use it in practice: IPOPT(open-source), KNITRO(commercial)

(Lecture 8) Nonlinear Optimization Sept 23, 2013 9 / 15

Page 10: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

A new perspective

What we really iterate:

xk+1 = xk − αkHk∇fk

BFGS update:

Hk+1 = V>k HkVk + ρksks>k

where

ρk =1

y>k sk,Vk = I − ρkyks>k

What we really need:

rk = Hk∇fk

Given H0, Hk+1 is determined by:

{st}, {yt}, ∀t = 1,2, . . . , k

Solution: keep {sk , yk} information(curvature information) for the mmost recent iterations

m typically small, e.g., 3 to 20

(Lecture 8) Nonlinear Optimization Sept 23, 2013 10 / 15

Page 11: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

How to use {sk , yk} to represent full matrix Hk+1

See notes!

(Lecture 8) Nonlinear Optimization Sept 23, 2013 11 / 15

Page 12: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

A two-loop recursion

Algorithm 1 Two-loop recursion with memory = m1: q ← ∇fk2: for i = k − 1 to k −m do3: αi ← ρis>i q4: q ← q − αiyi5: end for6: r ← H̃k

0 q7: for i = k −m to k − 1 do8: β ← ρiy>i r9: r ← r + si(αi − β)

10: end for

Obtain rk = Hk∇fk after the recursion

(Lecture 8) Nonlinear Optimization Sept 23, 2013 12 / 15

Page 13: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

Motivation: save space

1 r , αk : vectors of length n, β: number

2 {yi}k−mi=k−1, {si}k−m

i=k−1: 2m vectors of length nIn total: 2mn + m + n + 1 numbers!

How to choose H̃0k ?

H̃0k = γk I

where γk :=s>k−myk−m

y>k−myk−m, approximates eig([∇2f (xk )]

−1)

(Lecture 8) Nonlinear Optimization Sept 23, 2013 13 / 15

Page 14: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

LBFGS algorithm

Algorithm 2 LBFGS with memory = m1: Choose x0, integer m > 0 (typically 3,5,17), k = 02: repeat3: Choose H̃0

k = γk I4: Compute pk = −Hkg using two-loop recursion5: Update iterate xk+1 = xk + αkpk using Wolfe condition6: if k>m then7: Discard {sk−m, yk−m} from storage8: end if9: Compute and store: sk = xk+1 − xk , yk = ∇f (xk+1)−∇f (xk )

10: k = k+111: until ‖∇f (xk )‖ < ε

(Lecture 8) Nonlinear Optimization Sept 23, 2013 14 / 15

Page 15: OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

Summary of Quasi-Newton methods

THE first-order method in large-scale optimization

1 Newton method using the exact Hessian is impractical

2 Better than other first-order methods in general

3 Weakness: still slow convergence on ill-conditioned problems,e.g., Hessian has a wide range of eigenvalues

Nonlinear conjugate gradient methods will do better in this case!

(Lecture 8) Nonlinear Optimization Sept 23, 2013 15 / 15


Recommended