OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

OPER 627: Nonlinear OptimizationLecture 8: BFGS and L-BFGS

Department of Statistical Sciences and Operations ResearchVirginia Commonwealth University

Sept 23, 2013

(Lecture 8) Nonlinear Optimization Sept 23, 2013 1 / 15

Announcement

Mid-term, Oct. 21st. In class, 5:30pm-6:45pm

You can bring a two-sided cheat sheet (standard A4 paper)

No notes, reference books, computers allowed

The exam will cover all the materials up to Oct. 16th


Quiz

Why pure Newton method is not practically useful?

What are the concerns for modified Newton method?

What is the idea of quasi-Newton method? What is the typicalconvergence rate?What is the requirement for updating the approximation matrixHk?

YES/NO questions:Modified Newton method = Modified Hessian + line searchModified Newton method has a quadratic local convergenceIf the initial approximation of the inverse Hessian matrix H0 is PD,following BFGS update routine, and Wolfe condition is used, thenHk is PD for each iteration k


Today’s Outline

Convergence of BFGS

Limit-memory BFGS for large-scale optimization

Readings: NW Chapter 6.4, Chapter 7.2


Global convergence

Settings:BFGS applied to asmooth convex functionStart from an arbitrarypoint x0, and arbitraryinitial PD Hessianapproximation H0

Use a line searchmethod under Wolfecondition

Assumption 1: eig(∇2f (x)) ∈ [m,M], forall x ∈ L := {x | f (x) ≤ f (x0)}

Theorem{xk} generated by BFGS converges to theminimizer x∗ of f(Proof: page 154 in NW)

Warning: no global convergence result for general nonlinear function!


Local convergence

Setting: BFGS applied to a general nonlinear smooth function

Assumption 2: ∇2f (x) is Lipschitz continuous at x∗, i.e., ∃L,‖∇2f (x)−∇2f (x∗)‖ ≤ L‖x − x∗‖, for all x “close to” x∗

TheoremSuppose Assumption 2 holds, and ‖xk − x∗‖ converges to 0 “fastenough” that

∞∑k=1

‖xk − x∗‖ <∞

(will hold if Assumption 1 holds) then BFGS converges superlinearly

⇒(Lecture 8) Nonlinear Optimization Sept 23, 2013 6 / 15

A couple implementation remarks

1 Always try stepsize αk = 1 first. Why?Eventually αk = 1 will be satisfied and yields superlinearconvergence

2 In Wolfe condition, parameters c1 = 10−4, c2 = 0.9

In general practical implementation, we concern about:Computational complexity (# of basic computations)Storage cost (memory)Many others


Large scale, large concern

1 Computational cost for Hessian matrix, or even gradient

2 Storage of data: memory issue

3 Solving linear systems: matrix factorization

If the problem is large scale, but sparse, then a lot of algorithms thatexploit sparsity in the level of linear algebra!


LBFGS: limited-memory BFGS

Key idea:Implicit approximation using just a few vectors of size nRather than storing the fully dense n × n matrix

Use information from the most recent iterations to construct theHessian approximation

People use it in practice: IPOPT(open-source), KNITRO(commercial)


A new perspective

What we really iterate:

xk+1 = xk − αkHk∇fk

BFGS update:

Hk+1 = V>k HkVk + ρksks>k

where

ρk =1

y>k sk,Vk = I − ρkyks>k

What we really need:

rk = Hk∇fk

Given H0, Hk+1 is determined by:

{st}, {yt}, ∀t = 1,2, . . . , k

Solution: keep {sk , yk} information(curvature information) for the mmost recent iterations

m typically small, e.g., 3 to 20


How to use {sk , yk} to represent full matrix Hk+1

See notes!


A two-loop recursion

Algorithm 1 Two-loop recursion with memory = m1: q ← ∇fk2: for i = k − 1 to k −m do3: αi ← ρis>i q4: q ← q − αiyi5: end for6: r ← H̃k

0 q7: for i = k −m to k − 1 do8: β ← ρiy>i r9: r ← r + si(αi − β)

10: end for

Obtain rk = Hk∇fk after the recursion


Motivation: save space

1 r , αk : vectors of length n, β: number

2 {yi}k−mi=k−1, {si}k−m

i=k−1: 2m vectors of length nIn total: 2mn + m + n + 1 numbers!

How to choose H̃0k ?

H̃0k = γk I

where γk :=s>k−myk−m

y>k−myk−m, approximates eig([∇2f (xk )]

−1)


LBFGS algorithm

Algorithm 2 LBFGS with memory = m1: Choose x0, integer m > 0 (typically 3,5,17), k = 02: repeat3: Choose H̃0

k = γk I4: Compute pk = −Hkg using two-loop recursion5: Update iterate xk+1 = xk + αkpk using Wolfe condition6: if k>m then7: Discard {sk−m, yk−m} from storage8: end if9: Compute and store: sk = xk+1 − xk , yk = ∇f (xk+1)−∇f (xk )

10: k = k+111: until ‖∇f (xk )‖ < ε


Summary of Quasi-Newton methods

THE first-order method in large-scale optimization

1 Newton method using the exact Hessian is impractical

2 Better than other first-order methods in general

3 Weakness: still slow convergence on ill-conditioned problems,e.g., Hessian has a wide range of eigenvalues

Nonlinear conjugate gradient methods will do better in this case!


Date post:	01-Oct-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

OPER 627: Nonlinear Optimization Lecture 8: BFGS and L-BFGS

Documents