OPER 627: Nonlinear OptimizationLecture 8: BFGS and L-BFGS
Department of Statistical Sciences and Operations ResearchVirginia Commonwealth University
Sept 23, 2013
(Lecture 8) Nonlinear Optimization Sept 23, 2013 1 / 15
Announcement
Mid-term, Oct. 21st. In class, 5:30pm-6:45pm
You can bring a two-sided cheat sheet (standard A4 paper)
No notes, reference books, computers allowed
The exam will cover all the materials up to Oct. 16th
(Lecture 8) Nonlinear Optimization Sept 23, 2013 2 / 15
Quiz
Why pure Newton method is not practically useful?
What are the concerns for modified Newton method?
What is the idea of quasi-Newton method? What is the typicalconvergence rate?What is the requirement for updating the approximation matrixHk?
YES/NO questions:Modified Newton method = Modified Hessian + line searchModified Newton method has a quadratic local convergenceIf the initial approximation of the inverse Hessian matrix H0 is PD,following BFGS update routine, and Wolfe condition is used, thenHk is PD for each iteration k
(Lecture 8) Nonlinear Optimization Sept 23, 2013 3 / 15
Today’s Outline
Convergence of BFGS
Limit-memory BFGS for large-scale optimization
Readings: NW Chapter 6.4, Chapter 7.2
(Lecture 8) Nonlinear Optimization Sept 23, 2013 4 / 15
Global convergence
Settings:BFGS applied to asmooth convex functionStart from an arbitrarypoint x0, and arbitraryinitial PD Hessianapproximation H0
Use a line searchmethod under Wolfecondition
Assumption 1: eig(∇2f (x)) ∈ [m,M], forall x ∈ L := {x | f (x) ≤ f (x0)}
Theorem{xk} generated by BFGS converges to theminimizer x∗ of f(Proof: page 154 in NW)
Warning: no global convergence result for general nonlinear function!
(Lecture 8) Nonlinear Optimization Sept 23, 2013 5 / 15
Local convergence
Setting: BFGS applied to a general nonlinear smooth function
Assumption 2: ∇2f (x) is Lipschitz continuous at x∗, i.e., ∃L,‖∇2f (x)−∇2f (x∗)‖ ≤ L‖x − x∗‖, for all x “close to” x∗
TheoremSuppose Assumption 2 holds, and ‖xk − x∗‖ converges to 0 “fastenough” that
∞∑k=1
‖xk − x∗‖ <∞
(will hold if Assumption 1 holds) then BFGS converges superlinearly
⇒(Lecture 8) Nonlinear Optimization Sept 23, 2013 6 / 15
A couple implementation remarks
1 Always try stepsize αk = 1 first. Why?Eventually αk = 1 will be satisfied and yields superlinearconvergence
2 In Wolfe condition, parameters c1 = 10−4, c2 = 0.9
In general practical implementation, we concern about:Computational complexity (# of basic computations)Storage cost (memory)Many others
(Lecture 8) Nonlinear Optimization Sept 23, 2013 7 / 15
Large scale, large concern
1 Computational cost for Hessian matrix, or even gradient
2 Storage of data: memory issue
3 Solving linear systems: matrix factorization
If the problem is large scale, but sparse, then a lot of algorithms thatexploit sparsity in the level of linear algebra!
(Lecture 8) Nonlinear Optimization Sept 23, 2013 8 / 15
LBFGS: limited-memory BFGS
Key idea:Implicit approximation using just a few vectors of size nRather than storing the fully dense n × n matrix
Use information from the most recent iterations to construct theHessian approximation
People use it in practice: IPOPT(open-source), KNITRO(commercial)
(Lecture 8) Nonlinear Optimization Sept 23, 2013 9 / 15
A new perspective
What we really iterate:
xk+1 = xk − αkHk∇fk
BFGS update:
Hk+1 = V>k HkVk + ρksks>k
where
ρk =1
y>k sk,Vk = I − ρkyks>k
What we really need:
rk = Hk∇fk
Given H0, Hk+1 is determined by:
{st}, {yt}, ∀t = 1,2, . . . , k
Solution: keep {sk , yk} information(curvature information) for the mmost recent iterations
m typically small, e.g., 3 to 20
(Lecture 8) Nonlinear Optimization Sept 23, 2013 10 / 15
How to use {sk , yk} to represent full matrix Hk+1
See notes!
(Lecture 8) Nonlinear Optimization Sept 23, 2013 11 / 15
A two-loop recursion
Algorithm 1 Two-loop recursion with memory = m1: q ← ∇fk2: for i = k − 1 to k −m do3: αi ← ρis>i q4: q ← q − αiyi5: end for6: r ← H̃k
0 q7: for i = k −m to k − 1 do8: β ← ρiy>i r9: r ← r + si(αi − β)
10: end for
Obtain rk = Hk∇fk after the recursion
(Lecture 8) Nonlinear Optimization Sept 23, 2013 12 / 15
Motivation: save space
1 r , αk : vectors of length n, β: number
2 {yi}k−mi=k−1, {si}k−m
i=k−1: 2m vectors of length nIn total: 2mn + m + n + 1 numbers!
How to choose H̃0k ?
H̃0k = γk I
where γk :=s>k−myk−m
y>k−myk−m, approximates eig([∇2f (xk )]
−1)
(Lecture 8) Nonlinear Optimization Sept 23, 2013 13 / 15
LBFGS algorithm
Algorithm 2 LBFGS with memory = m1: Choose x0, integer m > 0 (typically 3,5,17), k = 02: repeat3: Choose H̃0
k = γk I4: Compute pk = −Hkg using two-loop recursion5: Update iterate xk+1 = xk + αkpk using Wolfe condition6: if k>m then7: Discard {sk−m, yk−m} from storage8: end if9: Compute and store: sk = xk+1 − xk , yk = ∇f (xk+1)−∇f (xk )
10: k = k+111: until ‖∇f (xk )‖ < ε
(Lecture 8) Nonlinear Optimization Sept 23, 2013 14 / 15
Summary of Quasi-Newton methods
THE first-order method in large-scale optimization
1 Newton method using the exact Hessian is impractical
2 Better than other first-order methods in general
3 Weakness: still slow convergence on ill-conditioned problems,e.g., Hessian has a wide range of eigenvalues
Nonlinear conjugate gradient methods will do better in this case!
(Lecture 8) Nonlinear Optimization Sept 23, 2013 15 / 15