AMS527: Numerical Analysis IISupplementary Material onNumerical Optimization
Xiangmin Jiao
SUNY Stony Brook
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 1 / 21
Outline
1 BFGS Method
2 Conjugate Gradient Methods
3 Constrained Optimization
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 2 / 21
BFGS Method
BFGS Method is one of most effective secant updating methods forminimizationNamed after Broyden, Fletcher, Goldfarb, and ShannoUnlike Broyden’s method, BFGS preserves the symmetry ofapproximate Hessian matrixIn addition, BFGS preserves the positive definiteness of theapproximate Hessian matrixReference: J. Nocedal, S. J. Wright, Numerical Optimization, 2ndedition, Springer, 2006. Section 6.1.
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 3 / 21
Algorithm
x0 =initial guessB0 =initial Hessian approximationfor k =0, 1, 2, . . .
Solve Bksk = −∇f (xk) for skxk+1 = xk + skyk = ∇f (xk+1)−∇f (xk)Bk+1 = Bk + (ykyTk )/(y
Tk sk)− (BksksTk Bk)/(sTk Bksk)
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 4 / 21
Motivation of BFGSLet sk = xk+1 − xk and yk = ∇f (xk+1)−∇f (xk)Matrix Bk+1 should satisfy secant equation
Bk+1sk = yk
In addition, Bk+1 is positive definition, which requires sTk yk > 0There are infinite number of Bk+1 that satisfies secant equationDavidon (1950s) proposed to choose Bk+1 to be closest to Bk , i.e.,
minB‖B − Bk‖
Subject to B = BT , Bsk = yk .
BFGS proposed to choose Bk+1 so that B−1k+1 is closest to B−1k , i.e.,
minB‖B−1 − B−1k ‖
Subject to B = BT , Bsk = yk .
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 5 / 21
Properties of BFGS
BFGS normally has superlinear convergence rate, even thoughapproximate Hessian does not necessarily converge to true HessianApproximate Hessian preserves positive definiteness
I Key idea of proof: Let Hk denote B−1k . For any vector z 6= 0, and letw = z − ρkyk(sTk z), where ρk > 0. Then it can be shown that
zTHk+1z = wTHkw + ρk(sTk z)2 ≥ 0.
If sTk z = 0, then w = z 6= 0. So zTHk+1z > 0.
Line search can be used to enhance effectiveness of BFGS. If exact linesearch is performed at each iteration, BFGS terminates at exactsolution in at most n iterations for a quadratic objective function
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 6 / 21
Outline
1 BFGS Method
2 Conjugate Gradient Methods
3 Constrained Optimization
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 7 / 21
Motivation of Conjugate Gradients
Conjugate gradient can be used to solve a linear system Ax = b,where A is symmetric positive definite (SPD)If A is m ×m SPD, then quadratic function
ϕ(x) =12xTAx − xTb
has unique minimumNegative gradient of this function is residual vector
−∇ϕ(x) = b − Ax = r
so minimum is obtained precisely when Ax = b
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 8 / 21
Search Direction in Conjugate Gradients
Optimization methods have form
xn+1 = xn + αnpn
where pn is search direction and α is step length chosen to minimizeϕ(xn + αnpn)Line search parameter can be determined analytically asαn = rTn pn/pTn ApnIn CG, pn is chosen to be A-conjugate (or A-orthogonal) to previoussearch directions, i.e., pTn Apj = 0 for j < n
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 9 / 21
Optimality of Step LengthSelect step length αn over vector pn−1 to minimizeϕ(x) = 12x
TAx − xTbLet xn = xn−1 + αnpn−1,
ϕ(xn) =12(xn−1 + αnpn−1)
TA(xn−1 + αnpn−1)− (xn−1 + αnpn−1)Tb
=12α2np
Tn−1Apn−1 + αnp
Tn−1Axn−1 − αnpTn−1b + constant
=12α2np
Tn−1Apn−1 − αnpTn−1rn−1 + constant
Therefore,
dϕdαn
= 0⇒ αnpTn−1Apn−1 − pTn−1rn−1 = 0⇒ αn =pTn−1rn−1
pTn−1Apn−1.
In addition, pTn−1rn−1 = rTn−1rn−1 because pn−1 = rn−1 + βnpn−2
and rTn−1pn−2 = 0 due to the following theorem.
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 10 / 21
Conjugate Gradient Method
Algorithm: Conjugate Gradient Methodx0 = 0, r0 = b, p0 = r0for n = 1 to 1, 2, 3, . . .
αn = (rTn−1rn−1)/(pTn−1Apn−1) step length
xn = xn−1 + αnpn−1 approximate solutionrn = rn−1 − αnApn−1 residualβn = (rTn rn)/(rTn−1rn−1) improvement this steppn = rn + βnpn−1 search direction
Only one matrix-vector product Apn−1 per iterationApart from matrix-vector product, #operations per iteration is O(m)CG can be viewed as minimization of quadratic functionϕ(x) = 12x
TAx − xTb by modifying steepest descentFirst proposed by Hestens and Stiefel in 1950s
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 11 / 21
An Alternative Interpretation of CG
Algorithm: CGx0 = 0, r0 = b, p0 = r0for n =1, 2, 3, . . .αn = rTn−1rn−1/(p
Tn−1Apn−1)
xn = xn−1 + αnpn−1rn = rn−1 − αnApn−1βn = rTn rn/(rTn−1rn−1)pn = rn + βnpn−1
Algorithm: A non-standard CGx0 = 0, r0 = b, p0 = r0for n =1, 2, 3, . . .αn = rTn−1pn−1/(p
Tn−1Apn−1)
xn = xn−1 + αnpn−1rn = b − Axnβn = −rTn Apn−1/(pTn−1Apn−1)pn = rn + βnpn−1
The non-standard one is less efficient but easier to understandIt is easy to see rn = rn−1 − αnApn−1 = b − Axn
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 12 / 21
Comparison of Linear and Nonlinear CG
Algorithm: Linear CGx0 = 0, r0 = b,p0 = r0for n =1, 2, 3, . . .αn = rTn−1rn−1/(p
Tn−1Apn−1)
xn = xn−1 + αnpn−1rn = rn−1 − αnApn−1βn = rTn rn/(rTn−1rn−1)pn = rn + βnpn−1
Algorithm: Non-linear CGx0 = initial guess, g0 = ∇f (x0),s0 = −g0for k = 0, 1, 2, . . .
Choose αk to min f (xk + αksk)xk+1 = xk + αkskgk+1 = ∇f (xk+1)βk+1 = (gTk+1gk+1)/(g
Tk gk)
sk+1 = −gk+1 + βk+1sk
βk+1 = (gTk+1gk+1)/(gTk gk) was due to Fletcher and Reeves (1964)
An alternative formula βk+1 = (gk+1 − gk)Tgk+1/(gTk gk) was dueto Polak and Riebiere (1969)
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 13 / 21
Properties of Conjugate GradientsKrylov subspaces for Ax = b is Kn = {b,Ab, . . . ,An−1b}.
TheoremIf rn−1 6= 0, spaces spanned by approximate solutions xn, search directionspn, and residuals rn are all equal to Krylov subspaces
Kn = 〈x1, x2, . . . , xn〉 = 〈p0,p1, . . . ,pn−1〉= 〈r0, r1, . . . , rn−1〉 = 〈b,Ab, . . . ,An−1b〉
The residual are orthogonal (i.e., rTn r j = 0 for j < n) and search directionsare A-conjugate (i.e, pTn Apj = 0 for j < n).
TheoremIf rn−1 6= 0, then error en = x∗ − xn are minimized in A-norm in Kn.
Because Kn grows monotonically, error decreases monotonically.
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 14 / 21
Rate of Convergence
Some important convergence resultsI If A has n distinct eigenvalues, CG converges in at most n stepsI If A has 2-norm condition number κ, the errors are
‖en‖A‖e0‖A
≤ 2(√
κ− 1√κ+ 1
)nwhich is ≈ 2
(1− 2√
κ
)nas κ→∞. So convergence is expected in
O(√κ) iterations.
In general, CG performs well with clustered eigenvalues
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 15 / 21
Outline
1 BFGS Method
2 Conjugate Gradient Methods
3 Constrained Optimization
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 16 / 21
Equality-Constrained MinimizationEquality-constrained problem has form
minx∈Rn
f (x) subject to g(x) = 0
where objective function f : Rn → R and constraints g : Rn → Rm,where m ≤ nNecessary condition for feasible point x to be solution is that negativegradient of f lie in space spanned by constraint normals, i.e.,
−∇f (x∗) = JTg (x∗)λ,
where Jg is Jacobian matrix of g , and λ is vector of LagrangemultipliersTherefore, constrained local minimum must be critical point ofLagrangian function
L(x ,λ) = f (x) + λTg(x)
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 17 / 21
First-Order and Second-Order Optimality ConditionsEquality-constrained minimization can be reduced to solving
∇L(x ,λ) =[∇f (x) + JTg (x)λ
g(x)
]= 0,
which is known as Karush-Kuhn-Tucker (or KKT) condition forconstrained local minimum.Hessian of Lagrangian function
HL(x ,λ) =[
B(x ,λ) JTg (x)Jg (x) 0
]where B(x ,λ) = H f (x) +
∑mi=1 λiHgi (x). HL is sometimes called
KKT (Karush-Kuhn-Tucker) matrix. HL is symmetric, but not ingeneral positive definiteCritical point (x∗,λ∗) of L is constrained minimum if B(x∗,λ∗) ispositive definite on null space of Jg (x∗).Let Z form basis of null (Jg (x∗)), then projected Hessian ZTBZshould be positive definite
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 18 / 21
Sequential Quadratic Programming∇L(x ,λ) = 0 can be solved using Newton’s method. kth iteration ofNewton’s step is[
B(xk ,λk) JTg (xk)Jg (xk) 0
] [skδk
]= −
[∇f (xk) + JTg (xk)λk
g(xk)
],
and then xk+1 = xk + sk and λk+1 = λk + δkAbove system of equations is first-order optimality condition forconstrained optimization problem
mins12sTB(xk ,λk)s + sT
(∇f (xk) + JTg (xk)λk
)subject to
Jg (xk)s + g(xk) = 0.
This problem is quadratic programming problem, so approach usingNewton’s method is known as sequential quadratic programing
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 19 / 21
Solving KKT System
KKT system[
B JT
J 0
] [sδ
]= −
[wg
]can be solved in several
waysDirect solution
I Solve system using method for symmetric indefinite factorization, suchas LDLT with pivoting, or
I Use iterative method such as GMRES, MINRES
Range-space methodI Use block elimination and obtain symmetric system(
JB−1JT)δ = g − JB−1w
and thenBs = −w − JTδ
I First equation finds δ in range space of JI It is attractive when number of constraints m is relatively small,
because JB−1JT is m ×mI However, it requires B to be nonsingular and J has full rank. Also,
condition number of JB−1JT may be largeXiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 20 / 21
Solving KKT System Cont’dNull space method
Let Z be composed of null space of J , and it can be obtained by QRfactorization of JT . Then JZ = 0Let JY = RT , and write s = Yu + Zv . Second block row yields
Js = J(Yu + Zv) = RTu = −g
and premultiplying first block row by ZT yields(ZTBZ
)v = −ZT (w − BYu)
Finally,Y TJTδ = Rδ = −Y T (w + Bs)
This method method is advantageous when n −m is smallIt is more stable than range-space method. Also, B does not need tobe nonsingular
Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 21 / 21
BFGS MethodConjugate Gradient MethodsConstrained Optimization