+ All Categories
Home > Documents > AMS527: Numerical Analysis II - Stony Brookjiao/teaching/ams527_spring13/...Motivation of BFGS Lets...

AMS527: Numerical Analysis II - Stony Brookjiao/teaching/ams527_spring13/...Motivation of BFGS Lets...

Date post: 01-Feb-2021
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
21
AMS527: Numerical Analysis II Supplementary Material on Numerical Optimization Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 1 / 21
Transcript
  • AMS527: Numerical Analysis IISupplementary Material onNumerical Optimization

    Xiangmin Jiao

    SUNY Stony Brook

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 1 / 21

  • Outline

    1 BFGS Method

    2 Conjugate Gradient Methods

    3 Constrained Optimization

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 2 / 21

  • BFGS Method

    BFGS Method is one of most effective secant updating methods forminimizationNamed after Broyden, Fletcher, Goldfarb, and ShannoUnlike Broyden’s method, BFGS preserves the symmetry ofapproximate Hessian matrixIn addition, BFGS preserves the positive definiteness of theapproximate Hessian matrixReference: J. Nocedal, S. J. Wright, Numerical Optimization, 2ndedition, Springer, 2006. Section 6.1.

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 3 / 21

  • Algorithm

    x0 =initial guessB0 =initial Hessian approximationfor k =0, 1, 2, . . .

    Solve Bksk = −∇f (xk) for skxk+1 = xk + skyk = ∇f (xk+1)−∇f (xk)Bk+1 = Bk + (ykyTk )/(y

    Tk sk)− (BksksTk Bk)/(sTk Bksk)

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 4 / 21

  • Motivation of BFGSLet sk = xk+1 − xk and yk = ∇f (xk+1)−∇f (xk)Matrix Bk+1 should satisfy secant equation

    Bk+1sk = yk

    In addition, Bk+1 is positive definition, which requires sTk yk > 0There are infinite number of Bk+1 that satisfies secant equationDavidon (1950s) proposed to choose Bk+1 to be closest to Bk , i.e.,

    minB‖B − Bk‖

    Subject to B = BT , Bsk = yk .

    BFGS proposed to choose Bk+1 so that B−1k+1 is closest to B−1k , i.e.,

    minB‖B−1 − B−1k ‖

    Subject to B = BT , Bsk = yk .

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 5 / 21

  • Properties of BFGS

    BFGS normally has superlinear convergence rate, even thoughapproximate Hessian does not necessarily converge to true HessianApproximate Hessian preserves positive definiteness

    I Key idea of proof: Let Hk denote B−1k . For any vector z 6= 0, and letw = z − ρkyk(sTk z), where ρk > 0. Then it can be shown that

    zTHk+1z = wTHkw + ρk(sTk z)2 ≥ 0.

    If sTk z = 0, then w = z 6= 0. So zTHk+1z > 0.

    Line search can be used to enhance effectiveness of BFGS. If exact linesearch is performed at each iteration, BFGS terminates at exactsolution in at most n iterations for a quadratic objective function

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 6 / 21

  • Outline

    1 BFGS Method

    2 Conjugate Gradient Methods

    3 Constrained Optimization

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 7 / 21

  • Motivation of Conjugate Gradients

    Conjugate gradient can be used to solve a linear system Ax = b,where A is symmetric positive definite (SPD)If A is m ×m SPD, then quadratic function

    ϕ(x) =12xTAx − xTb

    has unique minimumNegative gradient of this function is residual vector

    −∇ϕ(x) = b − Ax = r

    so minimum is obtained precisely when Ax = b

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 8 / 21

  • Search Direction in Conjugate Gradients

    Optimization methods have form

    xn+1 = xn + αnpn

    where pn is search direction and α is step length chosen to minimizeϕ(xn + αnpn)Line search parameter can be determined analytically asαn = rTn pn/pTn ApnIn CG, pn is chosen to be A-conjugate (or A-orthogonal) to previoussearch directions, i.e., pTn Apj = 0 for j < n

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 9 / 21

  • Optimality of Step LengthSelect step length αn over vector pn−1 to minimizeϕ(x) = 12x

    TAx − xTbLet xn = xn−1 + αnpn−1,

    ϕ(xn) =12(xn−1 + αnpn−1)

    TA(xn−1 + αnpn−1)− (xn−1 + αnpn−1)Tb

    =12α2np

    Tn−1Apn−1 + αnp

    Tn−1Axn−1 − αnpTn−1b + constant

    =12α2np

    Tn−1Apn−1 − αnpTn−1rn−1 + constant

    Therefore,

    dϕdαn

    = 0⇒ αnpTn−1Apn−1 − pTn−1rn−1 = 0⇒ αn =pTn−1rn−1

    pTn−1Apn−1.

    In addition, pTn−1rn−1 = rTn−1rn−1 because pn−1 = rn−1 + βnpn−2

    and rTn−1pn−2 = 0 due to the following theorem.

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 10 / 21

  • Conjugate Gradient Method

    Algorithm: Conjugate Gradient Methodx0 = 0, r0 = b, p0 = r0for n = 1 to 1, 2, 3, . . .

    αn = (rTn−1rn−1)/(pTn−1Apn−1) step length

    xn = xn−1 + αnpn−1 approximate solutionrn = rn−1 − αnApn−1 residualβn = (rTn rn)/(rTn−1rn−1) improvement this steppn = rn + βnpn−1 search direction

    Only one matrix-vector product Apn−1 per iterationApart from matrix-vector product, #operations per iteration is O(m)CG can be viewed as minimization of quadratic functionϕ(x) = 12x

    TAx − xTb by modifying steepest descentFirst proposed by Hestens and Stiefel in 1950s

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 11 / 21

  • An Alternative Interpretation of CG

    Algorithm: CGx0 = 0, r0 = b, p0 = r0for n =1, 2, 3, . . .αn = rTn−1rn−1/(p

    Tn−1Apn−1)

    xn = xn−1 + αnpn−1rn = rn−1 − αnApn−1βn = rTn rn/(rTn−1rn−1)pn = rn + βnpn−1

    Algorithm: A non-standard CGx0 = 0, r0 = b, p0 = r0for n =1, 2, 3, . . .αn = rTn−1pn−1/(p

    Tn−1Apn−1)

    xn = xn−1 + αnpn−1rn = b − Axnβn = −rTn Apn−1/(pTn−1Apn−1)pn = rn + βnpn−1

    The non-standard one is less efficient but easier to understandIt is easy to see rn = rn−1 − αnApn−1 = b − Axn

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 12 / 21

  • Comparison of Linear and Nonlinear CG

    Algorithm: Linear CGx0 = 0, r0 = b,p0 = r0for n =1, 2, 3, . . .αn = rTn−1rn−1/(p

    Tn−1Apn−1)

    xn = xn−1 + αnpn−1rn = rn−1 − αnApn−1βn = rTn rn/(rTn−1rn−1)pn = rn + βnpn−1

    Algorithm: Non-linear CGx0 = initial guess, g0 = ∇f (x0),s0 = −g0for k = 0, 1, 2, . . .

    Choose αk to min f (xk + αksk)xk+1 = xk + αkskgk+1 = ∇f (xk+1)βk+1 = (gTk+1gk+1)/(g

    Tk gk)

    sk+1 = −gk+1 + βk+1sk

    βk+1 = (gTk+1gk+1)/(gTk gk) was due to Fletcher and Reeves (1964)

    An alternative formula βk+1 = (gk+1 − gk)Tgk+1/(gTk gk) was dueto Polak and Riebiere (1969)

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 13 / 21

  • Properties of Conjugate GradientsKrylov subspaces for Ax = b is Kn = {b,Ab, . . . ,An−1b}.

    TheoremIf rn−1 6= 0, spaces spanned by approximate solutions xn, search directionspn, and residuals rn are all equal to Krylov subspaces

    Kn = 〈x1, x2, . . . , xn〉 = 〈p0,p1, . . . ,pn−1〉= 〈r0, r1, . . . , rn−1〉 = 〈b,Ab, . . . ,An−1b〉

    The residual are orthogonal (i.e., rTn r j = 0 for j < n) and search directionsare A-conjugate (i.e, pTn Apj = 0 for j < n).

    TheoremIf rn−1 6= 0, then error en = x∗ − xn are minimized in A-norm in Kn.

    Because Kn grows monotonically, error decreases monotonically.

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 14 / 21

  • Rate of Convergence

    Some important convergence resultsI If A has n distinct eigenvalues, CG converges in at most n stepsI If A has 2-norm condition number κ, the errors are

    ‖en‖A‖e0‖A

    ≤ 2(√

    κ− 1√κ+ 1

    )nwhich is ≈ 2

    (1− 2√

    κ

    )nas κ→∞. So convergence is expected in

    O(√κ) iterations.

    In general, CG performs well with clustered eigenvalues

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 15 / 21

  • Outline

    1 BFGS Method

    2 Conjugate Gradient Methods

    3 Constrained Optimization

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 16 / 21

  • Equality-Constrained MinimizationEquality-constrained problem has form

    minx∈Rn

    f (x) subject to g(x) = 0

    where objective function f : Rn → R and constraints g : Rn → Rm,where m ≤ nNecessary condition for feasible point x to be solution is that negativegradient of f lie in space spanned by constraint normals, i.e.,

    −∇f (x∗) = JTg (x∗)λ,

    where Jg is Jacobian matrix of g , and λ is vector of LagrangemultipliersTherefore, constrained local minimum must be critical point ofLagrangian function

    L(x ,λ) = f (x) + λTg(x)

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 17 / 21

  • First-Order and Second-Order Optimality ConditionsEquality-constrained minimization can be reduced to solving

    ∇L(x ,λ) =[∇f (x) + JTg (x)λ

    g(x)

    ]= 0,

    which is known as Karush-Kuhn-Tucker (or KKT) condition forconstrained local minimum.Hessian of Lagrangian function

    HL(x ,λ) =[

    B(x ,λ) JTg (x)Jg (x) 0

    ]where B(x ,λ) = H f (x) +

    ∑mi=1 λiHgi (x). HL is sometimes called

    KKT (Karush-Kuhn-Tucker) matrix. HL is symmetric, but not ingeneral positive definiteCritical point (x∗,λ∗) of L is constrained minimum if B(x∗,λ∗) ispositive definite on null space of Jg (x∗).Let Z form basis of null (Jg (x∗)), then projected Hessian ZTBZshould be positive definite

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 18 / 21

  • Sequential Quadratic Programming∇L(x ,λ) = 0 can be solved using Newton’s method. kth iteration ofNewton’s step is[

    B(xk ,λk) JTg (xk)Jg (xk) 0

    ] [skδk

    ]= −

    [∇f (xk) + JTg (xk)λk

    g(xk)

    ],

    and then xk+1 = xk + sk and λk+1 = λk + δkAbove system of equations is first-order optimality condition forconstrained optimization problem

    mins12sTB(xk ,λk)s + sT

    (∇f (xk) + JTg (xk)λk

    )subject to

    Jg (xk)s + g(xk) = 0.

    This problem is quadratic programming problem, so approach usingNewton’s method is known as sequential quadratic programing

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 19 / 21

  • Solving KKT System

    KKT system[

    B JT

    J 0

    ] [sδ

    ]= −

    [wg

    ]can be solved in several

    waysDirect solution

    I Solve system using method for symmetric indefinite factorization, suchas LDLT with pivoting, or

    I Use iterative method such as GMRES, MINRES

    Range-space methodI Use block elimination and obtain symmetric system(

    JB−1JT)δ = g − JB−1w

    and thenBs = −w − JTδ

    I First equation finds δ in range space of JI It is attractive when number of constraints m is relatively small,

    because JB−1JT is m ×mI However, it requires B to be nonsingular and J has full rank. Also,

    condition number of JB−1JT may be largeXiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 20 / 21

  • Solving KKT System Cont’dNull space method

    Let Z be composed of null space of J , and it can be obtained by QRfactorization of JT . Then JZ = 0Let JY = RT , and write s = Yu + Zv . Second block row yields

    Js = J(Yu + Zv) = RTu = −g

    and premultiplying first block row by ZT yields(ZTBZ

    )v = −ZT (w − BYu)

    Finally,Y TJTδ = Rδ = −Y T (w + Bs)

    This method method is advantageous when n −m is smallIt is more stable than range-space method. Also, B does not need tobe nonsingular

    Xiangmin Jiao (SUNY Stony Brook) AMS527: Numerical Analysis II 21 / 21

    BFGS MethodConjugate Gradient MethodsConstrained Optimization


Recommended