+ All Categories
Home > Documents > Continuous Optimization Methodsmepelman/teaching/IOE511/... · Continuous Optimization Methods...

Continuous Optimization Methodsmepelman/teaching/IOE511/... · Continuous Optimization Methods...

Date post: 03-Feb-2021
Category:
Upload: others
View: 18 times
Download: 0 times
Share this document with a friend
140
Continuous Optimization Methods These slides are prepared with the goal of occasionally providing the instructor rest from writing on the board during lectures. As such, they contain only part of the information that was discussed in the lecture, and do not provide a summary or outline of the lectures or assigned readings. Winter 2019 IOE 511/Math 562: ContOpt Page 1
Transcript
  • Continuous Optimization Methods

    These slides are prepared with the goal of occasionallyproviding the instructor rest from writing on the board duringlectures. As such, they contain only part of the information

    that was discussed in the lecture, and do not provide asummary or outline of the lectures or assigned readings.

    Winter 2019

    IOE 511/Math 562: ContOpt Page 1

  • IOE 511/Math 562: Continuous Optimization Methods

    Instructor: Marina A. Epelman [email protected]: Geunyeong Byeon [email protected];Office hours: as announced on CanvasOnline:I Canvas: Course materials, including announcements, lecture

    notes, reading assignments, grades, etc.; submission ofprogramming assignments. Make sure to follow links toSchedule and LATEX and Matlab resources on the web.

    I Piazza (linked to Canvas): Q&A regarding course materials,readings, homeworks solutions (after official solutions havebeen posted), etc. Please post your questions there, as well asanswer questions whenever you can contribute!

    I Gradescope (linked to Canvas): Homework assignments andsubmissions (except for Matlab code, which will be submittedseparately through Canvas).

    Make sure that you can access Canvas, Piazza, and Gradescopesites for this course

    IOE 511/Math 562: ContOpt Introduction Page 2

    [email protected]@umich.edu

  • Course Logistics

    Required background: A proof-based calculus or analysis course,linear algebra, familiarity with (or willingness to learn) Matlab.

    I Homeworks (worth a total of 30%), include some computermodeling and programming

    I Midterm (worth a total of 30%), roughly midpoint of thecourse

    I Final exam (worth a total of 40%), during Finals week

    Partial honor code policies: you are allowed, indeed, encouragedto work in groups on the homework conceptualizing the problems.However each student is individually responsible for expressing theiranswers in their own terms, writing their own solutions and code tobe submitted. Also you may not acquire, read, or otherwise utilizeanswers from solutions handed out in other courses, or in previousterms in this course. You also are not allowed to distributematerials from this course to individuals or repositories.Please read syllabus for complete course policies.

    IOE 511/Math 562: ContOpt Introduction Page 3

  • Informal (and tentative) course outline

    I Introduction to optimization

    I Optimality conditions for unconstrained problems

    I Algorithms for unconstrained problems (steepest descent,Newton’s, etc.) and analysis of their convergence

    I Optimality conditions and constraint qualifications forconstrained problems

    I Convexity and its role in optimization

    I Algorithms for constrained problems (SQP, barrier and penaltymethods, etc.)

    I If time remains: “large-scale” optimization problems; conicoptimization problems, their applications, and methods fortheir solution.

    IOE 511/Math 562: ContOpt Introduction Page 4

  • Forms of mathematical programming problems

    Unconstrained Problem:

    (NLP) minx f(x)s.t. x ∈ X,

    where x = (x1, . . . , xn)T ∈ Rn, f(x) : Rn → R, and X is an open

    set (often, but not always, X = Rn).Constrained Problem:

    (NLP) minx f(x)s.t. gi(x) ≤ 0 i = 1, . . . ,m

    hi(x) = 0 i = 1, . . . , lx ∈ X,

    where g1(x), . . . , gm(x), h1(x), . . . , hl(x) : Rn → R.

    IOE 511/Math 562: ContOpt Introduction Page 5

  • Constrains, objectives, optimal solutions

    (NLP) minx f(x)s.t. gi(x) ≤ 0 i = 1, . . . ,m

    hi(x) = 0 i = 1, . . . , lx ∈ X,

    I f(x) is the objective function

    I “hi(x) = 0” are equality constraints

    I “gi(x) ≤ 0” are inequality constraintsI x ∈ Rn is feasible if it satisfies all the constraintsI The set of all feasible points forms the feasible region

    I The goal: find a feasible point x̄ such that f(x̄) ≤ f(x) forany other feasible point x

    IOE 511/Math 562: ContOpt Introduction Page 6

  • Examples of NLP formulation

    I Markowitz portfolio optimization model

    I Least squares approximation

    I Maximum likelihood estimation

    IOE 511/Math 562: ContOpt Introduction Page 7

  • Markowitz portfolio optimization modelProblem description and data

    I You have an opportunity to invest in n assets

    I Future return of asset i is a random variables Ri withexpectation µi = E[Ri], i = 1, . . . , n

    I Covariances of returns are Qij = Cov(Ri, Rj), i, j = 1, . . . , n

    I At least one of these assets is a risk-free asset

    IOE 511/Math 562: ContOpt Introduction Page 8

  • Markowitz portfolio optimization modelA portfolio

    I Let xi, i = 1, . . . , n, be the fractions of your wealth allocatedto each of the assets

    I x ≥ 0 andn∑i=1

    xi = 1

    I Return of resulting portfolio is a random variablen∑i=1

    xiRi

    I Expectation:n∑i=1

    xiE[Ri] =

    n∑i=1

    xiµi

    I Variance:n∑i=1

    n∑j=1

    xixjCov(Ri, Rj) =

    n∑i=1

    n∑j=1

    xixjQij .

    IOE 511/Math 562: ContOpt Introduction Page 9

  • Markowitz portfolio optimization modelPortfolio optimization

    I A portfolio is usually chosen to optimize some measure of atradeoff between the expected return and the risk, such as

    max

    n∑i=1

    xiµi − αn∑i=1

    n∑j=1

    xixjQij

    s.t.n∑i=1

    xi = 1

    x ≥ 0,

    I Here α > 0 is a (fixed) parameter reflecting the investor’spreferences in the above tradeoff.

    I The above problem is usually solved for a variety of values ofα, generating the efficient frontier.

    IOE 511/Math 562: ContOpt Introduction Page 10

  • Parameter estimationProblem description and data

    I Setup: output (performance) of a system, y, depends on anumber of input parameters (settings), a ∈ Rn, but we don’tknow exactly how

    I Linear measurement model: assume that y ∈ R can beexpressed as a linear function

    y ≈ aTx

    of a for some x ∈ Rn

    I Goal: find value of x which provides the “best fit” for theavailable set of input-output pairs (ai, yi), i = 1, . . . ,m

    IOE 511/Math 562: ContOpt Introduction Page 11

  • Parameter estimationOptimization problem: least squares

    I One measure of “fit” is sum of squared errors betweenestimated and measured outputs

    I To find the best fit, solve optimization problem:

    minx∈Rn∑m

    i=1(vi)2

    s.t. vi = yi − aTi x, i = 1, . . . ,m

    I Same as

    minx∈Rn

    m∑i=1

    (yi − aTi x)2 = minx∈Rn

    ‖Ax− y‖22

    Here, A is the matrix with rows aTi

    IOE 511/Math 562: ContOpt Introduction Page 12

  • Maximum likelihood estimationOne observation

    I Setup: observing a sample of realizations of a random variableY , and trying to find out its probability distribution

    I Parametric family: assume that the distribution belongs to afamily of probability distributions px(·) on R, parameterizedby vector x ∈ Rn

    I Given one observation y ∈ R, px(y) as a function of x iscalled the likelihood function

    I It is more convenient to work with the log-likelihood function:

    l(x) = log px(y)

    I To estimate the value of x based on one sample y, take

    x̂ = argmaxx px(y) = argmaxx l(x),

    which is the maximum likelihood (ML) estimationI If there is prior information available about x, we can add

    constraint x ∈ C ⊆ RnIOE 511/Math 562: ContOpt Introduction Page 13

  • Maximum likelihood estimationMultiple observation

    I Recall: we assume that the distribution of Y belongs to aparametric family of probability distributions px(·) on R,parameterized by vector x ∈ Rn

    I For m iid samples (y1, . . . , ym), the log-likelihood function is

    l(x) = log

    (m∏i=1

    px(yi)

    )=

    m∑i=1

    log px(yi)

    I The ML estimation is thus an optimization problem:

    max l(x) subject to x ∈ C

    IOE 511/Math 562: ContOpt Introduction Page 14

  • Maximum likelihood estimationExample: linear measurement model

    I Return to the linear measurement model:I Previously assumed that y ≈ aTxI More specific assumption: y = aTx+ v, where the error v is

    iid random noise with density p(v)

    I m measurement/output pairs (ai, yi) give us m samples of v:

    vi = yi − aTi x

    I The likelihood function is

    px(y) =

    m∏i=1

    p(yi − aTi x),

    and the log-likelihood function is

    l(x) =

    m∑i=1

    log p(yi − aTi x)

    IOE 511/Math 562: ContOpt Introduction Page 15

  • Maximum likelihood estimationExample: linear measurement model, Gaussian noise

    I Suppose the noise is Gaussian with mean 0 and (unknown)standard deviation σ.

    I Density:

    p(z) =1√

    2πσ2e−

    z2

    2σ2

    I Log-likelihood function:

    l(x) = −m2

    log(2πσ2)− 12σ2‖Ax− y‖22

    I Therefore, the ML estimate of x is

    arg minx‖Ax− y‖22,

    the solution of the least squares approximation problemI This is the idea behind linear regression!

    IOE 511/Math 562: ContOpt Introduction Page 16

  • Calculus bootcamp

    A quick overview of definitions and results from (multivariate)calculus and analysis we will use throughout the course.Additional references:

    I Griva, Nash, and Sofer, “Linear and Nonlinear Optimization,”Appendices A and B

    I Bertsekas, “Nonlinear Programming,” Appendix A

    I Bazaraa, Sherali, and Shetty, “Nonlinear Programming:Theory and Algorithms,” Appendix A

    IOE 511/Math 562: ContOpt Calculus Page 17

  • Vectors and Norms

    I Rn: set of all n-dimensional real vectors (x1, . . . , xn)T (“xT ”— transpose)

    I Definition: norm ‖ · ‖ on Rn: a mapping of Rn onto R suchthat:

    1. ‖x‖ ≥ 0 ∀x ∈ Rn; ‖x‖ = 0⇔ x = 0.2. ‖cx‖ = |c| · ‖x‖ ∀c ∈ R, x ∈ Rn.3. ‖x+ y‖ ≤ ‖x‖+ ‖y‖ ∀x, y ∈ Rn.

    I Euclidean norm: ‖ · ‖2: ‖x‖2 =√xTx =

    (∑ni=1 x

    2i

    )1/2.

    I Cauchy-Schwarz inequality for Euclidean norm:|xT y| ≤ ‖x‖2 · ‖y‖2 with equality ⇔ x = αy.

    I All norms in Rn are equivalent, i.e., for any ‖ · ‖1 and ‖ · ‖2∃α1, α2 > 0 s.t. α1‖x‖1 ≤ ‖x‖2 ≤ α2‖x‖1 ∀x ∈ Rn.

    I Ball of radius � > 0 centered at x:B(x, �) = {y : ‖y− x‖ ≤ �} (sometimes — strict inequality).

    IOE 511/Math 562: ContOpt Calculus Page 18

  • Sequences and Limits in R.

    I Notation: a sequence: xk : k = 1, 2, . . . ⊂ R, {xk}k forshort.

    I Definition: {xk}k ⊂ R converges to x ∈ R (xk → x,limk→∞ x

    k = x) if

    ∀� > 0 ∃K� : |xk − x| ≤ � (equiv., xk ∈ B(x, �)) ∀k ≥ K�.

    I Definition: xk →∞ (−∞) if

    ∀A ∃KA : xk ≥ A (xk ≤ A) ∀k ≥ KA.

    IOE 511/Math 562: ContOpt Calculus Page 19

  • Other properties of sequences in R

    I Definition: {xk}k is bounded above (below):∃A : xk ≤ A (xk ≥ A) ∀k.

    I Definition: {xk}k is bounded: {|xk|} is bounded; equiv.,{xk}k bounded above and below.

    I Definition: {xk}k is nonincreasing (nondecreasing):xk+1 ≤ xk (xk+1 ≥ xk) ∀k;monotone: nondecreasing or nonincreasing.

    I Proposition: Every monotone sequence in R has a limit(possibly infinite). If it is also bounded, the limit is finite.

    IOE 511/Math 562: ContOpt Calculus Page 20

  • Sequences in Rn

    I Notation: a sequence: xk : k = 1, 2, . . . ⊂ Rn, {xk}k forshort.

    I Definitions: {xk}k ⊂ Rn converges to x ∈ Rn (is bounded) if{xki }, i.e., the sequence of ith coordinates of xk’s, convergesto the xi (is bounded) ∀i.

    I Propositions:I xk → x⇔ ‖xk − x‖ → 0I {xk}k is bounded⇔ {‖xk‖} is bounded

    I Note: ‖xk‖ → ‖x‖ does not imply that xk → x!! (Unlessx = 0).

    IOE 511/Math 562: ContOpt Calculus Page 21

  • Limit Points of Sequences vs Limits

    I Definition: x is a limit point of {xk}k if there exists an infinitesubsequence of {xk}k that converges to x.

    I To see the difference between limits of a sequence and limitpoints of a sequence, consider the sequence{

    0,1

    2, −1

    2,

    2

    3, −2

    3,

    3

    4, −3

    4, . . .

    }⊂ R

    I Proposition: let {xk}k ⊂ RnI If {xk}k is bounded, {xk}k converges ⇔ it has a unique limit

    pointI If {xk}k is bounded, it has at least one limit point

    IOE 511/Math 562: ContOpt Calculus Page 22

  • Limit Points of Sets

    I Definition: x is a limit point of set A ⊆ Rn if there exists aninfinite sequence {xk}k ⊂ A such that xk 6= x ∀k thatconverges to x.

    I Examples: what are the limit points of the following set:{y : ‖y − x‖ ≤ �}

    What about {y : ‖y − x‖ < �}?

    IOE 511/Math 562: ContOpt Calculus Page 23

  • Closed and Open Sets

    I Definition: A ⊆ Rn is closed if it contains all its limit pointsI Definition: A ⊆ Rn is open if its complement, Rn\A, is closedI Examples: balls centered at x:{y : ‖y − x‖ ≤ �} — closed{y : ‖y − x‖ < �} — open

    I Some sets are neither: (0, 1].I Proposition:

    1. Union of finitely many closed sets is closed.2. Intersection of closed sets is closed.3. Union of open sets is open.4. Intersection of finitely many open sets is open.5. A set is open ⇔ All of its elements are interior points.6. Every subspace of Rn is closed.

    I Definition: a point x ∈ A is interior if there is a neighborhoodof x (i.e., B(x, �) for some � > 0) contained in A

    IOE 511/Math 562: ContOpt Calculus Page 24

  • Basic notions in optimization

    I Definitions

    I Types of optima

    I Existence of optima

    IOE 511/Math 562: ContOpt Basic notions in optimization Page 25

  • Types of optimization problems

    Unconstrained Optimization Problem:

    (UP) minx f(x)s.t. x ∈ X,

    where x = (x1, . . . , xn)T ∈ Rn, f(x) : Rn → R, and X is an open

    set (usually X = Rn).Constrained Optimization Problem:

    (NLP) minx f(x)s.t. gi(x) ≤ 0 i = 1, . . . ,m

    hi(x) = 0 i = 1, . . . , lx ∈ X,

    where g1(x), . . . , gm(x), h1(x), . . . , hl(x) : Rn → R.

    IOE 511/Math 562: ContOpt Basic notions in optimization Page 26

  • Constraints and feasible region

    I A point x is feasible for (UP)/(NLP) if it satisfies allconstraints (including x ∈ X)

    I The set F of all feasible points forms the feasible region, orfeasible set:

    F = {x ∈ Rn : g1(x) ≤ 0, . . . , gm(x) ≤ 0,h1(x) = 0, . . . , hl(x) = 0, x ∈ X}

    I At a feasible point x̄ an inequality constraint gi(x) ≤ 0 is saidto be binding, or active if gi(x̄) = 0, and nonbinding, ornonactive if gi(x̄) < 0

    I All equality constraints are considered to be active at anyfeasible point.

    IOE 511/Math 562: ContOpt Basic notions in optimization Page 27

  • Types of optimal solutions

    (P) minx or maxx f(x)s.t. x ∈ F

    I Recall: B(x̄, �) := {x : ‖x− x̄‖ ≤ �}I Definitions:

    1.5.2 x ∈ F is a global minimum of (P) if f(x) ≤ f(y) for all y ∈ F .1.5.4 x ∈ F is a strict global minimum of (P) if f(x) < f(y) for all

    y ∈ F , y 6= x.1.5.1 x ∈ F is a local minimum of (P) if there exists � > 0 such that

    f(x) ≤ f(y) for all y ∈ B(x, �) ∩ F .1.5.3 x ∈ F is a strict local minimum of (P) if there exists � > 0

    such that f(x) < f(y) for all y ∈ B(x, �) ∩ F , y 6= x.I Local and global maxima (i.e., solutions of the problem

    maxx∈F f(x)) are defined analogously (1.5.5–1.5.8).

    IOE 511/Math 562: ContOpt Basic notions in optimization Page 28

  • Infimum and Supremum

    I Let A ⊂ R.Supremum of A (supA): smallest y : x ≤ y ∀x ∈ A.Infimum of A (inf A): largest y : x ≥ y ∀x ∈ A.

    I Not the same as max and min! Consider, for example, (0, 1).

    IOE 511/Math 562: ContOpt Basic notions in optimization Page 29

  • Functions and Continuity

    I A ⊆ Rm, f : A→ R – a function.I Definition: f is continuous at x̄ if

    ∀� > 0 ∃δ > 0 : x ∈ A, ‖x− x̄‖ < δ ⇒ |f(x)− f(x̄)| < �.

    I The above is the standard way to write the definition; Iwould’ve preferred “∀� > 0 ∃δx̄,� > 0... ”

    I Proposition: f is continuous at x̄ ⇔ for any{xn} ⊂ A : xn → x̄ we have f(xn)→ f(x̄). (In other words,limn→∞ f(xn) = f(limn→∞ xn).)

    I Proposition:I Sums, products and inverses of continuous functions are

    continuous (in the last case, provided the function is neverzero on its domain).

    I Composition of two continuous functions is continuous.I Any vector norm is a continuous function.

    IOE 511/Math 562: ContOpt Basic notions in optimization Page 30

  • Existence of solutions of optimization problems

    Thm. 1.6.1: Weierstrass’ Theorem for sequences

    Let {xk}k, k →∞ be an infinite sequence of points in thecompact (i.e., closed and bounded) set F . Then some infinitesubsequence of points xkj converges to a point contained in F .

    Thm. 1.6.2: Weierstrass’ Theorem for functions

    Let f(x) be a continuous real-valued function on the compactnonempty set F ⊂ Rn. Then F contains a point that minimizes(maximizes) f on the set F .

    IOE 511/Math 562: ContOpt Basic notions in optimization Page 31

  • Optimality conditions: unconstrained problems

    I Necessary conditions: identify candidates

    I Sufficient conditions: guarantee optimality

    I Convexity and its role in optimization

    IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 32

  • Local minima and descent directions

    (P) minx f(x)s.t. x ∈ X,

    where X = Rn or X ⊂ Rn is an open set

    Definition 2.1.1

    The direction d̄ is called a descent direction of f(·) at x̄ ∈ X if

    f(x̄+ �d̄) < f(x̄) for all � > 0 and sufficiently small.

    What is the relationship between local minima and descentdirections?

    IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 33

  • Differentiable functions and gradients (Appendix B.2.1)

    Let f : X → R, where X ⊂ Rn is open.I The directional derivative of f at x̄ in the direction d is

    ∂λf(x̄+ λd) = lim

    λ→0

    f(x̄+ λd)− f(x̄)λ

    = ∇f(x̄)Td

    I f is differentiable at x̄ ∈ X if ∃∇f(x̄) ∈ Rn such that

    α(x̄;x− x̄) = f(x)− f(x̄)−∇f(x̄)T (x− x̄)

    ‖x− x̄‖→ 0 as x→ x̄

    I ∇f(x̄) is the gradient of f at x̄, and satisfies

    ∇f(x̄) =(∂f(x̄)

    ∂x1, . . . ,

    ∂f(x̄)

    ∂xn

    )T,

    where ∂f(x̄)∂xi is the directional derivative in direction eiI f is differentiable on X if f is differentiable ∀x̄ ∈ X.

    IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 34

  • First order necessary optimality condition

    (P) minx∈X

    f(x)

    where x ∈ Rn, f : Rn → R, and X — an open set.

    Theorem 2.1.1

    Suppose that f is differentiable at x̄. If ∃d ∈ Rn such that∇f(x̄)Td < 0, then for all λ > 0 sufficiently small,f(x̄+ λd) < f(x̄) (i.e., d is a descent direction; Def. 2.1.1).

    Recall: directional derivative limλ→0f(x̄+λd)−f(x̄)

    λ = ∇f(x̄)Td

    Necessary condition for local optimality: “if x̄ is a local minimumof (P), then x̄ must satisfy...”

    Corollary 2.1.1: First order necessary optimalitycondition

    Suppose f is differentiable at x̄. If x̄ is a local minimum, then∇f(x̄) = 0 (such a point is called a stationary point).

    IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 35

  • Twice differentiable functions and Hessians (AppendixB.2.1)

    I Definition: the function f is twice differentiable at x̄ ∈ X ifthere exists a vector ∇f(x̄) and an n× n symmetric matrixH(x̄) (the Hessian of f at x̄) such that for each x ∈ X

    α(x̄;x−x̄) =f(x)− f(x̄)−∇f(x̄)T (x− x̄)− 1

    2(x− x̄)TH(x̄)(x− x̄)

    ‖x− x̄‖2→ 0 as x→ x̄.

    Note: this is a different function α than in the definition of ∇fI f is twice differentiable on X if f is twice differentiable∀x̄ ∈ X.

    I The Hessian is a matrix of second partial derivatives:

    [H(x̄)]ij =∂2f(x̄)

    ∂xi∂xj,

    and for functions with continuous second derivatives, it willalways be symmetric:

    ∂2f(x̄)

    ∂xi∂xj=∂2f(x̄)

    ∂xj∂xiIOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 36

  • Positive semi-definite symmetric matrices, etc. (AppendixA)

    Definition

    An n× n matrix Q is called symmetric if Qij = Qji ∀i, j.A symmetric n× n matrix Q is calledI positive definite if xTQx > 0 ∀x ∈ Rn, x 6= 0 (Q � 0, SPD)I positive semidefinite if xTQx ≥ 0 ∀x ∈ Rn (Q � 0, SPSD)I negative definite if xTQx < 0 ∀x ∈ Rn, x 6= 0 (Q ≺ 0)I negative semidefinite if xTQx ≤ 0 ∀x ∈ Rn (Q � 0)I indefinite if ∃x, y ∈ Rn : xTQx > 0, yTQy < 0

    IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 37

  • Eigenvalues and decomposition of matrices

    I A number γ is an eigenvalue of Q if there exists v 6= 0 suchthat Qv = γv; v is an eigenvector corresponding to γI γ also satisfies det(Q− γI) = 0

    I If Q ∈ Rn×n is a symmetric matrix, then...I Prop. A.1.1 all of its eigenvalues are real numbersI Prop. A.1.2 its eigenvectors corresponding to different

    eigenvalues are orthogonalI Prop. A.1.3 Q has n (distinct) eigenvectors that form an

    orthonormal basis for Rn:

    v1, . . . , vn : vTi vj =

    {0 if i 6= j1 if i = j

    IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 38

  • Eigenvalues and definiteness of matrices

    In these definitions and results, Q ∈ Rn×n is a symmetric matrixI Prop. A.1.5 If v1, . . . , vn are orthonormal eigenvectors

    corresponding to γ1, . . . , γn, then Q = RDRT , where

    I R = [v1, . . . , vn] (note: RT = R−1)

    I D =

    γ1 0. . .0 γn

    I Prop. A.1.4, mod. Q is SPSD (SPD) if and only if all of its

    eigenvalues are nonnegative (positive)

    I Prop. A.1.6 If Q is SPSD, then Q = MTM for some matrixM

    I Prop. A.1.7 If Q is SPSD, then xTQx = 0 implies Qx = 0

    I Prop. A.1.8 Suppose Q is symmetric; then Q � 0 andnonsingular if and only if Q � 0

    IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 39

  • Second order conditions

    Theorem 2.1.2: Second order necessary conditions

    Suppose that f is twice continuously differentiable at x̄ ∈ X. If x̄is a local minimum, then ∇f(x̄) = 0 and H(x̄) = ∇2f(x̄) ispositive semidefinite.

    Necessary conditions only allow us to come up with a list ofcandidate points for minima. Sufficient condition for localoptimality: “if x̄ satisfies ..., then x̄ is a local minimum of (P).”

    Theorem 2.1.3: Second order sufficient conditions

    Suppose that f is twice differentiable at x̄. If ∇f(x̄) = 0 andH(x̄) � 0 (positive definite), then x̄ is a (strict) local minimum.

    I If ∇f(x̄) = 0 and H(x̄) ≺ 0, then x̄ is a strict local maximumI If ∇f(x̄) = 0 and H(x̄) � 0 but not � 0, we cannot be sure ifx̄ is a local minimum

    IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 40

  • Convexity: Definitions (Appendix B.2, selections)

    I Let x1, x2 ∈ Rn. Points of the form λx1 + (1− λ)x2 forλ ∈ [0, 1] are called convex combinations of x1 and x2I More generally, point y is a convex combination of points

    x1, . . . , xk if y =∑ki=1 λix

    i where λi ≥ 0 ∀i, and∑ki=1 λi = 1

    I A set S ⊂ Rn is called convex if ∀x1, x2 ∈ S and ∀λ ∈ [0, 1],λx1 + (1− λ)x2 ∈ S

    I A function f : S → R, where S is a nonempty convex set is aconvex function if

    f(λx1+(1−λ)x2) ≤ λf(x1)+(1−λ)f(x2) ∀x1, x2 ∈ S, ∀λ ∈ [0, 1]

    I A function f as above is called a strictly convex function ifthe inequality above is strict for all x1 6= x2 and λ ∈ (0, 1)

    I A function f : S → R is called concave (strictly concave) if(−f) is convex (strictly convex)

    IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 41

  • Convexity and minimization

    (CP) minx f(x)s.t. x ∈ F

    Theorem 2.1.4

    Suppose F is a nonempty convex set, f : F → R is a convexfunction, and x̄ is a local minimum of (CP). Then x̄ is a globalminimum of f over F .

    Note:I A problem of minimizing a convex function over a convex

    feasible region (such as we considered in the theorem) is aconvex optimization problem

    I If f is strictly convex, a local minimum is the unique globalminimum

    I If f is (strictly) concave, a local maximum is a (unique) globalmaximum

    IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 42

  • How can we find out if a function is convex?

    To determine whether a function is convex,

    I Check the definition, or

    I For differentiable or twice differentiable functions, checkcorresponding N&S condition

    Theorem B.2.1: Gradient inequality, a.k.a. N&Scondition for convexity of differentiable functions

    Suppose X ⊆ Rn is a non-empty open convex set, and f : X → Ris differentiable. Then f is convex iff (“if and only if”) it satisfiesthe gradient inequality:

    f(y) ≥ f(x) +∇f(x)T (y − x) ∀x, y ∈ X.

    In one dimension, the gradient inequality has the form

    f(y) ≥ f(x) + f ′(x)(y − x) ∀x, y ∈ X.

    IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 43

  • How can we determine if a function is convex?

    Theorem B.2.2: N&S condition for convexity oftwice-diff. functions

    Suppose X is a non-empty open convex set, and f : X → R istwice differentiable. Then f is convex iff the Hessian of f , H(x), ispositive semidefinite ∀x ∈ X.

    In one dimension, the Hessian condition has the formf ′′(x) ≥ 0 ∀x ∈ X.

    Theorem: Sufficient conditions for strict convexity oftwice-diff. functions

    Suppose X is a non-empty open convex set, and f : X → R istwice differentiable. Then f is strictly convex if the Hessian of f ,H(x), is positive definite ∀x ∈ X.

    IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 44

  • Optimality conditions for convex unconstrained programs

    Theorem 2.1.5: N&S global optimality conditions fordifferentiable unconstrained convex problems

    Suppose f : X → R is convex and differentiable on an openconvex set X. Then x̄ ∈ X is a global minimum if and only if∇f(x̄) = 0.

    For non-differentiable convex functions:I Theorem B.3.6, modified: If S is a convex set and f : S → R

    is convex, ∀x̄ ∈ intS there is (at least one) subgradient vector,i.e., a vector ξ ∈ Rn with the property:

    f(y) ≥ f(x̄) + ξT (y − x̄) ∀y ∈ S.

    I Theorem 2.1.6: If f : X → R is convex on open convex setX, x̄ ∈ X is a global minimum if and only if 0 ∈ ∂f(x̄)I Here, ∂f(x̄) is the subdifferential: the set of all subgradients

    of f at x̄; ∂f(x̄) = {∇f(x̄)} if f differentiable at x̄.IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 45

  • General optimization algorithms

    (P) min f(x)s.t. x ∈ X

    where X ⊂ Rn is an open set and f(x) is differentiable on X

    General directional search optimization algorithm

    Initialization Initialize at x0 ∈ X; set k ← 0Iteration k: 1. If ∇f(xk) = 0, stop. Otherwise, choose dk — a

    search direction2. Choose αk > 0 — a step size3. Set xk+1 ← xk + αkdk ∈ X and k ← k + 1

    IOE 511/Math 562: ContOpt General optimization algorithms Page 46

  • Testing for optimality and terminating

    I Algorithm looks for a point such that ∇f(xk) = 0 (necessaryfor optimality; sufficient if f is convex)

    I Unlikely to find a point to satisfy this condition exactly

    I Rather, theoretical analysis of the algorithms deals with theirlimiting behavior, i.e., analyzes the limit points of the infinitesequence of iterates generated by the algorithm

    I In practice, the algorithms are terminated when the aboveconditions are satisfied approximately, e.g., ‖∇f(xk)‖ ≤ � forsome pre-specified � > 0.

    IOE 511/Math 562: ContOpt General optimization algorithms Page 47

  • Choosing the direction

    I We wantf(xk+1) < f(xk) ∀k,

    so typically, we choose dk that is a descent direction of f atxk, that is,

    f(xk + αdk) < f(xk) ∀α ∈ (0, ᾱ]

    for some ᾱ > 0

    I Any dk such that ∇f(xk)Tdk < 0 is a descent directionwhenever ∇f(xk) 6= 0

    I Often, dk = −Dk∇f(xk), where Dk is SPDI Steepest descent: Dk = I, k = 0, 1, 2, . . .I Newton’s method: Dk = H(xk)−1 (provided

    H(xk) = ∇2f(xk) is positive definite)

    IOE 511/Math 562: ContOpt General optimization algorithms Page 48

  • Choosing the stepsize

    I After dk is fixed, αk ideally would solve the one-dimensionaloptimization problem

    minαf(xk + αdk)

    I Usually also impossible to solve exactly (analytically)

    I Instead, αk is computed (via an iterative procedure referred toas line search) either to approximately solve the aboveoptimization problem, or to ensure a “sufficient” decrease inthe value of f

    IOE 511/Math 562: ContOpt General optimization algorithms Page 49

  • Line search: one-dimensional optimization

    I Suppose that f(x) is a continuously differentiable function,and ∇f(x̄)T d̄ < 0

    I Let h(α) = f(x̄+ αd̄)

    I Note: h′(α) = ∇f(x̄+ αd̄)T d̄, h′(0) = ∇f(x̄)T d̄ < 0I Bisection algorithm for line-search

    I Approximately solves minα>0 h(α) by searching for α̃ suchthat h′(α̃) ≈ 0

    I Idea:I h′(α) < 0; find α̂ > 0: h′(α̂) > 0I So, ∃α ∈ (0, α̂) such that h′(α) = 0I Check sign of h′ at midpoint of (0, α̂); continue search on the

    left or right half of the interval

    I Details: FV, Section 2.8 (including Section 2.8.3: what to doif X 6= Rn)

    IOE 511/Math 562: ContOpt Stepsize selection Page 50

  • Line search: Armijo’s rule (backtracking)

    I Armijo rule is an inexact line search, designed to ensure thatthere is sufficient descent in the objective function, and thatthe step size is not too small

    I Define, for a given 0 < γ < 0.5,

    ĥ(α) = h(0) + αγh′(0)

    I Backtracking implementation of the rule:

    Step 0 Set k = 0 and α = 1. Choose γ ∈ (0, 0.5) andβ ∈ (0, 1).

    Step k If h(α) ≤ ĥ(α), choose α as the step size; stop.Otherwise, let α← βα, k ← k + 1.

    IOE 511/Math 562: ContOpt Stepsize selection Page 51

  • Illustration of Armijo’s rule

    I Can be adapted to ensure x̄+ αd̄ ∈ XI As a result of backtracking, the chosen stepsize is α = βt,

    where t ≥ 0 is the smallest nonnegative integer such thath(βt) ≤ ĥ(βt).

    IOE 511/Math 562: ContOpt Stepsize selection Page 52

  • Steepest descent algorithm for unconstrained optimization

    I General directional search optimization algorithm withdk = −∇f(xk)

    I Motivation: − ∇f(xk)

    ‖∇f(xk)‖2is the (unit length) direction that

    minimizes the linear approximation of f at xk

    Steepest Descent Algorithm:

    Initialization Initialize at x0 ∈ X; set k ← 0Iteration k: 1. If ∇f(xk) = 0, stop. Otherwise, choose

    dk = −∇f(xk)2. Choose stepsize αk > 0 by performing exact or

    inexact line search3. Set xk+1 ← xk + αkdk ∈ X and k ← k + 1

    From the fact that dk = −∇f(xk) is a descent direction and Step2, it follows that f(xk+1) < f(xk).

    IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 53

  • Convergence of Steepest Descent algorithm

    Theorem 2.2.1: Convergence of Steepest Descent withexact line search

    Suppose that f : Rn → R is continuously differentiable on the setS(x0) = {x ∈ Rn : f(x) ≤ f(x0)}, and that S(x0) is a closed andbounded set. Suppose further that the sequence {xk}k isgenerated by the steepest descent algorithm with step lengths αkchosen by an exact line search. Then every point x̄ that is a limitpointa of the sequence {xk}k satisfies ∇f(x̄) = 0.

    aThe textbook says “cluster point” — it’s the same thing as a limit point.

    IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 54

  • Convergence of Steepest Descent algorithm

    I Definition: the gradient function ∇f(x) is Lipschitzcontinuous with constant G > 0 on the set S(x0) if

    ‖∇f(x)−∇f(y)‖ ≤ G‖x−y‖ ∀x, y ∈ S(x0) = {x : f(x) ≤ f(x0)}

    Note: this is stronger than just continuity of ∇f(x).

    Convergence Theorem with backtracking line search, cf.BSS 8.6.3

    Suppose f : Rn → R is such that its gradient is Lipschitzcontinuous with constant G > 0 on the set S(x0) Suppose thesequence {xk}k is generated by the steepest descent algorithmwith stepsizes chosen by backtracking line search with γ ∈ (0, 0.5)and β ∈ (0, 1). Then every limit point x̄ of the sequence {xk}ksatisfies ∇f(x̄) = 0.

    IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 55

  • SD on strictly convex quadratic functions

    f(x) =1

    2xTQx+ qTx, where Q is SPD

    Optimal solution:

    x? = −Q−1q, f(x?) = −12qTQ−1q

    At x, d = −∇f(x) = −Qx− q, and the next iterate x′ is

    x′ = x+ αd, and f(x′) = f(x+ αd) = f(x)− αdTd+ 12α2dTQd

    When stepsize α is determined by exact line search,

    x′ = x+dTd

    dTQdd, and f(x′) = f(x)− 1

    2

    (dTd)2

    dTQd

    IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 56

  • Example: SD on strictly convex quadratic function

    Let

    Q =

    (+4 −2−2 +2

    )� 0 and q =

    (+2−2

    ); x? =

    (01

    ), and f(x?) = −1

    If x0 = (0, 0),

    x1 = (−0.4, 0.4), x2 = (0, 0.8), etc.,

    and

    f(x0)−f(x?) = 1, f(x1)−f(x?) = 0.2, f(x2)−f(x?) = 0.04, etc.,

    and sof(xk)− f(x?) = 0.2k

    Interpretation: difference between objective value at the currentiterate and f(x?) is reduced by a factor of 5 in every iteration ofthe algorithm.

    IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 57

  • Example: SD on strictly convex quadratic function

    Note: vertical axis uses log scale

    IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 58

  • SD on strictly convex quadratic functions

    Recall: at iterate x, with stepsize found by exact line-search,d = −Qx− q, and

    x′ = x+dTd

    dTQdd, and f(x′) = f(x)− 1

    2

    (dTd)2

    dTQd

    Thus,

    f(x′)− f(x?)f(x)− f(x?)

    =f(x)− f(x?)− 12

    (dT d)2

    dTQd

    f(x)− f(x?)= 1− 1

    β,

    where β = (dTQd)(dTQ−1d)

    (dT d)2(using formulae for d and f(x?))

    Kantorovich Inequality (Prop. 2.2.1)

    Let A and a be the largest and the smallest eigenvalues of Q � 0,respectively. Then, when d 6= 0,

    β =(dTQd)(dTQ−1d)

    (dTd)2≤ (A+ a)

    2

    4Aa.

    IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 59

  • Analysis of SD on a strictly convex quadratic function

    f(x′)− f(x?)f(x)− f(x?)

    = 1− 1β≤ 1− 4Aa

    (A+ a)2=

    (A− a)2

    (A+ a)2=

    (A/a− 1A/a+ 1

    )2A/a ≥ 1 (by definition); it is called the condition number of Q

    Upper Bound on Upper bound on #Iter to ReduceA a 1− 1β Optimality Gap by a factor of 10

    1.1 1.0 0.0023 1

    3.0 1.0 0.25 2

    10.0 1.0 0.67 6

    100.0 1.0 0.96 58

    200.0 1.0 0.98 116

    400.0 1.0 0.99 231

    IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 60

  • Iterates of SD; A/a close to 1

    Pictured: level sets of f(x) and iterates of the SD algorithm

    !3 !2 !1 0 1 2 3!3

    !2

    !1

    0

    1

    2

    3

    IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 61

  • Iterates of SD; A/a >> 1

    Pictured: level sets of f(x) and iterates of the SD algorithm

    !3 !2 !1 0 1 2 3!3

    !2

    !1

    0

    1

    2

    3

    IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 62

  • Example 4 from FV: non-quadratic function

    f(x) = x1−0.6x2 + 4x3 + 0.25x4−4∑i=1

    log(xi)− log

    (5−

    4∑i=1

    xi

    )

    Asymptotically, f(x)− f(x?) is reduced by roughly a constantfactor at each iteration. This behavior is described as linear rate ofconvergence (see Appendix A.2).

    IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 63

  • Comments

    I This type of convergence rate analysis is the worst-caseanalysis.

    I Similarly, ‖xk − x?‖ will show linear convergence rate (in theworst case).

    I What about non-quadratic functions? If ∇f(x?) = 0 andH(x?) � 0, f will behave as near-quadratic functions in someneighborhood of x?. The analysis of the non-quadratic casegets more involved; fortunately, the key intuition is obtainedby analyzing the quadratic case.

    I Worst-case rate of convergence for Steepest Descent withbacktracking line search is also linear, with constant thatsimilarly depends on the condition number of the Hessian.

    I The worst-case bound on the rate of convergence is attainedin practice quite often, which is unfortunate.

    IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 64

  • Termination criteria

    I ∇f(xk) = 0? Not in finite time with imprecise computation...I ‖∇f(xk)‖ ≤ �? Depends on the scaling of f(x)I ‖∇f(xk)‖ ≤ �|f(xk)|? What if f(x?) = 0?I ‖∇f(xk)‖ ≤ �(1 + |f(xk)|)

    IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 65

  • Newton’s method for unconstrained minimization

    I General optimization algorithm withdk = −∇2f(xk)−1∇f(xk) (Newton direction, or Newtonstep) and, in the “pure” case, αk = 1 (if ∇2f(xk) � 0)

    I Motivation: iterative method for solving systems of equationapplied to ∇f(x) = 0

    I Motivation: if the current iterate is x̄, the next iterate solves

    minxf(x̄) +∇f(x̄)T (x− x̄) + 1

    2(x− x̄)TH(x̄)(x− x̄)

    “Pure” Newton’s Method:

    Initialization Initialize at x0 ∈ X; set k ← 0Iteration k: 1. dk = −∇2f(xk)−1∇f(xk). If dk = 0, then stop.

    2. Choose stepsize αk = 1.3. Set xk+1 ← xk + αkdk ∈ X and k ← k + 1

    IOE 511/Math 562: ContOpt Newton’s method Page 66

  • Newton’s method for solving equations

    I Let g : Rn → Rn. Goal: solve the system of equations

    g(x) = 0

    I Starting at a point x0, approximate the function g by

    g(x0 + d) ≈ g(x0) +∇g(x0)Td,

    where ∇g(x0)T ∈ Rn×n is the Jacobian of g at x0

    I Provided that ∇g(x0) is non-singular, solve the system oflinear equations

    ∇g(x0)Td = −g(x0)

    to obtain d

    I Set the next iterate x1 = x0 + d, and continue.

    IOE 511/Math 562: ContOpt Newton’s method Page 67

  • Newton’s method for solving equations

    I Well-studied method; well-known for its good performancewhen the starting point x0 is chosen appropriately.

    I However, for other choices of x0 the algorithm may notconverge, as demonstrated in the following well-known picture:

    g(x)

    0

    x

    x0 x2x1x3

    IOE 511/Math 562: ContOpt Newton’s method Page 68

  • Observations about Newton’s method for optimization

    I Work per iteration: on the order of n3; bottleneck — solving

    H(xk)d = −∇f(xk)

    I The iterates are, in general, equally attracted to local minimaand local maxima. Indeed, the method is just trying to solvethe system of equations ∇f(x) = 0.

    I The method assumes ∇2f(xk) is nonsingular at eachiteration. Moreover, unless ∇2f(xk) is positive definite, dk isnot guaranteed to be a descent direction.

    I There is no guarantee that f(xk+1) ≤ f(xk).I Without a line search, iterates of the Newton’s method may

    not converge, unless started “close enough” to the right point.

    IOE 511/Math 562: ContOpt Newton’s method Page 69

  • Example of Newton’s method

    Let f(x) = 7x− ln(x). x? = 17 = 0.142857143 is the unique globalminimum

    k xk xk xk

    0 1 0.1 0.011 −5 0.13 0.01932 0.1417 0.035992573 0.14284777 0.0629168844 0.142857142 0.0981240285 0.142857143 0.1288497826 0.14148377 0.1428439388 0.1428571429 0.14285714310 0.142857143

    IOE 511/Math 562: ContOpt Newton’s method Page 70

  • Example of Newton’s method

    Let f(x) = 7x− ln(x). x? = 17 = 0.142857143 is the unique globalminimum

    IOE 511/Math 562: ContOpt Newton’s method Page 71

  • Example of Newton’s method

    f(x) = − ln(1− x1 − x2)− lnx1 − lnx2. x? =(

    13 ,

    13

    ),

    f(x?) = 3.295836866.

    k (xk)1 (xk)2 ‖xk − x?‖

    0 0.85 0.05 0.589255650988791 0.717006802721088 0.0965986394557823 0.4508310619260112 0.512975199133209 0.176479706723556 0.2384832491574623 0.352478577567272 0.273248784105084 0.06306102942974464 0.338449016006352 0.32623807005996 0.008747169263796555 0.333337722134802 0.333259330511655 7.41328482837195e−5

    6 0.333333343617612 0.33333332724128 1.19532211855443e−8

    7 0.333333333333333 0.333333333333333 1.57009245868378e−16

    IOE 511/Math 562: ContOpt Newton’s method Page 72

  • Example of Newton’s method

    f(x) = − ln(1− x1 − x2)− lnx1 − lnx2. x? =(

    13 ,

    13

    ),

    f(x?) = 3.295836866.

    IOE 511/Math 562: ContOpt Newton’s method Page 73

  • Rate of convergence for sequences (Appendix A.2)

    Let {sk} ⊂ R, and limk→∞ sk = s̄.We say that {sk} exhibitsI Linear convergence: ∃c ∈ [0, 1) such that, for some k0,

    |sk+1 − s̄||sk − s̄|

    ≤ c ∀k ≥ k0.

    (Example: sk =(

    110

    )k: 0.1, 0.01, 0.001, etc.)

    I Superlinear convergence: ∃ck, limk→∞ ck = 0 such that,for some k0,

    |sk+1 − s̄||sk − s̄|

    ≤ ck ∀k ≥ k0.

    (Example: sk = 0.1 · 1k! :110 ,

    120 ,

    160 ,

    1240 ,

    11250 , etc. )

    I Quadratic convergence: ∃c ≥ 0 such that, for some k0,|sk+1 − s̄||sk − s̄|2

    ≤ c ∀k ≥ k0.

    (Example: sk =(

    110

    )(2k−1): 0.1, 0.01, 0.0001, 0.00000001,

    etc.)IOE 511/Math 562: ContOpt Newton’s method Page 74

  • Illustration of rates of convergence

    IOE 511/Math 562: ContOpt Newton’s method Page 75

  • Convergence rate of the Newton’s method

    I To analyze an algorithm’s rate of convergence, study theconvergence rate of ‖xk − x̄‖, or |f(xk)− f(x̄)|, where {xk}kis a sequence of iterates, and x̄ is its limit point.

    I We discussed many pitfalls of the pure Newton’s method (i.e.,with αk = 1, ∀k).

    I However, under certain conditions, the method exhibitsquadratic rate of convergence

    IOE 511/Math 562: ContOpt Newton’s method Page 76

  • Quadratic convergence of Newton’s method

    ‖M‖ ≡ max‖x‖=1 ‖Mx‖. So, ∀x, ‖Mx‖ ≤ ‖M‖ · ‖x‖ (A.1.4)

    Thm. 2.5.1 (Quadratic convergence)

    Suppose f(x) is twice continuously differentiable and x? is a pointfor which ∇f(x?) = 0. Suppose H(x) satisfies:I there exists a scalar h > 0 for which ‖[H(x?)]−1‖ ≤ 1hI there exists scalars β > 0 and L > 0 for which‖H(x)−H(y)‖ ≤ L‖x− y‖ for all x, y ∈ B(x?, β).

    Let x satisfy ‖x− x?‖ < γ, where γ := min{β, 2h3L

    }, and let

    xN := x−H(x)−1∇f(x). Then:

    (i) ‖xN − x?‖ ≤ ‖x− x?‖2(

    L2(h−L‖x−x?‖)

    )≤ ‖x− x?‖2

    (3L2h

    )(ii) ‖xN − x?‖ < ‖x− x?‖ < γ(iii) If {xk}k is the sequence of Newton method iterates with

    x0 = x, then ‖xk − x?‖ → 0 as k →∞

    IOE 511/Math 562: ContOpt Newton’s method Page 77

  • Comments on the quadratic convergence theorem

    I Another, more involved, convergence result: Theorem 2.5.2I Convergence results in Theorem 2.5.1 are local

    I i.e., they apply only if the algorithm is started “sufficientlyclose” to x?: ‖x− x?‖ < γ

    I In practice, for most functions, values of β, L, and h are notknown

    I Moreover, Newton’s method is invariable under invertiblelinear transformations of the variables, but β, L, and h are not

    I A good practical modification of the Newton’s method would:I Make algorithm globally convergent, i.e., ensure that it

    converges for any starting pointI Control behavior of the algorithm at points far from the

    optimumI E.g., ensure (sufficient) descent

    I Follow pure Newton’s method once the iterates get close to theoptimum, to achieve asymptotic quadratic convergence rate

    IOE 511/Math 562: ContOpt Newton’s method Page 78

  • Modifications of Newton’s method

    I Problem: ∇2f(xk) is nearly singular, or indefinite (so dk maynot be a descent direction)I Possible solution: use dk = −(∇2f(xk) + �kI)−1∇f(xk)I Here �k ≥ 0 is chosen so that the smallest eigenvalue of∇2f(xk) + �kI is bounded below by δ > 0

    I Also, don’t want �k too big – otherwise, we are doing SD!I Problem: αk = 1 does not result in (sufficient) descent

    I Solution: use a line search to ensure descent, e.g.,backtracking with initial α = 1

    I If f is strongly convex, e.g., ∃µ > 0 : min eigH(x) > µ > 0 forall x, and H(x) is Lipschitz continuous, then, whenbacktracking is used, for some η > 0 and κ > 0,

    I if ‖∇f(xk)‖ ≥ η, then f(xk+1)− f(xk) ≤ −κ, andI if ‖∇f(xk)‖ < η, then αk = 1 will be selected, and the next

    iterate will satisfy ‖∇f(xk+1)‖ < η, and so will all the furtheriterates. Moreover, quadratic convergence will be observed inthis phase.

    I Good stopping criteria for Newton’s method: either ‖dk‖ ≤ �,or −∇f(xk)Tdknt/2 = ∇f(xk)TH(xk)−1∇f(xk)/2 ≤ �2

    IOE 511/Math 562: ContOpt Newton’s method Page 79

  • Problem and notation

    (P) min f(x)s.t. g(x) ≤ 0

    h(x) = 0x ∈ X,

    where X is an open set andg(x) = (g1(x), . . . , gm(x))

    T : Rn → Rm,h(x) = (h1(x), . . . , hl(x))

    T : Rn → Rl

    F 4= {x ∈ X : g(x) ≤ 0, h(x) = 0}(P) min

    x∈Ff(x)

    Jacobian matrices of g and h: (Notation “J”different from thebook!)

    Jg(x) =

    ∇g1(x)T

    ...∇gm(x)T

    ∈ Rm×n and Jh(x) = ∇h1(x)

    T

    ...∇hl(x)T

    ∈ Rl×nIOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 80

  • Directions of interest

    (P) min f(x)s.t. g(x) ≤ 0

    h(x) = 0x ∈ X,

    Suppose x̄ ∈ F . We define the following sets:I F0 = {d : ∇f(x̄)Td < 0} — any element of this set is an

    “improving” (i.e., descent) direction of f at x̄

    I I(x̄) = {i : gi(x̄) = 0} — the indices of binding inequalityconstraints at x̄

    I G0 = {d : ∇gi(x̄)Td < 0 ∀i ∈ I(x̄)} — any element of thisset is an “inward” direction of binding inequality constraints

    I H0 = {d : ∇hi(x̄)Td = 0 ∀i = 1, . . . , l} — the set of tangentdirections of equality constraints

    Although not explicit in the notation, all of these sets depend on x̄

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 81

  • Geometric Necessary Optimality conditions, simple case

    Theorem 3.1.1, option (i): linear equality constraints

    Assume that h(x) is a linear function, i.e., h(x) = Ax− b forA ∈ Rl×n. If x̄ is a local minimum of (P), then F0 ∩G0 ∩H0 = ∅.

    Proof:I hi(x) = aTi x− bi and ∇hi(x) = ai, i.e.,H0 = {d : aTi d = 0, i = 1, . . . , l} = {d : Ad = 0}.

    I Suppose d ∈ F0 ∩G0 ∩H0. Then:I For all λ > 0 sufficiently small gi(x̄+ λd) < gi(x̄) = 0 ∀i ∈ II For i 6∈ I, since λ is small, gi(x̄+ λd) < 0 by continuityI h(x̄+ λd) = (Ax̄− b) + λAd = 0 for all λ

    Therefore, x̄+ λd ∈ F for all λ > 0 sufficiently smallI I.e., d is a feasible direction at x̄ in F (Def. 3.1.1)

    I On the other hand, for all sufficiently small λ > 0,f(x̄+ λd) < f(x̄)

    I ...contradicts assumption that x̄ is a local minimum of (P)To extend to nonlinear h, need to make some assumptions...

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 82

  • Geometric Necessary Optimality conditions

    Theorem 3.1.1, option (ii)

    If x̄ is a local minimum of (P) and the gradient vectors∇hi(x̄), i = 1, . . . , l are linearly independent, thenF0 ∩G0 ∩H0 = ∅.

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 83

  • Geometric Necessary Optimality conditions

    Theorem 3.1.1, option (ii)

    If x̄ is a local minimum of (P) and the gradient vectors∇hi(x̄), i = 1, . . . , l are linearly independent, thenF0 ∩G0 ∩H0 = ∅.

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 84

  • From Geometric to Algebraic FONC

    I Stating that F0 ∩G0 ∩H0 = ∅ is equivalent to saying that thefollowing system of linear equations and inequalities:

    ∇f(x̄)Td < 0∇gi(x̄)Td < 0, i ∈ I(x̄)∇hi(x̄)Td = 0, i = 1, . . . , l

    has no solutions

    I How do you show that a system doesn’t have a solution? Weneed a few results from Convex Analysis.

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 85

  • Separation of Convex Sets (Appendix B)

    Definitions:

    I A hyperplane is a set of the form H = {x ∈ Rn : pTx = α},where p 6= 0 is a vector in Rn and α ∈ R, andsets H+ = {x ∈ Rn : pTx ≥ α}, H− = {x ∈ Rn : pTx ≤ α}are closed half-spaces (open half-spaces: use > and α for all x ∈ Sand pTx < α for all x ∈ T

    I H is said to strongly separate S and T if for some � > 0,pTx > α+ � for all x ∈ S and pTx < α− � for all x ∈ T

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 86

  • A separation theorem

    Theorem B.3.1 (modified)

    Let S be a nonempty closed convex set in Rn, and x̄ 6∈ S. Then∃p 6= 0 and α such that H = {x : pTx = α} strongly (andtherefore strictly) separates S and {x̄}.

    Proposition

    Let S be a nonempty closed convex set in Rn, and x̄ 6∈ S. Thenthere exists a unique point w ∈ S with minimum distance from x̄.Furthermore, w is the minimizing point if and only if(x̄− w)T (x− w) ≤ 0 for all x ∈ S.

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 87

  • Some implications of Theorem B.3.1

    ∂S — boundary of the set S

    Theorem B.3.2

    If S is a nonempty convex set and x̄ ∈ ∂S, then there exists asupporting hyperplane to S at x̄, i.e., ∃p 6= 0 such that pTx ≤ pT x̄for all x ∈ S.

    Proof idea: Consider T = cl(S); x̄ ∈ ∂S = ∂T . Let {xi} → x̄,xi 6∈ T ∀i; each xi can be separated from T by a hyperplane withnormal vector pi 6= 0. Wolog, ‖pi‖ = 1 ∀i. Let p be a limit point of{pi}. Then p is the normal vector of a supporting hyperplane to Sand x̄.

    One application: if f : S → R is a convex function and x̄ ∈ intS,Theorem B.3.2 can be used to show that f has a subgradient at x̄.

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 88

  • Some implications of Theorem B.3.1

    Theorem B.3.3

    If A and B are nonempty convex sets and A ∩B = ∅, then A andB can be separated by a hyperplane.

    Proof idea: Let S = {x1 − x2 : x1 ∈ A, x2 ∈ B}. Note: 0 6∈ S.Let T = cl(S).

    I If 0 6∈ T , use a hyperplane that separates {0} from T to getthe result.

    I If 0 ∈ T , then, since 0 ∈ ∂S, use a supporting hyperplane toS at 0 to get the result.

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 89

  • Theorems of alternatives (a.k.a. Motzkin’s TranspositionTheorems)

    Theorem B.4.13, Farkas’ Lemma

    Given matrix A and vector b of appropriate dimensions, exactly oneof the following two systems has a solution:(1) Ax = b, x ≥ 0 or(2) AT y ≥ 0, bT y < 0 (or, equivalently, AT y ≤ 0, bT y > 0).

    Lemma 3.1.1: Key Lemma

    Given matrices Ā, B, and H of appropriate dimensions, exactlyone of the two following systems has a solution:(i) Āx < 0, Bx ≤ 0, Hx = 0(ii) ĀTu+BTw +HT v = 0, u ≥ 0, w ≥ 0, eTu = 1.

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 90

  • Back to constrained optimization

    (P) min f(x)s.t. gi(x) ≤ 0, i = 1, . . . ,m

    hi(x) = 0, i = 1, . . . , lx ∈ X,

    Suppose x̄ ∈ F . We define the following sets:I F0 = {d : ∇f(x̄)Td < 0}I G0 = {d : ∇gi(x̄)Td < 0 ∀i ∈ I(x̄)}, whereI(x̄) = {i : gi(x̄) = 0}

    I H0 = {d : ∇hi(x̄)Td = 0 ∀i = 1, . . . , l}

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 91

  • Back to optimality conditions

    Theorem 3.1.1

    If x̄ is a local minimizer of (P), and either (i) h(x) is a linearfunction, or (ii) ∇hi(x̄), i = 1, . . . , l are linearly independent, thenF0 ∩G0 ∩H0 = ∅.

    Proof idea for version (ii): (details in Section 3.5)I Since Jh(x̄) ∈ Rl×n has full row rank, can find l-by-l

    nonsingular submatrixI i.e., partition x = (y, z), rearranging variables if necessary

    and apply IFT to find...I s(z): x̄ = (s(z̄), z̄) and h(s(z), z) = 0 for z ∈ B(z̄, �)I Suppose d = F0 ∩G0 ∩H0. Let d = (dy; dz) and let

    x(λ) = (y(λ), z(λ)) = (s(z̄ + λdz), z̄ + λdz)

    I Using definition of d and formula for Js(z), can show that, forsmall λ > 0, x(λ) is feasible and f(x(λ)) < f(x̄) —contradiction

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 92

  • Implicit Function Theorem

    Theorem 3.5.1: Implicit Function Theorem

    Let h(x) = h(y, z) : Rn → Rl (with y ∈ Rl and z ∈ Rn−l) and x̄ = (ȳ, z̄) satisfy:1. h(x̄) = 0

    2. h(x) is continuously differentiable in a neighborhood of x̄

    3. The l × l Jacobian matrix

    Jyh(ȳ, z̄) =

    ∂h1(x̄)∂y1

    · · · ∂h1(x̄)∂yl

    .... . .

    ...∂hl(x̄)∂y1

    · · · ∂hl(x̄)∂yl

    is non-singular.

    Then there exists � > 0 along with function s : Rn−l → Rl such thatI s(z̄) = ȳ

    I ∀z ∈ B(z̄, �), h(s(z), z) = 0I ∀z ∈ B(z̄, �), s(z) is continuously differentiable and

    Jzs(z) = −Jyh(s(z), z)−1Jzh(s(z), z)

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 93

  • Fritz John Necessary Conditions1

    Theorem 3.1.2, Fritz John Necessary Conditions

    Let x̄ be a feasible solution of (P). If x̄ is a local minimum of (P),then there exists (u0, u, v) such that

    u0∇f(x̄) +m∑i=1

    ui∇gi(x̄) +l∑

    i=1

    vi∇hi(x̄) = 0,

    u0, u ≥ 0, (u0, u, v) 6= 0,

    ui · gi(x̄) = 0, i = 1, . . . ,m.

    (Note that the fist equation can be rewritten as∇f(x̄)u0 + Jg(x̄)Tu+ Jh(x̄)T v = 0.)

    I Condition ui · gi(x̄) = 0, i = 1, . . . ,m is referred to ascomplementary slackness

    1Rare case of someone’s first name being included in naming of a theorem!IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 94

  • KKT first order necessary conditions

    Theorem 3.1.3, KKT First Order Necessary Conditions

    Let x̄ be a feasible solution of (P) and let I(x̄) = {i : gi(x̄) = 0}.Further, suppose that ∇gi(x̄) for i ∈ I(x̄) and ∇hi(x̄) fori = 1, . . . , l are linearly independent. If x̄ is a local minimum, thereexists (u, v) such that

    ∇f(x̄) + Jg(x̄)Tu+ Jh(x̄)T v = 0, (1)

    u ≥ 0, uigi(x̄) = 0 i = 1, . . . ,m. (2)

    I Components of u and v are usually referred to as LagrangeMultipliers

    I A feasible point x̄ that together with some multiplier vectorsu and v satisfies conditions (1) and (2) is referred to as aKKT point.

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 95

  • Proof of KKT FONC

    x̄ must satisfy the Fritz John conditions:

    u0∇f(x̄) +m∑i=1

    ui∇gi(x̄) +l∑

    i=1

    vi∇hi(x̄) = 0,

    u0, u ≥ 0, (u0, u, v) 6= 0,uigi(x̄) = 0, i = 1, . . . ,m.

    I If u0 > 0, redefine u← u/u0 and v ← v/u0.I If u0 = 0, then

    m∑i=1

    ui∇gi(x̄) +l∑

    i=1

    vi∇hi(x̄) = 0,

    ∑i∈I(x̄)

    ui∇gi(x̄) +l∑

    i=1

    vi∇hi(x̄) = 0,

    i.e., the above gradients are linearly dependent —contradictionIOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 96

  • When are necessary conditions really necessary?

    I Recall that the statement of FONC above has the form

    “if x̄ is a local minimum of (P) and some assumption on theconstraints then x̄ must be a KKT point”

    I Without additional assumptions on the constraints, we cannotprove that KKT conditions are necessary for optimality, i.e.,that x̄ satisfies (1) and (2)

    I An additional assumption on the constraints that enables usto proceed with the proof that KKT conditions are necessaryfor optimality is called a constraint qualification

    I We have already established (Theorem 3.1.3) that thefollowing is a constraint qualification:Linear Independence Constraint Qualification Thegradients ∇gi(x̄), i ∈ I(x̄), ∇hi(x̄), i = 1, . . . , l are linearlyindependent.

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 97

  • Other CQs: Slater’s condition

    Definition 3.1.2

    A point x ∈ X is called a Slater point of problem (P) if x satisfiesg(x) < 0 and h(x) = 0, i.e., x is feasible and satisfies allinequalities strictly.

    Theorem 3.1.4 (Slater condition)

    Suppose the problem (P) satisfies Slater condition, i.e.,gi, i = 1, . . . ,m are convex, and hi(x), i = 1, . . . , l are linear and∇hi(x), i = 1, . . . , l are linearly independent, and (P) has a Slaterpoint. Then the KKT conditions are necessary to characterize alocal optimal solution.

    Note: if hi’s are linear, their gradients are constant and linearindependence assumption is wolog

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 98

  • Proof of Slater’s condition

    I Suppose assumptions of Theorem 3.1.4 are satisfied and x0 is aSlater point and x̄ — a local minimum

    I Fritz-John conditions: ∃(u0, u, v) 6= 0 such that (u0, u) ≥ 0 and

    u0∇f(x̄) +m∑i=1

    ui∇gi(x̄) +l∑i=1

    vi∇hi(x̄) = 0, uigi(x̄) = 0 ∀i

    I If u0 > 0, divide through by u0 to get KKT conditions

    I If u0 = 0, we have: 0 =∑mi=1 ui∇gi(x̄) +

    ∑li=1 vi∇hi(x̄)0 =∑

    i∈I(x̄) ui∇gi(x̄) +∑li=1 vi∇hi(x̄)

    I let d = x0 − x̄I For each i ∈ I(x̄), ∇gi(x̄)T d < 0, since

    0 > gi(x0) ≥ gi(x̄) +∇gi(x̄)T (x0 − x̄) = ∇gi(x̄)T d

    I Since h(x) are linear, ∇hi(x̄)T d = 0, i = 1, . . . , lI Thus, 0 = 0T d = (

    ∑i∈I(x̄) ui∇gi(x̄) +

    ∑li=1 vi∇hi(x̄))T d < 0,

    unless ui = 0 for all i ∈ I(x̄)I Therefore, v 6= 0 and Jh(x̄)T v = 0, violating the assumption. This

    is a contradiction, and so u0 > 0.IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 99

  • Other CQs: linear constraints

    Theorem 3.1.5, Linear constraints

    If all constraints are linear, the KKT conditions are necessary tocharacterize a local optimal solution.

    Proof:

    I Our problem is min{f(x) : Ax ≤ b, Mx = g}.I Suppose x̄ is a local minimum. Partition Ax ≤ b into two groups:

    AIx ≤ bI and AĪx ≤ bĪ(first group active at x̄).

    I Set of feas. directions at x̄ is precisely {d : Md = 0, AId ≤ 0}I Therefore, the following system has no solutions: AIM

    −M

    d ≤ 0, −∇f(x̄)T d > 0I From Farkas’ lemma, there exists (u, v1, v2) ≥ 0 such that

    ATI u+MT v1 −MT v2 = −∇f(x̄).

    Take v = v1 − v2 to finish the proof.IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 100

  • Applying FONC: summary (Section 3.6)

    Goal: build an (exhaustive) list of candidates for local optimalityI Check if the problem (P) satisfies one of the global CQs (e.g.,

    Slater, Linear constraints)I If yes, every local minimum is a KKT point;I find all KKT points to form a list of candidates

    I If not, check which feasible points satisfy/violate local CQs(e.g., Linear independence)I Among points that satisfy some CQ, all KKT points belong to

    the list of candidatesI Among points that violate (all) CQs, all Fritz John points

    belong to the list of candidates

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 101

  • Convex Problems and first-order sufficient conditions

    The program(P) min f(x)s.t. g(x) ≤ 0

    h(x) = 0x ∈ X

    is a convex problem if f , gi, i = 1, . . . ,m are convex functions,hi, i = 1 . . . , l are linear (more precisely, affine) functions, and Xis a convex set.

    Theorem 3.2.1 (paraphrased)

    Suppose (P) is a convex problem and X is an open set. Then firstorder KKT conditions are sufficient for optimality in a convexprogram.

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 102

  • Proof of Theorem 3.2.1: Analysis of the feasible region

    I Because each gi is convex, the level sets

    Ci = {x ∈ X : gi(x) ≤ 0}, i = 1, . . . ,m

    are convex

    I Because each hi is linear, the sets

    Di = {x ∈ X : hi(x) = 0}, i = 1, . . . , l

    are convex

    I Thus the feasible region

    F = {x ∈ X : g(x) ≤ 0, h(x) = 0}

    is a convex set (That’s Proposition 3.2.1)

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 103

  • Proof of Theorem 3.2.1 — continued

    I Let x ∈ F , x 6= x̄.I ∀i ∈ I(x̄),

    0 ≥ gi(x) ≥ gi(x̄) +∇gi(x̄)T (x− x̄) = ∇gi(x̄)T (x− x̄)

    I ∀i = 1, . . . , l, since hi(·) is linear,

    ∇hi(x̄)T (x− x̄) = 0

    I From the KKT conditions, including complementarity,

    ∇f(x̄)T (x−x̄) = −

    ∑i∈I(x̄)

    ui∇gi(x̄) +l∑i=1

    vi∇hi(x̄)

    T (x−x̄) ≥ 0,and by gradient inequality, f(x) ≥ f(x̄) for any feasible x.

    I Can relax assumptions a bit: pseudoconvex f , quasiconvexgi’s (Sec. 3.3)

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 104

  • Lagrangian function

    I Lagrangian function, or simply the Lagrangian:

    L(x, u, v) = f(x) +

    m∑i=1

    uigi(x) +

    l∑i=1

    vihi(x)

    I Gradient conditions of the KKT necessary conditions:

    ∇xL(x̄, u, v) = 0

    I Also,

    ∇2xxL(x, u, v) = ∇2f(x) +m∑i=1

    ui∇2gi(x) +l∑

    i=1

    vi∇2hi(x)

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 105

  • Second order KKT conditions

    Theorem 3.4.1, KKT second order necessary conditions

    Suppose x̄ is a local minimum of (P), and ∇gi(x̄), i ∈ I(x̄) and∇hi(x̄), i = 1, . . . , l are linearly independent. Then x̄ must satisfythe KKT conditions. Furthermore, every d that satisfies:

    ∇gi(x̄)Td ≤ 0, i ∈ I(x̄) and ∇hi(x̄)Td = 0, i = 1 . . . , l

    must also satisfydT∇2xxL(x̄, u, v)d ≥ 0.

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 106

  • Second order KKT conditions

    Theorem 3.4.2, KKT second order sufficient conditions

    Suppose the point x̄ ∈ F together with multipliers (u, v) satisfiesthe first order KKT conditions. Let I+ = {i ∈ I(x̄) : ui > 0} andI0 = {i ∈ I(x̄) : ui = 0}. Additionally, suppose that every d 6= 0that satisfies

    ∇gi(x̄)Td ≤ 0, i ∈ I0∇gi(x̄)Td = 0, i ∈ I+∇hi(x̄)Td = 0, i = 1 . . . , l

    also satisfiesdT∇2xxL(x̄, u, v)d > 0.

    Then x̄ is a (strict) local minimum.

    IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 107

  • Lagrangian duality

    Lectures on Lagrangian duality were presented on the whiteboard

    IOE 511/Math 562: ContOpt Lagrangian duality Page 108

  • Problems with linear equality constraintsProjected steepest descent method

    Lecture presented on the whiteboard

    IOE 511/Math 562: ContOpt Linearly constrained problems Page 109

  • Barrier methods

    (P) min f(x)s.t. gi(x) ≤ 0, i = 1, . . . ,m

    x ∈ X,

    where X is an open set.

    I F = {x ∈ X : gi(x) ≤ 0, i = 1, . . . ,m}I Assumption: ∃x0 ∈ X : gi(x0) < 0, i = 1, . . . ,mI Let b(x) be a barrier function

    I b(x) ≥ 0 for all x that satisfy g(x) < 0 (or any finite lowerbound)

    I b(x)→∞ as limx maxi{gi(x)} → 0.

    IOE 511/Math 562: ContOpt Barrier methods Page 110

  • Barrier problem

    B(θ) min f(x) + θb(x)s.t. g(x) < 0,

    x ∈ XBarrier method for solving (P): solve P (θk), k = 1, 2, . . . for asequence of parameters satisfying θk > θk+1 > 0, θk →k→∞ 0.I It is practical to initialize algorithm for solving P (θk+1) at

    solution xk to P (θk)I Initial solution x0 can be found by applying a Barrier method

    to an auxiliary optimization problem

    Barrier Convergence Theorem

    Suppose f(x), g(x), and b(x) are continuous functions. Letxk, k = 1, 2, . . ., be a sequence of solutions of B(θk). Supposethere exists an optimal solution x? of (P) for whichB(x?, δ) ∩ {x : g(x) < 0} 6= ∅ for every δ > 0 . Then any limitpoint x̄ of {xk} solves (P).

    IOE 511/Math 562: ContOpt Barrier methods Page 111

  • Common barrier functions

    b(x) =

    m∑i=1

    γ(−gi(x)),

    where γ : R++ → R is such thatI γ(t)→ +∞ as t→ 0+

    I γ is monotone decreasing.

    For example:

    I γ(t) = t−q, where q > 0: b(x) =∑m

    i=1(−gi(x))−q

    I γ(t) = − ln t: b(x) = −∑m

    i=1 ln(−gi(x)) (works if F is abounded set)

    Then

    ∇b(x) = −m∑i=1

    γ′(−gi(x))∇gi(x)

    IOE 511/Math 562: ContOpt Barrier methods Page 112

  • KKT multipliers in barrier methods

    xk — solution to B(θk) — satisfies:

    ∇f(xk)−θkm∑i=1

    γ′(−g(xk))∇gi(xk) = 0, or ∇f(xk)+m∑i=1

    uki∇gi(xk) = 0

    where uki = −θkγ′(−gi(xk)), i = 1, . . . ,m

    Theorem BAR 2

    Let (P) satisfy the conditions of the Barrier Convergence Theorem.Suppose γ(y) is continuously differentiable and let uk be definedas above. Then if xk → x̄, and x̄ satisfies the linear independencecondition for gradient vectors of active constraints, then uk → ū,where ū is a vector of Karush-Kuhn-Tucker multipliers for theoptimal solution x̄ of (P).

    IOE 511/Math 562: ContOpt Barrier methods Page 113

  • Penalty methods

    (P) min f(x)s.t. gi(x) ≤ 0, i = 1, . . . ,m

    hi(x) = 0, i = 1, . . . , lx ∈ X ⊆ Rn

    X is an open set; feasible region denoted by F

    Penalty function

    A function p(x) : Rn → R is called a penalty function for (P) ifp(x) satisfies:

    I p(x) = 0 if g(x) ≤ 0, h(x) = 0 andI p(x) > 0 if g(x) 6≤ 0 or h(x) 6= 0.

    IOE 511/Math 562: ContOpt Penalty methods Page 114

  • Penalty functions

    I Penalty functions are typically defined by

    p(x) =

    m∑i=1

    φ(gi(x)) +

    l∑i=1

    ψ(hi(x)),

    whereI φ(y) = 0 if y ≤ 0 and φ(y) > 0 if y > 0I ψ(y) = 0 if y = 0 and ψ(y) > 0 if y 6= 0,

    I Example:

    p(x) =

    m∑i=1

    (max{0, gi(x)})2 +l∑i=1

    hi(x)2.

    I More generally: we often use

    p(x) =

    m∑i=1

    [max{0, gi(x)}]q +l∑i=1

    |hi(x)|q, where q ≥ 1.

    I If q = 1, we have the “linear penalty function.” This function maynot be differentiable at points x where gi(x) = 0 or hi(x) = 0

    IOE 511/Math 562: ContOpt Penalty methods Page 115

  • Penalty method

    Solve the penalty program:

    P(c) : min f(x) + cp(x)s.t. x ∈ X

    for an increasing sequence of constants c as c→ +∞.I The scalar quantity c is called the penalty parameter

    I Idea: for a sufficiently large c, solution x(c) of P(c) will be(almost) feasible

    IOE 511/Math 562: ContOpt Penalty methods Page 116

  • Penalty methods: convergence and analysis

    Let q(c, x) = f(x) + cp(x) and xk = argminx∈X q(ck, x), whereck+1 > ck and ck → +∞ as k → +∞

    PEN 1: Penalty Lemma

    1. q(ck, xk) ≤ q(ck+1, xk+1)

    2. p(xk) ≥ p(xk+1)3. f(xk) ≤ f(xk+1)4. f(x?) ≥ q(ck, xk) ≥ f(xk)

    PEN 2: Penalty Convergence Theorem

    Suppose that the problem is feasible and f(x), g(x), h(x), andp(x) are continuous functions. Let {xk}k, k = 1, . . . ,∞, be asequence of solutions to P (ck), and suppose the sequence {xk}k iscontained in a compact set. Then any limit point x̄ of {xk}ksolves (P).

    IOE 511/Math 562: ContOpt Penalty methods Page 117

  • KKT multipliers in penalty methods

    I Suppose

    p(x) =

    m∑i=1

    φ(gi(x)) +

    l∑i=1

    ψ(hi(x)),

    where φ(y) and ψ(y) are as above.I If φ(y) and ψ(y) are continuously differentiable and

    φ′(0) = 0,

    then p(x) is differentiableI ∇p(x) =

    ∑mi=1 φ

    ′(gi(x))∇gi(x) +∑l

    i=1 ψ′(hi(x))∇hi(x)

    I If xk solves P (ck), then

    ∇f(xk) + ck∇p(xk) = 0,

    that is,

    ∇f(xk)+ckm∑i=1

    φ′(gi(xk))∇gi(xk)+ck

    l∑i=1

    ψ′(hi(xk))∇hi(xk) = 0.

    IOE 511/Math 562: ContOpt Penalty methods Page 118

  • KKT multipliers in penalty methods

    Let[uk]i = ckφ

    ′(gi(xk)), [vk]i = ckψ

    ′(hi(xk)).

    Then

    ∇f(xk) +m∑i=1

    [uk]i∇gi(xk) +l∑

    i=1

    [vk]i∇hi(xk) = 0,

    and so we can interpret (uk, vk) as a sort of vector ofKarush-Kuhn-Tucker multipliers.

    PEN 3: KKT multipliers in penalty methods

    Suppose φ(y) and ψ(y) are continuously differentiable andφ′(0) = 0, and that f(x), g(x), and h(x) are differentiable. Let(uk, vk) be defined as above. Then if x

    k → x̄, and x̄ satisfies thelinear independence condition for gradient vectors of activeconstraints, then (uk, vk)→ (ū, v̄), where (ū, v̄) is a vector ofKarush-Kuhn-Tucker multipliers for the optimal solution x̄ of (P).

    IOE 511/Math 562: ContOpt Penalty methods Page 119

  • Exact Penalty Methods

    I In most penalty methods, xk’s are not in the feasible region;solution to (P) approached in the limit.

    I Can we choose a penalty function p(x) and a constant c sothat xc is also an optimal solution of (P)?

    PEN 4: Linear penalty leads to an exact penaltymethod

    Suppose (P) is a convex program for which the Karush-Kuhn-Tuckerconditions are necessary.Suppose that

    p(x) :=

    m∑i=1

    gi(x)+ +

    l∑i=1

    |hi(x)|.

    Then as long as c is chosen sufficiently large, the sets of optimal

    solutions of P (c) and (P) coincide. In fact, it suffices to choose

    c > max{u?i , i = 1, . . . ,m; |v?i |, i = 1, . . . , l}, where (u?, v?) is a vectorof Karush-Kuhn-Tucker multipliers.

    IOE 511/Math 562: ContOpt Penalty methods Page 120

  • Augmented Lagrangian penalty function

    I Is there an exact penalty method with a smooth penaltyfunction?

    I Augmented Lagrangian penalty (equality constraints only):

    LALAG(x, v) = f(x) +

    l∑i=1

    vihi(x) + c

    l∑i=1

    hi(x)2,

    I If x̄ is the optimal solution and v̄ is the vector ofcorresponding multipliers,

    ∇xLALAG(x̄, v̄)

    =

    [∇f(x̄) +

    l∑i=1

    v̄i∇hi(x̄)

    ]+ 2c

    l∑i=1

    hi(x̄)∇hi(x̄) = 0

    I If (x̄, v̄) satisfy SOSC, then for c sufficiently large,x̄ = argminLALAG(x, v̄)

    IOE 511/Math 562: ContOpt Penalty methods Page 121

  • ALAG — continued

    v̄ and c are not known in advance. The algorithm:

    Initialization Select the initial multipliers v0 and penalty weightc0 > 0. Set k ← 0

    Iteration k, x update Solve the unconstrained problem to minimizeLALAG(x, vk) and let x

    k denote the optimal solutionobtained. If termination criteria are satisfied, stop.

    Iteration k, v update Obtain the updated multipliers vk+1according to appropriate formulas, increase k, repeatiteration.

    Multiplier updates:

    I Constant: Keep the multipliers constant (not much differentfrom the usual quadratic penalty method).

    I Method of multipliers: Let vk+1 = vk + 2ckh(xk). (underright circumstances, will converge to v̄ if xk converges to x̄).

    I Other multiplier update methods – second order, exponential,etc.

    IOE 511/Math 562: ContOpt Penalty methods Page 122

  • Motivation

    P minx∈X f(x) P (θ) minx∈X f(x) + θb(x)s.t. Ax = b s.t. Ax = b

    g(x) ≤ 0

    I Several aspects of general barrier methods were left withoutanalysis:I How may iterations of the (projected) Newton method are

    needed to solve each barrier problem P (θk)?I How does the above depend on the relationship between θk+1

    and θk?I How precisely do we need to solve each P (θk)?I How small does θ need to get to produce a good approximate

    solution of (P)?

    I These questions are hard to answer in general, but we can doanalysis for specific classes of (P)

    IOE 511/Math 562: ContOpt Interior point methods for LPs Page 123

  • Linear Programming: The primal and dual problems

    Primal LP in standard form and its Lagrangian dual:

    P : min cTx D : max bTπs.t. Ax = b s.t. ATπ + s = c

    x ≥ 0 s ≥ 0

    Note:

    I If x and (π, s) are primal/dual feasible, then the optimalitygap (or duality gap) between them is

    cTx− bTπ = cTx− (Ax)Tπ = (c−ATπ)Tx = sTx

    I KKT FO N&S conditions for P:I Ax = b, x ≥ 0 (primal feasibility)I ATπ + s = c, s ≥ 0 (gradient alignment, and dual feasibility)I sTx = 0 (complementarity)

    IOE 511/Math 562: ContOpt Interior point methods for LPs Page 124

  • Barrier subproblem for P

    P (θ) min cTx− θn∑j=1

    ln(xj)

    s.t. Ax = bx > 0

    I Idea for an interior point (barrier) algorithm: solve a sequenceof problems P (θ) with θ → 0+

    I KKT conditions for P (θ):

    Ax = b, x > 0

    ATπ + s = c

    1

    θXSe− e = 0

    I If (x, π, s) satisfy these conditions, then x is feasible for P ,(π, s) is feasible for D, and the duality gap between them is:

    xT s = eTXSe = θeT e = θn.IOE 511/Math 562: ContOpt Interior point methods for LPs Page 125

  • Idea: don’t need to solve each barrier problem “exactly”β-approximate solutions of P (θ)

    A β-approximate solution of P (θ) is defined as any vector (x, π, s)that satisfies

    Ax = b, x > 0

    ATπ + s = c∥∥∥∥1θXs− e∥∥∥∥ ≤ β.

    Lemma 5.2.1

    If (x̄, π̄, s̄) is a β-approximate solution of P (θ) and β < 1, then x̄is feasible for P , (π̄, s̄) is feasible for D, and the duality gapbetween them satisfies:

    nθ(1− β) ≤ cT x̄− bT π̄ = x̄T s̄ ≤ nθ(1 + β).

    IOE 511/Math 562: ContOpt Interior point methods for LPs Page 126

  • Several flavors of interior point methods for LP in the book

    I Section 5.3: Primal algorithmI Essentially, a special case of barrier method we’ve already

    discussedI For each P (θk), a β-approximate solution is foundI For theoretical analysis, use a sequence of θk’s such that only

    one iteration of (projected) Newton method needed for eachP (θk)

    I Section 5.4: A primal-dual algorithmI A variation on the above, but simultaneously updating primal

    and dual variablesI Theoretical complexity analysis

    I Section 5.5: A more practical primal-dual algorithm

    IOE 511/Math 562: ContOpt Interior point methods for LPs Page 127

  • Primal-dual IPMSolving P (θ) via a primal-dual Newton’s method

    Trying to solve system of (KKT) equations:

    Ax = b, x > 0

    ATπ + s = c

    1

    θXSe− e = 0⇔ XSe = θe

    Let (x̄, π̄, s̄) be our current primal and dual feasible iterate:

    Ax̄ = b, x̄ > 0, AT π̄ + s̄ = c, s̄ > 0

    The Newton equation system (to find the Newton direction)around the current iterate is:

    A∆x = 0

    AT∆π + ∆s = 0

    S̄∆x+ X̄∆s = θe− X̄S̄eIOE 511/Math 562: ContOpt Interior point methods for LPs Page 128

  • Primal-dual IPMApproaches for solving the primal-dual Newton equation system:

    0. Closed-form expressions in (5.21) — not for use in practice

    1. Solve directly (e.g., using Matlab’s backsolve operator): A 0 00 AT IS̄ 0 X̄

    · ∆x∆π

    ∆s

    = 00θe− X̄S̄e

    2. Better: use special sparsity structure of the matrix:

    I Solve (AX̄S̄−1AT )∆π = A(x̄− θS̄−1e)I Substitute ∆s = −AT∆πI Substitute ∆x = −x̄+ θS̄−1e− S̄−1X̄∆s

    IOE 511/Math 562: ContOpt Interior point methods for LPs Page 129

  • Primal-dual interior point algorithm (Alg. 12)

    Step 0: Initialization Data is (x0, π0, s0, θ0) such that (x0, π0, s0)

    is a β-approximate solution of P (θ0) for some knownvalue of β that satisfies β < 13 . k ← 0.

    Step 1: Set current values (x̄, π̄, s̄) = (xk, πk, sk), θ̃ = θk.Step 2: Shrink θ. Set θ̄ = αθ̃ for some α ∈ (0, 1).Step 3: Compute the primal-dual Newton direction. Compute the

    Newton step for P (θ̄) at (x̄, π̄, s̄) by solving

    A∆x = 0

    AT∆π + ∆s = 0

    S̄∆x+ X̄∆s = θ̄e− X̄S̄e.Step 4: Update All Values.

    (x′, π′, s′) = (x̄, π̄, s̄) + (∆x,∆π,∆s)

    Step 5: Reset Counter and Continue.(xk+1, πk+1, sk+1) = (x′, π′, s′). θk+1 = θ̄.k ← k + 1. Go to Step 1.

    IOE 511/Math 562: ContOpt Interior point methods for LPs Page 130

  • Primal-dual IPMAnalysis of the algorithm

    Theorem 5.4.1: Explicit Quadratic Convergence ofPrimal-Dual Newton’s Method

    Suppose that (x̄, π̄, s̄) is a β-approximate solution of P (θ̄) andβ < 13 . Let (∆x,∆π,∆s) be the solution to the primal-dualNewton equations above, and let:

    (x′, π′, s′) = (x̄, π̄, s̄) + (∆x,∆π,∆s).

    Then (x′, π′, s′) is a(

    1+β(1−β)2

    )· β2-approximate solution of P (θ).

    Proof at the end of Section 5.4. Two pages of rather boringalgebra, with key insight:

    ‖X̄−1∆x‖ ≤ β1− β

    < 1 and ‖S̄−1∆s‖ ≤ β1− β

    < 1

    IOE 511/Math 562: ContOpt Interior point methods for LPs Page 131

  • Primal-dual IPMAnalysis of the algorithm

    Thm. 5.4.2: Shrinkage Theorem

    Suppose that (x̄, π̄, s̄) is a 340 -approximate solution of P (θ̃). Let

    α = 1−18

    15

    +√n

    and let θ̄ = αθ̃. Then (x̄, π̄, s̄) is a 15 -approximate

    solution of P (θ̄).

    Thm. 5.4.3: Convergence theorem

    Suppose that (x0, π0, s0) is a β = 340 -approximate solution ofP (θ0), and let

    α = 1−18

    15 +√n.

    Then, for all k = 1, 2, 3, . . ., (xk, πk, sk) is a β = 340 -approximatesolution of P (θk).

    IOE 511/Math 562: ContOpt Interior point methods for LPs Page 132

  • Primal-dual IPMComplexity of the algorithm

    Thm. 5.4.4: Computational Guarantee

    Suppose that (x0, π0, s0) is a β = 340 -approximate solution ofP (θ0). In order to obtain primal and dual feasible solutions(xk, πk, sk) with a duality gap of at most �, one needs to run thealgorithm for at most

    k =

    ⌈10√n ln

    (43

    37

    (x0)T s0

    )⌉iterations.

    IOE 511/Math 562: ContOpt Interior point methods for LPs Page 133

  • A practial interior point algorithm (Alg. 13)1. Given (x0, π0, s0) satisfying x0 > 0, s0 > 0, and θ0 > 0, and r satisfying 0 < r < 1, and � > 0.

    Set k ← 0.2. Test stopping criterion. Check if:

    (1) ‖Axk − b‖ ≤ �(2) ‖AT πk + sk − c‖ ≤ �(3) (sk)T xk ≤ �.

    If so, STOP. If not, proceed.

    3. Set θ ←(

    1

    10

    )((xk)T (sk)

    n

    )4. Solve the Newton equation system:

    (1) A∆x = b− Axk =: r1(2) AT ∆π + ∆s = c− AT πk − sk =: r2(3) Sk∆x +Xk∆s = θe−XkSke =: r3

    5. Determine the step-sizes: for r ∈ (0, 1) (e.g., r = 0.99)

    αP = min

    {1, r min

    ∆xj

  • What we covered in this course

    I Some useful math: convexity, separation of convex sets,theorems of alternative

    I Optimality conditions for optimization problems withdifferentiable functionsI A brief detour into non-differentiable convex unconstrained

    problemsI Lagrangian duality for constrained optimization problems

    I and its relationship to optimality conditionsI Algorithms for solving unconstrained problems with

    twice-differentiable objective functionI Projection frameworks for problems with linear equality

    constraintsI Barrier and Penalty frameworks for solving constrained

    problems, also under differentiability assumptionsI In detail: barrier methods for linear optimizatiom

    I Role of problem convexity in analysis of optimality conditionsand algorithmic performance

    IOE 511/Math 562: ContOpt Wrap-up Page 135

  • What we did not cover in this courseOther methods for solving unconstrained problems

    I Conjugate gradient methods

    I Quasi-Newton methods

    I Trust-region methodsI Methods for continuous but non-differentiable functions

    I E.g., convex non-differentiable functions

    I First-order methods that attempt to improve upon SteepestDescent for large-scale problems

    Note: these methods are motivated by the basic direction-basedmethods we studied, and make modifications to the methodnecessitated by particular features of the problem.

    IOE 511/Math 562: ContOpt Wrap-up Page 136

  • What we did not cover in this courseOther methods for solving constrained problems

    I Sequential quadratic programming

    I Reduced gradient method in greater generality

    I Practical versions of barrier and penalty methods

    I ...I Specialized methods for convex optimization

    I e.g., cutting plane methods

    Note: most of these methods are extensions of the fundamentalframeworks we discussed.

    IOE 511/Math 562: ContOpt Wrap-up Page 137

  • What we did not cover in this courseOther optimization problem types

    I Problems with generalized inequality constraintsI Most prominent example: optimization over symmetric

    matrices with constraints “X SPSD”

    I Infinite-dimensional problems (n and/or m)

    I Equilibrium problems

    I Non-convex problems where global solutions are required

    I Problems where (some of) the variables are required to takeon integer values

    Note: analysis and algorithm development for such problems isoften much more complicated, but at its root, attempts to extendaspects of analysis we did in the traditional framework.

    IOE 511/Math 562: ContOpt Wrap-up Page 138

  • So, you’ve got yourself an optimization problemWhat algorithms/software to use?

    I First, study what kind of problem you are dealing with:I Linear, discrete, continuous nonlinear?I Unconstrained?I Constrained with bound/linear constraints only?I General non-linear constraints?I Special structure (network, fixed-point, semi-definite

    programming, least squares, etc)?I Convex?I Local or global solution needed?

    I Other features of the problem to take into account:I Size, i.e., number of variables and constraintsI Differentiability (are formulas for derivatives of appropriate

    order available; are numerical approximations techniquesapplicable, etc.)

    I Function behaviorI Structure of the problem

    IOE 511/Math 562: ContOpt Wrap-up Page 139

  • So, you’ve got yourself an optimization problemWhat algorithms/software to use?

    I Review what you learned in this course (and expand on whatyou learned) to see what type of algorithm may be able tohandle your problem based on features identified above

    I Explore what commercial and/or free software might becapable of solving your problem, based on algorithms itimplements. Some resources:I Matlab’s optimization package user manual (the “deep cuts”)I The NEOS Optimization guide and the NEOS server

    http://neos-guide.orgI “Decision Tree” for Optimization Software

    http://plato.la.asu.edu/guide.html

    I If above fails:I Can your problem be formulated differently?I Write your own solver (for this problem type)?

    IOE 511/Math 562: ContOpt Wrap-up Page 140

    http://neos-guide.orghttp://plato.la.asu.edu/guide.html

    IntroductionCalculusBasic notions in optimizationOptimality Conditions — UnconstrainedConvexity and minimization

    General optimization algorithmsStepsize selectionSteepest descent algorithm for unconstrained optimizationNewton's methodConstrained optimization — optimality conditionsIntroductionNecessary Optimality Conditions: Geometric viewSeparation of Convex SetsFirst order optimality conditionsSecond order conditions

    Lagrangian dualityLinearly constrained problemsBarrier methodsPenalty methodsKarush-Kuhn-Tucker Multipliers in Penalty MethodsExact Penalty MethodsAugmented Lagrangian penalty function

    Interior point methods for LPs


Recommended