Continuous Optimization Methodsmepelman/teaching/IOE511/... · Continuous Optimization Methods...

Continuous Optimization Methods

These slides are prepared with the goal of occasionallyproviding the instructor rest from writing on the board duringlectures. As such, they contain only part of the information

that was discussed in the lecture, and do not provide asummary or outline of the lectures or assigned readings.

Winter 2019

IOE 511/Math 562: ContOpt Page 1

IOE 511/Math 562: Continuous Optimization Methods

Instructor: Marina A. Epelman [email protected]: Geunyeong Byeon [email protected];Office hours: as announced on CanvasOnline:I Canvas: Course materials, including announcements, lecture

notes, reading assignments, grades, etc.; submission ofprogramming assignments. Make sure to follow links toSchedule and LATEX and Matlab resources on the web.

I Piazza (linked to Canvas): Q&A regarding course materials,readings, homeworks solutions (after official solutions havebeen posted), etc. Please post your questions there, as well asanswer questions whenever you can contribute!

I Gradescope (linked to Canvas): Homework assignments andsubmissions (except for Matlab code, which will be submittedseparately through Canvas).

Make sure that you can access Canvas, Piazza, and Gradescopesites for this course

IOE 511/Math 562: ContOpt Introduction Page 2

[email protected]@umich.edu

Course Logistics

Required background: A proof-based calculus or analysis course,linear algebra, familiarity with (or willingness to learn) Matlab.

I Homeworks (worth a total of 30%), include some computermodeling and programming

I Midterm (worth a total of 30%), roughly midpoint of thecourse

I Final exam (worth a total of 40%), during Finals week

Partial honor code policies: you are allowed, indeed, encouragedto work in groups on the homework conceptualizing the problems.However each student is individually responsible for expressing theiranswers in their own terms, writing their own solutions and code tobe submitted. Also you may not acquire, read, or otherwise utilizeanswers from solutions handed out in other courses, or in previousterms in this course. You also are not allowed to distributematerials from this course to individuals or repositories.Please read syllabus for complete course policies.


Informal (and tentative) course outline

I Introduction to optimization

I Optimality conditions for unconstrained problems

I Algorithms for unconstrained problems (steepest descent,Newton’s, etc.) and analysis of their convergence

I Optimality conditions and constraint qualifications forconstrained problems

I Convexity and its role in optimization

I Algorithms for constrained problems (SQP, barrier and penaltymethods, etc.)

I If time remains: “large-scale” optimization problems; conicoptimization problems, their applications, and methods fortheir solution.


Forms of mathematical programming problems

Unconstrained Problem:

(NLP) minx f(x)s.t. x ∈ X,

where x = (x1, . . . , xn)T ∈ Rn, f(x) : Rn → R, and X is an open

set (often, but not always, X = Rn).Constrained Problem:

(NLP) minx f(x)s.t. gi(x) ≤ 0 i = 1, . . . ,m

hi(x) = 0 i = 1, . . . , lx ∈ X,

where g1(x), . . . , gm(x), h1(x), . . . , hl(x) : Rn → R.


Constrains, objectives, optimal solutions


hi(x) = 0 i = 1, . . . , lx ∈ X,

I f(x) is the objective function

I “hi(x) = 0” are equality constraints

I “gi(x) ≤ 0” are inequality constraintsI x ∈ Rn is feasible if it satisfies all the constraintsI The set of all feasible points forms the feasible region

I The goal: find a feasible point x̄ such that f(x̄) ≤ f(x) forany other feasible point x


Examples of NLP formulation

I Markowitz portfolio optimization model

I Least squares approximation

I Maximum likelihood estimation


Markowitz portfolio optimization modelProblem description and data

I You have an opportunity to invest in n assets

I Future return of asset i is a random variables Ri withexpectation µi = E[Ri], i = 1, . . . , n

I Covariances of returns are Qij = Cov(Ri, Rj), i, j = 1, . . . , n

I At least one of these assets is a risk-free asset


Markowitz portfolio optimization modelA portfolio

I Let xi, i = 1, . . . , n, be the fractions of your wealth allocatedto each of the assets

I x ≥ 0 andn∑i=1

xi = 1

I Return of resulting portfolio is a random variablen∑i=1

xiRi

I Expectation:n∑i=1

xiE[Ri] =

n∑i=1

xiµi

I Variance:n∑i=1

n∑j=1

xixjCov(Ri, Rj) =

n∑i=1

n∑j=1

xixjQij .


Markowitz portfolio optimization modelPortfolio optimization

I A portfolio is usually chosen to optimize some measure of atradeoff between the expected return and the risk, such as

max

n∑i=1

xiµi − αn∑i=1

n∑j=1

xixjQij

s.t.n∑i=1

xi = 1

x ≥ 0,

I Here α > 0 is a (fixed) parameter reflecting the investor’spreferences in the above tradeoff.

I The above problem is usually solved for a variety of values ofα, generating the efficient frontier.


Parameter estimationProblem description and data

I Setup: output (performance) of a system, y, depends on anumber of input parameters (settings), a ∈ Rn, but we don’tknow exactly how

I Linear measurement model: assume that y ∈ R can beexpressed as a linear function

y ≈ aTx

of a for some x ∈ Rn

I Goal: find value of x which provides the “best fit” for theavailable set of input-output pairs (ai, yi), i = 1, . . . ,m


Parameter estimationOptimization problem: least squares

I One measure of “fit” is sum of squared errors betweenestimated and measured outputs

I To find the best fit, solve optimization problem:

minx∈Rn∑m

i=1(vi)2

s.t. vi = yi − aTi x, i = 1, . . . ,m

I Same as

minx∈Rn

m∑i=1

(yi − aTi x)2 = minx∈Rn

‖Ax− y‖22

Here, A is the matrix with rows aTi


Maximum likelihood estimationOne observation

I Setup: observing a sample of realizations of a random variableY , and trying to find out its probability distribution

I Parametric family: assume that the distribution belongs to afamily of probability distributions px(·) on R, parameterizedby vector x ∈ Rn

I Given one observation y ∈ R, px(y) as a function of x iscalled the likelihood function

I It is more convenient to work with the log-likelihood function:

l(x) = log px(y)

I To estimate the value of x based on one sample y, take

x̂ = argmaxx px(y) = argmaxx l(x),

which is the maximum likelihood (ML) estimationI If there is prior information available about x, we can add

constraint x ∈ C ⊆ RnIOE 511/Math 562: ContOpt Introduction Page 13

Maximum likelihood estimationMultiple observation

I Recall: we assume that the distribution of Y belongs to aparametric family of probability distributions px(·) on R,parameterized by vector x ∈ Rn

I For m iid samples (y1, . . . , ym), the log-likelihood function is

l(x) = log

(m∏i=1

px(yi)

)=

m∑i=1

log px(yi)

I The ML estimation is thus an optimization problem:

max l(x) subject to x ∈ C


Maximum likelihood estimationExample: linear measurement model

I Return to the linear measurement model:I Previously assumed that y ≈ aTxI More specific assumption: y = aTx+ v, where the error v is

iid random noise with density p(v)

I m measurement/output pairs (ai, yi) give us m samples of v:

vi = yi − aTi x

I The likelihood function is

px(y) =

m∏i=1

p(yi − aTi x),

and the log-likelihood function is

l(x) =

m∑i=1

log p(yi − aTi x)


Maximum likelihood estimationExample: linear measurement model, Gaussian noise

I Suppose the noise is Gaussian with mean 0 and (unknown)standard deviation σ.

I Density:

p(z) =1√

2πσ2e−

z2

2σ2

I Log-likelihood function:

l(x) = −m2

log(2πσ2)− 12σ2‖Ax− y‖22

I Therefore, the ML estimate of x is

arg minx‖Ax− y‖22,

the solution of the least squares approximation problemI This is the idea behind linear regression!


Calculus bootcamp

A quick overview of definitions and results from (multivariate)calculus and analysis we will use throughout the course.Additional references:

I Griva, Nash, and Sofer, “Linear and Nonlinear Optimization,”Appendices A and B

I Bertsekas, “Nonlinear Programming,” Appendix A

I Bazaraa, Sherali, and Shetty, “Nonlinear Programming:Theory and Algorithms,” Appendix A

IOE 511/Math 562: ContOpt Calculus Page 17

Vectors and Norms

I Rn: set of all n-dimensional real vectors (x1, . . . , xn)T (“xT ”— transpose)

I Definition: norm ‖ · ‖ on Rn: a mapping of Rn onto R suchthat:

1. ‖x‖ ≥ 0 ∀x ∈ Rn; ‖x‖ = 0⇔ x = 0.2. ‖cx‖ = |c| · ‖x‖ ∀c ∈ R, x ∈ Rn.3. ‖x+ y‖ ≤ ‖x‖+ ‖y‖ ∀x, y ∈ Rn.

I Euclidean norm: ‖ · ‖2: ‖x‖2 =√xTx =

(∑ni=1 x

2i

)1/2.

I Cauchy-Schwarz inequality for Euclidean norm:|xT y| ≤ ‖x‖2 · ‖y‖2 with equality ⇔ x = αy.

I All norms in Rn are equivalent, i.e., for any ‖ · ‖1 and ‖ · ‖2∃α1, α2 > 0 s.t. α1‖x‖1 ≤ ‖x‖2 ≤ α2‖x‖1 ∀x ∈ Rn.

I Ball of radius � > 0 centered at x:B(x, �) = {y : ‖y− x‖ ≤ �} (sometimes — strict inequality).


Sequences and Limits in R.

I Notation: a sequence: xk : k = 1, 2, . . . ⊂ R, {xk}k forshort.

I Definition: {xk}k ⊂ R converges to x ∈ R (xk → x,limk→∞ x

k = x) if

∀� > 0 ∃K� : |xk − x| ≤ � (equiv., xk ∈ B(x, �)) ∀k ≥ K�.

I Definition: xk →∞ (−∞) if

∀A ∃KA : xk ≥ A (xk ≤ A) ∀k ≥ KA.


Other properties of sequences in R

I Definition: {xk}k is bounded above (below):∃A : xk ≤ A (xk ≥ A) ∀k.

I Definition: {xk}k is bounded: {|xk|} is bounded; equiv.,{xk}k bounded above and below.

I Definition: {xk}k is nonincreasing (nondecreasing):xk+1 ≤ xk (xk+1 ≥ xk) ∀k;monotone: nondecreasing or nonincreasing.

I Proposition: Every monotone sequence in R has a limit(possibly infinite). If it is also bounded, the limit is finite.


Sequences in Rn

I Notation: a sequence: xk : k = 1, 2, . . . ⊂ Rn, {xk}k forshort.

I Definitions: {xk}k ⊂ Rn converges to x ∈ Rn (is bounded) if{xki }, i.e., the sequence of ith coordinates of xk’s, convergesto the xi (is bounded) ∀i.

I Propositions:I xk → x⇔ ‖xk − x‖ → 0I {xk}k is bounded⇔ {‖xk‖} is bounded

I Note: ‖xk‖ → ‖x‖ does not imply that xk → x!! (Unlessx = 0).


Limit Points of Sequences vs Limits

I Definition: x is a limit point of {xk}k if there exists an infinitesubsequence of {xk}k that converges to x.

I To see the difference between limits of a sequence and limitpoints of a sequence, consider the sequence{

0,1

2, −1

2,

2

3, −2

3,

3

4, −3

4, . . .

}⊂ R

I Proposition: let {xk}k ⊂ RnI If {xk}k is bounded, {xk}k converges ⇔ it has a unique limit

pointI If {xk}k is bounded, it has at least one limit point


Limit Points of Sets

I Definition: x is a limit point of set A ⊆ Rn if there exists aninfinite sequence {xk}k ⊂ A such that xk 6= x ∀k thatconverges to x.

I Examples: what are the limit points of the following set:{y : ‖y − x‖ ≤ �}

What about {y : ‖y − x‖ < �}?


Closed and Open Sets

I Definition: A ⊆ Rn is closed if it contains all its limit pointsI Definition: A ⊆ Rn is open if its complement, Rn\A, is closedI Examples: balls centered at x:{y : ‖y − x‖ ≤ �} — closed{y : ‖y − x‖ < �} — open

I Some sets are neither: (0, 1].I Proposition:

1. Union of finitely many closed sets is closed.2. Intersection of closed sets is closed.3. Union of open sets is open.4. Intersection of finitely many open sets is open.5. A set is open ⇔ All of its elements are interior points.6. Every subspace of Rn is closed.

I Definition: a point x ∈ A is interior if there is a neighborhoodof x (i.e., B(x, �) for some � > 0) contained in A


Basic notions in optimization

I Definitions

I Types of optima

I Existence of optima

IOE 511/Math 562: ContOpt Basic notions in optimization Page 25

Types of optimization problems

Unconstrained Optimization Problem:

(UP) minx f(x)s.t. x ∈ X,

where x = (x1, . . . , xn)T ∈ Rn, f(x) : Rn → R, and X is an open

set (usually X = Rn).Constrained Optimization Problem:


hi(x) = 0 i = 1, . . . , lx ∈ X,

where g1(x), . . . , gm(x), h1(x), . . . , hl(x) : Rn → R.


Constraints and feasible region

I A point x is feasible for (UP)/(NLP) if it satisfies allconstraints (including x ∈ X)

I The set F of all feasible points forms the feasible region, orfeasible set:

F = {x ∈ Rn : g1(x) ≤ 0, . . . , gm(x) ≤ 0,h1(x) = 0, . . . , hl(x) = 0, x ∈ X}

I At a feasible point x̄ an inequality constraint gi(x) ≤ 0 is saidto be binding, or active if gi(x̄) = 0, and nonbinding, ornonactive if gi(x̄) < 0

I All equality constraints are considered to be active at anyfeasible point.


Types of optimal solutions

(P) minx or maxx f(x)s.t. x ∈ F

I Recall: B(x̄, �) := {x : ‖x− x̄‖ ≤ �}I Definitions:

1.5.2 x ∈ F is a global minimum of (P) if f(x) ≤ f(y) for all y ∈ F .1.5.4 x ∈ F is a strict global minimum of (P) if f(x) < f(y) for all

y ∈ F , y 6= x.1.5.1 x ∈ F is a local minimum of (P) if there exists � > 0 such that

f(x) ≤ f(y) for all y ∈ B(x, �) ∩ F .1.5.3 x ∈ F is a strict local minimum of (P) if there exists � > 0

such that f(x) < f(y) for all y ∈ B(x, �) ∩ F , y 6= x.I Local and global maxima (i.e., solutions of the problem

maxx∈F f(x)) are defined analogously (1.5.5–1.5.8).


Infimum and Supremum

I Let A ⊂ R.Supremum of A (supA): smallest y : x ≤ y ∀x ∈ A.Infimum of A (inf A): largest y : x ≥ y ∀x ∈ A.

I Not the same as max and min! Consider, for example, (0, 1).


Functions and Continuity

I A ⊆ Rm, f : A→ R – a function.I Definition: f is continuous at x̄ if

∀� > 0 ∃δ > 0 : x ∈ A, ‖x− x̄‖ < δ ⇒ |f(x)− f(x̄)| < �.

I The above is the standard way to write the definition; Iwould’ve preferred “∀� > 0 ∃δx̄,� > 0... ”

I Proposition: f is continuous at x̄ ⇔ for any{xn} ⊂ A : xn → x̄ we have f(xn)→ f(x̄). (In other words,limn→∞ f(xn) = f(limn→∞ xn).)

I Proposition:I Sums, products and inverses of continuous functions are

continuous (in the last case, provided the function is neverzero on its domain).

I Composition of two continuous functions is continuous.I Any vector norm is a continuous function.


Existence of solutions of optimization problems

Thm. 1.6.1: Weierstrass’ Theorem for sequences

Let {xk}k, k →∞ be an infinite sequence of points in thecompact (i.e., closed and bounded) set F . Then some infinitesubsequence of points xkj converges to a point contained in F .

Thm. 1.6.2: Weierstrass’ Theorem for functions

Let f(x) be a continuous real-valued function on the compactnonempty set F ⊂ Rn. Then F contains a point that minimizes(maximizes) f on the set F .


Optimality conditions: unconstrained problems

I Necessary conditions: identify candidates

I Sufficient conditions: guarantee optimality

I Convexity and its role in optimization

IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 32

Local minima and descent directions

(P) minx f(x)s.t. x ∈ X,

where X = Rn or X ⊂ Rn is an open set

Definition 2.1.1

The direction d̄ is called a descent direction of f(·) at x̄ ∈ X if

f(x̄+ �d̄) < f(x̄) for all � > 0 and sufficiently small.

What is the relationship between local minima and descentdirections?


Differentiable functions and gradients (Appendix B.2.1)

Let f : X → R, where X ⊂ Rn is open.I The directional derivative of f at x̄ in the direction d is

∂

∂λf(x̄+ λd) = lim

λ→0

f(x̄+ λd)− f(x̄)λ

= ∇f(x̄)Td

I f is differentiable at x̄ ∈ X if ∃∇f(x̄) ∈ Rn such that

α(x̄;x− x̄) = f(x)− f(x̄)−∇f(x̄)T (x− x̄)

‖x− x̄‖→ 0 as x→ x̄

I ∇f(x̄) is the gradient of f at x̄, and satisfies

∇f(x̄) =(∂f(x̄)

∂x1, . . . ,

∂f(x̄)

∂xn

)T,

where ∂f(x̄)∂xi is the directional derivative in direction eiI f is differentiable on X if f is differentiable ∀x̄ ∈ X.


First order necessary optimality condition

(P) minx∈X

f(x)

where x ∈ Rn, f : Rn → R, and X — an open set.

Theorem 2.1.1

Suppose that f is differentiable at x̄. If ∃d ∈ Rn such that∇f(x̄)Td < 0, then for all λ > 0 sufficiently small,f(x̄+ λd) < f(x̄) (i.e., d is a descent direction; Def. 2.1.1).

Recall: directional derivative limλ→0f(x̄+λd)−f(x̄)

λ = ∇f(x̄)Td

Necessary condition for local optimality: “if x̄ is a local minimumof (P), then x̄ must satisfy...”

Corollary 2.1.1: First order necessary optimalitycondition

Suppose f is differentiable at x̄. If x̄ is a local minimum, then∇f(x̄) = 0 (such a point is called a stationary point).


Twice differentiable functions and Hessians (AppendixB.2.1)

I Definition: the function f is twice differentiable at x̄ ∈ X ifthere exists a vector ∇f(x̄) and an n× n symmetric matrixH(x̄) (the Hessian of f at x̄) such that for each x ∈ X

α(x̄;x−x̄) =f(x)− f(x̄)−∇f(x̄)T (x− x̄)− 1

2(x− x̄)TH(x̄)(x− x̄)

‖x− x̄‖2→ 0 as x→ x̄.

Note: this is a different function α than in the definition of ∇fI f is twice differentiable on X if f is twice differentiable∀x̄ ∈ X.

I The Hessian is a matrix of second partial derivatives:

[H(x̄)]ij =∂2f(x̄)

∂xi∂xj,

and for functions with continuous second derivatives, it willalways be symmetric:

∂2f(x̄)

∂xi∂xj=∂2f(x̄)

∂xj∂xiIOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 36

Positive semi-definite symmetric matrices, etc. (AppendixA)

Definition

An n× n matrix Q is called symmetric if Qij = Qji ∀i, j.A symmetric n× n matrix Q is calledI positive definite if xTQx > 0 ∀x ∈ Rn, x 6= 0 (Q � 0, SPD)I positive semidefinite if xTQx ≥ 0 ∀x ∈ Rn (Q � 0, SPSD)I negative definite if xTQx < 0 ∀x ∈ Rn, x 6= 0 (Q ≺ 0)I negative semidefinite if xTQx ≤ 0 ∀x ∈ Rn (Q � 0)I indefinite if ∃x, y ∈ Rn : xTQx > 0, yTQy < 0


Eigenvalues and decomposition of matrices

I A number γ is an eigenvalue of Q if there exists v 6= 0 suchthat Qv = γv; v is an eigenvector corresponding to γI γ also satisfies det(Q− γI) = 0

I If Q ∈ Rn×n is a symmetric matrix, then...I Prop. A.1.1 all of its eigenvalues are real numbersI Prop. A.1.2 its eigenvectors corresponding to different

eigenvalues are orthogonalI Prop. A.1.3 Q has n (distinct) eigenvectors that form an

orthonormal basis for Rn:

v1, . . . , vn : vTi vj =

{0 if i 6= j1 if i = j


Eigenvalues and definiteness of matrices

In these definitions and results, Q ∈ Rn×n is a symmetric matrixI Prop. A.1.5 If v1, . . . , vn are orthonormal eigenvectors

corresponding to γ1, . . . , γn, then Q = RDRT , where

I R = [v1, . . . , vn] (note: RT = R−1)

I D =

γ1 0. . .0 γn

I Prop. A.1.4, mod. Q is SPSD (SPD) if and only if all of its

eigenvalues are nonnegative (positive)

I Prop. A.1.6 If Q is SPSD, then Q = MTM for some matrixM

I Prop. A.1.7 If Q is SPSD, then xTQx = 0 implies Qx = 0

I Prop. A.1.8 Suppose Q is symmetric; then Q � 0 andnonsingular if and only if Q � 0


Second order conditions

Theorem 2.1.2: Second order necessary conditions

Suppose that f is twice continuously differentiable at x̄ ∈ X. If x̄is a local minimum, then ∇f(x̄) = 0 and H(x̄) = ∇2f(x̄) ispositive semidefinite.

Necessary conditions only allow us to come up with a list ofcandidate points for minima. Sufficient condition for localoptimality: “if x̄ satisfies ..., then x̄ is a local minimum of (P).”

Theorem 2.1.3: Second order sufficient conditions

Suppose that f is twice differentiable at x̄. If ∇f(x̄) = 0 andH(x̄) � 0 (positive definite), then x̄ is a (strict) local minimum.

I If ∇f(x̄) = 0 and H(x̄) ≺ 0, then x̄ is a strict local maximumI If ∇f(x̄) = 0 and H(x̄) � 0 but not � 0, we cannot be sure ifx̄ is a local minimum


Convexity: Definitions (Appendix B.2, selections)

I Let x1, x2 ∈ Rn. Points of the form λx1 + (1− λ)x2 forλ ∈ [0, 1] are called convex combinations of x1 and x2I More generally, point y is a convex combination of points

x1, . . . , xk if y =∑ki=1 λix

i where λi ≥ 0 ∀i, and∑ki=1 λi = 1

I A set S ⊂ Rn is called convex if ∀x1, x2 ∈ S and ∀λ ∈ [0, 1],λx1 + (1− λ)x2 ∈ S

I A function f : S → R, where S is a nonempty convex set is aconvex function if

f(λx1+(1−λ)x2) ≤ λf(x1)+(1−λ)f(x2) ∀x1, x2 ∈ S, ∀λ ∈ [0, 1]

I A function f as above is called a strictly convex function ifthe inequality above is strict for all x1 6= x2 and λ ∈ (0, 1)

I A function f : S → R is called concave (strictly concave) if(−f) is convex (strictly convex)


Convexity and minimization

(CP) minx f(x)s.t. x ∈ F

Theorem 2.1.4

Suppose F is a nonempty convex set, f : F → R is a convexfunction, and x̄ is a local minimum of (CP). Then x̄ is a globalminimum of f over F .

Note:I A problem of minimizing a convex function over a convex

feasible region (such as we considered in the theorem) is aconvex optimization problem

I If f is strictly convex, a local minimum is the unique globalminimum

I If f is (strictly) concave, a local maximum is a (unique) globalmaximum


How can we find out if a function is convex?

To determine whether a function is convex,

I Check the definition, or

I For differentiable or twice differentiable functions, checkcorresponding N&S condition

Theorem B.2.1: Gradient inequality, a.k.a. N&Scondition for convexity of differentiable functions

Suppose X ⊆ Rn is a non-empty open convex set, and f : X → Ris differentiable. Then f is convex iff (“if and only if”) it satisfiesthe gradient inequality:

f(y) ≥ f(x) +∇f(x)T (y − x) ∀x, y ∈ X.

In one dimension, the gradient inequality has the form

f(y) ≥ f(x) + f ′(x)(y − x) ∀x, y ∈ X.


How can we determine if a function is convex?

Theorem B.2.2: N&S condition for convexity oftwice-diff. functions

Suppose X is a non-empty open convex set, and f : X → R istwice differentiable. Then f is convex iff the Hessian of f , H(x), ispositive semidefinite ∀x ∈ X.

In one dimension, the Hessian condition has the formf ′′(x) ≥ 0 ∀x ∈ X.

Theorem: Sufficient conditions for strict convexity oftwice-diff. functions

Suppose X is a non-empty open convex set, and f : X → R istwice differentiable. Then f is strictly convex if the Hessian of f ,H(x), is positive definite ∀x ∈ X.


Optimality conditions for convex unconstrained programs

Theorem 2.1.5: N&S global optimality conditions fordifferentiable unconstrained convex problems

Suppose f : X → R is convex and differentiable on an openconvex set X. Then x̄ ∈ X is a global minimum if and only if∇f(x̄) = 0.

For non-differentiable convex functions:I Theorem B.3.6, modified: If S is a convex set and f : S → R

is convex, ∀x̄ ∈ intS there is (at least one) subgradient vector,i.e., a vector ξ ∈ Rn with the property:

f(y) ≥ f(x̄) + ξT (y − x̄) ∀y ∈ S.

I Theorem 2.1.6: If f : X → R is convex on open convex setX, x̄ ∈ X is a global minimum if and only if 0 ∈ ∂f(x̄)I Here, ∂f(x̄) is the subdifferential: the set of all subgradients

of f at x̄; ∂f(x̄) = {∇f(x̄)} if f differentiable at x̄.IOE 511/Math 562: ContOpt Optimality Conditions — Unconstrained Page 45

General optimization algorithms

(P) min f(x)s.t. x ∈ X

where X ⊂ Rn is an open set and f(x) is differentiable on X

General directional search optimization algorithm

Initialization Initialize at x0 ∈ X; set k ← 0Iteration k: 1. If ∇f(xk) = 0, stop. Otherwise, choose dk — a

search direction2. Choose αk > 0 — a step size3. Set xk+1 ← xk + αkdk ∈ X and k ← k + 1

IOE 511/Math 562: ContOpt General optimization algorithms Page 46

Testing for optimality and terminating

I Algorithm looks for a point such that ∇f(xk) = 0 (necessaryfor optimality; sufficient if f is convex)

I Unlikely to find a point to satisfy this condition exactly

I Rather, theoretical analysis of the algorithms deals with theirlimiting behavior, i.e., analyzes the limit points of the infinitesequence of iterates generated by the algorithm

I In practice, the algorithms are terminated when the aboveconditions are satisfied approximately, e.g., ‖∇f(xk)‖ ≤ � forsome pre-specified � > 0.


Choosing the direction

I We wantf(xk+1) < f(xk) ∀k,

so typically, we choose dk that is a descent direction of f atxk, that is,

f(xk + αdk) < f(xk) ∀α ∈ (0, ᾱ]

for some ᾱ > 0

I Any dk such that ∇f(xk)Tdk < 0 is a descent directionwhenever ∇f(xk) 6= 0

I Often, dk = −Dk∇f(xk), where Dk is SPDI Steepest descent: Dk = I, k = 0, 1, 2, . . .I Newton’s method: Dk = H(xk)−1 (provided

H(xk) = ∇2f(xk) is positive definite)


Choosing the stepsize

I After dk is fixed, αk ideally would solve the one-dimensionaloptimization problem

minαf(xk + αdk)

I Usually also impossible to solve exactly (analytically)

I Instead, αk is computed (via an iterative procedure referred toas line search) either to approximately solve the aboveoptimization problem, or to ensure a “sufficient” decrease inthe value of f


Line search: one-dimensional optimization

I Suppose that f(x) is a continuously differentiable function,and ∇f(x̄)T d̄ < 0

I Let h(α) = f(x̄+ αd̄)

I Note: h′(α) = ∇f(x̄+ αd̄)T d̄, h′(0) = ∇f(x̄)T d̄ < 0I Bisection algorithm for line-search

I Approximately solves minα>0 h(α) by searching for α̃ suchthat h′(α̃) ≈ 0

I Idea:I h′(α) < 0; find α̂ > 0: h′(α̂) > 0I So, ∃α ∈ (0, α̂) such that h′(α) = 0I Check sign of h′ at midpoint of (0, α̂); continue search on the

left or right half of the interval

I Details: FV, Section 2.8 (including Section 2.8.3: what to doif X 6= Rn)

IOE 511/Math 562: ContOpt Stepsize selection Page 50

Line search: Armijo’s rule (backtracking)

I Armijo rule is an inexact line search, designed to ensure thatthere is sufficient descent in the objective function, and thatthe step size is not too small

I Define, for a given 0 < γ < 0.5,

ĥ(α) = h(0) + αγh′(0)

I Backtracking implementation of the rule:

Step 0 Set k = 0 and α = 1. Choose γ ∈ (0, 0.5) andβ ∈ (0, 1).

Step k If h(α) ≤ ĥ(α), choose α as the step size; stop.Otherwise, let α← βα, k ← k + 1.


Illustration of Armijo’s rule

I Can be adapted to ensure x̄+ αd̄ ∈ XI As a result of backtracking, the chosen stepsize is α = βt,

where t ≥ 0 is the smallest nonnegative integer such thath(βt) ≤ ĥ(βt).


Steepest descent algorithm for unconstrained optimization

I General directional search optimization algorithm withdk = −∇f(xk)

I Motivation: − ∇f(xk)

‖∇f(xk)‖2is the (unit length) direction that

minimizes the linear approximation of f at xk

Steepest Descent Algorithm:

Initialization Initialize at x0 ∈ X; set k ← 0Iteration k: 1. If ∇f(xk) = 0, stop. Otherwise, choose

dk = −∇f(xk)2. Choose stepsize αk > 0 by performing exact or

inexact line search3. Set xk+1 ← xk + αkdk ∈ X and k ← k + 1

From the fact that dk = −∇f(xk) is a descent direction and Step2, it follows that f(xk+1) < f(xk).

IOE 511/Math 562: ContOpt Steepest descent algorithm for unconstrained optimization Page 53

Convergence of Steepest Descent algorithm

Theorem 2.2.1: Convergence of Steepest Descent withexact line search

Suppose that f : Rn → R is continuously differentiable on the setS(x0) = {x ∈ Rn : f(x) ≤ f(x0)}, and that S(x0) is a closed andbounded set. Suppose further that the sequence {xk}k isgenerated by the steepest descent algorithm with step lengths αkchosen by an exact line search. Then every point x̄ that is a limitpointa of the sequence {xk}k satisfies ∇f(x̄) = 0.

aThe textbook says “cluster point” — it’s the same thing as a limit point.


Convergence of Steepest Descent algorithm

I Definition: the gradient function ∇f(x) is Lipschitzcontinuous with constant G > 0 on the set S(x0) if

‖∇f(x)−∇f(y)‖ ≤ G‖x−y‖ ∀x, y ∈ S(x0) = {x : f(x) ≤ f(x0)}

Note: this is stronger than just continuity of ∇f(x).

Convergence Theorem with backtracking line search, cf.BSS 8.6.3

Suppose f : Rn → R is such that its gradient is Lipschitzcontinuous with constant G > 0 on the set S(x0) Suppose thesequence {xk}k is generated by the steepest descent algorithmwith stepsizes chosen by backtracking line search with γ ∈ (0, 0.5)and β ∈ (0, 1). Then every limit point x̄ of the sequence {xk}ksatisfies ∇f(x̄) = 0.


SD on strictly convex quadratic functions

f(x) =1

2xTQx+ qTx, where Q is SPD

Optimal solution:

x? = −Q−1q, f(x?) = −12qTQ−1q

At x, d = −∇f(x) = −Qx− q, and the next iterate x′ is

x′ = x+ αd, and f(x′) = f(x+ αd) = f(x)− αdTd+ 12α2dTQd

When stepsize α is determined by exact line search,

x′ = x+dTd

dTQdd, and f(x′) = f(x)− 1

2

(dTd)2

dTQd


Example: SD on strictly convex quadratic function

Let

Q =

(+4 −2−2 +2

)� 0 and q =

(+2−2

); x? =

(01

), and f(x?) = −1

If x0 = (0, 0),

x1 = (−0.4, 0.4), x2 = (0, 0.8), etc.,

and

f(x0)−f(x?) = 1, f(x1)−f(x?) = 0.2, f(x2)−f(x?) = 0.04, etc.,

and sof(xk)− f(x?) = 0.2k

Interpretation: difference between objective value at the currentiterate and f(x?) is reduced by a factor of 5 in every iteration ofthe algorithm.


Example: SD on strictly convex quadratic function

Note: vertical axis uses log scale


SD on strictly convex quadratic functions

Recall: at iterate x, with stepsize found by exact line-search,d = −Qx− q, and

x′ = x+dTd

dTQdd, and f(x′) = f(x)− 1

2

(dTd)2

dTQd

Thus,

f(x′)− f(x?)f(x)− f(x?)

=f(x)− f(x?)− 12

(dT d)2

dTQd

f(x)− f(x?)= 1− 1

β,

where β = (dTQd)(dTQ−1d)

(dT d)2(using formulae for d and f(x?))

Kantorovich Inequality (Prop. 2.2.1)

Let A and a be the largest and the smallest eigenvalues of Q � 0,respectively. Then, when d 6= 0,

β =(dTQd)(dTQ−1d)

(dTd)2≤ (A+ a)

2

4Aa.


Analysis of SD on a strictly convex quadratic function

f(x′)− f(x?)f(x)− f(x?)

= 1− 1β≤ 1− 4Aa

(A+ a)2=

(A− a)2

(A+ a)2=

(A/a− 1A/a+ 1

)2A/a ≥ 1 (by definition); it is called the condition number of Q

Upper Bound on Upper bound on #Iter to ReduceA a 1− 1β Optimality Gap by a factor of 10

1.1 1.0 0.0023 1

3.0 1.0 0.25 2

10.0 1.0 0.67 6

100.0 1.0 0.96 58

200.0 1.0 0.98 116

400.0 1.0 0.99 231


Iterates of SD; A/a close to 1

Pictured: level sets of f(x) and iterates of the SD algorithm

!3 !2 !1 0 1 2 3!3

!2

!1

0

1

2

3


Iterates of SD; A/a >> 1

Pictured: level sets of f(x) and iterates of the SD algorithm

!3 !2 !1 0 1 2 3!3

!2

!1

0

1

2

3


Example 4 from FV: non-quadratic function

f(x) = x1−0.6x2 + 4x3 + 0.25x4−4∑i=1

log(xi)− log

(5−

4∑i=1

xi

)

Asymptotically, f(x)− f(x?) is reduced by roughly a constantfactor at each iteration. This behavior is described as linear rate ofconvergence (see Appendix A.2).


Comments

I This type of convergence rate analysis is the worst-caseanalysis.

I Similarly, ‖xk − x?‖ will show linear convergence rate (in theworst case).

I What about non-quadratic functions? If ∇f(x?) = 0 andH(x?) � 0, f will behave as near-quadratic functions in someneighborhood of x?. The analysis of the non-quadratic casegets more involved; fortunately, the key intuition is obtainedby analyzing the quadratic case.

I Worst-case rate of convergence for Steepest Descent withbacktracking line search is also linear, with constant thatsimilarly depends on the condition number of the Hessian.

I The worst-case bound on the rate of convergence is attainedin practice quite often, which is unfortunate.


Termination criteria

I ∇f(xk) = 0? Not in finite time with imprecise computation...I ‖∇f(xk)‖ ≤ �? Depends on the scaling of f(x)I ‖∇f(xk)‖ ≤ �|f(xk)|? What if f(x?) = 0?I ‖∇f(xk)‖ ≤ �(1 + |f(xk)|)


Newton’s method for unconstrained minimization

I General optimization algorithm withdk = −∇2f(xk)−1∇f(xk) (Newton direction, or Newtonstep) and, in the “pure” case, αk = 1 (if ∇2f(xk) � 0)

I Motivation: iterative method for solving systems of equationapplied to ∇f(x) = 0

I Motivation: if the current iterate is x̄, the next iterate solves

minxf(x̄) +∇f(x̄)T (x− x̄) + 1

2(x− x̄)TH(x̄)(x− x̄)

“Pure” Newton’s Method:

Initialization Initialize at x0 ∈ X; set k ← 0Iteration k: 1. dk = −∇2f(xk)−1∇f(xk). If dk = 0, then stop.

2. Choose stepsize αk = 1.3. Set xk+1 ← xk + αkdk ∈ X and k ← k + 1

IOE 511/Math 562: ContOpt Newton’s method Page 66

Newton’s method for solving equations

I Let g : Rn → Rn. Goal: solve the system of equations

g(x) = 0

I Starting at a point x0, approximate the function g by

g(x0 + d) ≈ g(x0) +∇g(x0)Td,

where ∇g(x0)T ∈ Rn×n is the Jacobian of g at x0

I Provided that ∇g(x0) is non-singular, solve the system oflinear equations

∇g(x0)Td = −g(x0)

to obtain d

I Set the next iterate x1 = x0 + d, and continue.


Newton’s method for solving equations

I Well-studied method; well-known for its good performancewhen the starting point x0 is chosen appropriately.

I However, for other choices of x0 the algorithm may notconverge, as demonstrated in the following well-known picture:

g(x)

0

x

x0 x2x1x3


Observations about Newton’s method for optimization

I Work per iteration: on the order of n3; bottleneck — solving

H(xk)d = −∇f(xk)

I The iterates are, in general, equally attracted to local minimaand local maxima. Indeed, the method is just trying to solvethe system of equations ∇f(x) = 0.

I The method assumes ∇2f(xk) is nonsingular at eachiteration. Moreover, unless ∇2f(xk) is positive definite, dk isnot guaranteed to be a descent direction.

I There is no guarantee that f(xk+1) ≤ f(xk).I Without a line search, iterates of the Newton’s method may

not converge, unless started “close enough” to the right point.


Example of Newton’s method

Let f(x) = 7x− ln(x). x? = 17 = 0.142857143 is the unique globalminimum

k xk xk xk

0 1 0.1 0.011 −5 0.13 0.01932 0.1417 0.035992573 0.14284777 0.0629168844 0.142857142 0.0981240285 0.142857143 0.1288497826 0.14148377 0.1428439388 0.1428571429 0.14285714310 0.142857143



Let f(x) = 7x− ln(x). x? = 17 = 0.142857143 is the unique globalminimum



f(x) = − ln(1− x1 − x2)− lnx1 − lnx2. x? =(

13 ,

13

),

f(x?) = 3.295836866.

k (xk)1 (xk)2 ‖xk − x?‖

0 0.85 0.05 0.589255650988791 0.717006802721088 0.0965986394557823 0.4508310619260112 0.512975199133209 0.176479706723556 0.2384832491574623 0.352478577567272 0.273248784105084 0.06306102942974464 0.338449016006352 0.32623807005996 0.008747169263796555 0.333337722134802 0.333259330511655 7.41328482837195e−5

6 0.333333343617612 0.33333332724128 1.19532211855443e−8

7 0.333333333333333 0.333333333333333 1.57009245868378e−16



f(x) = − ln(1− x1 − x2)− lnx1 − lnx2. x? =(

13 ,

13

),

f(x?) = 3.295836866.


Rate of convergence for sequences (Appendix A.2)

Let {sk} ⊂ R, and limk→∞ sk = s̄.We say that {sk} exhibitsI Linear convergence: ∃c ∈ [0, 1) such that, for some k0,

|sk+1 − s̄||sk − s̄|

≤ c ∀k ≥ k0.

(Example: sk =(

110

)k: 0.1, 0.01, 0.001, etc.)

I Superlinear convergence: ∃ck, limk→∞ ck = 0 such that,for some k0,

|sk+1 − s̄||sk − s̄|

≤ ck ∀k ≥ k0.

(Example: sk = 0.1 · 1k! :110 ,

120 ,

160 ,

1240 ,

11250 , etc. )

I Quadratic convergence: ∃c ≥ 0 such that, for some k0,|sk+1 − s̄||sk − s̄|2

≤ c ∀k ≥ k0.

(Example: sk =(

110

)(2k−1): 0.1, 0.01, 0.0001, 0.00000001,

etc.)IOE 511/Math 562: ContOpt Newton’s method Page 74

Illustration of rates of convergence


Convergence rate of the Newton’s method

I To analyze an algorithm’s rate of convergence, study theconvergence rate of ‖xk − x̄‖, or |f(xk)− f(x̄)|, where {xk}kis a sequence of iterates, and x̄ is its limit point.

I We discussed many pitfalls of the pure Newton’s method (i.e.,with αk = 1, ∀k).

I However, under certain conditions, the method exhibitsquadratic rate of convergence


Quadratic convergence of Newton’s method

‖M‖ ≡ max‖x‖=1 ‖Mx‖. So, ∀x, ‖Mx‖ ≤ ‖M‖ · ‖x‖ (A.1.4)

Thm. 2.5.1 (Quadratic convergence)

Suppose f(x) is twice continuously differentiable and x? is a pointfor which ∇f(x?) = 0. Suppose H(x) satisfies:I there exists a scalar h > 0 for which ‖[H(x?)]−1‖ ≤ 1hI there exists scalars β > 0 and L > 0 for which‖H(x)−H(y)‖ ≤ L‖x− y‖ for all x, y ∈ B(x?, β).

Let x satisfy ‖x− x?‖ < γ, where γ := min{β, 2h3L

}, and let

xN := x−H(x)−1∇f(x). Then:

(i) ‖xN − x?‖ ≤ ‖x− x?‖2(

L2(h−L‖x−x?‖)

)≤ ‖x− x?‖2

(3L2h

)(ii) ‖xN − x?‖ < ‖x− x?‖ < γ(iii) If {xk}k is the sequence of Newton method iterates with

x0 = x, then ‖xk − x?‖ → 0 as k →∞


Comments on the quadratic convergence theorem

I Another, more involved, convergence result: Theorem 2.5.2I Convergence results in Theorem 2.5.1 are local

I i.e., they apply only if the algorithm is started “sufficientlyclose” to x?: ‖x− x?‖ < γ

I In practice, for most functions, values of β, L, and h are notknown

I Moreover, Newton’s method is invariable under invertiblelinear transformations of the variables, but β, L, and h are not

I A good practical modification of the Newton’s method would:I Make algorithm globally convergent, i.e., ensure that it

converges for any starting pointI Control behavior of the algorithm at points far from the

optimumI E.g., ensure (sufficient) descent

I Follow pure Newton’s method once the iterates get close to theoptimum, to achieve asymptotic quadratic convergence rate


Modifications of Newton’s method

I Problem: ∇2f(xk) is nearly singular, or indefinite (so dk maynot be a descent direction)I Possible solution: use dk = −(∇2f(xk) + �kI)−1∇f(xk)I Here �k ≥ 0 is chosen so that the smallest eigenvalue of∇2f(xk) + �kI is bounded below by δ > 0

I Also, don’t want �k too big – otherwise, we are doing SD!I Problem: αk = 1 does not result in (sufficient) descent

I Solution: use a line search to ensure descent, e.g.,backtracking with initial α = 1

I If f is strongly convex, e.g., ∃µ > 0 : min eigH(x) > µ > 0 forall x, and H(x) is Lipschitz continuous, then, whenbacktracking is used, for some η > 0 and κ > 0,

I if ‖∇f(xk)‖ ≥ η, then f(xk+1)− f(xk) ≤ −κ, andI if ‖∇f(xk)‖ < η, then αk = 1 will be selected, and the next

iterate will satisfy ‖∇f(xk+1)‖ < η, and so will all the furtheriterates. Moreover, quadratic convergence will be observed inthis phase.

I Good stopping criteria for Newton’s method: either ‖dk‖ ≤ �,or −∇f(xk)Tdknt/2 = ∇f(xk)TH(xk)−1∇f(xk)/2 ≤ �2


Problem and notation

(P) min f(x)s.t. g(x) ≤ 0

h(x) = 0x ∈ X,

where X is an open set andg(x) = (g1(x), . . . , gm(x))

T : Rn → Rm,h(x) = (h1(x), . . . , hl(x))

T : Rn → Rl

F 4= {x ∈ X : g(x) ≤ 0, h(x) = 0}(P) min

x∈Ff(x)

Jacobian matrices of g and h: (Notation “J”different from thebook!)

Jg(x) =

∇g1(x)T

...∇gm(x)T

∈ Rm×n and Jh(x) = ∇h1(x)

T

...∇hl(x)T

∈ Rl×nIOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 80

Directions of interest

(P) min f(x)s.t. g(x) ≤ 0

h(x) = 0x ∈ X,

Suppose x̄ ∈ F . We define the following sets:I F0 = {d : ∇f(x̄)Td < 0} — any element of this set is an

“improving” (i.e., descent) direction of f at x̄

I I(x̄) = {i : gi(x̄) = 0} — the indices of binding inequalityconstraints at x̄

I G0 = {d : ∇gi(x̄)Td < 0 ∀i ∈ I(x̄)} — any element of thisset is an “inward” direction of binding inequality constraints

I H0 = {d : ∇hi(x̄)Td = 0 ∀i = 1, . . . , l} — the set of tangentdirections of equality constraints

Although not explicit in the notation, all of these sets depend on x̄

IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 81

Geometric Necessary Optimality conditions, simple case

Theorem 3.1.1, option (i): linear equality constraints

Assume that h(x) is a linear function, i.e., h(x) = Ax− b forA ∈ Rl×n. If x̄ is a local minimum of (P), then F0 ∩G0 ∩H0 = ∅.

Proof:I hi(x) = aTi x− bi and ∇hi(x) = ai, i.e.,H0 = {d : aTi d = 0, i = 1, . . . , l} = {d : Ad = 0}.

I Suppose d ∈ F0 ∩G0 ∩H0. Then:I For all λ > 0 sufficiently small gi(x̄+ λd) < gi(x̄) = 0 ∀i ∈ II For i 6∈ I, since λ is small, gi(x̄+ λd) < 0 by continuityI h(x̄+ λd) = (Ax̄− b) + λAd = 0 for all λ

Therefore, x̄+ λd ∈ F for all λ > 0 sufficiently smallI I.e., d is a feasible direction at x̄ in F (Def. 3.1.1)

I On the other hand, for all sufficiently small λ > 0,f(x̄+ λd) < f(x̄)

I ...contradicts assumption that x̄ is a local minimum of (P)To extend to nonlinear h, need to make some assumptions...


Geometric Necessary Optimality conditions

Theorem 3.1.1, option (ii)

If x̄ is a local minimum of (P) and the gradient vectors∇hi(x̄), i = 1, . . . , l are linearly independent, thenF0 ∩G0 ∩H0 = ∅.


From Geometric to Algebraic FONC

I Stating that F0 ∩G0 ∩H0 = ∅ is equivalent to saying that thefollowing system of linear equations and inequalities:

∇f(x̄)Td < 0∇gi(x̄)Td < 0, i ∈ I(x̄)∇hi(x̄)Td = 0, i = 1, . . . , l

has no solutions

I How do you show that a system doesn’t have a solution? Weneed a few results from Convex Analysis.


Separation of Convex Sets (Appendix B)

Definitions:

I A hyperplane is a set of the form H = {x ∈ Rn : pTx = α},where p 6= 0 is a vector in Rn and α ∈ R, andsets H+ = {x ∈ Rn : pTx ≥ α}, H− = {x ∈ Rn : pTx ≤ α}are closed half-spaces (open half-spaces: use > and α for all x ∈ Sand pTx < α for all x ∈ T

I H is said to strongly separate S and T if for some � > 0,pTx > α+ � for all x ∈ S and pTx < α− � for all x ∈ T


A separation theorem

Theorem B.3.1 (modified)

Let S be a nonempty closed convex set in Rn, and x̄ 6∈ S. Then∃p 6= 0 and α such that H = {x : pTx = α} strongly (andtherefore strictly) separates S and {x̄}.

Proposition

Let S be a nonempty closed convex set in Rn, and x̄ 6∈ S. Thenthere exists a unique point w ∈ S with minimum distance from x̄.Furthermore, w is the minimizing point if and only if(x̄− w)T (x− w) ≤ 0 for all x ∈ S.


Some implications of Theorem B.3.1

∂S — boundary of the set S

Theorem B.3.2

If S is a nonempty convex set and x̄ ∈ ∂S, then there exists asupporting hyperplane to S at x̄, i.e., ∃p 6= 0 such that pTx ≤ pT x̄for all x ∈ S.

Proof idea: Consider T = cl(S); x̄ ∈ ∂S = ∂T . Let {xi} → x̄,xi 6∈ T ∀i; each xi can be separated from T by a hyperplane withnormal vector pi 6= 0. Wolog, ‖pi‖ = 1 ∀i. Let p be a limit point of{pi}. Then p is the normal vector of a supporting hyperplane to Sand x̄.

One application: if f : S → R is a convex function and x̄ ∈ intS,Theorem B.3.2 can be used to show that f has a subgradient at x̄.


Some implications of Theorem B.3.1

Theorem B.3.3

If A and B are nonempty convex sets and A ∩B = ∅, then A andB can be separated by a hyperplane.

Proof idea: Let S = {x1 − x2 : x1 ∈ A, x2 ∈ B}. Note: 0 6∈ S.Let T = cl(S).

I If 0 6∈ T , use a hyperplane that separates {0} from T to getthe result.

I If 0 ∈ T , then, since 0 ∈ ∂S, use a supporting hyperplane toS at 0 to get the result.


Theorems of alternatives (a.k.a. Motzkin’s TranspositionTheorems)

Theorem B.4.13, Farkas’ Lemma

Given matrix A and vector b of appropriate dimensions, exactly oneof the following two systems has a solution:(1) Ax = b, x ≥ 0 or(2) AT y ≥ 0, bT y < 0 (or, equivalently, AT y ≤ 0, bT y > 0).

Lemma 3.1.1: Key Lemma

Given matrices Ā, B, and H of appropriate dimensions, exactlyone of the two following systems has a solution:(i) Āx < 0, Bx ≤ 0, Hx = 0(ii) ĀTu+BTw +HT v = 0, u ≥ 0, w ≥ 0, eTu = 1.


Back to constrained optimization

(P) min f(x)s.t. gi(x) ≤ 0, i = 1, . . . ,m

hi(x) = 0, i = 1, . . . , lx ∈ X,

Suppose x̄ ∈ F . We define the following sets:I F0 = {d : ∇f(x̄)Td < 0}I G0 = {d : ∇gi(x̄)Td < 0 ∀i ∈ I(x̄)}, whereI(x̄) = {i : gi(x̄) = 0}

I H0 = {d : ∇hi(x̄)Td = 0 ∀i = 1, . . . , l}


Back to optimality conditions

Theorem 3.1.1

If x̄ is a local minimizer of (P), and either (i) h(x) is a linearfunction, or (ii) ∇hi(x̄), i = 1, . . . , l are linearly independent, thenF0 ∩G0 ∩H0 = ∅.

Proof idea for version (ii): (details in Section 3.5)I Since Jh(x̄) ∈ Rl×n has full row rank, can find l-by-l

nonsingular submatrixI i.e., partition x = (y, z), rearranging variables if necessary

and apply IFT to find...I s(z): x̄ = (s(z̄), z̄) and h(s(z), z) = 0 for z ∈ B(z̄, �)I Suppose d = F0 ∩G0 ∩H0. Let d = (dy; dz) and let

x(λ) = (y(λ), z(λ)) = (s(z̄ + λdz), z̄ + λdz)

I Using definition of d and formula for Js(z), can show that, forsmall λ > 0, x(λ) is feasible and f(x(λ)) < f(x̄) —contradiction


Implicit Function Theorem

Theorem 3.5.1: Implicit Function Theorem

Let h(x) = h(y, z) : Rn → Rl (with y ∈ Rl and z ∈ Rn−l) and x̄ = (ȳ, z̄) satisfy:1. h(x̄) = 0

2. h(x) is continuously differentiable in a neighborhood of x̄

3. The l × l Jacobian matrix

Jyh(ȳ, z̄) =

∂h1(x̄)∂y1

· · · ∂h1(x̄)∂yl

.... . .

...∂hl(x̄)∂y1

· · · ∂hl(x̄)∂yl

is non-singular.

Then there exists � > 0 along with function s : Rn−l → Rl such thatI s(z̄) = ȳ

I ∀z ∈ B(z̄, �), h(s(z), z) = 0I ∀z ∈ B(z̄, �), s(z) is continuously differentiable and

Jzs(z) = −Jyh(s(z), z)−1Jzh(s(z), z)


Fritz John Necessary Conditions1

Theorem 3.1.2, Fritz John Necessary Conditions

Let x̄ be a feasible solution of (P). If x̄ is a local minimum of (P),then there exists (u0, u, v) such that

u0∇f(x̄) +m∑i=1

ui∇gi(x̄) +l∑

i=1

vi∇hi(x̄) = 0,

u0, u ≥ 0, (u0, u, v) 6= 0,

ui · gi(x̄) = 0, i = 1, . . . ,m.

(Note that the fist equation can be rewritten as∇f(x̄)u0 + Jg(x̄)Tu+ Jh(x̄)T v = 0.)

I Condition ui · gi(x̄) = 0, i = 1, . . . ,m is referred to ascomplementary slackness

1Rare case of someone’s first name being included in naming of a theorem!IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 94

KKT first order necessary conditions

Theorem 3.1.3, KKT First Order Necessary Conditions

Let x̄ be a feasible solution of (P) and let I(x̄) = {i : gi(x̄) = 0}.Further, suppose that ∇gi(x̄) for i ∈ I(x̄) and ∇hi(x̄) fori = 1, . . . , l are linearly independent. If x̄ is a local minimum, thereexists (u, v) such that

∇f(x̄) + Jg(x̄)Tu+ Jh(x̄)T v = 0, (1)

u ≥ 0, uigi(x̄) = 0 i = 1, . . . ,m. (2)

I Components of u and v are usually referred to as LagrangeMultipliers

I A feasible point x̄ that together with some multiplier vectorsu and v satisfies conditions (1) and (2) is referred to as aKKT point.


Proof of KKT FONC

x̄ must satisfy the Fritz John conditions:

u0∇f(x̄) +m∑i=1

ui∇gi(x̄) +l∑

i=1

vi∇hi(x̄) = 0,

u0, u ≥ 0, (u0, u, v) 6= 0,uigi(x̄) = 0, i = 1, . . . ,m.

I If u0 > 0, redefine u← u/u0 and v ← v/u0.I If u0 = 0, then

m∑i=1

ui∇gi(x̄) +l∑

i=1

vi∇hi(x̄) = 0,

∑i∈I(x̄)

ui∇gi(x̄) +l∑

i=1

vi∇hi(x̄) = 0,

i.e., the above gradients are linearly dependent —contradictionIOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 96

When are necessary conditions really necessary?

I Recall that the statement of FONC above has the form

“if x̄ is a local minimum of (P) and some assumption on theconstraints then x̄ must be a KKT point”

I Without additional assumptions on the constraints, we cannotprove that KKT conditions are necessary for optimality, i.e.,that x̄ satisfies (1) and (2)

I An additional assumption on the constraints that enables usto proceed with the proof that KKT conditions are necessaryfor optimality is called a constraint qualification

I We have already established (Theorem 3.1.3) that thefollowing is a constraint qualification:Linear Independence Constraint Qualification Thegradients ∇gi(x̄), i ∈ I(x̄), ∇hi(x̄), i = 1, . . . , l are linearlyindependent.


Other CQs: Slater’s condition

Definition 3.1.2

A point x ∈ X is called a Slater point of problem (P) if x satisfiesg(x) < 0 and h(x) = 0, i.e., x is feasible and satisfies allinequalities strictly.

Theorem 3.1.4 (Slater condition)

Suppose the problem (P) satisfies Slater condition, i.e.,gi, i = 1, . . . ,m are convex, and hi(x), i = 1, . . . , l are linear and∇hi(x), i = 1, . . . , l are linearly independent, and (P) has a Slaterpoint. Then the KKT conditions are necessary to characterize alocal optimal solution.

Note: if hi’s are linear, their gradients are constant and linearindependence assumption is wolog


Proof of Slater’s condition

I Suppose assumptions of Theorem 3.1.4 are satisfied and x0 is aSlater point and x̄ — a local minimum

I Fritz-John conditions: ∃(u0, u, v) 6= 0 such that (u0, u) ≥ 0 and

u0∇f(x̄) +m∑i=1

ui∇gi(x̄) +l∑i=1

vi∇hi(x̄) = 0, uigi(x̄) = 0 ∀i

I If u0 > 0, divide through by u0 to get KKT conditions

I If u0 = 0, we have: 0 =∑mi=1 ui∇gi(x̄) +

∑li=1 vi∇hi(x̄)0 =∑

i∈I(x̄) ui∇gi(x̄) +∑li=1 vi∇hi(x̄)

I let d = x0 − x̄I For each i ∈ I(x̄), ∇gi(x̄)T d < 0, since

0 > gi(x0) ≥ gi(x̄) +∇gi(x̄)T (x0 − x̄) = ∇gi(x̄)T d

I Since h(x) are linear, ∇hi(x̄)T d = 0, i = 1, . . . , lI Thus, 0 = 0T d = (

∑i∈I(x̄) ui∇gi(x̄) +

∑li=1 vi∇hi(x̄))T d < 0,

unless ui = 0 for all i ∈ I(x̄)I Therefore, v 6= 0 and Jh(x̄)T v = 0, violating the assumption. This

is a contradiction, and so u0 > 0.IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 99

Other CQs: linear constraints

Theorem 3.1.5, Linear constraints

If all constraints are linear, the KKT conditions are necessary tocharacterize a local optimal solution.

Proof:

I Our problem is min{f(x) : Ax ≤ b, Mx = g}.I Suppose x̄ is a local minimum. Partition Ax ≤ b into two groups:

AIx ≤ bI and AĪx ≤ bĪ(first group active at x̄).

I Set of feas. directions at x̄ is precisely {d : Md = 0, AId ≤ 0}I Therefore, the following system has no solutions: AIM

−M

d ≤ 0, −∇f(x̄)T d > 0I From Farkas’ lemma, there exists (u, v1, v2) ≥ 0 such that

ATI u+MT v1 −MT v2 = −∇f(x̄).

Take v = v1 − v2 to finish the proof.IOE 511/Math 562: ContOpt Constrained optimization — optimality conditions Page 100

Applying FONC: summary (Section 3.6)

Goal: build an (exhaustive) list of candidates for local optimalityI Check if the problem (P) satisfies one of the global CQs (e.g.,

Slater, Linear constraints)I If yes, every local minimum is a KKT point;I find all KKT points to form a list of candidates

I If not, check which feasible points satisfy/violate local CQs(e.g., Linear independence)I Among points that satisfy some CQ, all KKT points belong to

the list of candidatesI Among points that violate (all) CQs, all Fritz John points

belong to the list of candidates


Convex Problems and first-order sufficient conditions

The program(P) min f(x)s.t. g(x) ≤ 0

h(x) = 0x ∈ X

is a convex problem if f , gi, i = 1, . . . ,m are convex functions,hi, i = 1 . . . , l are linear (more precisely, affine) functions, and Xis a convex set.

Theorem 3.2.1 (paraphrased)

Suppose (P) is a convex problem and X is an open set. Then firstorder KKT conditions are sufficient for optimality in a convexprogram.


Proof of Theorem 3.2.1: Analysis of the feasible region

I Because each gi is convex, the level sets

Ci = {x ∈ X : gi(x) ≤ 0}, i = 1, . . . ,m

are convex

I Because each hi is linear, the sets

Di = {x ∈ X : hi(x) = 0}, i = 1, . . . , l

are convex

I Thus the feasible region

F = {x ∈ X : g(x) ≤ 0, h(x) = 0}

is a convex set (That’s Proposition 3.2.1)


Proof of Theorem 3.2.1 — continued

I Let x ∈ F , x 6= x̄.I ∀i ∈ I(x̄),

0 ≥ gi(x) ≥ gi(x̄) +∇gi(x̄)T (x− x̄) = ∇gi(x̄)T (x− x̄)

I ∀i = 1, . . . , l, since hi(·) is linear,

∇hi(x̄)T (x− x̄) = 0

I From the KKT conditions, including complementarity,

∇f(x̄)T (x−x̄) = −

∑i∈I(x̄)

ui∇gi(x̄) +l∑i=1

vi∇hi(x̄)

T (x−x̄) ≥ 0,and by gradient inequality, f(x) ≥ f(x̄) for any feasible x.

I Can relax assumptions a bit: pseudoconvex f , quasiconvexgi’s (Sec. 3.3)


Lagrangian function

I Lagrangian function, or simply the Lagrangian:

L(x, u, v) = f(x) +

m∑i=1

uigi(x) +

l∑i=1

vihi(x)

I Gradient conditions of the KKT necessary conditions:

∇xL(x̄, u, v) = 0

I Also,

∇2xxL(x, u, v) = ∇2f(x) +m∑i=1

ui∇2gi(x) +l∑

i=1

vi∇2hi(x)


Second order KKT conditions

Theorem 3.4.1, KKT second order necessary conditions

Suppose x̄ is a local minimum of (P), and ∇gi(x̄), i ∈ I(x̄) and∇hi(x̄), i = 1, . . . , l are linearly independent. Then x̄ must satisfythe KKT conditions. Furthermore, every d that satisfies:

∇gi(x̄)Td ≤ 0, i ∈ I(x̄) and ∇hi(x̄)Td = 0, i = 1 . . . , l

must also satisfydT∇2xxL(x̄, u, v)d ≥ 0.


Second order KKT conditions

Theorem 3.4.2, KKT second order sufficient conditions

Suppose the point x̄ ∈ F together with multipliers (u, v) satisfiesthe first order KKT conditions. Let I+ = {i ∈ I(x̄) : ui > 0} andI0 = {i ∈ I(x̄) : ui = 0}. Additionally, suppose that every d 6= 0that satisfies

∇gi(x̄)Td ≤ 0, i ∈ I0∇gi(x̄)Td = 0, i ∈ I+∇hi(x̄)Td = 0, i = 1 . . . , l

also satisfiesdT∇2xxL(x̄, u, v)d > 0.

Then x̄ is a (strict) local minimum.


Lagrangian duality

Lectures on Lagrangian duality were presented on the whiteboard

IOE 511/Math 562: ContOpt Lagrangian duality Page 108

Problems with linear equality constraintsProjected steepest descent method

Lecture presented on the whiteboard

IOE 511/Math 562: ContOpt Linearly constrained problems Page 109

Barrier methods

(P) min f(x)s.t. gi(x) ≤ 0, i = 1, . . . ,m

x ∈ X,

where X is an open set.

I F = {x ∈ X : gi(x) ≤ 0, i = 1, . . . ,m}I Assumption: ∃x0 ∈ X : gi(x0) < 0, i = 1, . . . ,mI Let b(x) be a barrier function

I b(x) ≥ 0 for all x that satisfy g(x) < 0 (or any finite lowerbound)

I b(x)→∞ as limx maxi{gi(x)} → 0.

IOE 511/Math 562: ContOpt Barrier methods Page 110

Barrier problem

B(θ) min f(x) + θb(x)s.t. g(x) < 0,

x ∈ XBarrier method for solving (P): solve P (θk), k = 1, 2, . . . for asequence of parameters satisfying θk > θk+1 > 0, θk →k→∞ 0.I It is practical to initialize algorithm for solving P (θk+1) at

solution xk to P (θk)I Initial solution x0 can be found by applying a Barrier method

to an auxiliary optimization problem

Barrier Convergence Theorem

Suppose f(x), g(x), and b(x) are continuous functions. Letxk, k = 1, 2, . . ., be a sequence of solutions of B(θk). Supposethere exists an optimal solution x? of (P) for whichB(x?, δ) ∩ {x : g(x) < 0} 6= ∅ for every δ > 0 . Then any limitpoint x̄ of {xk} solves (P).


Common barrier functions

b(x) =

m∑i=1

γ(−gi(x)),

where γ : R++ → R is such thatI γ(t)→ +∞ as t→ 0+

I γ is monotone decreasing.

For example:

I γ(t) = t−q, where q > 0: b(x) =∑m

i=1(−gi(x))−q

I γ(t) = − ln t: b(x) = −∑m

i=1 ln(−gi(x)) (works if F is abounded set)

Then

∇b(x) = −m∑i=1

γ′(−gi(x))∇gi(x)


KKT multipliers in barrier methods

xk — solution to B(θk) — satisfies:

∇f(xk)−θkm∑i=1

γ′(−g(xk))∇gi(xk) = 0, or ∇f(xk)+m∑i=1

uki∇gi(xk) = 0

where uki = −θkγ′(−gi(xk)), i = 1, . . . ,m

Theorem BAR 2

Let (P) satisfy the conditions of the Barrier Convergence Theorem.Suppose γ(y) is continuously differentiable and let uk be definedas above. Then if xk → x̄, and x̄ satisfies the linear independencecondition for gradient vectors of active constraints, then uk → ū,where ū is a vector of Karush-Kuhn-Tucker multipliers for theoptimal solution x̄ of (P).


Penalty methods

(P) min f(x)s.t. gi(x) ≤ 0, i = 1, . . . ,m

hi(x) = 0, i = 1, . . . , lx ∈ X ⊆ Rn

X is an open set; feasible region denoted by F

Penalty function

A function p(x) : Rn → R is called a penalty function for (P) ifp(x) satisfies:

I p(x) = 0 if g(x) ≤ 0, h(x) = 0 andI p(x) > 0 if g(x) 6≤ 0 or h(x) 6= 0.

IOE 511/Math 562: ContOpt Penalty methods Page 114

Penalty functions

I Penalty functions are typically defined by

p(x) =

m∑i=1

φ(gi(x)) +

l∑i=1

ψ(hi(x)),

whereI φ(y) = 0 if y ≤ 0 and φ(y) > 0 if y > 0I ψ(y) = 0 if y = 0 and ψ(y) > 0 if y 6= 0,

I Example:

p(x) =

m∑i=1

(max{0, gi(x)})2 +l∑i=1

hi(x)2.

I More generally: we often use

p(x) =

m∑i=1

[max{0, gi(x)}]q +l∑i=1

|hi(x)|q, where q ≥ 1.

I If q = 1, we have the “linear penalty function.” This function maynot be differentiable at points x where gi(x) = 0 or hi(x) = 0


Penalty method

Solve the penalty program:

P(c) : min f(x) + cp(x)s.t. x ∈ X

for an increasing sequence of constants c as c→ +∞.I The scalar quantity c is called the penalty parameter

I Idea: for a sufficiently large c, solution x(c) of P(c) will be(almost) feasible


Penalty methods: convergence and analysis

Let q(c, x) = f(x) + cp(x) and xk = argminx∈X q(ck, x), whereck+1 > ck and ck → +∞ as k → +∞

PEN 1: Penalty Lemma

1. q(ck, xk) ≤ q(ck+1, xk+1)

2. p(xk) ≥ p(xk+1)3. f(xk) ≤ f(xk+1)4. f(x?) ≥ q(ck, xk) ≥ f(xk)

PEN 2: Penalty Convergence Theorem

Suppose that the problem is feasible and f(x), g(x), h(x), andp(x) are continuous functions. Let {xk}k, k = 1, . . . ,∞, be asequence of solutions to P (ck), and suppose the sequence {xk}k iscontained in a compact set. Then any limit point x̄ of {xk}ksolves (P).


KKT multipliers in penalty methods

I Suppose

p(x) =

m∑i=1

φ(gi(x)) +

l∑i=1

ψ(hi(x)),

where φ(y) and ψ(y) are as above.I If φ(y) and ψ(y) are continuously differentiable and

φ′(0) = 0,

then p(x) is differentiableI ∇p(x) =

∑mi=1 φ

′(gi(x))∇gi(x) +∑l

i=1 ψ′(hi(x))∇hi(x)

I If xk solves P (ck), then

∇f(xk) + ck∇p(xk) = 0,

that is,

∇f(xk)+ckm∑i=1

φ′(gi(xk))∇gi(xk)+ck

l∑i=1

ψ′(hi(xk))∇hi(xk) = 0.


KKT multipliers in penalty methods

Let[uk]i = ckφ

′(gi(xk)), [vk]i = ckψ

′(hi(xk)).

Then

∇f(xk) +m∑i=1

[uk]i∇gi(xk) +l∑

i=1

[vk]i∇hi(xk) = 0,

and so we can interpret (uk, vk) as a sort of vector ofKarush-Kuhn-Tucker multipliers.

PEN 3: KKT multipliers in penalty methods

Suppose φ(y) and ψ(y) are continuously differentiable andφ′(0) = 0, and that f(x), g(x), and h(x) are differentiable. Let(uk, vk) be defined as above. Then if x

k → x̄, and x̄ satisfies thelinear independence condition for gradient vectors of activeconstraints, then (uk, vk)→ (ū, v̄), where (ū, v̄) is a vector ofKarush-Kuhn-Tucker multipliers for the optimal solution x̄ of (P).


Exact Penalty Methods

I In most penalty methods, xk’s are not in the feasible region;solution to (P) approached in the limit.

I Can we choose a penalty function p(x) and a constant c sothat xc is also an optimal solution of (P)?

PEN 4: Linear penalty leads to an exact penaltymethod

Suppose (P) is a convex program for which the Karush-Kuhn-Tuckerconditions are necessary.Suppose that

p(x) :=

m∑i=1

gi(x)+ +

l∑i=1

|hi(x)|.

Then as long as c is chosen sufficiently large, the sets of optimal

solutions of P (c) and (P) coincide. In fact, it suffices to choose

c > max{u?i , i = 1, . . . ,m; |v?i |, i = 1, . . . , l}, where (u?, v?) is a vectorof Karush-Kuhn-Tucker multipliers.


Augmented Lagrangian penalty function

I Is there an exact penalty method with a smooth penaltyfunction?

I Augmented Lagrangian penalty (equality constraints only):

LALAG(x, v) = f(x) +

l∑i=1

vihi(x) + c

l∑i=1

hi(x)2,

I If x̄ is the optimal solution and v̄ is the vector ofcorresponding multipliers,

∇xLALAG(x̄, v̄)

=

[∇f(x̄) +

l∑i=1

v̄i∇hi(x̄)

]+ 2c

l∑i=1

hi(x̄)∇hi(x̄) = 0

I If (x̄, v̄) satisfy SOSC, then for c sufficiently large,x̄ = argminLALAG(x, v̄)


ALAG — continued

v̄ and c are not known in advance. The algorithm:

Initialization Select the initial multipliers v0 and penalty weightc0 > 0. Set k ← 0

Iteration k, x update Solve the unconstrained problem to minimizeLALAG(x, vk) and let x

k denote the optimal solutionobtained. If termination criteria are satisfied, stop.

Iteration k, v update Obtain the updated multipliers vk+1according to appropriate formulas, increase k, repeatiteration.

Multiplier updates:

I Constant: Keep the multipliers constant (not much differentfrom the usual quadratic penalty method).

I Method of multipliers: Let vk+1 = vk + 2ckh(xk). (underright circumstances, will converge to v̄ if xk converges to x̄).

I Other multiplier update methods – second order, exponential,etc.


Motivation

P minx∈X f(x) P (θ) minx∈X f(x) + θb(x)s.t. Ax = b s.t. Ax = b

g(x) ≤ 0

I Several aspects of general barrier methods were left withoutanalysis:I How may iterations of the (projected) Newton method are

needed to solve each barrier problem P (θk)?I How does the above depend on the relationship between θk+1

and θk?I How precisely do we need to solve each P (θk)?I How small does θ need to get to produce a good approximate

solution of (P)?

I These questions are hard to answer in general, but we can doanalysis for specific classes of (P)

IOE 511/Math 562: ContOpt Interior point methods for LPs Page 123

Linear Programming: The primal and dual problems

Primal LP in standard form and its Lagrangian dual:

P : min cTx D : max bTπs.t. Ax = b s.t. ATπ + s = c

x ≥ 0 s ≥ 0

Note:

I If x and (π, s) are primal/dual feasible, then the optimalitygap (or duality gap) between them is

cTx− bTπ = cTx− (Ax)Tπ = (c−ATπ)Tx = sTx

I KKT FO N&S conditions for P:I Ax = b, x ≥ 0 (primal feasibility)I ATπ + s = c, s ≥ 0 (gradient alignment, and dual feasibility)I sTx = 0 (complementarity)


Barrier subproblem for P

P (θ) min cTx− θn∑j=1

ln(xj)

s.t. Ax = bx > 0

I Idea for an interior point (barrier) algorithm: solve a sequenceof problems P (θ) with θ → 0+

I KKT conditions for P (θ):

Ax = b, x > 0

ATπ + s = c

1

θXSe− e = 0

I If (x, π, s) satisfy these conditions, then x is feasible for P ,(π, s) is feasible for D, and the duality gap between them is:

xT s = eTXSe = θeT e = θn.IOE 511/Math 562: ContOpt Interior point methods for LPs Page 125

Idea: don’t need to solve each barrier problem “exactly”β-approximate solutions of P (θ)

A β-approximate solution of P (θ) is defined as any vector (x, π, s)that satisfies

Ax = b, x > 0

ATπ + s = c∥∥∥∥1θXs− e∥∥∥∥ ≤ β.

Lemma 5.2.1

If (x̄, π̄, s̄) is a β-approximate solution of P (θ) and β < 1, then x̄is feasible for P , (π̄, s̄) is feasible for D, and the duality gapbetween them satisfies:

nθ(1− β) ≤ cT x̄− bT π̄ = x̄T s̄ ≤ nθ(1 + β).


Several flavors of interior point methods for LP in the book

I Section 5.3: Primal algorithmI Essentially, a special case of barrier method we’ve already

discussedI For each P (θk), a β-approximate solution is foundI For theoretical analysis, use a sequence of θk’s such that only

one iteration of (projected) Newton method needed for eachP (θk)

I Section 5.4: A primal-dual algorithmI A variation on the above, but simultaneously updating primal

and dual variablesI Theoretical complexity analysis

I Section 5.5: A more practical primal-dual algorithm


Primal-dual IPMSolving P (θ) via a primal-dual Newton’s method

Trying to solve system of (KKT) equations:

Ax = b, x > 0

ATπ + s = c

1

θXSe− e = 0⇔ XSe = θe

Let (x̄, π̄, s̄) be our current primal and dual feasible iterate:

Ax̄ = b, x̄ > 0, AT π̄ + s̄ = c, s̄ > 0

The Newton equation system (to find the Newton direction)around the current iterate is:

A∆x = 0

AT∆π + ∆s = 0

S̄∆x+ X̄∆s = θe− X̄S̄eIOE 511/Math 562: ContOpt Interior point methods for LPs Page 128

Primal-dual IPMApproaches for solving the primal-dual Newton equation system:

0. Closed-form expressions in (5.21) — not for use in practice

1. Solve directly (e.g., using Matlab’s backsolve operator): A 0 00 AT IS̄ 0 X̄

· ∆x∆π

∆s

= 00θe− X̄S̄e

2. Better: use special sparsity structure of the matrix:

I Solve (AX̄S̄−1AT )∆π = A(x̄− θS̄−1e)I Substitute ∆s = −AT∆πI Substitute ∆x = −x̄+ θS̄−1e− S̄−1X̄∆s


Primal-dual interior point algorithm (Alg. 12)

Step 0: Initialization Data is (x0, π0, s0, θ0) such that (x0, π0, s0)

is a β-approximate solution of P (θ0) for some knownvalue of β that satisfies β < 13 . k ← 0.

Step 1: Set current values (x̄, π̄, s̄) = (xk, πk, sk), θ̃ = θk.Step 2: Shrink θ. Set θ̄ = αθ̃ for some α ∈ (0, 1).Step 3: Compute the primal-dual Newton direction. Compute the

Newton step for P (θ̄) at (x̄, π̄, s̄) by solving

A∆x = 0

AT∆π + ∆s = 0

S̄∆x+ X̄∆s = θ̄e− X̄S̄e.Step 4: Update All Values.

(x′, π′, s′) = (x̄, π̄, s̄) + (∆x,∆π,∆s)

Step 5: Reset Counter and Continue.(xk+1, πk+1, sk+1) = (x′, π′, s′). θk+1 = θ̄.k ← k + 1. Go to Step 1.


Primal-dual IPMAnalysis of the algorithm

Theorem 5.4.1: Explicit Quadratic Convergence ofPrimal-Dual Newton’s Method

Suppose that (x̄, π̄, s̄) is a β-approximate solution of P (θ̄) andβ < 13 . Let (∆x,∆π,∆s) be the solution to the primal-dualNewton equations above, and let:

(x′, π′, s′) = (x̄, π̄, s̄) + (∆x,∆π,∆s).

Then (x′, π′, s′) is a(

1+β(1−β)2

)· β2-approximate solution of P (θ).

Proof at the end of Section 5.4. Two pages of rather boringalgebra, with key insight:

‖X̄−1∆x‖ ≤ β1− β

< 1 and ‖S̄−1∆s‖ ≤ β1− β

< 1


Primal-dual IPMAnalysis of the algorithm

Thm. 5.4.2: Shrinkage Theorem

Suppose that (x̄, π̄, s̄) is a 340 -approximate solution of P (θ̃). Let

α = 1−18

15

+√n

and let θ̄ = αθ̃. Then (x̄, π̄, s̄) is a 15 -approximate

solution of P (θ̄).

Thm. 5.4.3: Convergence theorem

Suppose that (x0, π0, s0) is a β = 340 -approximate solution ofP (θ0), and let

α = 1−18

15 +√n.

Then, for all k = 1, 2, 3, . . ., (xk, πk, sk) is a β = 340 -approximatesolution of P (θk).


Primal-dual IPMComplexity of the algorithm

Thm. 5.4.4: Computational Guarantee

Suppose that (x0, π0, s0) is a β = 340 -approximate solution ofP (θ0). In order to obtain primal and dual feasible solutions(xk, πk, sk) with a duality gap of at most �, one needs to run thealgorithm for at most

k =

⌈10√n ln

(43

37

(x0)T s0

�

)⌉iterations.


A practial interior point algorithm (Alg. 13)1. Given (x0, π0, s0) satisfying x0 > 0, s0 > 0, and θ0 > 0, and r satisfying 0 < r < 1, and � > 0.

Set k ← 0.2. Test stopping criterion. Check if:

(1) ‖Axk − b‖ ≤ �(2) ‖AT πk + sk − c‖ ≤ �(3) (sk)T xk ≤ �.

If so, STOP. If not, proceed.

3. Set θ ←(

1

10

)((xk)T (sk)

n

)4. Solve the Newton equation system:

(1) A∆x = b− Axk =: r1(2) AT ∆π + ∆s = c− AT πk − sk =: r2(3) Sk∆x +Xk∆s = θe−XkSke =: r3

5. Determine the step-sizes: for r ∈ (0, 1) (e.g., r = 0.99)

αP = min

{1, r min

∆xj

What we covered in this course

I Some useful math: convexity, separation of convex sets,theorems of alternative

I Optimality conditions for optimization problems withdifferentiable functionsI A brief detour into non-differentiable convex unconstrained

problemsI Lagrangian duality for constrained optimization problems

I and its relationship to optimality conditionsI Algorithms for solving unconstrained problems with

twice-differentiable objective functionI Projection frameworks for problems with linear equality

constraintsI Barrier and Penalty frameworks for solving constrained

problems, also under differentiability assumptionsI In detail: barrier methods for linear optimizatiom

I Role of problem convexity in analysis of optimality conditionsand algorithmic performance

IOE 511/Math 562: ContOpt Wrap-up Page 135

What we did not cover in this courseOther methods for solving unconstrained problems

I Conjugate gradient methods

I Quasi-Newton methods

I Trust-region methodsI Methods for continuous but non-differentiable functions

I E.g., convex non-differentiable functions

I First-order methods that attempt to improve upon SteepestDescent for large-scale problems

Note: these methods are motivated by the basic direction-basedmethods we studied, and make modifications to the methodnecessitated by particular features of the problem.


What we did not cover in this courseOther methods for solving constrained problems

I Sequential quadratic programming

I Reduced gradient method in greater generality

I Practical versions of barrier and penalty methods

I ...I Specialized methods for convex optimization

I e.g., cutting plane methods

Note: most of these methods are extensions of the fundamentalframeworks we discussed.


What we did not cover in this courseOther optimization problem types

I Problems with generalized inequality constraintsI Most prominent example: optimization over symmetric

matrices with constraints “X SPSD”

I Infinite-dimensional problems (n and/or m)

I Equilibrium problems

I Non-convex problems where global solutions are required

I Problems where (some of) the variables are required to takeon integer values

Note: analysis and algorithm development for such problems isoften much more complicated, but at its root, attempts to extendaspects of analysis we did in the traditional framework.


So, you’ve got yourself an optimization problemWhat algorithms/software to use?

I First, study what kind of problem you are dealing with:I Linear, discrete, continuous nonlinear?I Unconstrained?I Constrained with bound/linear constraints only?I General non-linear constraints?I Special structure (network, fixed-point, semi-definite

programming, least squares, etc)?I Convex?I Local or global solution needed?

I Other features of the problem to take into account:I Size, i.e., number of variables and constraintsI Differentiability (are formulas for derivatives of appropriate

order available; are numerical approximations techniquesapplicable, etc.)

I Function behaviorI Structure of the problem


So, you’ve got yourself an optimization problemWhat algorithms/software to use?

I Review what you learned in this course (and expand on whatyou learned) to see what type of algorithm may be able tohandle your problem based on features identified above

I Explore what commercial and/or free software might becapable of solving your problem, based on algorithms itimplements. Some resources:I Matlab’s optimization package user manual (the “deep cuts”)I The NEOS Optimization guide and the NEOS server

http://neos-guide.orgI “Decision Tree” for Optimization Software

http://plato.la.asu.edu/guide.html

I If above fails:I Can your problem be formulated differently?I Write your own solver (for this problem type)?


http://neos-guide.orghttp://plato.la.asu.edu/guide.html

IntroductionCalculusBasic notions in optimizationOptimality Conditions — UnconstrainedConvexity and minimization

General optimization algorithmsStepsize selectionSteepest descent algorithm for unconstrained optimizationNewton's methodConstrained optimization — optimality conditionsIntroductionNecessary Optimality Conditions: Geometric viewSeparation of Convex SetsFirst order optimality conditionsSecond order conditions

Lagrangian dualityLinearly constrained problemsBarrier methodsPenalty methodsKarush-Kuhn-Tucker Multipliers in Penalty MethodsExact Penalty MethodsAugmented Lagrangian penalty function

Interior point methods for LPs

Date post:	03-Feb-2021
Category:	Documents
Upload:	others
View:	18 times
Download:	0 times

Continuous Optimization Methodsmepelman/teaching/IOE511/... · Continuous Optimization Methods...

Documents