Chapter IV Unconstrained Multivariate Optimizationaksikas/cha4.pdf · 1 INTRODUCTION 2 ’ & $ % 1...

1'

&

$

%

Chapter IV

Unconstrained Multivariate Optimization

– Introduction

– Sequential Univariate Searches

– Gradient-Based Methods

– Newton’s Method

– Quasi-Newton Methods

1 INTRODUCTION 2'

&

$

%

1 Introduction

Multivariate optimization means optimization of a scalar functionof several variables :

y = P (x)

and has the general form :

minxP (x)

where P (x) is a nonlinear scalar-valued function of the vectorvariable x.

1 INTRODUCTION 3'

&

$

%

Background

Before we discuss optimization methods, we need to talk about howto characterize nonlinear, multivariable functions such as P (x).

Consider the 2nd order Taylor series expansion about the point x0 :

P (x0) ' P (x0) +∇xP|x0 (x− x0) +1

2(x− x0)T∇2

xP|x0 (x− x0)

If we let :

a = P (x0)−∇xP|x0x0 +1

2xT0∇2

xP|x0x0

bT = ∇xP|x0 − xT0∇2

xP|x0

H = ∇2xP|x0

1 INTRODUCTION 4'

&

$

%

Then we can re-write the Taylor series expansion as a quadraticapproximation for P (x) :

P (x) = a+ bTx+1

2xTHx

and the derivatives are :

∇xP (x) = bT + xTH

∇2xP (x) = H

We can describe some of the local geometric properties of P (x)using its gradient and Hessian. If fact there are only a fewpossibilities for the local geometry, which can easily bedifferentiated by the eigenvalues of the Hessian (H).

Recall that the eigenvalues of a square matrix (H) are computed byfinding all of the roots (λ) of its characteristic equation :

|λI −H| = 0

1 INTRODUCTION 5'

&

$

%

The possible geometries are :

1. if λi < 0 (i = 1, 2, . . . , n), the Hessian is said to be negativedefinite. This object has a unique maximum and is what wecommonly refer to as a hill (in three dimensions).

1 INTRODUCTION 6'

&

$

%

2. if λi > 0 (i = 1, 2, . . . , n), the Hessian is said to be positivedefinite. This object has a unique minimum and is what wecommonly refer to as a valley (in three dimensions).

1 INTRODUCTION 7'

&

$

%

3. if λi < 0 (i = 1, 2, . . . ,m) and λi > 0 (i = m+ 1, . . . , n), theHessian is said to be indefinite. This object has neither aunique minimum or maximum and is what we commonly referto as a saddle (in three dimensions).

1 INTRODUCTION 8'

&

$

%

4. if λi < 0 (i = 1, 2, . . . ,m) and λi = 0 (i = m+ 1, . . . , n) , theHessian is said to be negative semi-definite. This does not havea unique maximum, and is what we commonly refer to as aridge (in three dimensions).

1 INTRODUCTION 9'

&

$

%

5. if λi > 0 (i = 1, 2, . . . ,m) and λi = 0 (i = m+ 1, . . . , n) theHessian is said to be positive semi-definite. This object doesnot have a unique minimum and is what we commonly refer toas a trough (in three dimensions).

1 INTRODUCTION 10'

&

$

%

A well posed problem has a unique optimum, so we will limitour discussions to either problems with positive definiteHessians (for minimization) or negative definite Hessians (formaximization).

Further, we would prefer to choose units for our decisionvariables (x) so that the eigenvalues of the Hessian all haveapproximately the same magnitude. This will scale the problemso that the profit contours are concentric circles and willcondition our optimization calculations.

1 INTRODUCTION 11'

&

$

%

Necessary and Sufficient Conditions

For a twice continuously differentiable scalar function P (x), apoint x∗ is an optimum if :

∇xP|x∗ = 0

and :

∇2xP|x∗ is positive definite (a minimum)

∇2xP|x∗ is negative definite (a maximum)

We can use these conditions directly, but it usually involvessolving a set of simultaneous nonlinear equations (which isusually just as tough as the original optimization problem).

1 INTRODUCTION 12'

&

$

%

Consider :

P (x) = xTAxexTAx

Then :

∇xP = 2(xTA)exTAx + xTAx(xTA)ex

TAx

= 2(xTA)exTAx(1 + xTAx)

and stationarity of the gradient requires that :

∇xP = 2(xTA)exTAx(1 + xTAx) = 0

This is a set of very nonlinear equations in the variables x.

1 INTRODUCTION 13'

&

$

%

Example

Consider the scalar function :

P (x) = 3 + x1 + 2x2 + 4x1x2 + x21 + x22

or :

P (x) = 3 +[

1 2]x+ xT

[1 22 1

]x

where x = [x1 x2]T .

Stationarity of the gradient requires :

∇xP =[

1 2]

+ 2xT[

1 22 1

]= 0

x = −1

2

[1 22 1

]−1 [12

]=

[−0.5

0

]

1 INTRODUCTION 14'

&

$

%

Check the Hessian to classify the type of stationarity point :

∇2P = 2

[1 22 1

]=

[2 44 2

]The eigenvalues of the Hessian are :

∣∣∣∣λI − [ 2 44 2

]∣∣∣∣ =

∣∣∣∣ λ− 2 −4−4 λ− 2

∣∣∣∣ = (λ− 2)2 − 16 = 0

λ1 = 6 and λ2 = −2

Thus the Hessian is indefinite and the stationary point is asaddle point. This is a trivial example, but it highlights thegeneral procedure for direct use of the optimality conditions.

1 INTRODUCTION 15'

&

$

%

Like the univariate search methods we studied earlier,multivariate optimization methods can be separated into twogroups :

i) those which depend solely on function evaluations,

ii) those which use derivative information (either analyticalderivatives or numerical approximations).

Regardless of which method you choose to solve a multivariateoptimization problem, the general procedure will be :

1) select a starting point(s),2) choose a search direction,3) minimize in chosen search direction,4) repeat steps 2 & 3 until converged.

1 INTRODUCTION 16'

&

$

%

Also, successful solution of an optimization problem willrequire specification of a convergence criterion. Somepossibilities include :

‖xk+1 − xk‖ ≤ γ

‖P (xk+1)− P (xk)‖ ≤ δ

‖∇xP|xk ‖ ≤ ε

2 SEQUENTIAL UNIVARIATE SEARCHES 17'

&

$

%

2 Sequential Univariate Searches

Perhaps the simplest multivariable search technique toimplement would be a system of sequential univariate searchesalong some fixed set of directions. Consider the twodimensional case, where the chosen search directions areparallel to the coordinate axes :


&

$

%

In this algorithm, you :

1. select a starting point x0,

2. select a coordinate direction (e.g. s = [0 1]T , or [1 0]T ),

3. perform a univariate search ,

minαP (xk + αs)

4. repeat steps 2 and 3 alternating between search directions,until converged.

The problem with this method is that a very large number ofiterations may be required to attain a reasonable level ofaccuracy. If we new something about the ”orientation” of theobjective function the rate of convergence could be greatlyenhanced.


&

$

%

Consider the previous two dimensional problem, but using theindependent search directions ([1 1]T , [1 − 1]T ).

Of course, finding the optimum in n steps only works forquadratic objective functions where the Hessian is known andeach line search is exact. There are a large number of theseoptimization techniques which vary only in the way that thesearch directions are chosen.


&

$

%

Nelder-Meade Simplex

This approach has nothing to do with the SIMPLEX method oflinear programming. The method derives its name from ann-dimensional polytope (and as a result is often referred to asthe ”polytope” method).

A polytope is an n-dimensional object that has n+1 vertices :

n polytope number of vertices1 line segment 22 triangle 33 tetrahedron 4

It is an easily implemented direct search method, that onlyrelies on objective function evaluations. As a result, it can berobust to some types of discontinuities and so forth. However,it is often slow to converge and is not useful for larger problems( > 10 variables).


&

$

%

To illustrate the basic idea of the method, consider thetwo-dimensional minimization problem :


&

$

%

Step 1 :

- P1 = max(P1, P2, P3), define P4 as the reflection of P1

through the centroid of the line joining P2 and P3.

- P4 ≤ max(P1, P2, P3), from a new polytope from the pointsP2, P3, P4.

Step 2 :

- repeat the procedure in Step 1 to form new polytopes.

We can repeat the procedure until the polytope defined by P4,P5, P6. Further iterations will cause us to flip between twopolytopes. We can eliminate these problems by introducing twofurther operations into the method :

- contraction when the reflection step does not offer anyimprovement,

- expansion to accelerate convergence when the reflection stepis an improvement.


&

$

%


&

$

%

For minimization, the algorithm is :

(a) Order according to the values at the vertices

P (x1) ≤ P (x2) ≤ . . . ≤ P (xn+1)

(b) Calculate xo, the centre of gravity of all points except xn+1

xo =1

n

n∑i=1

xi

(c) Reflection : Compute reflected point

xr = xo + α(xo − xn+1)

If P (x1) ≤ P (xr) < P (xn) :

The reflected point is better than the second worst, but notbetter than the best, then obtain a new simplex by replacingthe worst point xn+1 with the reflected point xr, and go tostep (a).


&

$

%

If P (xr) < P (x1)

The reflected point is the best point so far, then go to step (d).

(d) Expansion : compute the expanded point

xe = xo + γ(xo − xn+1)

If P (xe) < P (xr)

The expanded point is better than the reflected point,

then obtain a new simplex by replacing the worst point

xn+1 with the expanded point xe, and go to step (a).

Else

Obtain a new simplex by replacing the worst point

xn+1 with the reflected point xr, and go to step (a).


&

$

%

Else

The reflected point is worse than second worst, then continueat step (e).

(e) Contraction : Here, it is certain that P (xr) ≥ P (xn)Compute contracted point

xc = xn+1 + ρ(xo − xn+1)

If P (xc) ≤ P (xn+1)

The contracted point is better than the worst point,

then obtain a new simplex by replacing the worst point

xn+1 with the contracted point xc, and go to step (a).

Else

go to step (f).


&

$

%

(f) Reduction For all but the best point, replace the point with

xi = x1 + σ(xi − x1), ∀i ∈ {2, . . . , n+ 1}

go to step (a).

(g) End of the algorithm.

Note : α, γ and σ are respectively the reflection, theexpansion, the contraction and the shrink coefficients.Standard values are α = 1, γ = 2, ρ = 1/2 and σ = 1/2.


&

$

%

For the reflection, since xn+1 is the vertex with the higherassociated value among the vertices, we can expect to find alower value at the reflection of xn+1 in the opposite faceformed by all vertices point xi except xn+1.

For the expansion, if the reflection point xr is the newminimum along the vertices we can expect to find interestingvalues along the direction from xo to xr.

Concerning the contraction : If P (xr) ≥ P (xn) we can expectthat a better value will be inside the simplex formed by allthe vertices xi.

The initial simplex is important, indeed, a too small initialsimplex can lead to a local search, consequently the NM canget more easily stuck. So this simplex should depend on thenature of the problem.

The Nelder-Meade Simplex method can be slow to converge,but it is useful for functions whose derivatives cannot becalculated or approximated (e.g. some non-smooth functions).


&

$

%

Example

minxx21 + x22 − 2x1 − 2x2 + 2

x1 x2 x3 P1 P2 P3 xr Pr notes

0[

33

] [32

] [22

]8 5 2

[21

]1 expansion

1[

21

] [32

] [22

]1 5 2

[11

]0 expansion

2[

21

] [11

] [22

]1 0 2

[10

]1 contraction

3[

21

] [11

] [7/43/2

]1 0 13

16

[3432

]516

4[

3/43/2

] [11

] [7/43/2

] 516 0 13

16

3 GRADIENT-BASED METHODS 30'

&

$

%

3 Gradient-Based Methods

This family of optimization methods use first-order derivatives todetermine a ”descent” direction. (Recall that the gradient gives thedirection of the quickest increase for the objective function. Thus,the negative of the gradient gives the quickest decrease for theobjective function.)


&

$

%

Steepest Descent

It would seem that the fastest way to find an optimum would be toalways move in the direction in which the objective functiondecreases the fastest. Although this idea is intuitively appealing, itis very rarely true ; however with a good line search every iterationwill move closer to the optimum.

The Steepest Descent algorithm is :

1. choose a starting point x0,

2. calculate the search direction :

sk = −[∇xP|xk

]3. determine the next iterate for x :

xk+1 = xk + λksk


&

$

%

by solving the univariate optimization problem :

minλk

P (xk + λksk)

4. repeat steps 2 &3 until termination. Some termination criteriato consider :

λk ≤ ε‖xk+1 − xk‖ ≤ ε‖∇xP|xk ‖ ≤ ε

‖P (xk+1)− P (xk)‖ ≤ ε

where ε is some small positive scalar.


&

$

%

Example 1

minxx21 + x22 − 2x1 − 2x2 + 2

∇xPT =

[2x1 − 22x2 − 2

]start at x0 = [3 3]T

xk ∇xPT|xk λk P (xk)

0

[33

] [44

]8

1

[11

] [00

]


&

$

%

start at x0 = [1 2]T

xk ∇xPT|xk λk P (xk)

0

[12

] [02

]1

1

[11

] [00

]

This example shows that the method of Steepest Descent is veryefficient for well-scaled quadratic optimization problems (the profitcontours are concentric circles). This method is not nearly asefficient for non-quadratic objective functions with very ellipticalprofit contours (poorly scaled).


&

$

%

Example 2

Consider the minimization problem :

minx

1

2xT[

101 −99−99 101

]x

Then

∇xP = xT[

101 −99−99 101

]= [101x1 − 99x2 − 99x1 + 101x2]

To develop an objective function for our line search, substitute

xk+1 = xk + λksk

into the original objective function :


&

$

%

P (xk+λksk) =1

2

(xk − λk∇xP|xk

)T [ 101 −99−99 101

](xk − λk∇xP|xk

)

=1

2

(xk − λk

[101 −99−99 101

]xk

)T [101 −99−99 101

](xk − λk

[101 −99−99 101

]xk

)

To simplify our work, let

H =

[101 −99−99 101

]which yields :

P (xk + λksk) =1

2

[xkHxk − 2λkx

TkH

2xk + λ2kxTkH

3xk]


&

$

%

This expression is quadratic in λk , thus making it easy to performan exact line search :

dP

dλk= −xTkH2xk + λkx

TkH

3xk = 0

Solving for the step length yields :

λk =xTkH

2xkxTkH

3xk=

xTk

[101 −99−99 101

]2xk

xTk

[101 −99−99 101

]3xk

λk =

xTk

[20, 002 −19, 998−19, 998 20, 002

]xk

xTk

[4, 000, 004 −3, 999, 9996−3, 999, 996 4, 000, 004

]xk


&

$

%

Starting at x0 = [2 1]T :

xk P (xk) ∇xPT|xk

λk

0

[21

]54.50

[103−97

]0.0050

1

[1.48451.4854

]4.4104

[2.88093.0591

]0.4591

2

[0.16180.0809

]0.3569

[8.3353−7.8497

]0.0050

3

[0.12010.1202

]0.0289

[0.23310.2476

]0.4591

4

[0.01310.0065

]0.0023

[0.6745−0.6352

]0.0050

5

[0.00970.0097

]1.8915 · 10−4

[0.01890.0200

]0.4591

6

[0.00110.0005

]1.5307 · 10−5

[0.0547−0.0514

]0.0050

7

[7.868 · 10−4

7.872 · 10−4

]1.2387 · 10−5

[0.00150.0016

]0.4591


&

$

%

Conjugate Gradients

Steepest Descent was based on the idea that the minimum will befound provided that we always move in the downhill direction. Thesecond example showed that the gradient does not always point inthe direction of the optimum, due to the geometry of the problem.Fletcher and Reeves (1964) developed a method which attempts toaccount for local curvature in determining the next search direction.

The algorithm is :

1. choose a starting point x0,

2. set the initial search direction

s0 = −[∇xP|x0

]


&

$

%

3. determine the next iterate for x :

xk+1 = xk + λksk

by solving the univariate optimization problem :

minλk

P (xk + λksk)

4. calculate the next search direction as :

sk+1 = −∇xP|xk + sk∇xP|xk+1

∇xPT|xk+1

∇xP|xk∇xPT|xk

5. repeat steps 2 &3 until termination.


&

$

%

The manner in which the successive search directions are calculatedis important. For quadratic functions, these successive searchdirections are conjugate with respect to the Hessian. This meansthat for the quadratic function :

P (x) = a+ bTx+1

2xTHx

successive search directions satisfy :

sTkHsk+1 = 0

As a result, these successive search directions have incorporatedwithin them information about the geometry of the optimizationproblem.


&

$

%

Example

Consider again the optimization problem :

minx

1

2xT[

101 −99−99 101

]x

We saw that Steepest Descent was slow to converge on this problemof its poor scaling. As in the previous example, we will use an exactline search. It can be shown that the optimal step length is :

λk = −xTk

[101 −99−99 101

]sk

sTk

[101 −99−99 101

]sk


&

$

%


xk P (xk) ∇xPT|xk sk λk

0

[21

] [103−97

] [103−97

]

1

[1.48451.4854

] [2.88093.0591

] [−2.9717−2.9735

]

2

[00

]


&

$

%

Notice the much quicker convergence of the algorithm. For aquadratic objective with exact line searches, the ConjugateGradients method exhibits quadratic convergence.

The quicker convergence comes at an increased computationalrequirement. These include :

– a more complex search direction calculation,

– increased storage for maintaining previous search directions andgradients.

These are usually small in relation to the performanceimprovements.

4 NEWTON’S METHOD 45'

&

$

%

4 Newton’s Method

The Conjugate Gradients method showed that there was aconsiderable performance gain to be realized by incorporating someproblem geometry into the optimization algorithm. However, themethod did this in an approximate sense. Newton’s method doesthis by using second-derivative information.

In the multivariate case, Newton’s method determines a searchdirection by using the Hessian to modify the gradient. The methodis developed directly from the Taylor Series approximation of theobjective function. Recall that an objective function can be locallyapproximated :

P (x) ∼ P (xk) +∇xP|xk (x− xk) +1

2(x− xk)T∇2

xP|xk (x− xk)


&

$

%

Then, stationarity can be approximated as :

∇xP|xk ∼ ∇xP|xk + (x− xk)T∇2xP|xk = 0

Which can be used to determine an expression for calculating thenext point in the minimization procedure :

xk+1 = xk −(∇2xP|xk

)−1 (∇xP|xk

)Notice that in this case we get both the direction and the steplength for the minimization. Further, if the objective function isquadratic, the optimum is found in a single step.


&

$

%

At a given point xk, the algorithm for Newton’s method is :

1. calculate the gradient at the current point ∇xP|xk .

2. calculate the Hessian at the current point ∇2xP|xk .

3. calculate the Newton step :

∆xk = −(∇2xP|xk

)−1 (∇xP|xk

)T

4. calculate the next point :

xk+1 = xk + ∆xk

5. repeat steps 1 through 4 until termination.


&

$

%

Example Consider again the optimization problem :

minx

1

2xT[

101 −99−99 101

]x

The gradient and Hessian for this example are :

∇xP = xT[

101 −99−99 101

], ∇2

xP =

[101 −99−99 101

]start at x0 = [2 1]T

xk P (xk) ∇xPT|xk

∇2xP

T|xk

0

[21

]54.50

[103−97

] [101 −99−99 101

]

1

[00

]0


&

$

%

As expected Newton’s method was very effective, eventhough theproblem was not well scaled. Poor scaling was compensated for byusing the inverse of the Hessian.

Steepest Descent was very good for the first two iterations, whenwe were quite far from the optimum. But it slowed rapidly as :

x→ x∗ ∇xP|xk → 0

The Conjugate Gradients approach avoided this difficulty byincluding some measure of problem geometry in the algorithm.However, the method started out using the Steepest Descentdirection and built in curvature information as the algorithmprogressed to the optimum.

For higher-dimensional and non-quadratic objective functions, thenumber of iterations required for convergence can increasesubstantially.


&

$

%

In general, Newton’s Method will yield the best convergenceproperties, at the cost of increased computational load to calculateand store the Hessian. The Conjugate Gradients approach can beeffective in situations where computational requirements are anissue. However, there some other methods where the Hessian isapproximated using gradient information and the approximateHessian is used to calculate a Newton step. These are called theQuasi-Newton methods.

As we saw in the univariate case, care must be taken when usingNewton’s method for complex functions. Since Newton’s method issearching for a stationary point, it can oscillate between suchpoints in the above situation. For such complex functions, Newton’smethod is often implemented with an alternative to step 4.

4’. calculate the next point :

xk+1 = xk + λk∆xk


&

$

%

where the step length is determined as :

i) 0 < λk < 1 (so we don’t take a full step).

ii) perform a line search on λk.

Generally, Newton and Newton-like methods are preferred to othermethods, the main difficulty is determining the Hessian.

5 APPROXIMATE NEWTON METHODS 52'

&

$

%

5 Approximate Newton Methods

As we saw in the univariate case the derivatives can beapproximated by finite differences. In multiple dimensions this canrequire substantially more calculations. As an example, considerforward difference approximation of the gradient :

∇xP|xk =

[∂P

∂xi

]xk

∼ P (xk + δi)− P (xk)

‖δi‖where δi is a perturbation in the direction of xi. This approach

requires 1 additional objective function evaluation per dimensionfor forward or back ward differencing, and 2 additional objectivefunction evaluations per dimension for central differencing.


&

$

%

To approximate the Hessian using only finite differences in theobjective function could have the form :

∇2xP|xk =

[∂2P

∂xi∂xj

]xk

∼ [P (xk + δi)− P (xk)]− [P (xk + δj)− P (xk)]

‖δi‖‖δj‖

Finite difference approximation of the Hessian requires at least anadditional 2 objective function evaluations per dimension. Somedifferencing schemes will give better performance than others ;however, the increased computational load required for differenceapproximation of the second derivatives precludes the use of thisapproach for larger problems.

Alternatively, the Hessian can be approximated in terms ofgradient information. In the forward difference case the Hessian canbe approximated as :

∇2xP|xk =

[∂2P

∂xi∂xj

]xk

∼ hi ∼∇xP (xk + δi)−∇xP (xk)

‖δi‖


&

$

%

Then, we can form a finite difference approximation to the Hessianas :

H = [hi]

The problem is that often this approximate Hessian isnon-symmetric. A symmetric approximation can be formed as :

H =1

2

[H + HT

]Whichever approach is used to approximate the derivatives, theNewton step is calculated from them. In general, Newton’s methodsbased on finite differences can produce results similar to analyticalresults, providing care is taken in choosing the perturbations δi .(Recall that as the algorithm proceeds : xk → x∗ and ∇xP|xk → 0.

Then small approximation errors will affect convergence andaccuracy.) Some alternatives which can require fewer computations,yet build curvature information as the algorithm proceed are theQuasi-Newton family.


&

$

%

5.1 Quasi-Newton Methods

This family of methods are the multivariate analogs of theunivariate Secant methods. They build second derivative (Hessian)information as the algorithm proceeds to solution, using availablegradient information.

Consider the Taylor series expansion for the gradient of theobjective function along the step direction sk :

∇xP|xk+1= ∇xP|xk +∇2

xP|xk (∆xk) + . . .

Truncating the Taylor series at the second-order terms anre-arranging slightly yields the so-called Quasi-Newton Condition :

Hk+1(∆xk) = ∇xP|xk+1−∇xP|xk = ∆gk


&

$

%

Thus, any algorithm which satisfies this condition can build upcurvature information as it proceeds from xk to xk+1 in thedirection ∆xk ; however, the curvature information is gained in onlyone direction. Then as these algorithms proceed the fromiteration-to-iteration, the approximate Hessian can differ by only arank-one matrix

Hk+1 = Hk + vuT

Combining the Quasi-Newton Condition and the update formulayields :

Hk+1(∆xk) = (Hk + vuT )(∆xk) = ∆gk

Which yields a solution for the vector v of :

v =1

uT∆xk

[∆gk − Hk∆xk

]


&

$

%

Providing ∆xk and uT are not orthogonal, the Hessian updateformula is given by :

Hk+1 = Hk +1

uT∆xk

[∆gk − Hk∆xk

]uT

Unfortunately this update formula is not unique with respect to theQuasi-Newton Condition and the step vector sk.

We could add an additional term of wzT to the update formula,where the vector w is any vector from the null space of ∆xk.(Recall that if w ∈ null(∆xk), then ∆xTkw = 0). Thus there are afamily of such Quasi-Newton methods which differ only in thisadditional term wzT .


&

$

%

Perhaps the most successful member of the family is theBroyden-Fletcher-Goldfarb-Shanno (BFGS) update :

Hk+1 = Hk +∆gk∆gTk∆gTk ∆xk

− Hk∆xk∆xTk Hk

∆xTk Hk∆xk

The BFGS update has several advantages including that it issymmetric and it can be further simplified if the step is calculatedaccording to :

Hk∆xk = −λk∇xP|xkIn this case the BFGS update formula simplifies to :

Hk+1 = Hk +∆gk∆gTk∆gTk ∆xk

+λk∇xP|xk∇xP

T|xk

∇xPT|xk∆xk


&

$

%

General Quasi-Newton Algorithm

1. Initialization :

- choose a starting point x0

- choose an initial Hessian approximation (usually H0 = I).

2. At iteration k :

- calculate the gradient ∇xP|xk- calculate the search direction :

sk = −(Hk)−1(∇xP|xk )

- calculate the next iterate :

xk+1 = xk + λksk


&

$

%

using a line search on λk.

- compute ∇xP|xk ,∆gk,∆xk.

- update the Hessian (Hk+1) using the method of your choice(e.g. BFGS).

3. Repeat step 2 until termination.


&

$

%

Example

Consider again the optimization problem :

minx

1

2xT[

101 −99−99 101

]x

The gradient and Hessian for this example are :

∇xP = xT[

101 −99−99 101

], ∇2

xP =

[101 −99−99 101

]


&

$

%


xk P (xk) ∇xP|xkHk sk λk

[21

]54.5

[103−97

] [1 00 1

] [103−97

]0.005

[1.481.48

]4.41

[2.883.05

] [100.52 −99.5−99.5 100.46

] [−2.97−2.97

]0.50

[7.8 · 10−4

7.8 · 10−4

]1.2 ·10−8

[1.5 · 10−4

1.6 · 10−4

] [101 −99−99 101

] [−7.8 · 10−4

−7.8 · 10−4

]1

[00

]

It is worth noting that the optimum is found in 3 steps(although we are very close within 2). Also, notice that we havea very accurate approximation for the Hessian (to 7 significantfigures) after 2 updates.


&

$

%

The BFGS algorithm was :

- not quit as efficient as Newton’s method,

- approximately as efficient as the Conjugate Gradientsapproach,

- much more efficient that Steepest Descent.

This is what should be expected for quadratic functions of lowdimension. As the dimension of the problem increases, theNewton and Quasi-Newton methods will usually out performthe other methods.

cautionary note :

The BFGS algorithm guarantees that the approximate Hessianremains positive definite (and invertible) providing that theinitial matrix is chosen as positive definite, in theory. However,round-off errors and sometimes the search history can cause theapproximate Hessian to become badly conditioned. When thisoccurs, the approximate Hessian should be reset to some value(often the identity matrix I).


&

$

%

5.2 Convergence

During our discussions we have mentioned convergenceproperties of various algorithms. There are a number of wayswith which convergence properties of an algorithm can becharacterized. The most conventional approach is to determinethe asymptotic behaviour of the sequence of distances betweenthe iterates (xk) and the optimum (x∗), as the iterations of aparticular algorithm proceed. Typically, the distance betweenas iterate and the optimum is defined as :

‖xk − x∗‖

We are most interested in the behaviour of an algorithm withinsome neighbourhood of the optimum ; hence, the asymptoticanalysis .


&

$

%

The asymptotic rate of convergence is defined as :

limk→∞

‖xk+1 − x∗‖‖xk − x∗‖p

= β

with the asymptotic order p and the asymptotic error β.

In general, most algorithms show convergence properties whichare within three categories :

- linear (0 ≤ β < 1, p = 1),

- super-linear (β = 0, p = 1),

- quadratic (β > 0, p = 2).


&

$

%

For the methods we have studied, it can be shown that :

(a) the Steepest Descent method exhibits super-linear convergencein most cases. For quadratic objective functions the asymptoticconvergence properties can be shown to approach quadratic ask(H)→ 1 and linear as k(H)→∞.

(b) the Conjugate Gradients method will generally showsuper-linear (and often nearly quadratic) convergence properties.

(c) Newton’s method converges quadratically.

(d) Quasi-Newton methods can closely approach quadratic rates ofconvergence.


&

$

%

As in the univariate case, there are two basic approaches :

- direct search (Nelder-Meade Simplex, . . . ),

- derivative-based (Steepest Descent, Newton’s, . . . ).

Choosing which method to use for a particular problem is atrade-off among :

- ease of implementation,

- convergence speed and accuracy,

- computational limitations.


&

$

%

Unfortunately there is no best method in all situations. Fornearly quadratic functions Newton’s and Quasi-Newtonmethods will usually be very efficient. For nearly flat ornon-smooth objective functions, the direct search methods areoften a good choice. For very large problems (more than 1000decision variables), algorithms which use only gradientinformation (Steepest Descent, Conjugate Gradients,Quasi-Newton, etc.) can be the only practical methods.

Remember that the solution to a multivariable optimizationproblem requires :

- a function (and derivatives) that can be evaluated,

- an appropriate optimization method.

Date post:	24-Jan-2020
Category:	Documents
Upload:	others
View:	10 times
Download:	0 times

Chapter IV Unconstrained Multivariate Optimizationaksikas/cha4.pdf · 1 INTRODUCTION 2 ’ & $ % 1...

Documents