1'
&
$
%
Chapter IV
Unconstrained Multivariate Optimization
– Introduction
– Sequential Univariate Searches
– Gradient-Based Methods
– Newton’s Method
– Quasi-Newton Methods
1 INTRODUCTION 2'
&
$
%
1 Introduction
Multivariate optimization means optimization of a scalar functionof several variables :
y = P (x)
and has the general form :
minxP (x)
where P (x) is a nonlinear scalar-valued function of the vectorvariable x.
1 INTRODUCTION 3'
&
$
%
Background
Before we discuss optimization methods, we need to talk about howto characterize nonlinear, multivariable functions such as P (x).
Consider the 2nd order Taylor series expansion about the point x0 :
P (x0) ' P (x0) +∇xP|x0 (x− x0) +1
2(x− x0)T∇2
xP|x0 (x− x0)
If we let :
a = P (x0)−∇xP|x0x0 +1
2xT0∇2
xP|x0x0
bT = ∇xP|x0 − xT0∇2
xP|x0
H = ∇2xP|x0
1 INTRODUCTION 4'
&
$
%
Then we can re-write the Taylor series expansion as a quadraticapproximation for P (x) :
P (x) = a+ bTx+1
2xTHx
and the derivatives are :
∇xP (x) = bT + xTH
∇2xP (x) = H
We can describe some of the local geometric properties of P (x)using its gradient and Hessian. If fact there are only a fewpossibilities for the local geometry, which can easily bedifferentiated by the eigenvalues of the Hessian (H).
Recall that the eigenvalues of a square matrix (H) are computed byfinding all of the roots (λ) of its characteristic equation :
|λI −H| = 0
1 INTRODUCTION 5'
&
$
%
The possible geometries are :
1. if λi < 0 (i = 1, 2, . . . , n), the Hessian is said to be negativedefinite. This object has a unique maximum and is what wecommonly refer to as a hill (in three dimensions).
1 INTRODUCTION 6'
&
$
%
2. if λi > 0 (i = 1, 2, . . . , n), the Hessian is said to be positivedefinite. This object has a unique minimum and is what wecommonly refer to as a valley (in three dimensions).
1 INTRODUCTION 7'
&
$
%
3. if λi < 0 (i = 1, 2, . . . ,m) and λi > 0 (i = m+ 1, . . . , n), theHessian is said to be indefinite. This object has neither aunique minimum or maximum and is what we commonly referto as a saddle (in three dimensions).
1 INTRODUCTION 8'
&
$
%
4. if λi < 0 (i = 1, 2, . . . ,m) and λi = 0 (i = m+ 1, . . . , n) , theHessian is said to be negative semi-definite. This does not havea unique maximum, and is what we commonly refer to as aridge (in three dimensions).
1 INTRODUCTION 9'
&
$
%
5. if λi > 0 (i = 1, 2, . . . ,m) and λi = 0 (i = m+ 1, . . . , n) theHessian is said to be positive semi-definite. This object doesnot have a unique minimum and is what we commonly refer toas a trough (in three dimensions).
1 INTRODUCTION 10'
&
$
%
A well posed problem has a unique optimum, so we will limitour discussions to either problems with positive definiteHessians (for minimization) or negative definite Hessians (formaximization).
Further, we would prefer to choose units for our decisionvariables (x) so that the eigenvalues of the Hessian all haveapproximately the same magnitude. This will scale the problemso that the profit contours are concentric circles and willcondition our optimization calculations.
1 INTRODUCTION 11'
&
$
%
Necessary and Sufficient Conditions
For a twice continuously differentiable scalar function P (x), apoint x∗ is an optimum if :
∇xP|x∗ = 0
and :
∇2xP|x∗ is positive definite (a minimum)
∇2xP|x∗ is negative definite (a maximum)
We can use these conditions directly, but it usually involvessolving a set of simultaneous nonlinear equations (which isusually just as tough as the original optimization problem).
1 INTRODUCTION 12'
&
$
%
Consider :
P (x) = xTAxexTAx
Then :
∇xP = 2(xTA)exTAx + xTAx(xTA)ex
TAx
= 2(xTA)exTAx(1 + xTAx)
and stationarity of the gradient requires that :
∇xP = 2(xTA)exTAx(1 + xTAx) = 0
This is a set of very nonlinear equations in the variables x.
1 INTRODUCTION 13'
&
$
%
Example
Consider the scalar function :
P (x) = 3 + x1 + 2x2 + 4x1x2 + x21 + x22
or :
P (x) = 3 +[
1 2]x+ xT
[1 22 1
]x
where x = [x1 x2]T .
Stationarity of the gradient requires :
∇xP =[
1 2]
+ 2xT[
1 22 1
]= 0
x = −1
2
[1 22 1
]−1 [12
]=
[−0.5
0
]
1 INTRODUCTION 14'
&
$
%
Check the Hessian to classify the type of stationarity point :
∇2P = 2
[1 22 1
]=
[2 44 2
]The eigenvalues of the Hessian are :
∣∣∣∣λI − [ 2 44 2
]∣∣∣∣ =
∣∣∣∣ λ− 2 −4−4 λ− 2
∣∣∣∣ = (λ− 2)2 − 16 = 0
λ1 = 6 and λ2 = −2
Thus the Hessian is indefinite and the stationary point is asaddle point. This is a trivial example, but it highlights thegeneral procedure for direct use of the optimality conditions.
1 INTRODUCTION 15'
&
$
%
Like the univariate search methods we studied earlier,multivariate optimization methods can be separated into twogroups :
i) those which depend solely on function evaluations,
ii) those which use derivative information (either analyticalderivatives or numerical approximations).
Regardless of which method you choose to solve a multivariateoptimization problem, the general procedure will be :
1) select a starting point(s),2) choose a search direction,3) minimize in chosen search direction,4) repeat steps 2 & 3 until converged.
1 INTRODUCTION 16'
&
$
%
Also, successful solution of an optimization problem willrequire specification of a convergence criterion. Somepossibilities include :
‖xk+1 − xk‖ ≤ γ
‖P (xk+1)− P (xk)‖ ≤ δ
‖∇xP|xk ‖ ≤ ε
2 SEQUENTIAL UNIVARIATE SEARCHES 17'
&
$
%
2 Sequential Univariate Searches
Perhaps the simplest multivariable search technique toimplement would be a system of sequential univariate searchesalong some fixed set of directions. Consider the twodimensional case, where the chosen search directions areparallel to the coordinate axes :
2 SEQUENTIAL UNIVARIATE SEARCHES 18'
&
$
%
In this algorithm, you :
1. select a starting point x0,
2. select a coordinate direction (e.g. s = [0 1]T , or [1 0]T ),
3. perform a univariate search ,
minαP (xk + αs)
4. repeat steps 2 and 3 alternating between search directions,until converged.
The problem with this method is that a very large number ofiterations may be required to attain a reasonable level ofaccuracy. If we new something about the ”orientation” of theobjective function the rate of convergence could be greatlyenhanced.
2 SEQUENTIAL UNIVARIATE SEARCHES 19'
&
$
%
Consider the previous two dimensional problem, but using theindependent search directions ([1 1]T , [1 − 1]T ).
Of course, finding the optimum in n steps only works forquadratic objective functions where the Hessian is known andeach line search is exact. There are a large number of theseoptimization techniques which vary only in the way that thesearch directions are chosen.
2 SEQUENTIAL UNIVARIATE SEARCHES 20'
&
$
%
Nelder-Meade Simplex
This approach has nothing to do with the SIMPLEX method oflinear programming. The method derives its name from ann-dimensional polytope (and as a result is often referred to asthe ”polytope” method).
A polytope is an n-dimensional object that has n+1 vertices :
n polytope number of vertices1 line segment 22 triangle 33 tetrahedron 4
It is an easily implemented direct search method, that onlyrelies on objective function evaluations. As a result, it can berobust to some types of discontinuities and so forth. However,it is often slow to converge and is not useful for larger problems( > 10 variables).
2 SEQUENTIAL UNIVARIATE SEARCHES 21'
&
$
%
To illustrate the basic idea of the method, consider thetwo-dimensional minimization problem :
2 SEQUENTIAL UNIVARIATE SEARCHES 22'
&
$
%
Step 1 :
- P1 = max(P1, P2, P3), define P4 as the reflection of P1
through the centroid of the line joining P2 and P3.
- P4 ≤ max(P1, P2, P3), from a new polytope from the pointsP2, P3, P4.
Step 2 :
- repeat the procedure in Step 1 to form new polytopes.
We can repeat the procedure until the polytope defined by P4,P5, P6. Further iterations will cause us to flip between twopolytopes. We can eliminate these problems by introducing twofurther operations into the method :
- contraction when the reflection step does not offer anyimprovement,
- expansion to accelerate convergence when the reflection stepis an improvement.
2 SEQUENTIAL UNIVARIATE SEARCHES 23'
&
$
%
2 SEQUENTIAL UNIVARIATE SEARCHES 24'
&
$
%
For minimization, the algorithm is :
(a) Order according to the values at the vertices
P (x1) ≤ P (x2) ≤ . . . ≤ P (xn+1)
(b) Calculate xo, the centre of gravity of all points except xn+1
xo =1
n
n∑i=1
xi
(c) Reflection : Compute reflected point
xr = xo + α(xo − xn+1)
If P (x1) ≤ P (xr) < P (xn) :
The reflected point is better than the second worst, but notbetter than the best, then obtain a new simplex by replacingthe worst point xn+1 with the reflected point xr, and go tostep (a).
2 SEQUENTIAL UNIVARIATE SEARCHES 25'
&
$
%
If P (xr) < P (x1)
The reflected point is the best point so far, then go to step (d).
(d) Expansion : compute the expanded point
xe = xo + γ(xo − xn+1)
If P (xe) < P (xr)
The expanded point is better than the reflected point,
then obtain a new simplex by replacing the worst point
xn+1 with the expanded point xe, and go to step (a).
Else
Obtain a new simplex by replacing the worst point
xn+1 with the reflected point xr, and go to step (a).
2 SEQUENTIAL UNIVARIATE SEARCHES 26'
&
$
%
Else
The reflected point is worse than second worst, then continueat step (e).
(e) Contraction : Here, it is certain that P (xr) ≥ P (xn)Compute contracted point
xc = xn+1 + ρ(xo − xn+1)
If P (xc) ≤ P (xn+1)
The contracted point is better than the worst point,
then obtain a new simplex by replacing the worst point
xn+1 with the contracted point xc, and go to step (a).
Else
go to step (f).
2 SEQUENTIAL UNIVARIATE SEARCHES 27'
&
$
%
(f) Reduction For all but the best point, replace the point with
xi = x1 + σ(xi − x1), ∀i ∈ {2, . . . , n+ 1}
go to step (a).
(g) End of the algorithm.
Note : α, γ and σ are respectively the reflection, theexpansion, the contraction and the shrink coefficients.Standard values are α = 1, γ = 2, ρ = 1/2 and σ = 1/2.
2 SEQUENTIAL UNIVARIATE SEARCHES 28'
&
$
%
For the reflection, since xn+1 is the vertex with the higherassociated value among the vertices, we can expect to find alower value at the reflection of xn+1 in the opposite faceformed by all vertices point xi except xn+1.
For the expansion, if the reflection point xr is the newminimum along the vertices we can expect to find interestingvalues along the direction from xo to xr.
Concerning the contraction : If P (xr) ≥ P (xn) we can expectthat a better value will be inside the simplex formed by allthe vertices xi.
The initial simplex is important, indeed, a too small initialsimplex can lead to a local search, consequently the NM canget more easily stuck. So this simplex should depend on thenature of the problem.
The Nelder-Meade Simplex method can be slow to converge,but it is useful for functions whose derivatives cannot becalculated or approximated (e.g. some non-smooth functions).
2 SEQUENTIAL UNIVARIATE SEARCHES 29'
&
$
%
Example
minxx21 + x22 − 2x1 − 2x2 + 2
x1 x2 x3 P1 P2 P3 xr Pr notes
0[
33
] [32
] [22
]8 5 2
[21
]1 expansion
1[
21
] [32
] [22
]1 5 2
[11
]0 expansion
2[
21
] [11
] [22
]1 0 2
[10
]1 contraction
3[
21
] [11
] [7/43/2
]1 0 13
16
[3432
]516
4[
3/43/2
] [11
] [7/43/2
] 516 0 13
16
3 GRADIENT-BASED METHODS 30'
&
$
%
3 Gradient-Based Methods
This family of optimization methods use first-order derivatives todetermine a ”descent” direction. (Recall that the gradient gives thedirection of the quickest increase for the objective function. Thus,the negative of the gradient gives the quickest decrease for theobjective function.)
3 GRADIENT-BASED METHODS 31'
&
$
%
Steepest Descent
It would seem that the fastest way to find an optimum would be toalways move in the direction in which the objective functiondecreases the fastest. Although this idea is intuitively appealing, itis very rarely true ; however with a good line search every iterationwill move closer to the optimum.
The Steepest Descent algorithm is :
1. choose a starting point x0,
2. calculate the search direction :
sk = −[∇xP|xk
]3. determine the next iterate for x :
xk+1 = xk + λksk
3 GRADIENT-BASED METHODS 32'
&
$
%
by solving the univariate optimization problem :
minλk
P (xk + λksk)
4. repeat steps 2 &3 until termination. Some termination criteriato consider :
λk ≤ ε‖xk+1 − xk‖ ≤ ε‖∇xP|xk ‖ ≤ ε
‖P (xk+1)− P (xk)‖ ≤ ε
where ε is some small positive scalar.
3 GRADIENT-BASED METHODS 33'
&
$
%
Example 1
minxx21 + x22 − 2x1 − 2x2 + 2
∇xPT =
[2x1 − 22x2 − 2
]start at x0 = [3 3]T
xk ∇xPT|xk λk P (xk)
0
[33
] [44
]8
1
[11
] [00
]
3 GRADIENT-BASED METHODS 34'
&
$
%
start at x0 = [1 2]T
xk ∇xPT|xk λk P (xk)
0
[12
] [02
]1
1
[11
] [00
]
This example shows that the method of Steepest Descent is veryefficient for well-scaled quadratic optimization problems (the profitcontours are concentric circles). This method is not nearly asefficient for non-quadratic objective functions with very ellipticalprofit contours (poorly scaled).
3 GRADIENT-BASED METHODS 35'
&
$
%
Example 2
Consider the minimization problem :
minx
1
2xT[
101 −99−99 101
]x
Then
∇xP = xT[
101 −99−99 101
]= [101x1 − 99x2 − 99x1 + 101x2]
To develop an objective function for our line search, substitute
xk+1 = xk + λksk
into the original objective function :
3 GRADIENT-BASED METHODS 36'
&
$
%
P (xk+λksk) =1
2
(xk − λk∇xP|xk
)T [ 101 −99−99 101
](xk − λk∇xP|xk
)
=1
2
(xk − λk
[101 −99−99 101
]xk
)T [101 −99−99 101
](xk − λk
[101 −99−99 101
]xk
)
To simplify our work, let
H =
[101 −99−99 101
]which yields :
P (xk + λksk) =1
2
[xkHxk − 2λkx
TkH
2xk + λ2kxTkH
3xk]
3 GRADIENT-BASED METHODS 37'
&
$
%
This expression is quadratic in λk , thus making it easy to performan exact line search :
dP
dλk= −xTkH2xk + λkx
TkH
3xk = 0
Solving for the step length yields :
λk =xTkH
2xkxTkH
3xk=
xTk
[101 −99−99 101
]2xk
xTk
[101 −99−99 101
]3xk
λk =
xTk
[20, 002 −19, 998−19, 998 20, 002
]xk
xTk
[4, 000, 004 −3, 999, 9996−3, 999, 996 4, 000, 004
]xk
3 GRADIENT-BASED METHODS 38'
&
$
%
Starting at x0 = [2 1]T :
xk P (xk) ∇xPT|xk
λk
0
[21
]54.50
[103−97
]0.0050
1
[1.48451.4854
]4.4104
[2.88093.0591
]0.4591
2
[0.16180.0809
]0.3569
[8.3353−7.8497
]0.0050
3
[0.12010.1202
]0.0289
[0.23310.2476
]0.4591
4
[0.01310.0065
]0.0023
[0.6745−0.6352
]0.0050
5
[0.00970.0097
]1.8915 · 10−4
[0.01890.0200
]0.4591
6
[0.00110.0005
]1.5307 · 10−5
[0.0547−0.0514
]0.0050
7
[7.868 · 10−4
7.872 · 10−4
]1.2387 · 10−5
[0.00150.0016
]0.4591
3 GRADIENT-BASED METHODS 39'
&
$
%
Conjugate Gradients
Steepest Descent was based on the idea that the minimum will befound provided that we always move in the downhill direction. Thesecond example showed that the gradient does not always point inthe direction of the optimum, due to the geometry of the problem.Fletcher and Reeves (1964) developed a method which attempts toaccount for local curvature in determining the next search direction.
The algorithm is :
1. choose a starting point x0,
2. set the initial search direction
s0 = −[∇xP|x0
]
3 GRADIENT-BASED METHODS 40'
&
$
%
3. determine the next iterate for x :
xk+1 = xk + λksk
by solving the univariate optimization problem :
minλk
P (xk + λksk)
4. calculate the next search direction as :
sk+1 = −∇xP|xk + sk∇xP|xk+1
∇xPT|xk+1
∇xP|xk∇xPT|xk
5. repeat steps 2 &3 until termination.
3 GRADIENT-BASED METHODS 41'
&
$
%
The manner in which the successive search directions are calculatedis important. For quadratic functions, these successive searchdirections are conjugate with respect to the Hessian. This meansthat for the quadratic function :
P (x) = a+ bTx+1
2xTHx
successive search directions satisfy :
sTkHsk+1 = 0
As a result, these successive search directions have incorporatedwithin them information about the geometry of the optimizationproblem.
3 GRADIENT-BASED METHODS 42'
&
$
%
Example
Consider again the optimization problem :
minx
1
2xT[
101 −99−99 101
]x
We saw that Steepest Descent was slow to converge on this problemof its poor scaling. As in the previous example, we will use an exactline search. It can be shown that the optimal step length is :
λk = −xTk
[101 −99−99 101
]sk
sTk
[101 −99−99 101
]sk
3 GRADIENT-BASED METHODS 43'
&
$
%
start at x0 = [2 1]T
xk P (xk) ∇xPT|xk sk λk
0
[21
] [103−97
] [103−97
]
1
[1.48451.4854
] [2.88093.0591
] [−2.9717−2.9735
]
2
[00
]
3 GRADIENT-BASED METHODS 44'
&
$
%
Notice the much quicker convergence of the algorithm. For aquadratic objective with exact line searches, the ConjugateGradients method exhibits quadratic convergence.
The quicker convergence comes at an increased computationalrequirement. These include :
– a more complex search direction calculation,
– increased storage for maintaining previous search directions andgradients.
These are usually small in relation to the performanceimprovements.
4 NEWTON’S METHOD 45'
&
$
%
4 Newton’s Method
The Conjugate Gradients method showed that there was aconsiderable performance gain to be realized by incorporating someproblem geometry into the optimization algorithm. However, themethod did this in an approximate sense. Newton’s method doesthis by using second-derivative information.
In the multivariate case, Newton’s method determines a searchdirection by using the Hessian to modify the gradient. The methodis developed directly from the Taylor Series approximation of theobjective function. Recall that an objective function can be locallyapproximated :
P (x) ∼ P (xk) +∇xP|xk (x− xk) +1
2(x− xk)T∇2
xP|xk (x− xk)
4 NEWTON’S METHOD 46'
&
$
%
Then, stationarity can be approximated as :
∇xP|xk ∼ ∇xP|xk + (x− xk)T∇2xP|xk = 0
Which can be used to determine an expression for calculating thenext point in the minimization procedure :
xk+1 = xk −(∇2xP|xk
)−1 (∇xP|xk
)Notice that in this case we get both the direction and the steplength for the minimization. Further, if the objective function isquadratic, the optimum is found in a single step.
4 NEWTON’S METHOD 47'
&
$
%
At a given point xk, the algorithm for Newton’s method is :
1. calculate the gradient at the current point ∇xP|xk .
2. calculate the Hessian at the current point ∇2xP|xk .
3. calculate the Newton step :
∆xk = −(∇2xP|xk
)−1 (∇xP|xk
)T
4. calculate the next point :
xk+1 = xk + ∆xk
5. repeat steps 1 through 4 until termination.
4 NEWTON’S METHOD 48'
&
$
%
Example Consider again the optimization problem :
minx
1
2xT[
101 −99−99 101
]x
The gradient and Hessian for this example are :
∇xP = xT[
101 −99−99 101
], ∇2
xP =
[101 −99−99 101
]start at x0 = [2 1]T
xk P (xk) ∇xPT|xk
∇2xP
T|xk
0
[21
]54.50
[103−97
] [101 −99−99 101
]
1
[00
]0
4 NEWTON’S METHOD 49'
&
$
%
As expected Newton’s method was very effective, eventhough theproblem was not well scaled. Poor scaling was compensated for byusing the inverse of the Hessian.
Steepest Descent was very good for the first two iterations, whenwe were quite far from the optimum. But it slowed rapidly as :
x→ x∗ ∇xP|xk → 0
The Conjugate Gradients approach avoided this difficulty byincluding some measure of problem geometry in the algorithm.However, the method started out using the Steepest Descentdirection and built in curvature information as the algorithmprogressed to the optimum.
For higher-dimensional and non-quadratic objective functions, thenumber of iterations required for convergence can increasesubstantially.
4 NEWTON’S METHOD 50'
&
$
%
In general, Newton’s Method will yield the best convergenceproperties, at the cost of increased computational load to calculateand store the Hessian. The Conjugate Gradients approach can beeffective in situations where computational requirements are anissue. However, there some other methods where the Hessian isapproximated using gradient information and the approximateHessian is used to calculate a Newton step. These are called theQuasi-Newton methods.
As we saw in the univariate case, care must be taken when usingNewton’s method for complex functions. Since Newton’s method issearching for a stationary point, it can oscillate between suchpoints in the above situation. For such complex functions, Newton’smethod is often implemented with an alternative to step 4.
4’. calculate the next point :
xk+1 = xk + λk∆xk
4 NEWTON’S METHOD 51'
&
$
%
where the step length is determined as :
i) 0 < λk < 1 (so we don’t take a full step).
ii) perform a line search on λk.
Generally, Newton and Newton-like methods are preferred to othermethods, the main difficulty is determining the Hessian.
5 APPROXIMATE NEWTON METHODS 52'
&
$
%
5 Approximate Newton Methods
As we saw in the univariate case the derivatives can beapproximated by finite differences. In multiple dimensions this canrequire substantially more calculations. As an example, considerforward difference approximation of the gradient :
∇xP|xk =
[∂P
∂xi
]xk
∼ P (xk + δi)− P (xk)
‖δi‖where δi is a perturbation in the direction of xi. This approach
requires 1 additional objective function evaluation per dimensionfor forward or back ward differencing, and 2 additional objectivefunction evaluations per dimension for central differencing.
5 APPROXIMATE NEWTON METHODS 53'
&
$
%
To approximate the Hessian using only finite differences in theobjective function could have the form :
∇2xP|xk =
[∂2P
∂xi∂xj
]xk
∼ [P (xk + δi)− P (xk)]− [P (xk + δj)− P (xk)]
‖δi‖‖δj‖
Finite difference approximation of the Hessian requires at least anadditional 2 objective function evaluations per dimension. Somedifferencing schemes will give better performance than others ;however, the increased computational load required for differenceapproximation of the second derivatives precludes the use of thisapproach for larger problems.
Alternatively, the Hessian can be approximated in terms ofgradient information. In the forward difference case the Hessian canbe approximated as :
∇2xP|xk =
[∂2P
∂xi∂xj
]xk
∼ hi ∼∇xP (xk + δi)−∇xP (xk)
‖δi‖
5 APPROXIMATE NEWTON METHODS 54'
&
$
%
Then, we can form a finite difference approximation to the Hessianas :
H = [hi]
The problem is that often this approximate Hessian isnon-symmetric. A symmetric approximation can be formed as :
H =1
2
[H + HT
]Whichever approach is used to approximate the derivatives, theNewton step is calculated from them. In general, Newton’s methodsbased on finite differences can produce results similar to analyticalresults, providing care is taken in choosing the perturbations δi .(Recall that as the algorithm proceeds : xk → x∗ and ∇xP|xk → 0.
Then small approximation errors will affect convergence andaccuracy.) Some alternatives which can require fewer computations,yet build curvature information as the algorithm proceed are theQuasi-Newton family.
5 APPROXIMATE NEWTON METHODS 55'
&
$
%
5.1 Quasi-Newton Methods
This family of methods are the multivariate analogs of theunivariate Secant methods. They build second derivative (Hessian)information as the algorithm proceeds to solution, using availablegradient information.
Consider the Taylor series expansion for the gradient of theobjective function along the step direction sk :
∇xP|xk+1= ∇xP|xk +∇2
xP|xk (∆xk) + . . .
Truncating the Taylor series at the second-order terms anre-arranging slightly yields the so-called Quasi-Newton Condition :
Hk+1(∆xk) = ∇xP|xk+1−∇xP|xk = ∆gk
5 APPROXIMATE NEWTON METHODS 56'
&
$
%
Thus, any algorithm which satisfies this condition can build upcurvature information as it proceeds from xk to xk+1 in thedirection ∆xk ; however, the curvature information is gained in onlyone direction. Then as these algorithms proceed the fromiteration-to-iteration, the approximate Hessian can differ by only arank-one matrix
Hk+1 = Hk + vuT
Combining the Quasi-Newton Condition and the update formulayields :
Hk+1(∆xk) = (Hk + vuT )(∆xk) = ∆gk
Which yields a solution for the vector v of :
v =1
uT∆xk
[∆gk − Hk∆xk
]
5 APPROXIMATE NEWTON METHODS 57'
&
$
%
Providing ∆xk and uT are not orthogonal, the Hessian updateformula is given by :
Hk+1 = Hk +1
uT∆xk
[∆gk − Hk∆xk
]uT
Unfortunately this update formula is not unique with respect to theQuasi-Newton Condition and the step vector sk.
We could add an additional term of wzT to the update formula,where the vector w is any vector from the null space of ∆xk.(Recall that if w ∈ null(∆xk), then ∆xTkw = 0). Thus there are afamily of such Quasi-Newton methods which differ only in thisadditional term wzT .
5 APPROXIMATE NEWTON METHODS 58'
&
$
%
Perhaps the most successful member of the family is theBroyden-Fletcher-Goldfarb-Shanno (BFGS) update :
Hk+1 = Hk +∆gk∆gTk∆gTk ∆xk
− Hk∆xk∆xTk Hk
∆xTk Hk∆xk
The BFGS update has several advantages including that it issymmetric and it can be further simplified if the step is calculatedaccording to :
Hk∆xk = −λk∇xP|xkIn this case the BFGS update formula simplifies to :
Hk+1 = Hk +∆gk∆gTk∆gTk ∆xk
+λk∇xP|xk∇xP
T|xk
∇xPT|xk∆xk
5 APPROXIMATE NEWTON METHODS 59'
&
$
%
General Quasi-Newton Algorithm
1. Initialization :
- choose a starting point x0
- choose an initial Hessian approximation (usually H0 = I).
2. At iteration k :
- calculate the gradient ∇xP|xk- calculate the search direction :
sk = −(Hk)−1(∇xP|xk )
- calculate the next iterate :
xk+1 = xk + λksk
5 APPROXIMATE NEWTON METHODS 60'
&
$
%
using a line search on λk.
- compute ∇xP|xk ,∆gk,∆xk.
- update the Hessian (Hk+1) using the method of your choice(e.g. BFGS).
3. Repeat step 2 until termination.
5 APPROXIMATE NEWTON METHODS 61'
&
$
%
Example
Consider again the optimization problem :
minx
1
2xT[
101 −99−99 101
]x
The gradient and Hessian for this example are :
∇xP = xT[
101 −99−99 101
], ∇2
xP =
[101 −99−99 101
]
5 APPROXIMATE NEWTON METHODS 62'
&
$
%
start at x0 = [2 1]T
xk P (xk) ∇xP|xkHk sk λk
[21
]54.5
[103−97
] [1 00 1
] [103−97
]0.005
[1.481.48
]4.41
[2.883.05
] [100.52 −99.5−99.5 100.46
] [−2.97−2.97
]0.50
[7.8 · 10−4
7.8 · 10−4
]1.2 ·10−8
[1.5 · 10−4
1.6 · 10−4
] [101 −99−99 101
] [−7.8 · 10−4
−7.8 · 10−4
]1
[00
]
It is worth noting that the optimum is found in 3 steps(although we are very close within 2). Also, notice that we havea very accurate approximation for the Hessian (to 7 significantfigures) after 2 updates.
5 APPROXIMATE NEWTON METHODS 63'
&
$
%
The BFGS algorithm was :
- not quit as efficient as Newton’s method,
- approximately as efficient as the Conjugate Gradientsapproach,
- much more efficient that Steepest Descent.
This is what should be expected for quadratic functions of lowdimension. As the dimension of the problem increases, theNewton and Quasi-Newton methods will usually out performthe other methods.
cautionary note :
The BFGS algorithm guarantees that the approximate Hessianremains positive definite (and invertible) providing that theinitial matrix is chosen as positive definite, in theory. However,round-off errors and sometimes the search history can cause theapproximate Hessian to become badly conditioned. When thisoccurs, the approximate Hessian should be reset to some value(often the identity matrix I).
5 APPROXIMATE NEWTON METHODS 64'
&
$
%
5.2 Convergence
During our discussions we have mentioned convergenceproperties of various algorithms. There are a number of wayswith which convergence properties of an algorithm can becharacterized. The most conventional approach is to determinethe asymptotic behaviour of the sequence of distances betweenthe iterates (xk) and the optimum (x∗), as the iterations of aparticular algorithm proceed. Typically, the distance betweenas iterate and the optimum is defined as :
‖xk − x∗‖
We are most interested in the behaviour of an algorithm withinsome neighbourhood of the optimum ; hence, the asymptoticanalysis .
5 APPROXIMATE NEWTON METHODS 65'
&
$
%
The asymptotic rate of convergence is defined as :
limk→∞
‖xk+1 − x∗‖‖xk − x∗‖p
= β
with the asymptotic order p and the asymptotic error β.
In general, most algorithms show convergence properties whichare within three categories :
- linear (0 ≤ β < 1, p = 1),
- super-linear (β = 0, p = 1),
- quadratic (β > 0, p = 2).
5 APPROXIMATE NEWTON METHODS 66'
&
$
%
For the methods we have studied, it can be shown that :
(a) the Steepest Descent method exhibits super-linear convergencein most cases. For quadratic objective functions the asymptoticconvergence properties can be shown to approach quadratic ask(H)→ 1 and linear as k(H)→∞.
(b) the Conjugate Gradients method will generally showsuper-linear (and often nearly quadratic) convergence properties.
(c) Newton’s method converges quadratically.
(d) Quasi-Newton methods can closely approach quadratic rates ofconvergence.
5 APPROXIMATE NEWTON METHODS 67'
&
$
%
As in the univariate case, there are two basic approaches :
- direct search (Nelder-Meade Simplex, . . . ),
- derivative-based (Steepest Descent, Newton’s, . . . ).
Choosing which method to use for a particular problem is atrade-off among :
- ease of implementation,
- convergence speed and accuracy,
- computational limitations.
5 APPROXIMATE NEWTON METHODS 68'
&
$
%
Unfortunately there is no best method in all situations. Fornearly quadratic functions Newton’s and Quasi-Newtonmethods will usually be very efficient. For nearly flat ornon-smooth objective functions, the direct search methods areoften a good choice. For very large problems (more than 1000decision variables), algorithms which use only gradientinformation (Steepest Descent, Conjugate Gradients,Quasi-Newton, etc.) can be the only practical methods.
Remember that the solution to a multivariable optimizationproblem requires :
- a function (and derivatives) that can be evaluated,
- an appropriate optimization method.