of 30
8/9/2019 0456_PDF_07
1/30
Chapter seven
Optimization algorithms
7.1 Optimization
We have already seen that symbolic learning by induction is a search process,
where the search for the correct rule, relationship, or statement is steered by
the examples that are encountered. Numerical learning systems can be viewedin the same light. An initial model is set up, and its parameters are
progressively refined in the light of experience. The goal is invariably to
determine the maximum or minimum value of some function of one or more
variables. This is the process of optimization. Often the optimization problem
is considered to be one of determining a minimum, and the function that is
being minimized is referred to as a cost function. The cost function might
typically be the difference, or error, between a desired output and the actual
output. Alternatively, optimization is sometimes viewed as maximizing the
value of a function, known then as a fitness function. In fact the twoapproaches are equivalent, because the fitness can simply be taken to be the
negation of the cost and vice versa, with the optional addition of a constant
value to keep both cost and fitness positive. Similarly, fitness and cost are
sometimes taken as the reciprocals of each other. The term objective function
embraces both fitness and cost. Optimization of the objective function might
mean either minimizing the cost or maximizing the fitness.
7.2 The search space
The potential solutions to a search problem constitute the search space or
parameter space. If a value is sought for a single variable, or parameter, the
search space is one-dimensional. If simultaneous values of n variables are
sought, the search space is n-dimensional. Invalid combinations of parameter
values can be either explicitly excluded from the search space, or included on
the assumption that they will be rejected by the optimization algorithm. In
combinatorial problems, the search space comprises combinations of values,
2001 by CRC Press LLC
8/9/2019 0456_PDF_07
2/30
the order of which has no particular significance provided that the meaning of
each value is known. For example, in a steel rolling mill the combination of
parameters that describe the profiles of the rolls can be optimized to maximize
the flatness of the manufactured steel [1]. Here, each possible combination of
parameter values represents a point in the search space. The extent of the
search space is constrained by any limits that apply to the variables.
In contrast, permutation problems involve the ordering of certain
attributes. One of the best known examples is the traveling salesperson
problem, where he or she must find the shortest route between cities of known
location, visiting each city only once. This sort of problem has many real
applications, such as in the routing of electrical connections on a
semiconductor chip. For each permutation of cities, known as a tour, we can
evaluate the cost function as the sum of distances traveled. Each possible tour
represents a point in the search space. Permutation problems are often cyclic,
so the tour ABCDE is considered the same as BCDEA.
The metaphor of space relies on the notion that certain points in the search
space can be considered closer together than others. In the traveling
salesperson example, the tour ABCDE is close to ABDCE, but DACEB is
distant from both of them. This separation of patterns can be measured
intuitively in terms of the number of pair-wise swaps required to turn one tour
into another. In the case of binary patterns, the separation of the patterns is
usually measured as the Hamming distancebetween them, i.e., the number of
bit positions that contain different values. For instance, the binary patterns
01101 and 11110 have a Hamming separation of 3.
We can associate a fitness value with each point in the search space. By
plotting the fitness for a two-dimensional search space, we obtain a fitness
landscape (Figure 7.1). Here the two search parameters are x and y,
constrained within a range of allowed values. For higher dimensions of search
space a fitness landscape still exists, but is difficult to visualize. A suitable
optimization algorithm would involve finding peaks in the fitness landscape or
valleys in the cost landscape. Regardless of the number of dimensions, there is
a risk of finding a local optimum rather than the global optimum for the
function. A global optimum is the point in the search space with the highest
fitness. A local optimum is a point whose fitness is higher than all its near
neighbors but lower than that of the global optimum.
If neighboring points in the search space have a similar fitness, the
landscape is said to be smooth or correlated. The fitness of any individual
point in the search space is, therefore, representative of the quality of the
surrounding region. Where neighboring points have very different fitnesses,
the landscape is said to be rugged. Rugged landscapes typically have large
2001 by CRC Press LLC
8/9/2019 0456_PDF_07
3/30
numbers of local optima and the fitness of an individual point in the searchspace will not necessarily reflect that of its neighbors.
The idea of a fitness landscape assumes that the function to be optimized
remains constant during the optimization process. If this assumption cannot be
made, as might be the case in a real-time system, we can think of the problem
as finding an optimum in a fitness seascape [2].
7.3 Searching the search space
Determining the optimum for an objective function of multiple variables is not
straightforward, even when the landscape is static. Although exhaustively
evaluating the fitness of each point in the search space will always reveal the
optimum, this is usually impracticable because of the enormity of the search
space. Thus, the essence of all the numerical optimization techniques is to
determine the optimum point in the search space by examining only a fraction
of all possible candidates.
The techniques described here are all based upon the idea of choosing a
starting point and then altering one or more variables in an attempt to increasethe fitness or reduce the cost. The various approaches have the following two
key characteristics.
(i) Whether they are based on a single candidate or a population of
candidates
Some of the methods to be described, such as hill-climbing, maintain a
single best solution so far which is refined until no further increase in
x
y0
0.1
0.2
0.3
0.4
Fi
tness
Global optimum
Local optima
Figure 7.1 A fitness landscape
2001 by CRC Press LLC
8/9/2019 0456_PDF_07
4/30
fitness can be achieved. Genetic algorithms, on the other hand, maintain a
population of candidate solutions. The overall fitness of the population
generally improves with each generation, although some decidedly unfit
individual candidates may be added along the way.
(ii) Whether new candidates can be distant in the search space from theexisting ones
Methods such as hill-climbing take small steps from the start point until
they reach either a local or global optimum. To guard against missing the
global optimum, it is advisable to repeat the process several times, starting
from different points in the search space. An alternative approach, adopted
in genetic algorithms and simulated annealing, is to begin with the
freedom to roam around the whole of the search space in order to find the
regions of highest fitness. This initial exploration phase is followed by
exploitation, i.e., a detailed search of the best regions of the search spaceidentified during exploration. Methods, such as genetic algorithms, that
use a population of candidates rather than just one allow several regions to
be explored at the same time.
7.4 Hill-climbing and gradient descent algorithms
7.4.1 Hill-climbing
The name hill-climbingimplies that optimization is viewed as the search for a
maximum in a fitness landscape. However, the method can equally be applied
to a cost landscape, in which case a better name might be valley descent. It is
the simplest of the optimization procedures described here. The algorithm is
easy to implement, but is inefficient and offers no protection against finding a
local minimum rather than the global one. From a randomly selected start pointin the search space, i.e., a trial solution, a step is taken in a random direction. If
the fitness of the new point is greater than the previous position, it is accepted
as the new trial solution. Otherwise the trial solution is unchanged. The process
is repeated until the algorithm no longer accepts any steps from the trial
solution. At this point the trial solution is assumed to be the optimum. As noted
above, one way of guarding against the trap of detecting a local optimum is to
repeat the process many times with different starting points.
2001 by CRC Press LLC
8/9/2019 0456_PDF_07
5/30
7.4.2 Steepest gradient descent or ascent
Steepest gradient descent (or ascent) is a refinement of hill-climbing that can
speed the convergence toward a minimum cost (or maximum fitness). It is only
slightly more sophisticated than hill-climbing, and it offers no protection
against finding a local minimum rather than the global one. From a given
starting point, i.e., a trial solution, the direction of steepest descent is
determined. A point lying a small distance along this direction is then taken as
the new trial solution. The process is repeated until it is no longer possible to
descend, at which point it is assumed that the optimum has been reached.
If the search space is not continuous but discrete, i.e., it is made up of
separate individual points, at each step the new trial solution is the neighbor
with the highest fitness or lowest cost. The most extreme form of discrete data
is where the search parameters are binary, i.e., they have only two possible
values. The parameters can then be placed together so that any point in the
search space is represented as a binary string and neighboring points are those
at a Hamming distance (see Section 7.2) of 1 from the current trial solution.
7.4.3 Gradient-proportional descent
Gradient-proportional descent, often simply called gradient descent, is a
variant of steepest gradient descent that can be applied in a cost landscape that
is continuous and differentiable, i.e., where the variables can take any value
within the allowed range and the cost varies smoothly. Rather than choosing a
fixed step size, the size of the steps is allowed to vary in proportion to the local
gradient of descent.
7.4.4 Conjugate gradient descent or ascent
Conjugate gradient descent (or ascent) is a simple attempt at avoiding the
problem of finding a local, rather than global, optimum in the cost (or fitness)
landscape. From a given starting point in the cost landscape, the direction of
steepest descent is initially chosen. New trial solutions are then taken by
stepping along this direction, with the same direction being retained until the
slope begins to curve uphill. When this happens, an alternative direction
having a downhill gradient is chosen. When the direction that has been
followed curves uphill, and all of the alternative directions are also uphill, it is
assumed that the optimum has been reached. As the method does not
continually hunt for the sharpest descent, it may be more successful than the
steepest gradient descent method in finding the global minimum. However, the
technique will never cause a gradient to be climbed, even though this would be
necessary in order to escape a local minimum and thereby reach the global
minimum.
2001 by CRC Press LLC
8/9/2019 0456_PDF_07
6/30
7.5 Simulated annealing
Simulated annealing [3] owes its name to its similarity to the problem of atoms
rearranging themselves in a cooling metal. In the cooling metal, atoms move to
form a near-perfect crystal lattice, even though they may have to overcome a
localized energy barrier called the activation energy,Ea, in order to do so. The
atomic rearrangements within the crystal are probabilistic. The probabilityPofan atom jumping into a neighboring site is given by:
)/exp( kTEP a (7.1)
where k is Boltzmanns constant and T is temperature. At high temperatures,
the probability approaches 1, while at T= 0 the probability is 0.
In simulated annealing, a trial solution is chosen and the effects of taking a
small random step from this position are tested. If the step results in a
reduction in the cost function, it replaces the previous solution as the currenttrial solution. If it does not result in a cost saving, the solution still has a
probabilityPof being accepted as the new trial solution given by:
)/exp( TEP (7.2)
This function is shown in Figure 7.2(a).Here, E is the increase in the cost
function that would result from the step and is, therefore, analogous to the
activation energy in the atomic system. There is no need to include
Boltzmanns constant, as E and T no longer represent real energies ortemperatures.
The temperature Tis simply a numerical value that determines the stability
of a trial solution. If T is high, new trial solutions will be generated
continually. If T is low, the trial solution will move to a local or global cost
minimum if it is not there already and will remain there. The value of T
is initially set high and is periodically reduced according to a cooling schedule.
A commonly used simple cooling schedule is:
Tt+1= Tt (7.3)
where Tt is the temperature at step number tand is a constant close to, but
below, 1. While T is high, the optimization routine is free to accept many
varied solutions, but as it drops, this freedom diminishes. At T= 0, the method
is equivalent to the hill-climbing algorithm, as shown inFigure 7.2(b).
If the optimization is successful, the final solution will be the global
minimum. The success of the technique is dependent upon values chosen for
2001 by CRC Press LLC
8/9/2019 0456_PDF_07
7/30
starting temperature, the size and frequency of the temperature decrement, andthe size of perturbations applied to the trial solutions. A flowchart for the
simulated annealing algorithm is given inFigure 7.3.
Johnson and Picton [4] have described a variant of simulated annealing in
which the probability of accepting a trial solution is always probabilistic, even
E / T
-4 -3 -2 -1 0 1 2 3 4
P
=1
ifE