0456_PDF_07

8/9/2019 0456_PDF_07

1/30

Chapter seven

Optimization algorithms

7.1 Optimization

We have already seen that symbolic learning by induction is a search process,

where the search for the correct rule, relationship, or statement is steered by

the examples that are encountered. Numerical learning systems can be viewedin the same light. An initial model is set up, and its parameters are

progressively refined in the light of experience. The goal is invariably to

determine the maximum or minimum value of some function of one or more

variables. This is the process of optimization. Often the optimization problem

is considered to be one of determining a minimum, and the function that is

being minimized is referred to as a cost function. The cost function might

typically be the difference, or error, between a desired output and the actual

output. Alternatively, optimization is sometimes viewed as maximizing the

value of a function, known then as a fitness function. In fact the twoapproaches are equivalent, because the fitness can simply be taken to be the

negation of the cost and vice versa, with the optional addition of a constant

value to keep both cost and fitness positive. Similarly, fitness and cost are

sometimes taken as the reciprocals of each other. The term objective function

embraces both fitness and cost. Optimization of the objective function might

mean either minimizing the cost or maximizing the fitness.

7.2 The search space

The potential solutions to a search problem constitute the search space or

parameter space. If a value is sought for a single variable, or parameter, the

search space is one-dimensional. If simultaneous values of n variables are

sought, the search space is n-dimensional. Invalid combinations of parameter

values can be either explicitly excluded from the search space, or included on

the assumption that they will be rejected by the optimization algorithm. In

combinatorial problems, the search space comprises combinations of values,

2001 by CRC Press LLC

8/9/2019 0456_PDF_07

2/30

the order of which has no particular significance provided that the meaning of

each value is known. For example, in a steel rolling mill the combination of

parameters that describe the profiles of the rolls can be optimized to maximize

the flatness of the manufactured steel [1]. Here, each possible combination of

parameter values represents a point in the search space. The extent of the

search space is constrained by any limits that apply to the variables.

In contrast, permutation problems involve the ordering of certain

attributes. One of the best known examples is the traveling salesperson

problem, where he or she must find the shortest route between cities of known

location, visiting each city only once. This sort of problem has many real

applications, such as in the routing of electrical connections on a

semiconductor chip. For each permutation of cities, known as a tour, we can

evaluate the cost function as the sum of distances traveled. Each possible tour

represents a point in the search space. Permutation problems are often cyclic,

so the tour ABCDE is considered the same as BCDEA.

The metaphor of space relies on the notion that certain points in the search

space can be considered closer together than others. In the traveling

salesperson example, the tour ABCDE is close to ABDCE, but DACEB is

distant from both of them. This separation of patterns can be measured

intuitively in terms of the number of pair-wise swaps required to turn one tour

into another. In the case of binary patterns, the separation of the patterns is

usually measured as the Hamming distancebetween them, i.e., the number of

bit positions that contain different values. For instance, the binary patterns

01101 and 11110 have a Hamming separation of 3.

We can associate a fitness value with each point in the search space. By

plotting the fitness for a two-dimensional search space, we obtain a fitness

landscape (Figure 7.1). Here the two search parameters are x and y,

constrained within a range of allowed values. For higher dimensions of search

space a fitness landscape still exists, but is difficult to visualize. A suitable

optimization algorithm would involve finding peaks in the fitness landscape or

valleys in the cost landscape. Regardless of the number of dimensions, there is

a risk of finding a local optimum rather than the global optimum for the

function. A global optimum is the point in the search space with the highest

fitness. A local optimum is a point whose fitness is higher than all its near

neighbors but lower than that of the global optimum.

If neighboring points in the search space have a similar fitness, the

landscape is said to be smooth or correlated. The fitness of any individual

point in the search space is, therefore, representative of the quality of the

surrounding region. Where neighboring points have very different fitnesses,

the landscape is said to be rugged. Rugged landscapes typically have large


8/9/2019 0456_PDF_07

3/30

numbers of local optima and the fitness of an individual point in the searchspace will not necessarily reflect that of its neighbors.

The idea of a fitness landscape assumes that the function to be optimized

remains constant during the optimization process. If this assumption cannot be

made, as might be the case in a real-time system, we can think of the problem

as finding an optimum in a fitness seascape [2].

7.3 Searching the search space

Determining the optimum for an objective function of multiple variables is not

straightforward, even when the landscape is static. Although exhaustively

evaluating the fitness of each point in the search space will always reveal the

optimum, this is usually impracticable because of the enormity of the search

space. Thus, the essence of all the numerical optimization techniques is to

determine the optimum point in the search space by examining only a fraction

of all possible candidates.

The techniques described here are all based upon the idea of choosing a

starting point and then altering one or more variables in an attempt to increasethe fitness or reduce the cost. The various approaches have the following two

key characteristics.

(i) Whether they are based on a single candidate or a population of

candidates

Some of the methods to be described, such as hill-climbing, maintain a

single best solution so far which is refined until no further increase in

x

y0

0.1

0.2

0.3

0.4

Fi

tness

Global optimum

Local optima

Figure 7.1 A fitness landscape


8/9/2019 0456_PDF_07

4/30

fitness can be achieved. Genetic algorithms, on the other hand, maintain a

population of candidate solutions. The overall fitness of the population

generally improves with each generation, although some decidedly unfit

individual candidates may be added along the way.

(ii) Whether new candidates can be distant in the search space from theexisting ones

Methods such as hill-climbing take small steps from the start point until

they reach either a local or global optimum. To guard against missing the

global optimum, it is advisable to repeat the process several times, starting

from different points in the search space. An alternative approach, adopted

in genetic algorithms and simulated annealing, is to begin with the

freedom to roam around the whole of the search space in order to find the

regions of highest fitness. This initial exploration phase is followed by

exploitation, i.e., a detailed search of the best regions of the search spaceidentified during exploration. Methods, such as genetic algorithms, that

use a population of candidates rather than just one allow several regions to

be explored at the same time.

7.4 Hill-climbing and gradient descent algorithms

7.4.1 Hill-climbing

The name hill-climbingimplies that optimization is viewed as the search for a

maximum in a fitness landscape. However, the method can equally be applied

to a cost landscape, in which case a better name might be valley descent. It is

the simplest of the optimization procedures described here. The algorithm is

easy to implement, but is inefficient and offers no protection against finding a

local minimum rather than the global one. From a randomly selected start pointin the search space, i.e., a trial solution, a step is taken in a random direction. If

the fitness of the new point is greater than the previous position, it is accepted

as the new trial solution. Otherwise the trial solution is unchanged. The process

is repeated until the algorithm no longer accepts any steps from the trial

solution. At this point the trial solution is assumed to be the optimum. As noted

above, one way of guarding against the trap of detecting a local optimum is to

repeat the process many times with different starting points.


8/9/2019 0456_PDF_07

5/30

7.4.2 Steepest gradient descent or ascent

Steepest gradient descent (or ascent) is a refinement of hill-climbing that can

speed the convergence toward a minimum cost (or maximum fitness). It is only

slightly more sophisticated than hill-climbing, and it offers no protection

against finding a local minimum rather than the global one. From a given

starting point, i.e., a trial solution, the direction of steepest descent is

determined. A point lying a small distance along this direction is then taken as

the new trial solution. The process is repeated until it is no longer possible to

descend, at which point it is assumed that the optimum has been reached.

If the search space is not continuous but discrete, i.e., it is made up of

separate individual points, at each step the new trial solution is the neighbor

with the highest fitness or lowest cost. The most extreme form of discrete data

is where the search parameters are binary, i.e., they have only two possible

values. The parameters can then be placed together so that any point in the

search space is represented as a binary string and neighboring points are those

at a Hamming distance (see Section 7.2) of 1 from the current trial solution.

7.4.3 Gradient-proportional descent

Gradient-proportional descent, often simply called gradient descent, is a

variant of steepest gradient descent that can be applied in a cost landscape that

is continuous and differentiable, i.e., where the variables can take any value

within the allowed range and the cost varies smoothly. Rather than choosing a

fixed step size, the size of the steps is allowed to vary in proportion to the local

gradient of descent.

7.4.4 Conjugate gradient descent or ascent

Conjugate gradient descent (or ascent) is a simple attempt at avoiding the

problem of finding a local, rather than global, optimum in the cost (or fitness)

landscape. From a given starting point in the cost landscape, the direction of

steepest descent is initially chosen. New trial solutions are then taken by

stepping along this direction, with the same direction being retained until the

slope begins to curve uphill. When this happens, an alternative direction

having a downhill gradient is chosen. When the direction that has been

followed curves uphill, and all of the alternative directions are also uphill, it is

assumed that the optimum has been reached. As the method does not

continually hunt for the sharpest descent, it may be more successful than the

steepest gradient descent method in finding the global minimum. However, the

technique will never cause a gradient to be climbed, even though this would be

necessary in order to escape a local minimum and thereby reach the global

minimum.


8/9/2019 0456_PDF_07

6/30

7.5 Simulated annealing

Simulated annealing [3] owes its name to its similarity to the problem of atoms

rearranging themselves in a cooling metal. In the cooling metal, atoms move to

form a near-perfect crystal lattice, even though they may have to overcome a

localized energy barrier called the activation energy,Ea, in order to do so. The

atomic rearrangements within the crystal are probabilistic. The probabilityPofan atom jumping into a neighboring site is given by:

)/exp( kTEP a (7.1)

where k is Boltzmanns constant and T is temperature. At high temperatures,

the probability approaches 1, while at T= 0 the probability is 0.

In simulated annealing, a trial solution is chosen and the effects of taking a

small random step from this position are tested. If the step results in a

reduction in the cost function, it replaces the previous solution as the currenttrial solution. If it does not result in a cost saving, the solution still has a

probabilityPof being accepted as the new trial solution given by:

)/exp( TEP (7.2)

This function is shown in Figure 7.2(a).Here, E is the increase in the cost

function that would result from the step and is, therefore, analogous to the

activation energy in the atomic system. There is no need to include

Boltzmanns constant, as E and T no longer represent real energies ortemperatures.

The temperature Tis simply a numerical value that determines the stability

of a trial solution. If T is high, new trial solutions will be generated

continually. If T is low, the trial solution will move to a local or global cost

minimum if it is not there already and will remain there. The value of T

is initially set high and is periodically reduced according to a cooling schedule.

A commonly used simple cooling schedule is:

Tt+1= Tt (7.3)

where Tt is the temperature at step number tand is a constant close to, but

below, 1. While T is high, the optimization routine is free to accept many

varied solutions, but as it drops, this freedom diminishes. At T= 0, the method

is equivalent to the hill-climbing algorithm, as shown inFigure 7.2(b).

If the optimization is successful, the final solution will be the global

minimum. The success of the technique is dependent upon values chosen for


8/9/2019 0456_PDF_07

7/30

starting temperature, the size and frequency of the temperature decrement, andthe size of perturbations applied to the trial solutions. A flowchart for the

simulated annealing algorithm is given inFigure 7.3.

Johnson and Picton [4] have described a variant of simulated annealing in

which the probability of accepting a trial solution is always probabilistic, even

E / T

-4 -3 -2 -1 0 1 2 3 4

P

=1

ifE

Date post:	01-Jun-2018
Category:	Documents
Upload:	luss4u
View:	221 times
Download:	0 times

0456_PDF_07

Documents