+ All Categories
Home > Documents > 0456_PDF_07

0456_PDF_07

Date post: 01-Jun-2018
Category:
Upload: luss4u
View: 221 times
Download: 0 times
Share this document with a friend

of 30

Transcript
  • 8/9/2019 0456_PDF_07

    1/30

    Chapter seven

    Optimization algorithms

    7.1 Optimization

    We have already seen that symbolic learning by induction is a search process,

    where the search for the correct rule, relationship, or statement is steered by

    the examples that are encountered. Numerical learning systems can be viewedin the same light. An initial model is set up, and its parameters are

    progressively refined in the light of experience. The goal is invariably to

    determine the maximum or minimum value of some function of one or more

    variables. This is the process of optimization. Often the optimization problem

    is considered to be one of determining a minimum, and the function that is

    being minimized is referred to as a cost function. The cost function might

    typically be the difference, or error, between a desired output and the actual

    output. Alternatively, optimization is sometimes viewed as maximizing the

    value of a function, known then as a fitness function. In fact the twoapproaches are equivalent, because the fitness can simply be taken to be the

    negation of the cost and vice versa, with the optional addition of a constant

    value to keep both cost and fitness positive. Similarly, fitness and cost are

    sometimes taken as the reciprocals of each other. The term objective function

    embraces both fitness and cost. Optimization of the objective function might

    mean either minimizing the cost or maximizing the fitness.

    7.2 The search space

    The potential solutions to a search problem constitute the search space or

    parameter space. If a value is sought for a single variable, or parameter, the

    search space is one-dimensional. If simultaneous values of n variables are

    sought, the search space is n-dimensional. Invalid combinations of parameter

    values can be either explicitly excluded from the search space, or included on

    the assumption that they will be rejected by the optimization algorithm. In

    combinatorial problems, the search space comprises combinations of values,

    2001 by CRC Press LLC

  • 8/9/2019 0456_PDF_07

    2/30

    the order of which has no particular significance provided that the meaning of

    each value is known. For example, in a steel rolling mill the combination of

    parameters that describe the profiles of the rolls can be optimized to maximize

    the flatness of the manufactured steel [1]. Here, each possible combination of

    parameter values represents a point in the search space. The extent of the

    search space is constrained by any limits that apply to the variables.

    In contrast, permutation problems involve the ordering of certain

    attributes. One of the best known examples is the traveling salesperson

    problem, where he or she must find the shortest route between cities of known

    location, visiting each city only once. This sort of problem has many real

    applications, such as in the routing of electrical connections on a

    semiconductor chip. For each permutation of cities, known as a tour, we can

    evaluate the cost function as the sum of distances traveled. Each possible tour

    represents a point in the search space. Permutation problems are often cyclic,

    so the tour ABCDE is considered the same as BCDEA.

    The metaphor of space relies on the notion that certain points in the search

    space can be considered closer together than others. In the traveling

    salesperson example, the tour ABCDE is close to ABDCE, but DACEB is

    distant from both of them. This separation of patterns can be measured

    intuitively in terms of the number of pair-wise swaps required to turn one tour

    into another. In the case of binary patterns, the separation of the patterns is

    usually measured as the Hamming distancebetween them, i.e., the number of

    bit positions that contain different values. For instance, the binary patterns

    01101 and 11110 have a Hamming separation of 3.

    We can associate a fitness value with each point in the search space. By

    plotting the fitness for a two-dimensional search space, we obtain a fitness

    landscape (Figure 7.1). Here the two search parameters are x and y,

    constrained within a range of allowed values. For higher dimensions of search

    space a fitness landscape still exists, but is difficult to visualize. A suitable

    optimization algorithm would involve finding peaks in the fitness landscape or

    valleys in the cost landscape. Regardless of the number of dimensions, there is

    a risk of finding a local optimum rather than the global optimum for the

    function. A global optimum is the point in the search space with the highest

    fitness. A local optimum is a point whose fitness is higher than all its near

    neighbors but lower than that of the global optimum.

    If neighboring points in the search space have a similar fitness, the

    landscape is said to be smooth or correlated. The fitness of any individual

    point in the search space is, therefore, representative of the quality of the

    surrounding region. Where neighboring points have very different fitnesses,

    the landscape is said to be rugged. Rugged landscapes typically have large

    2001 by CRC Press LLC

  • 8/9/2019 0456_PDF_07

    3/30

    numbers of local optima and the fitness of an individual point in the searchspace will not necessarily reflect that of its neighbors.

    The idea of a fitness landscape assumes that the function to be optimized

    remains constant during the optimization process. If this assumption cannot be

    made, as might be the case in a real-time system, we can think of the problem

    as finding an optimum in a fitness seascape [2].

    7.3 Searching the search space

    Determining the optimum for an objective function of multiple variables is not

    straightforward, even when the landscape is static. Although exhaustively

    evaluating the fitness of each point in the search space will always reveal the

    optimum, this is usually impracticable because of the enormity of the search

    space. Thus, the essence of all the numerical optimization techniques is to

    determine the optimum point in the search space by examining only a fraction

    of all possible candidates.

    The techniques described here are all based upon the idea of choosing a

    starting point and then altering one or more variables in an attempt to increasethe fitness or reduce the cost. The various approaches have the following two

    key characteristics.

    (i) Whether they are based on a single candidate or a population of

    candidates

    Some of the methods to be described, such as hill-climbing, maintain a

    single best solution so far which is refined until no further increase in

    x

    y0

    0.1

    0.2

    0.3

    0.4

    Fi

    tness

    Global optimum

    Local optima

    Figure 7.1 A fitness landscape

    2001 by CRC Press LLC

  • 8/9/2019 0456_PDF_07

    4/30

    fitness can be achieved. Genetic algorithms, on the other hand, maintain a

    population of candidate solutions. The overall fitness of the population

    generally improves with each generation, although some decidedly unfit

    individual candidates may be added along the way.

    (ii) Whether new candidates can be distant in the search space from theexisting ones

    Methods such as hill-climbing take small steps from the start point until

    they reach either a local or global optimum. To guard against missing the

    global optimum, it is advisable to repeat the process several times, starting

    from different points in the search space. An alternative approach, adopted

    in genetic algorithms and simulated annealing, is to begin with the

    freedom to roam around the whole of the search space in order to find the

    regions of highest fitness. This initial exploration phase is followed by

    exploitation, i.e., a detailed search of the best regions of the search spaceidentified during exploration. Methods, such as genetic algorithms, that

    use a population of candidates rather than just one allow several regions to

    be explored at the same time.

    7.4 Hill-climbing and gradient descent algorithms

    7.4.1 Hill-climbing

    The name hill-climbingimplies that optimization is viewed as the search for a

    maximum in a fitness landscape. However, the method can equally be applied

    to a cost landscape, in which case a better name might be valley descent. It is

    the simplest of the optimization procedures described here. The algorithm is

    easy to implement, but is inefficient and offers no protection against finding a

    local minimum rather than the global one. From a randomly selected start pointin the search space, i.e., a trial solution, a step is taken in a random direction. If

    the fitness of the new point is greater than the previous position, it is accepted

    as the new trial solution. Otherwise the trial solution is unchanged. The process

    is repeated until the algorithm no longer accepts any steps from the trial

    solution. At this point the trial solution is assumed to be the optimum. As noted

    above, one way of guarding against the trap of detecting a local optimum is to

    repeat the process many times with different starting points.

    2001 by CRC Press LLC

  • 8/9/2019 0456_PDF_07

    5/30

    7.4.2 Steepest gradient descent or ascent

    Steepest gradient descent (or ascent) is a refinement of hill-climbing that can

    speed the convergence toward a minimum cost (or maximum fitness). It is only

    slightly more sophisticated than hill-climbing, and it offers no protection

    against finding a local minimum rather than the global one. From a given

    starting point, i.e., a trial solution, the direction of steepest descent is

    determined. A point lying a small distance along this direction is then taken as

    the new trial solution. The process is repeated until it is no longer possible to

    descend, at which point it is assumed that the optimum has been reached.

    If the search space is not continuous but discrete, i.e., it is made up of

    separate individual points, at each step the new trial solution is the neighbor

    with the highest fitness or lowest cost. The most extreme form of discrete data

    is where the search parameters are binary, i.e., they have only two possible

    values. The parameters can then be placed together so that any point in the

    search space is represented as a binary string and neighboring points are those

    at a Hamming distance (see Section 7.2) of 1 from the current trial solution.

    7.4.3 Gradient-proportional descent

    Gradient-proportional descent, often simply called gradient descent, is a

    variant of steepest gradient descent that can be applied in a cost landscape that

    is continuous and differentiable, i.e., where the variables can take any value

    within the allowed range and the cost varies smoothly. Rather than choosing a

    fixed step size, the size of the steps is allowed to vary in proportion to the local

    gradient of descent.

    7.4.4 Conjugate gradient descent or ascent

    Conjugate gradient descent (or ascent) is a simple attempt at avoiding the

    problem of finding a local, rather than global, optimum in the cost (or fitness)

    landscape. From a given starting point in the cost landscape, the direction of

    steepest descent is initially chosen. New trial solutions are then taken by

    stepping along this direction, with the same direction being retained until the

    slope begins to curve uphill. When this happens, an alternative direction

    having a downhill gradient is chosen. When the direction that has been

    followed curves uphill, and all of the alternative directions are also uphill, it is

    assumed that the optimum has been reached. As the method does not

    continually hunt for the sharpest descent, it may be more successful than the

    steepest gradient descent method in finding the global minimum. However, the

    technique will never cause a gradient to be climbed, even though this would be

    necessary in order to escape a local minimum and thereby reach the global

    minimum.

    2001 by CRC Press LLC

  • 8/9/2019 0456_PDF_07

    6/30

    7.5 Simulated annealing

    Simulated annealing [3] owes its name to its similarity to the problem of atoms

    rearranging themselves in a cooling metal. In the cooling metal, atoms move to

    form a near-perfect crystal lattice, even though they may have to overcome a

    localized energy barrier called the activation energy,Ea, in order to do so. The

    atomic rearrangements within the crystal are probabilistic. The probabilityPofan atom jumping into a neighboring site is given by:

    )/exp( kTEP a (7.1)

    where k is Boltzmanns constant and T is temperature. At high temperatures,

    the probability approaches 1, while at T= 0 the probability is 0.

    In simulated annealing, a trial solution is chosen and the effects of taking a

    small random step from this position are tested. If the step results in a

    reduction in the cost function, it replaces the previous solution as the currenttrial solution. If it does not result in a cost saving, the solution still has a

    probabilityPof being accepted as the new trial solution given by:

    )/exp( TEP (7.2)

    This function is shown in Figure 7.2(a).Here, E is the increase in the cost

    function that would result from the step and is, therefore, analogous to the

    activation energy in the atomic system. There is no need to include

    Boltzmanns constant, as E and T no longer represent real energies ortemperatures.

    The temperature Tis simply a numerical value that determines the stability

    of a trial solution. If T is high, new trial solutions will be generated

    continually. If T is low, the trial solution will move to a local or global cost

    minimum if it is not there already and will remain there. The value of T

    is initially set high and is periodically reduced according to a cooling schedule.

    A commonly used simple cooling schedule is:

    Tt+1= Tt (7.3)

    where Tt is the temperature at step number tand is a constant close to, but

    below, 1. While T is high, the optimization routine is free to accept many

    varied solutions, but as it drops, this freedom diminishes. At T= 0, the method

    is equivalent to the hill-climbing algorithm, as shown inFigure 7.2(b).

    If the optimization is successful, the final solution will be the global

    minimum. The success of the technique is dependent upon values chosen for

    2001 by CRC Press LLC

  • 8/9/2019 0456_PDF_07

    7/30

    starting temperature, the size and frequency of the temperature decrement, andthe size of perturbations applied to the trial solutions. A flowchart for the

    simulated annealing algorithm is given inFigure 7.3.

    Johnson and Picton [4] have described a variant of simulated annealing in

    which the probability of accepting a trial solution is always probabilistic, even

    E / T

    -4 -3 -2 -1 0 1 2 3 4

    P

    =1

    ifE