The Anticipated Mean Shift and Cluster Registration in...

The Anticipated Mean Shift and Cluster Registration inMixture-based EDAs for Multi-Objective Optimization

Peter A.N. BosmanCentrum Wiskunde & Informatica (CWI)

P.O. Box 94079, 1090 GB Amsterdam, The Netherlands

[email protected]

ABSTRACT

It is known that in real-valued Single-Objective (SO) op-timization with Gaussian Estimation-of-Distribution Algo-rithms (EDAs), it is important to take into account howdistribution parameters change in subsequent generationsto prevent inefficient convergence as a result of overfitting,especially if dependencies are modelled. We illustrate thatin Multi-Objective (MO) optimization the risk of overfittingis even larger and only further increased if clustered varia-tion is used, a technique often employed in Multi-ObjectiveEDAs (MOEDAs) in the form of mixture modelling via clus-tering selected solutions in objective space. We point outthat a technique previously used in EDAs to remove the riskof overfitting for SO optimization, the anticipated mean shift(AMS), can also be used in MO optimization if clusters insubsequent generations are registered. We propose to com-pute this registration explicitly. Although computationallymore intensive than existing approaches, the effectiveness ofAMS is thereby increased. We further propose a new clus-tering technique to improve mixture modelling in EDAs by1) allowing clusters to overlap substantially and 2) assign-ing each cluster the same number of solutions. This allowsany existing EDA to be transformed into a mixture-basedversion straightforwardly. Finally, we point out the benefitof injecting solutions obtained from running equal-capacitySO optimizers in synchronous parallel and investigate ex-perimentally, using 9 well-known benchmark problems, theadvantages of each of the techniques.

Categories and Subject Descriptors

G.1 [Numerical Analysis]: Optimization; I.2 [ArtificialIntelligence]: Problem Solving, Control Methods, and Search

General Terms

Algorithms, Performance, Experimentation

Keywords

Estimation of Distribution Algorithms, Multi-Objective Op-timization, Mixture Distribution, Anticipation

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.GECCO’10, July 7–11, 2010, Portland, Oregon, USA.Copyright 2010 ACM 978-1-4503-0072-8/10/07 ...$10.00.

1. INTRODUCTIONEDAs aim to exploit features of a problem’s structure in

a principled manner via probabilistic modelling. It is of-ten assumed that a higher capacity of the distribution classused in an EDA automatically allows for a larger, and morecomplex, class of optimization problems to be solved effi-ciently. Merely enlarging this capacity, e.g. by allowing moredependencies to be modelled, isn’t necessarily enough how-ever. When using Gaussian (i.e. normal) distributions withmaximum-likelihood estimates for instance, it is known thatmodelling dependencies may actually lead to overfitting theselected solutions, which, in turn, results in an inefficientalignment of the distribution with the direction of improve-ment in the problem landscape [5, 6, 7, 8]. For this reason,the direction in which the distribution has shifted in subse-quent generations must be considered. This is done usingadaptive mechanisms that span multiple generations suchas the Anticipated Mean Shift (AMS) [1] approach in EDAsand the evolution path and the estimation of covariancesusing the mean in the previous generation in CMA-ES [8].

For single-objective (SO) optimization, these algorithmsare highly efficient. Many optimization problems in practicehowever are multi-objective (MO). In MO optimization, theoptimum is no longer a single solution but a set of solutions,called the optimal Pareto front. This is because many solu-tions may be equally good, e.g. solution a may be better inthe first objective than solution b, but worse in the secondobjective. Population-based methods such as evolutionaryalgorithms (EAs) are commonly accepted to be well-suitedfor solving MO problems [4]. Because a set of solutions isused, EAs can spread their search bias along the Pareto frontand thereby prevent many re-computations that are involvedif a single point on the Pareto front is repeatedly targetedusing an approach that only considers a single solution.

Considering EDAs, mixture distributions are of particu-lar interest when solving MO optimization problems becausethey can spread the search intensity along the Pareto front,allowing more focused exploitation of problem structure indifferent regions of the objective space [2, 12]. To obtainhigh-quality solutions, exploiting dependencies in each re-gion may be necessary, but the configuration of these depen-dencies or the values for the problem variables may be verydifferent in each region. Probabilistic dependency modellingmay be less effective if it is the same in each region.

It is the focus of this paper to study more closely the rela-tion between spreading the search distribution in EDAs us-ing mixture distributions and the observed pressure towardsfinding better (i.e. Pareto-dominating) solutions. In MO op-

timization, the number of equally-preferable solutions caneasily be larger than the population size, causing the vari-ance of the estimated distribution to quickly become focusedon the variety within sets of equally-preferable solutions in-stead of the variety between such sets, i.e. the direction of im-provement in the MO fitness landscape. Non-zero variancealong such directions is required for any EDA to have a sub-stantial probability of sampling better solutions. If higher-order dependencies can be modelled, the risk of fitting onlysolutions of equal preference becomes only larger becausea more accurate probabilistic representation of the selectedsolutions is possible, especially in the case of real-valued ob-jectives because then there may be an infinite number ofequally-preferable solutions. Arguably, premature conver-gence and inefficient performance are then much more likely,making this an important topic to study more closely.

2. MULTI-OBJECTIVE OPTIMIZATIONWe assume to have m objective functions fi(x), i ∈ {0, 1,

. . . , m − 1} and, without loss of generality, we assume thatthe goal is to minimize all objectives.

A solution x0 is said to (Pareto) dominate a solution x1

(denoted x0 ≻ x1) if and only if fi(x0) ≤ fi(x

1) holdsfor all i ∈ {0, 1, . . . , m − 1} and fi(x

0) < fi(x1) holds for

at least one i ∈ {0, 1, . . . , m − 1}. A Pareto set of sizen then is a set of solutions xj , j ∈ {0, 1, . . . , n − 1} forwhich no solution dominates any other solution, i.e. thereare no j, k ∈ {0, 1, . . . , n − 1} such that xj ≻ xk holds. APareto front corresponding to a Pareto set is the set of all m-dimensional objective function values corresponding to thesolutions, i.e. the set of all f (xj), j ∈ {0, 1, . . . , n− 1}.

A solution x0 is said to be Pareto optimal if and only ifthere is no other x1 such that x1 ≻ x0 holds. Further, theoptimal Pareto set is the set of all Pareto-optimal solutionsand the optimal Pareto front is the Pareto front that cor-responds to the optimal Pareto set. We denote the optimalPareto set by PS and the optimal Pareto front by PF .

3. CLUSTERED VARIATIONInstead of using one population, multiple populations can

be used. With the exception of selection, a completely sepa-rate EA is run for each subpopulation. Because selection isperformed on all solutions in all populations the generationsare synchronized and the populations can also be thoughtof as subpopulations. This approach is taken in SDR-AVS-MIDEA [3] and in MO-CMA-ES [11].

Not all populations necessarily then get the same num-ber of selected solutions. For some populations, none of thegenerated solutions may even be selected in the next gen-eration. In that case, the population will have to be resetsomehow, for instance by copying solutions from other pop-ulations. Also, all adaptive mechanisms that span multiplegenerations will have to be reset for the disappearing popu-lation. This is the case for SDR-AVS-MIDEA [3]. One wayto overcome this problem is to restrict the population size tobe of size 1. This is the case for in MO-CMA-ES [11] wherea (1, 1) strategy is used. This restriction however doesn’tallow other existing SO population-based methods to be ex-tended to the MO case in a straightforward manner.

Clustered variation can also be performed using only onepopulation. The selected solutions are then first clustered.Subsequently, the actual variation takes place by consider-ing only individuals in the same cluster, i.e. a mating re-

striction is employed. To ensure that the spatial separationof the search bias is obtained in the objective space, clus-tering should be performed on the basis of objective val-ues. In EDAs, this corresponds to using a mixture distribu-tion. A mixture probability distribution is a weighted sumof k probability distributions. Let X be the random vari-able that represents the parameter space of the problem athand. A mixture probability distribution is then defined byPk−1

i=0βiP

i(X), βi > 0, i ∈ {0, . . . , k− 1} andPk−1

i=0βi = 1.

The βi are called the mixing coefficients and each probabilitydistribution P i is called a mixture component.

Using mixture probability distributions instead of subpop-ulations is probabilistically a superior approach because alldata is used each generation to compute the distribution.Obtaining mixture distributions by clustering the selectedsolutions and estimating a probability distribution in eachcluster separately is an approach taken in various MOEDAs,e.g. in MIDEA [2] and in mohBOA [12]. The main differencebetween these two approaches, besides employing Gaussiansfor real-valued solutions versus employing decision graphsin Bayesian factorizations for discrete solutions, is that inMIDEA a different clustering algorithm is used (leader clus-tering) than in mohBOA (k-means). We refer the interestedreader for details on these clustering algorithms to the re-spective literature. Although asymptotically these cluster-ing algorithms have the same computational complexity, thek-means clustering algorithm loops over the data more thanonce, requiring more time, but typically resulting in a supe-rior clustering result with less variation in cluster sizes.

The clustering methods used so far do not necessarily re-sult in a clustering where each cluster has equal size. Thiscan give similar problems as when using multiple popula-tions. For a straightforward extension of existing EAs, it isconvenient to know for sure that cluster sizes are uniformand what this size is. To this end, we propose the followingmix of the leader and the k-means clustering algorithms.

First, a nearest-neighbour heuristic is used to select k lead-ers that are spread as well as possible: the first leader ischosen as a solution with a maximum value for a randomlychosen objective. For all remaining solutions, the nearest-neighbor distance is computed to the single leader and theone with the largest distance is chosen as the next leader.The distances for the remaining solutions are updated bychecking whether the distance to the new leader is smallerthan the currently stored nearest-neighbour distance. Theselast two steps are repeated until k leaders are selected. Sec-ond, these solutions serve as the initial cluster means fork-means clustering. Third, the distance from each selectedsolution to the final cluster means is computed. After sort-ing then, for each cluster the closest c solutions are finallyassigned to that cluster, ensuring that each cluster consistsof exactly c solutions. Because sorting and the final assign-ment is done independently for each cluster, some solutionsmay be assigned to multiple clusters whereas other solutionsare not assigned at all. The probability of this happeningcan be reduced by forcing the clusters to overlap by settingc > 1

k|S | where S is the set of selected solutions. Specifi-

cally, we propose to use c = 2

k|S |, resulting in substantial ex-

pected overlap between neighboring clusters. This increasesthe expected density in the usual void between the bound-aries of clusters in the objective space, thereby increasingthe probability of finding a good, uniform spread of solu-

LeaderLeader k-meansk-means Balancedk-leader-means

Balancedk-leader-means

Figure 1: Three clustering algorithms and densitycontours of the associated Gaussian mixture.

tions faster. Further, twice the number of clusters can beused in this way, given the same population size.

Figure 1 shows results of clustering 105 samples in a trian-gle, reminiscent of a selection result on a 2D slope, i.e. min-imizing x0 + x1. Also shown are the density contours of theassociated Gaussian mixture that is obtained by estimat-ing a Gaussian distribution in each cluster. For the leaderand the k-means clustering algorithm, 5 clusters are com-puted. For the proposed balanced k-leader-means (BKLM)clustering algorithm, 10 clusters are computed. An increasein uniformity of the density estimate can be observed withan increase in clustering effort, with the smoothest densityestimate obtained using BKLM. The problem of unequalcluster sizes also diminishes with increased clustering effort.The most uneven result was found for leader clustering: 29,27, 23, 19 and 8, followed by k-means: 27, 22, 22, 20, 15 andfinally BKLM with all equal cluster sizes of 21.

Finally, we remark that clustering in MOEDAs shouldcompute distances based on normalized objective values toremove the influence of differently scaled objectives. To thisend, first the minimum fmin

i and maximum fmaxi values for

each objective i can be computed from all selected solutions.A point in objective space f(x) then can be scaled linearlyto the observed ranges, i.e. (f(x)− fmin

i )/(fmaxi − fmin

i ).

4. CLUSTER REGISTRATIONAn important part of state-of-the-art variation operators

are adaptive mechanisms that span multiple generations suchas the Anticipated Mean Shift (AMS) [1] approach in EDAsand the evolution path and the estimation of the covari-ance matrix based on the mean in the previous generation inCMA-ES [8]. The contribution of these mechanisms stronglydepends on a correlation to exist between the sets of solu-tions in subsequent generations from which the models arebuilt. By re-applying clustering each generation however,in principle there is no spatial relation between clusters insubsequent generations. Even if the clustering algorithmhas low variation when applied twice to the same data, thefinal enumeration of the clusters does not guarantee at allthat cluster i in generation t − 1 is near cluster i in gen-eration t. Therefore, some form of registration is requiredthat determines the best correspondence between clusters insubsequent generations. An implicit form of registration isachieved by assigning each newly generated solution to thecluster to which it is nearest (i.e. the highest density). Thisapproach is taken in SDR-AVS-MIDEA [3]. Once new so-lutions cannot be assigned to a particular cluster anymorebecause it has become too large already (i.e. larger than thepredefined subpopulation size), suboptimal cluster assign-ments can be made. Over multiple generations, the spatialseparation can then degrade, resulting in clusters movingacross the Pareto front as can for instance be observed inFigure 2. This problem can not be overcome by using a

0

0.6

1.2

1.8

2.4

3

3.6

4.2

4.8

5.4

6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1

2

3

4

5

1

2

3

4

51

2

3

4

5

0

0.6

1.2

1.8

2.4

3

3.6

4.2

4.8

5.4

6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1

2

3

4

5

1

2

3

4

51

2

3

4

5

0

0.6

1.2

1.8

2.4

3

3.6

4.2

4.8

5.4

6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1

2

3

4

5

1

2

3

4

51

2

3

4

5

Implicit registrationImplicit registrationImplicit registration

f0f0f0

f 1f 1f 1

0

0.6

1.2

1.8

2.4

3

3.6

4.2

4.8

5.4

6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1

2

3

45

67

89 10

123

4 5 6 78

9 10

123 4 5 6 7 8

9 10

0

0.6

1.2

1.8

2.4

3

3.6

4.2

4.8

5.4

6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1

2

3

45

67

89 10

123

4 5 6 78

9 10

123 4 5 6 7 8

9 10

0

0.6

1.2

1.8

2.4

3

3.6

4.2

4.8

5.4

6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1

2

3

45

67

89 10

123

4 5 6 78

9 10

123 4 5 6 7 8

9 10

Explicit registrationExplicit registrationExplicit registration

f0f0f0

f 1f 1f 1

Figure 2: Clustering of selected solutions in differentgenerations using implicit and explicit registrationof 5 and 10 clusters respectively and estimating aGaussian with a full covariance matrix per cluster.

population size of 1 as in MO-CMA-ES [11] unless an ex-plicit registration is performed. Therefore, we propose toexplicitly compute a registration between clusters in subse-quent generations. The approach to this end that we proposehere is not specific for Gaussian EDAs and can therefore beapplied to any clustered or multi-population algorithm.

The goal of explicit cluster registration is to re-assign thecluster indices of the current generation t such that clusteri in generation t is the cluster that is closest to cluster i inthe previous generation t − 1. To this end, we propose analgorithm that first computes all distances between clustersin generation t and generation t− 1. The distance betweentwo clusters is taken to be the smallest distance betweenany solution in the one cluster and any solution in the othercluster. Also all cluster distances are computed between theclusters in generation t and between the clusters in gener-ation t − 1. Then, the algorithm repeatedly selects r ≤ kclusters to be registered, that is, r clusters in generation tand r clusters in generation t− 1. To this end, first the twostill-unregistered clusters in generation t are determined thatare the farthest apart. One of these two far-apart clustersis randomly selected as well as the still-unregistered clusterin generation t − 1 that is closest to it. The r − 1 nearestneighbours of these clusters are then determined in the setof still-unregistered clusters of their respective generations,leading to two subsets of r clusters to be registered. To reg-ister subsets of clusters, all possible r! permutations for theset of clusters in generation t are considered and the permu-tation is selected for which the sum of the distances betweenthe matched clusters is minimal. Subset registration is thenrepeated until all clusters are registered.

The reason for using subset registration with r ≤ k insteadof r = k is that subset registration is performed by enumer-ating permutations. As this number grows factorially fast,exact optimization via enumeration of all possible permuta-tions can only be done for small values of r. Still, we foundthat r can be set large enough (we used r = 10) to sub-stantially reduce the risk of suboptimal registration withoutrequiring more time than other parts in model-building.

Figure 2 shows clusters in different generations (1, 30 and60) using the same number of solutions for implicit registra-tion and explicit registration on the well-known benchmarkproblem EC1. For implicit registration, k = 5 subpopula-tions are used. For explicit registration, BKLM is used withk = 10 clusters. In each cluster a Gaussian distributionis estimated using a full covariance matrix without furtheradaptive enhancements. The superiorly smooth front and

stable registration over many generations is clear for explicitregistration, but so is the lack of front progress as a result ofoverfitting the selected solutions with more involved mixtureestimates. Next, we specifically target this issue.

5. GAUSSIANS, AMS, SDR, AVS AND MOEstimating a Gaussian distribution only using the selected

solutions of the current generation, the density contours canbecome aligned with directions in which only solutions ofsimilar quality can be found. Methods that only adaptivelyscale the covariance matrix, such as SDR-AVS, do not helpmuch as they almost solely increases search effort in the fu-tile direction perpendicular to the direction of improvement.In SDR-AVS, a distribution multiplier cMultiplier is main-tained by which the covariance matrix is multiplied eachgeneration. This multiplier is scaled up if improvementsare found that are more than standard-deviation away fromthe mean and scaled down if no improvements are found (formore details, see [1]). This misalignment behavior is alreadyknown to occur in SO optimization with EDAs [1, 8], butthe same issue can occur in MO optimization because it isa direct consequence of selecting solutions of similar quality,regardless of the number of objectives.

This inefficient behavior is illustrated in Figure 3 on a two-dimensional and two-objective minimization problem definedby f0(x) = 1

2(x2

0 + (x1 − 1.0)2) and f1(x) = 1

2((x0 − 1.0)2 +

x21). The optimal Pareto front is convex and defined by

x0 = 1−x1 and f1 = f0−2√

f0 +1. By initializing the pop-ulation in the initialization range (IR) [0.9; 1.0]2, the bettersolutions form a rotated V-shape in the lower-left triangleof the IR. Using a maximum-likelihood estimate, the dis-tribution thereby becomes misaligned with the direction ofimprovement. Although the variance is adaptively scaledup, the misalignment prevents the MOEDA from efficientlylocating solutions closer to the optimal Pareto front.

One way to overcome this problem, is to use the An-ticipated Mean Shift (AMS) [1]. The AMS is computedas the difference between the means of subsequent genera-tions, i.e. µShift(t) = µ(t) − µ(t − 1). A part, specificallyα100%, of the newly sampled solutions is then moved inthe direction of the AMS: x ← x + 2µShift(t). The ratio-nale is that solutions changed by AMS are further downthe slope. Selecting those solutions as well as solutions notchanged by AMS aligns the distribution estimate better withthe direction of improvement. In a population of size nwhere ⌊τn⌋ solutions are selected, nelitist solutions are main-tained and n − nelitist new solutions are generated, propor-tioning the selected solutions perfectly between unalteredand AMS-altered solutions requires α(n − nelitist) = 1

2τn

and thus α = 1

2τ n

n−nelitist . A combination of AMS with SDR

and AVS has been termed AMaLGaM (Adapted Maximum-Likelihood Gaussian Model) [1] in which traversing a slopeis further sped up by multiplying the movement of solutionsin the direction of the AMS by the same multiplier used forthe covariance matrix, i.e. x← x + cMultiplier2µShift(t).

The effect of adding AMS, i.e. using AMaLGaM, is shownfor the example problem in Figure 3. In parameter space,the Gaussian is quickly adaptively re-aligned with the direc-tion of Pareto-improvement. In objective space the variancetowards the optimal Pareto front remains substantial, caus-ing the density to already start spreading along the optimalPareto front within the first 7 generations.

AMS, SDR and AVS can all be applied directly in com-

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Parameter space Objective space

Figure 3: 95%-contours of the estimated distribu-tion in the first 7 generations of typical MOEDAruns with either a single Gaussian (left column) ora mixture (right column) and with either SDR-AVS(top row) or AMaLGaM (bottom row), on the ex-ample problem with IR [0.9; 1.0]2. Subsequent gener-ations alternatingly use solid and dashed lines. Theestimations are shown both in parameter space (redand green) and objective space (blue and pink).

bination with mixture distributions if a correspondence be-tween the clusters in subsequent generations exists. For eachcluster then, a separate AMS, SDR and AVS mechanism canbe used. For the use of SDR-AVS without AMS however,the resulting performance of the MOEDA when going from asingle distribution to a mixture distribution can become evenworse. In Figure 3 it can clearly be seen that the problem ofoverfitting the selected solutions is even more problematic.In each cluster, the selected solutions can be fitted even moreclosely, resulting in an overall better fit, but less progress interms of optimization. The addition of AMS for each clusterseparately changes this behavior completely. Similar to us-ing a single distribution, within a few generations the mostimportant direction of improvement is detected and the den-sity of the estimated distribution is re-aligned to efficientlyfind Pareto-improving solutions. The density in the objec-tive space also shows that in the last few generations a moreuniform density estimate is obtained only in the vicinity ofthe Pareto front, whereas for a single cluster the density inobjective space is spread out even across large parts of theobjective space that are inferior.

6. MAMALGAM-XWe call the composition of the techniques proposed above,

i.e. the MOEDA illustrated in the bottom-right in Figure 3,MAMaLGaM-X (Multi-objective AMaLGaM-miXture) andsummarize its operational description below.

Given a population of size n, ⌊τn⌋, τ ∈ [ 1

n, 1], solutions

are selected and clustered using BKLM, giving cluster sizesof 2

k⌊τn⌋. Selection is performed by computing domination

ranks and then selecting the lowest ranks that fit within the

maximum of ⌊τn⌋. From the rank that crosses this bound-ary, the same nearest neighbour heuristic is used to fill theselected set with as is used to select the k leaders in BKLM.

An elitist archive is maintained, storing all currently non-dominated solutions. Because the objectives are real-valued,there are typically infinitely many non-dominated solutionspossible. To prevent the archive from growing to an ex-treme size, the objective space is discretized into hypercubes.Only one solution per hypercube is allowed in the archive.Newly generated solutions are compared to the solutions inthe archive. If a new solution is dominated by any archivesolution, it is not entered. If a new solution is not domi-nated, it is added to the archive if the hypercube that it re-sides in does not already contain a solution or if it dominatesthat particular solution. When a new solution is entered, allarchive solutions that are dominated by it, are removed.

After clustering and subsequently performing explicit clus-ter registration, a Gaussian distribution is estimated in eachof the clusters and adapted using the combination of AMS,SDR and AVS as in AMaLGaM [1] with two minor differ-ences. 1) The AVS scheme is based upon whether improve-ments are found. After generating new solutions, each newlygenerated solution is re-associated with the cluster to whichit is closest in objective space. An improvement is said tobe obtained for cluster i if any new solution associated withcluster i is added to the archive. 2) The SDR scheme com-putes, for each cluster, the average of all improvements asso-ciated with that cluster and checks whether the average liesbeyond one standard deviation. Because here improvementscan be obtained in different regions, the ratio of the aver-age improvement is less informative. Instead, we thereforecompute the average ratio of the improvements.

Keeping elitist solutions in the population can contributeto improved convergence. Therefore, each solution in theelitist archive is associated with its nearest cluster. For eachcluster, at most 1

k⌊τn⌋ of its associated elitist solutions are

copied to the population. If there are more elitist solutions,the same nearest-neighbour heuristic is used as in selection.Finally, each cluster generates equally many solutions, corre-sponding to uniform mixture coefficients βi = 1

k. Depending

on how many elitist solutions were copied to the population,at least n− ⌊τn⌋ new solutions are thereby generated.

7. SYNCHRONOUS PARALLEL SOEDASAlthough clustered variation spreads the search bias, MO

selection still focuses exploitation on all objectives at thesame time, reducing pressure towards finding Pareto im-provements. It may therefore be beneficial to add expertsearch bias in the form of separate SO optimization of the mobjectives. In SO optimization there are typically less prob-lems with maintaining pressure on finding improvements.

A combination of MO optimization and SO optimizationhas been proposed before [10]. There, m+1 equal-sized pop-ulations are used. Here, we propose to set the populationsize for each of the m SO optimizers equal to the cluster sizein the MO population. For MAMaLGaM-X this amountsto an overall population size of n + 2mn

k. We further pro-

pose to use an EA for SO optimization that is similar tothe MO optimizer being used, i.e. using the same variationoperator and same selection intensity. In this way, givenenough clusters, the rate of convergence in each cluster isexpected to be similar, resulting in better-aligned supportof the SO optimizers in terms of convergence. Furthermore,

in [10] solutions are migrated from the SO populations tothe MO population and vice versa. Assuming competentSO optimizers however, this may only reduce the effective-ness of the SO optimizers. We therefore propose to onlyadd the best solutions found by SO optimizers in each gen-eration to the archive of the MO optimizer. By injectingthe best solutions found by the SO optimizers for the differ-ent objectives into the elitist archive the pressure of the SOoptimizers to find improvements can filter through to theMO optimizer. Also, the search bias of the MO optimizer isspread out towards the edges of the Pareto front, i.e. wherethe SO optimizers are, ensuring that no unnecessary gapappears between solutions found by the SO optimizers andsolutions found by the MO optimizer. In the remainder wewill refer to the SO-extended version of MAMaLGaM-X byMAMaLGaM-X+.

8. EXPERIMENTS

8.1 Test suiteThe definitions of the problems in our multi-objective op-

timization problem test suite are presented in Table 1.The first two problems we use are the easiest. They are

generalizations of the MED (Multiple Euclidean Distances)problems [9]. Each objective is similarly scaled. There arefurthermore no constraints and no local Pareto fronts, mak-ing the problem relatively simple, comparable to the spherefunction in real-valued SO optimization. The initializationrange (IR) of [−1; 1] is not a constraint. The optimal Paretofront for GM1 is convex; for GM2 it is concave.

We also used the well-known problems1 ECi, i ∈ {1, 2, 3,4, 6}. The IRs of the ECi problems are also constraints.These problems differ from the GM problems in that theobjectives are not similarly defined and not similarly scaled.For more details about these functions, see [13].

The final two problems come from more recent literatureon real-valued MO optimization [3] and are labeled BDi,i ∈ {1, 2}. Both problems make use of Rosenbrock’s func-tion. Premature convergence on this function is likely with-out proper induction of the structure of the search space.Function BD2 is harder than BD1 in that the objective func-tions overlap in all variables instead of only in x0. Further,the IR of x0 in function BD1 is also a constraint. Finally, wehave scaled the objectives of BD2 to ensure that the opti-mum of all problems is in approximately the same range. Bydoing so, using the same value-to-reach for the DPF →S indi-cator (which is explained in the next Section) on all problemscorresponds to a similar front quality on all problems.

To avoid artifacts resulting from boundary-repair meth-ods, the sampling procedure in all MOEDAs is constructedsuch that solutions that are out of bounds are rejected.

8.2 Measuring performanceWe consider the elitist archive upon termination to be

the outcome of a MOEDA and refer to it as an approxima-tion set, denoted S. To measure performance the DPF →S

performance indicator is computed. This performance in-dicator computes the average distance over all points inthe optimal Pareto front PF to the nearest point in S:DPF →S(S) = 1

|PF |

P

f1∈PFminf0∈S{d(f 0, f 1)} where f

is a point in objective space and d(·, ·) computes Euclideandistance. A smaller DPF →S value is preferable and a value

1These problems are also known as ZDTi.

NameObjectives IR

GM1

f0 =˛

˛

˛

˛

1

2

`

x − c0´

˛

˛

˛

˛

d, f1 =

˛

˛

˛

˛

1

2

`

x − c1´

˛

˛

˛

˛

d

c0 = (1, 0, 0, . . .), c1 = (0, 1, 0, 0, . . .), d = 2

[−1; 1]10

(l = 10)

GM2

f0 =˛

˛

˛

˛

1

2

`

x − c0´

˛

˛

˛

˛

d, f1 =

˛

˛

˛

˛

1

2

`

x − c1´

˛

˛

˛

˛

d

c0 = (1, 0, 0, . . .), c1 = (0, 1, 0, 0, . . .), d = 1

2

[−1; 1]10

(l = 10)

EC1

f0 = x0, f1 = γ“

1−p

f0/γ”

γ = 1 + 9“

Pl−1

i=1xi/(l − 1)

”

[0; 1]30

(l = 30)

EC2

f0 = x0, f1 = γ`

1− (f0/γ)2´

γ = 1 + 9“

Pl−1

i=1xi/(l − 1)

”

[0; 1]30

(l = 30)

EC3

f0 = x0, f1 = γ“

1−p

f0/γ − (f0/γ)sin(10πf0)”

γ = 1 + 9“

Pl−1

i=1xi/(l − 1)

”

[0; 1]30

(l = 30)

EC4

f0 = x0, f1 = γ“

1−p

f0/γ”

γ = 1 + 10(l − 1) +Pl−1

i=1

`

x2i − 10cos(4πxi)

´

[−1; 1]×[−5; 5]9

(l = 10)

EC6

f0 = 1− e−4x0sin6(6πx0), f1 = γ`

1− (f0/γ)2´

γ = 1 + 9“

Pl−1

i=1xi/(l − 1)

”0.25

[0; 1]10

(l = 10)

BD1

f0 = x0, f1 = 1− x0 + γ

γ =Pl−2

i=1

`

100(xi+1 − x2i )

2 + (1− xi)2)

´

[0; 1]×[−5.12; 5.12]9

(l = 10)

BDs

2

f0 = 1

l

Pl−1

i=0x2

i

f1 = 1

l−1

Pl−2

i=0

`

100(xi+1 − x2i )

2 + (1− xi)2)

´

[−5.12; 5.12]10

(l = 10)

Table 1: The MO problem test suite.

of 0 is obtained if and only if the approximation set and theoptimal Pareto front are identical. This indicator is usefulfor evaluating performance if the optimum is known becauseit describes how well the optimal Pareto front is covered andthereby represents an intuitive trade-off between the diver-sity of S and its proximity (i.e. closeness to the optimalPareto front). Even if all points in the S are on the opti-mal Pareto front the indicator is not minimized unless thesolutions in the approximation set are spread out perfectly.Because the optimal Pareto front may be continuous, thereare infinitely many solutions possible on the optimal Paretofront. Therefore, we computed 5000 uniformly sampled so-lutions along the optimal Pareto front to use as a discretizedversion of PF for a high-quality approximation.

For the problems in our test-suite, given the ranges ofthe objectives for the optimal Pareto front configurations, avalue of 0.01 for the DPF →S indicator corresponds to frontsthat are quite close to the optimal Pareto front. Fronts thathave a DPF →S value of 0.01 can be seen in Figure 4.

8.3 ResultsAll presented results are averaged over 30 runs. The sub-

population or cluster sizes were set according to guidelinesfrom recent literature on SO [1]. For the two different prob-lem sizes in our test suite, i.e. l = 10 and l = 30, thisboils down to cluster sizes of 112 and 510 respectively forthe full-covariance matrix, 52 and 99 for the Bayesian fac-torization and 32 and 55 for the univariate factorization.Both MOEDAs were given the same overall population size,meaning that twice the number of clusters could be used inMAMaLGaM-X (see Section 3), i.e. the population size inMAMaLGaM-X variants is 1

2k times the cluster size whereas

the population size in SDR-AVS-MIDEA variants is k timesthe cluster size. The discretization of the objectives intohypercubes for the elitist archive is set to 10−3. We com-pared SDR-AVS-MIDEA using implicit cluster registrationwith MAMaLGaM-X and MAMaLGaM-X+ using explicitcluster registration. We observe the average convergence ofthe DPF →S metric to study the impact of the various tech-

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1f0

f 1

GM1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1f0

f 1

GM2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1f0

f 1

EC1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1f0

f 1

EC2

-1

-0.5

0

0.5

1

0 0.2 0.4 0.6 0.8 1f0

f 1

EC3

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1f0

f 1

EC4

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1f0

f 1

EC6

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1f0

f 1

BD1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1f0

f 1

BDs2

Default front Approximation set

Figure 4: Default fronts and approximation sets ob-tained with MAMaLGaM-X+ (DPF →S =0.01, k=20).

niques proposed in combination with estimating Gaussiandistributions either modelling all dependencies (i.e. usinga full covariance matrix), a subset of all dependencies viagreedy Bayesian factorization learning (which is a commonapproach in EDAs, see, e.g. [1, 2, 12]), or modelling no de-pendencies at all (i.e. using the univariate factorization). Incase of the Bayesian factorization, we limited the maximumnumber of parents per variable to 5.

In Figure 5 the convergence of successful runs of bothMOEDAs is shown on EC1 and EC6 with and without theuse of AMS. These results were found to be exemplary ofthe results on all problems. If not all runs were success-ful, the average convergence over all unsuccessful runs isalso shown. A run is defined to be successful if a value of0.01 was reached within the limit of 106 function evalua-tions. The convergence results without AMS are inferior.This is especially the case if the full covariance matrix isused, i.e. when overfitting is most likely. Overfitting is alsomore likely if BKLM clustering is used and consequently,without AMS, MAMaLGaM-X performs the worst. How-ever, combined with explicit cluster registration, AMS has atremendous impact on the performance of MAMaLGaM-X.AMS also speeds up SDR-AVS-MIDEA, albeit not as pro-foundly. AMS further can be seen to positively influence theconvergence of both MOEDAs if only a subset of all possi-ble dependencies is estimated. The impact is then smallerthough because the density-misalignment problem associ-ated with overfitting isn’t there. Overall, MAMaLGaM-X(with AMS) has the best convergence behavior for all vari-ants of dependency processing due to the use of the BKLMmethod combined with explicit registration.

A similar positive influence by AMS was observed on allproblems, for which reason we refrain from presenting fur-ther convergence graphs for results obtained without AMS.Instead, results are summarized using success rates (withinthe limit of 106 evaluations) and presented in Table 2. Theseresults confirm that using AMS results in better perfor-

0.001

0.01

0.1

1

10

0 200000 400000 600000 800000 1e+06

EC1

0.001

0.01

0.1

1

10

0 200000 400000 600000 800000 1e+06

EC6

0.001

0.01

0.1

1

10

0 200000 400000 600000 800000 1e+06

EC1

0.001

0.01

0.1

1

10

0 200000 400000 600000 800000 1e+06

EC6

0.001

0.01

0.1

1

10

0 200000 400000 600000 800000 1e+06

EC1

0.001

0.01

0.1

1

10

0 200000 400000 600000 800000 1e+06

EC6

SDR-AVS-MIDEA

SDR-AVS-MIDEAAMS

MAMaLGaM-XNO AMS

MAMaLGaM-X

Figure 5: Average performance of SDR-AVS-MIDEA (k = 10) and MAMaLGaM-X (k = 20) withand without AMS, on two problems, estimating fullcovariance matrices (top row), Bayesian factoriza-tions (center row) and no covariances (bottom row).Horizontal axis: number of evaluations (both objec-tives per evaluation). Vertical axis: DPF →S . Foreach algorithm averages are shown both for succes-ful runs and unsuccesful runs, giving double occur-rences of lines if some runs were unsuccesful.

mance. The table also shows the severity of the impact ofoverfitting. Going from low-order dependency learning tolearning the full covariance matrix, one would expect onlyhigher success rates. Without the use of AMS however, thesuccess rates almost always drop, often to near 0% success.With AMS however, the overfitting problem is relieved andthe intuition of being able to solve a larger class of problemsreliably by estimating dependencies again can be seen, ob-taining high success rates (given enough clusters and explicitcluster registration as in MAMaLGaM-X) when the use ofthe univariate factorization fails (e.g. on problem BDs

2).Figure 6 shows convergence graphs for use of the full-

covariance Gaussian. MAMaLGaM-X either performs sim-ilar to SDR-AVS-MIDEA or outperforms it. This showsthat the BKLM clustering and explicit cluster registrationtechniques are beneficial and promising in general for multi-objective optimization with mixture-based EDAs.

If the Bayesian factorization or univariate factorizationare used, convergence happens faster because a much smallerpopulation size can be used. For our test suite, a similar suc-cess rate is even obtained, with the exception of BDs

2. Onthis problem, slower convergence is obtained using Bayesianfactorizations and the optimum cannot be found using uni-

Full covariance matrix

BD1 BDs

2 GM1 GM2 EC1 EC2 EC3 EC4 EC6

Without AMS

SDR-AVS-MIDEA-05 83 0 100 100 0 0 13 0 40SDR-AVS-MIDEA-10 100 0 100 100 0 0 0 0 6

MAMaLGaM-X-10 93 0 100 100 6 0 0 0 100MAMaLGaM-X-20 83 3 100 100 0 0 0 3 0

With AMS



MAMaLGaM-X+-10 100 100 100 100 100 100 96 0 100MAMaLGaM-X+-20 100 100 100 100 100 100 100 0 100

Bayesian factorization

BD1 BDs


Without AMS



With AMS




Univariate factorization

BD1 BDs


Without AMS



With AMS




Table 2: Success rates, i.e. the percentage of timesa MOEDA variant obtained DPF →S indicator ≤ 0.01.

variate factorizations. Although it concerns only a singletest problem here, this does illustrate the important fact tokeep in mind that not all problems can be solved efficientlywithout taking dependencies into account, which is also inaccordance with findings for discrete MO problems [12]. Ex-amining this importance in the light of more practical oreven real-world problems is however an important topic offuture research. Also, although the problems used here havedependencies between the problem variables, i.e. because ofthe Rosenbrock problem in BD1 and BDs

2, these dependen-cies are of low order. Using Bayesian factorizations ratherthan a full covariance matrix the Rosenbrock function canbe optimized more efficiently. Moreover, using AMS, the op-timum can be found even with the univariate factorization,albeit it less efficiently. For this reason the optimum of BD1

and BDs

2 can be found using MAMaLGaM-X+ for all vari-ants of dependency modelling. On the one hand this demon-strates the potential of the proposed SO-MO combination.On the other hand, this stresses even more the importanceof testing the influence of high-order dependency modellingon more practical or even real-world MO problems.

Overall, MAMaLGaM-X+ performs the best. While re-quiring only marginally more effort in terms of functionevaluations, good approximations can be found in all runson all problems. The exception is problem EC4, where alltested MOEDAs almost always fail. This problem is highlymulti-modal. Furthermore, the optima of the EC problemslie on the boundary of the search space. Finally, we note

0.001

0.01

0.1

1

0 200000 400000 600000 800000 1e+06

GM1

0.001

0.01

0.1

1

0 200000 400000 600000 800000 1e+06

GM2

0.001

0.01

0.1

1

10

0 200000 400000 600000 800000 1e+06

EC1

0.001

0.01

0.1

1

10

0 200000 400000 600000 800000 1e+06

EC2

0.001

0.01

0.1

1

10

0 200000 400000 600000 800000 1e+06

EC3

0.001

0.01

0.1

1

10

100

0 200000 400000 600000 800000 1e+06

EC4

0.001

0.01

0.1

1

10

0 200000 400000 600000 800000 1e+06

EC6

0.001

0.01

0.1

1

10

100

1000

10000

0 200000 400000 600000 800000 1e+06

BD1

0.001

0.01

0.1

1

10

100

1000

10000

0 200000 400000 600000 800000 1e+06

BDs

2

SDR-AVS-MIDEAAMS, k = 5 SDR-AVS-MIDEAAMS, k = 10 MAMaLGaM-X,MAMaLGaM-X,k = 10 k = 20MAMaLGaM-X+, k = 10 MAMaLGaM-X+, k = 20

Figure 6: Average performance of various MOEDAs on all problems, estimating full covariance matrices ineach cluster. Horizontal axis: number of evaluations (both objectives per evaluation). Vertical axis: DPF →S .For each algorithm averages are shown both for succesful runs and unsuccesful runs, giving double occurrencesof lines if some runs were unsuccesful.

that the < 100% successrate of the univariately-factorizedMAMaLGaM-X+ is only due to the limit of 106 evaluations,around which budget the MOEDA is always near the re-quired DPF →S score of 0.01.

9. SUMMARY AND CONCLUSIONSTo find good approximations of the optimal Pareto front,

continued pressure toward finding improvements is required.If the Pareto front spreads fast, this pressure can be hard tomaintain, especially in the real-valued case where infinitelymany solutions are available. As many solutions of a sim-ilar quality are then selected, a MOEDA can easily con-verge prematurely due to overfitting solutions of that qual-ity, i.e. a contour in the fitness landscape. Enlarging thecapacity of the probabilistic model via mixture distributionsand the modelling of dependencies only increases the prob-ability that such overfitting can occur, contrary to what iscommonly expected from EDAs when employing more com-plex distributions. The techniques described in this paperreduce this risk substantially and effectively. Moreover, us-ing the proposed BKLM clustering technique any EDA canbe extended to a mixture-based version straightforwardly.In future work we shall use this approach to further studythe convergence of MOEDAs in discrete search spaces. Weshall also investigate the use of incremental learning meth-ods to reduce the required number of solutions per cluster.Especially in combination with many clusters, this can po-tentially lead to large performance improvements.

10. REFERENCES[1] P. A. N. Bosman. On empirical memory design, faster

selection of Bayesian factorizations and parameter-freeGaussian EDAs. In G. Raidl et al., editors, Proc. of theGenetic and Evolutionary Comp. Conf. - GECCO-2009,pages 389–396, New York, New York, 2009. ACM Press.

[2] P. A. N. Bosman and D. Thierens. Multi-objectiveoptimization with diversity preserving mixture-basediterated density estimation evolutionary algorithms.International J. of Approx. Reasoning, 31(3):259–289, 2002.

[3] P. A. N. Bosman and D. Thierens. Adaptive variancescaling in continuous multi-objective estimation-of-

distribution algorithms. In D. Thierens et al., editors, Proc.of the Genetic and Evol. Comp. Conf. - GECCO-2007,pages 500–507, New York, New York, 2007. ACM Press.

[4] C. A. Coello Coello, G. B. Lamont, and D. A. VanVeldhuizen. Evolutionary Algorithms for SolvingMulti-Objective Problems. Springer-Verlag, Berlin, 2007.

[5] M. Gallagher and M. Frean. Population-based continuousoptimization, probabilistic modelling and mean shift.Evolutionary Computation, 13(1):29–42, 2005.

[6] C. Gonzalez, J. A. Lozano, and P. Larranaga. Mathematicalmodelling of UMDAc algorithm with tournament selection.behaviour on linear and quadratic functions. InternationalJournal of Approximate Reasoning, 31(3):313–340, 2002.

[7] J. Grahl, S. Minner, and F. Rothlauf. Behaviour of UMDAcwith truncation selection on monotonous functions. InD. Corne et al., editors, Proceedings of the IEEE Congresson Evolutionary Computation - CEC-2005, pages2553–2559, Piscataway, New Jersey, 2005. IEEE Press.

[8] N. Hansen. The CMA evolution strategy: a comparingreview. In J. A. Lozano et al., editors, Towards a NewEvolutionary Computation. Advances in Estimation ofDistribution Algorithms. Springer–Verlag, Berlin, 2006.

[9] K. Harada, J. Sakuma, and S. Kobayashi. Local search formultiobjective function optimization: Pareto descendmethod. In M. Keijzer et al., editors, Proc. of the Geneticand Evolutionary Computation Conf. - GECCO-2006,pages 659–666, New York, New York, 2006. ACM Press.

[10] T. Hiroyasu, M. Nishioka, M. Miki, and H. Yokouchi.Discussion of search strategy for multi-objective geneticalgorithm with consideration of accuracy and broadness ofPareto optimal solutions. In X. Li et al., editors, SimulatedEvolution and Learning - SEAL-2008, pages 339–348,Berlin, 2008. Springer–Verlag.

[11] C. Igel, N. Hansen, and S. Roth. Covariance matrixadaptation for multi-objective optimization. EvolutionaryComputation, 15(1):1–28, 2007.

[12] M. Pelikan, K. Sastry, and D. E. Goldberg. MultiobjectivehBOA, clustering and scalability. In H.-G. Beyer et al.,editors, Proceedings of the Genetic and EvolutionaryComputation Conference - GECCO-2005, pages 663–670,New York, New York, 2005. ACM Press.

[13] E. Zitzler, K. Deb, and L. Thiele. Comparison ofmultiobjective evolutionary algorithms: Empirical results.Evolutionary Computation, 8(2):173–195, 2000.

Date post:	27-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

The Anticipated Mean Shift and Cluster Registration in...

Documents