Design of Experiments for theTuning of Optimisation Algorithms
Enda Ridge
PhD Thesis
The University of York
Department of Computer Science
October 2007
Abstract
This thesis presents a set of rigorous methodologies for tuning the performance of
algorithms that solve optimisation problems.
Many optimisation problems are difficult and time-consuming to solve exactly.
An alternative is to use an approximate algorithm that solves the problem to an
acceptable level of quality and provides such a solution in a reasonable time. Us-
ing optimisation algorithms typically requires choosing the settings of tuning pa-
rameters that adjust algorithm performance subject to this compromise between
solution quality and running time. This is the parameter tuning problem.
This thesis demonstrates that the Design Of Experiments (DOE) approach can
be adapted to successfully address the parameter tuning problem for algorithms
that find approximate solutions to optimisation problems. The thesis introduces
experiment designs and analyses for (1) determining the problem characteristics
affecting algorithm performance (2) screening and ranking the most important tun-
ing parameters and problem characteristics and (3) tuning algorithm parameters to
maximise algorithm performance for a given problem instance. Desirability func-
tions are introduced for tackling the compromise of achieving satisfactory solution
quality in reasonable running time.
Five case studies apply the thesis methodologies to the Ant Colony System and
the Max-Min Ant System algorithms for the Travelling Salesperson Problem. New
results are reported and open questions are answered regarding the importance
of both existing tuning parameters and proposed new tuning parameters. A new
problem characteristic is identified and shown to have a very strong effect on the
quality of the algorithms’ solutions. The tuning methodologies presented here yield
solution quality that is as good as or better than than the general parameter set-
tings from the literature. Furthermore, the associated running times are orders of
magnitude faster than the results obtained with the general parameter settings.
All experiments are performed with publicly available algorithm code, publicly
available problem generators and benchmarked experimental machines.
Contents
Abstract 1
List of Figures 7
List of Tables 11
Acknowledgments 13
Author’s Declaration 15
I Preliminaries 19
1 Introduction and motivation 211.1 Hypothesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 Background 292.1 Combinatorial optimisation . . . . . . . . . . . . . . . . . . . . . . 29
2.2 The Travelling Salesperson Problem (TSP) . . . . . . . . . . . . . . 30
2.3 Approaches to solving combinatorial optimisation pro-blems . . . 31
2.4 Ant Colony Optimisation (ACO) . . . . . . . . . . . . . . . . . . . . 34
2.5 Design Of Experiments (DOE) . . . . . . . . . . . . . . . . . . . . . 47
2.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
II Related Work 51
3 Empirical methods concerns 533.1 Is the heuristic even worth researching? . . . . . . . . . . . . . . 54
3.2 Types of experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Life cycle of a heuristic and its problem domain . . . . . . . . . . 56
3.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Sound experimental design . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Heuristic instantiation and problem abstraction . . . . . . . . . . 64
3.7 Pilot Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.8 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3
CONTENTS
3.9 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.10 Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.11 Random number generators . . . . . . . . . . . . . . . . . . . . . 71
3.12 Problem instances and libraries . . . . . . . . . . . . . . . . . . . 71
3.13 Stopping criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.14 Interpretive bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.15 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4 Experimental work 774.1 Problem difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Parameter tuning of other metaheuristics . . . . . . . . . . . . . . 79
4.3 Parameter tuning of ACO . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
III Design Of Experiments for Tuning Metaheuristics 93
5 Experimental testbed 955.1 Problem generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Algorithm implementation . . . . . . . . . . . . . . . . . . . . . . . 97
5.3 Benchmarking the machines . . . . . . . . . . . . . . . . . . . . . 100
5.4 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6 Methodology 1056.1 Sequential experimentation . . . . . . . . . . . . . . . . . . . . . . 105
6.2 Stage 1a: Determining important problem characteristics . . . . 106
6.3 Stage 1b: Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.4 Stage 2: Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.5 Stage 3: Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.6 Stage 4: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.7 Common case study issues . . . . . . . . . . . . . . . . . . . . . . 125
6.8 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
IV Case Studies 131
7 Case study: Determining whether a problem characteristic affects heuristic perfor-mance 1337.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2 Research question and hypothesis . . . . . . . . . . . . . . . . . . 134
7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.7 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4
CONTENTS
8 Case study: Screening Ant Colony System 1438.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.4 Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . . 150
8.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9 Case study: Tuning Ant Colony System 1539.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.4 Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . . 164
9.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
10 Case study: Screening Max-Min Ant System 16910.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
10.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
10.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
10.4 Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . . 176
10.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
11 Case study: Tuning Max-Min Ant System 17911.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
11.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
11.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
11.4 Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . . 195
11.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
12 Conclusions 19712.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
12.2 Advantages of DOE . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
12.3 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
12.4 Summary of main thesis contributions . . . . . . . . . . . . . . . . 199
12.5 Thesis strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
12.6 Thesis limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
12.7 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
12.8 Closing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
V Appendices 209
A Design Of Experiments (DOE) 211A.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
A.2 Regions of operability and interest . . . . . . . . . . . . . . . . . . 213
A.3 Experiment Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
A.4 Experiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 220
A.5 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5
CONTENTS
A.6 Error, Significance, Power and Replicates . . . . . . . . . . . . . . 225
B TSPLIB Statistics 229
C Calculation of Average Lambda Branching Factor 233
D Example OFAT Analysis 235D.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
D.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
D.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
D.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
D.5 Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . . 238
References 243
6
List of Figures
2.1 Growth of TSP problem search space. . . . . . . . . . . . . . . . . 30
2.2 Special cases and generalisations of the TSP. . . . . . . . . . . . . 32
2.3 Experiment setup for the double bridge experiment. . . . . . . . . 35
2.4 An example of a graph data structure. . . . . . . . . . . . . . . . . 36
2.5 The ACO Metaheuristic. . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6 Common tuning parameters and recommended settings for the ACO
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7 Tuning parameters and recommended settings for MMAS . . . . 46
2.8 Tuning parameters and recommended settings for MMAS . . . . 46
5.1 Relative frequencies of normalised edge lengths for several TSP in-
stances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Results of the DIMACS benchmarking of the experiment testbed. 101
5.3 Data from the DIMACS benchmarking of the experiment testbed. 101
6.1 The sequential experimentation methodology. . . . . . . . . . . . 107
6.2 Schematic for the Two-Stage Nested Design with r replicates. . . 108
6.3 A sample overlay plot. . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.1 Number of outliers deleted during each problem difficulty experi-
ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2 Relative Error response for ACS on problems of size 300, mean 100. 138
7.3 Relative Error response for ACS on problems of size 700, mean 100. 138
7.4 Relative Error response for MMAS on problems of size 300, mean
100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.5 Relative Error response for MMAS on problems of size 700, mean
100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.1 Descriptive statistics for the ACS screening experiment. . . . . . 145
8.2 Descriptive statistics for the confirmation of the ACS screening
ANOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.3 95% Prediction intervals for the ACS screening of Relative Error. 147
8.4 95% Prediction intervals for the ACS screening of ADA. . . . . . 147
8.5 95% Prediction intervals for the ACS screening of Time. . . . . . . 148
8.6 Summary of ANOVAs for Relative Error, ADA and Time. . . . . . . 148
7
LIST OF FIGURES
9.1 Descriptive statistics for the full ACS FCC design. . . . . . . . . . 155
9.2 Descriptive statistics for the screened ACS FCC design. . . . . . . 156
9.3 Descriptive statistics for the confirmation of the ACS tuning. . . . 158
9.4 95% Prediction intervals for the full ACS response surface model of
Relative Error-Time. . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.5 95% Prediction intervals for the screened ACS response surface
model of RelativeError-Time. . . . . . . . . . . . . . . . . . . . . . 159
9.6 RelativeError-Time ranked ANOVA of Relative Error response from
full model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.7 RelativeError-Time ranked ANOVA of time response from full model. 161
9.8 Full RelativeError-Time model results of desirability optimisation. 162
9.9 Screened RelativeError-Time model results of desirability optimi-
sation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9.10 Evaluation of Relative Error response in the RelativeError-Time
model of ACS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.11 Evaluation of Time response in the RelativeError-Time model of
ACS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.12 Evaluation of ADA response in the ADA-Time model of ACS . . . 165
9.13 Evaluation of Time response in the ADA-Time model of ACS . . . 165
10.1 Descriptive statistics for the MMAS screening experiment. . . . . 171
10.2 Descriptive statistics for the confirmation of the MMAS screening
ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
10.3 95% Prediction intervals for the MMAS screening of Relative Error. 173
10.4 95% Prediction intervals for the MMAS screening of ADA. . . . . . 173
10.5 95% Prediction intervals for the MMAS screening of Time. . . . . 174
10.6 Summary of ANOVAs for Relative Error, ADA and Time for MMAS. 174
11.1 Descriptive statistics for the full MMAS experiment design. . . . . 181
11.2 Descriptive statistics for the screened MMAS experiment design. 182
11.3 Descriptive statistics for the MMAS confirmation experiments. . . 184
11.4 95% prediction intervals of Relative Error by the full RelativeError-
Time model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . 185
11.5 95% prediction intervals of Relative Error by the screened RelativeError-
Time model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . 185
11.6 Predictions of Time by the full RelativeError-Time model of MMAS. 186
11.7 Predictions of Time by the screened RelativeError-Time model of
MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
11.8 95% prediction intervals of ADA by the full ADA-Time model of
MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
11.9 95% prediction intervals of ADA by the screened ADA-Time model
of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
11.10 95% prediction intervals of Time by the full ADA-Time model of
MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8
LIST OF FIGURES
11.11 95% prediction intervals of Time by the screened ADA-Time model
of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
11.12 RelativeError-Time ranked ANOVA of Relative Error response from
full model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
11.13 RelativeError-Time ranked ANOVA of time response from full model. 190
11.14 Full RelativeError-Time model results of desirability optimisation. 191
11.15 Screened RelativeError-Time model results of desirability optimi-
sation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
11.16 Evaluation of Relative Error response in the RelativeError-Time
model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
11.17 Evaluation of the Time response in the relativeError-Time model of
MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
11.18 Evaluation of the Time response in the RelativeError-Time model
of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
11.19 Evaluation of the Time response in the ADA-Time model of MMAS. 194
11.20 Evaluation of Relative Error response in the RelativeError-Time
model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
A.1 Region of operability and region of interest . . . . . . . . . . . . . 213
A.2 Fractional Factorial designs for two to twelve factors. . . . . . . . 216
A.3 Effects and alias chains . . . . . . . . . . . . . . . . . . . . . . . . 217
A.4 Savings in experiment runs when using a fractional factorial design
instead of a full factorial design. . . . . . . . . . . . . . . . . . . . 218
A.5 Central composite designs for building response surface models. 218
A.6 Individual desirability functions. . . . . . . . . . . . . . . . . . . . 221
A.7 Examples of possible main and interaction effects . . . . . . . . . 224
B.1 Some descriptive statistics for the symmetric Euclidean instances
in TSPLIB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
B.2 Histogram of the bier127 TSPLIB instance. . . . . . . . . . . . . . 230
B.3 Histogram of the Oliver30 TSPLIB instance. . . . . . . . . . . . . . 231
B.4 Histogram of the pr1002 TSPLIB instance. . . . . . . . . . . . . . 231
C.1 Pseudocode for the calculation of the average lambda branching
factor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
D.1 Fixed parameter settings for the OFAT analysis. . . . . . . . . . . 236
D.2 Descriptive statistics for the six OFAT analyses. . . . . . . . . . . 237
D.3 Summary of results from the six OFAT analyses. . . . . . . . . . . 238
D.4 Plot of the effect of alpha on relative error for a problem with size
400 and standard deviation 10. . . . . . . . . . . . . . . . . . . . . 239
D.5 Plot of the effect of alpha on relative error for a problem with size
400 and standard deviation 40. . . . . . . . . . . . . . . . . . . . . 239
D.6 Plot of the effect of alpha on relative error for a problem with size
400 and standard deviation 70. . . . . . . . . . . . . . . . . . . . . 240
9
LIST OF FIGURES
D.7 Plot of the effect of alpha on relative error for a problem with size
500 and standard deviation 10. . . . . . . . . . . . . . . . . . . . . 240
D.8 Plot of the effect of alpha on relative error for a problem with size
500 and standard deviation 40. . . . . . . . . . . . . . . . . . . . . 241
D.9 Plot of the effect of alpha on relative error for a problem with size
500 and standard deviation 70. . . . . . . . . . . . . . . . . . . . . 241
10
List of Tables
2.1 A selection of ant heuristic applications. . . . . . . . . . . . . . . . 36
3.1 The state of the art in nature-inspired heuristics from 10 years ago. 57
4.1 Evolved parameter values for ACS. . . . . . . . . . . . . . . . . . . 86
6.1 A full factorial combination of two problem characteristics. . . . . 123
7.1 Parameter settings for the problem difficulty experiments . . . . . 135
8.1 Design factors for the screening study with ACS. . . . . . . . . . . 144
9.1 Design factors for the tuning study with ACS. . . . . . . . . . . . 154
10.1 Design factors for the screening study with MMAS. . . . . . . . . 170
11.1 Design factors for the tuning study with MMAS. . . . . . . . . . . 180
11.2 Amount of outliers removed from MMAS tuning analyses. . . . . 183
A.1 Numbers of each effect estimated by a full factorial design of 10
factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
A.2 Some common response transformations. . . . . . . . . . . . . . 222
11
Acknowledgments
I am very grateful to my family and friends whose support and encouragement
helped me through the PhD process. I thank my supervisor Daniel Kudenko at
the University of York for his supervision. He promptly reviewed my writing and
was always available when I had questions or doubts. I thank my departmental
assessor, Professor John Clark for his advice and Dimitar Kazakov for reviewing
some of my early papers. The thorough examination of the thesis and constructive
criticisms by Thomas Stutzle, at l’Universite Libre de Bruxelles, and John Clark
greatly improved the thesis.
I also wish to thank Daniel and my colleagues Leonardo Freitas and Arturo
Servin for allowing me to use their machines for my experimental work. Pierre
Andrews, Rania Hodhod, Silvia Quarteroni, Juan Perna and Sergio Mena also
made their machines available when a deadline required some additional com-
puting power. I am very grateful to my colleague and friend Jovan Cakic who
passed away during the course of my research. Jovan helped me quickly set up
the original C program on which much of this thesis is based.
I am grateful to the anonymous reviewers of my publications whose comments
helped shape my research. The research was also greatly improved by discussions
with Ruben Ruiz, Thomas Bartz-Beielstein, Holger Hoos, David Woodruff, Marco
Chiarandini, and Mike Preuss at various international conferences and with Simon
Poulding at York. Major parts of the thesis were proof read by Leonardo Freitas.
I thank Pauline Greenhough, Filomena Ottaway, Judith Warren, Diane Neville,
Richard Selby, Carol Lock, Nicholas Black, Ian Patrick and all the administrative
and technical support staff at the Department for their help during the PhD.
I am very grateful to Michael Madden, my MSc supervisor, and Colm O’ Riordan
at the National University of Ireland, Galway who encouraged and supported my
application to York. Finally, I gratefully acknowledge the financial support of my
scholarship from the Department of Computer Science at the University of York
and its support of my research travel.
Admhalacha
Taim an-bhuıoch de mo chlann agus de mo chairde as a dtacaıocht agus as a
spreagadh a chuidigh go mor liom i rith an phroiseas PhD. Gabhaim buıochas le
m’fheitheoir Daniel Kudenko in Ollscoil Eabhraic as a aiseolas. Cheartaigh se mo
chuid scripteanna go tapa agus bhı se ann nuair a bhı aon cheist agam no amhras
13
LIST OF TABLES
orm. Gabhaim buıochas le mo mheasunoir rannach i Eabhrac, an tOllamh Sean
O Cleirigh, as a chomhairle agus le Dimitar Kazakov as a leirmheas de chuid de
mo scripteanna. D’fheabhsaıodh an trachtas go mor de thoradh an dianscrudu o
Thomas Stutzle, o l’Universite Libre de Bruxelles, agus o Shean O Cleirigh.
Gabhaim buıochas freisin le Daniel agus le mo chomhaltaı Leonardo Freitas
agus Arturo Servin as ligean dom a rıomhairı a usaid do mo thurgnamh. Chuir
Pierre Andrews, Rania Hodhod, Silvia Quarteroni, Juan Perna agus Sergio Mena
a rıomhairı ar fail dom freisin nuair a bhı me faoi bhru le cuspoir. Taim an-
bhuıoch de chara agus chomhalta liom Jovan Cakic a fuair bas le linn mo thaighde.
Chuidigh Jovan liom an tasc-chlar rıomhaire, ar a bhfuil chuid mhor den trachtas
seo bunaithe, a chruthu go tapa.
Taim buıoch de na leirmheastoirı neamhaitheanta de mo chuid foilseachain mar
chuidigh a gcuid tuairiscı go mor liom cruth a chur ar mo thaighde. D’fheabhsaıodh
an taighde go mor de bharr ple le Ruben Ruiz, Thomas Bartz-Beielstein, Holger
Hoos, David Woodruff, Marco Chiarandini, agus Mike Preuss ag comhdhalacha idir-
naisiunta eagsula agus le Simon Poulding in Eabhrac. Leigh agus cheartaigh
Leonardo Freitas pıosaı mora den trachtas.
Gabhaim buıochas le Pauline Greenhough, Filomena Ottaway, Judith Warren,
Diane Neville, Richard Selby, Carol Lock, Nicholas Black, Ian Patrick agus an
fhoireann riarachain ar fad sa Roinn as a gcabhair le linn an PhD.
Taim an-bhuıoch de Michael Madden, m’fheitheoir MSc, agus de Colm O’ Rior-
dan in Ollscoil na hEireann, Gaillimh as a spreagadh agus as a dtacaiocht do mo
iarratas go Eabhrac. Sa deireadh thiar, gabhaim buıochas leis an Roinn Riomhe-
olaıochta in Ollscoil Eabhraic as a dtacaıocht airgeadais do mo thaisteal taighde.
14
Author’s Declaration
This thesis describes original research carried out by the author Enda Ridge under
the supervision of Dr. Daniel Kudenko at the University of York. This research has
not been previously submitted to the University of York or to any other university
for the award of any degree. Some chapters of the thesis are based on articles that
the author published or submitted for publication in the peer-reviewed scientific
literature during the course of the thesis research. The details of these publications
follow.
The early ideas in this research arose from explorations into parallel and de-
centralised versions of Ant Colony Optimisation (ACO) algorithms.
1. Enda Ridge, Daniel Kudenko, Dimitar Kazakov, Edward Curry. Parallel,Asynchronous and Decentralised Ant Colony System, in Proceedings of
AISB 2006: Adaptation in Artificial and Biological Systems. First Interna-
tional Symposium on Nature-Inspired Systems for Parallel, Asynchronous
and Decentralised Environments, vol. 2, T. Kovacs and J. A. R. Marshall,
Eds. AISB, 2006, pp. 174-177.
2. Enda Ridge, Daniel Kudenko, and Dimitar Kazakov, A Study of Concur-rency in the Ant Colony System Algorithm, in Proceedings of the IEEE
Congress on Evolutionary Computation, 2006, pp. 1662-1669.
3. Enda Ridge, Edward Curry, Daniel Kudenko, Dimitar Kazakov, Nature-Inspir-ed Systems for Parallel, Asynchronous and Decentralised Environments,
in Multi-Agent and Grid Systems, vol. 3, H. Tianfield and R. Unland, Eds.
IOS Press, 2007.
It quickly became obvious that these experiments would have a large amount of
experimental noise arising from the parallel and asynchronous nature of the soft-
ware. This prompted a search for how experiments with Ant Colony Optimisation
(ACO) had been conducted in the literature and how the original sequential single
machine versions of the algorithms were set up. An examination of the litera-
ture revealed there were few guidelines and no rigourous approaches to setting up
ACO algorithms. The original research direction changed. ‘Roadmap’ publications
called for, among other things, recommended experiment designs and analyses for
experiments with metaheuristics such as ACO.
4. Enda Ridge and Edward Curry, A Roadmap of Nature-Inspired SystemsResearch and Development, Multi-Agent and Grid Systems, vol. 3, IOS
15
LIST OF TABLES
Press, 2007.
5. Marco Chiarandini, Luıs Paquete, Mike Preuss, Enda Ridge, Experimentson Metaheuristics: Methodological Overview and Open Issues, Institut
for Matematik og Datalogi, University of Southern Denmark, Technical Report
IMADA-PP-2007-04 (http://bib.mathematics.dk/preprint.php?id=IMADA-PP-
2007-04), March 2007, ISSN 0903-3920.
A preliminary version of the screening and tuning methodologies of the thesis
appeared in the following publication.
6. Enda Ridge and Daniel Kudenko, Sequential Experiment Designs for Scr-eening and Tuning Parameters of Stochastic Heuristics, in Workshop on
Empirical Methods for the Analysis of Algorithms at the Ninth International
Conference on Parallel Problem Solving from Nature, L. Paquete, M. Chiaran-
dini, and D. Basso, Eds., 2006, pp. 27-34.
A refined version of this methodology is described in Chapter 6 and is used in
the case studies of Chapters 8 to 11.
Initial attempts to apply this methodology were not performing as well as ex-
pected and so an investigation was conducted into possible unknown problem
characteristics that might be interfering with the methodology’s models. This led
to the following publications, the second of which contains the updated data of
Chapter 7.
7. Enda Ridge and Daniel Kudenko, An Analysis of Problem Difficulty fora Class of Optimisation Heuristics, in Proceedings of the Seventh Euro-
pean Conference on Evolutionary Computation in Combinatorial Optimisa-
tion (EvoCOP), vol. 4446, Lecture Notes in Computer Science, C. Cotta and
J. Van Hemert, Eds. Springer-Verlag, 2007, pp. 198-209. ISBN 978-3-540-
71614-3.
8. Enda Ridge and Daniel Kudenko, Determining whether a problem charac-teristic affects heuristic performance. A rigorous Design of Experimentsapproach, in Recent Advances in Evolutionary Computation for Combina-
torial Optimization. Springer, Studies in Computational Intelligence, 2008.
ISBN 1860-949X.
The first application of the methodology was published in the following papers,
updated versions of which appear in Chapters 8 and 9. The third of these was
nominated for best paper in its track at the Genetic and Evolutionary Computation
conference 2007.
9. Enda Ridge and Daniel Kudenko, Screening the Parameters Affecting Heur-istic Performance, in Proceedings of the Genetic and Evolutionary Compu-
tation Conference, vol. 1, D. Thierens, H.-G. Beyer, M. Birattari, et al., Eds.
ACM, 2007. ISBN 978-1-59593-697-4.
16
LIST OF TABLES
10. Enda Ridge and Daniel Kudenko, Screening the Parameters Affecting Heur-istic Performance. The Department of Computer Science, The University of
York, Technical Report YCS 415 (www.cs.york.ac.uk/ftpdir/reports/index.php),
April 2007.
11. Enda Ridge and Daniel Kudenko, Analyzing Heuristic Performance withResponse Surface Models: Prediction, Optimization and Robustness in
Proceedings of the Genetic and Evolutionary Computation Conference, D.
Thierens, H.-G. Beyer, M. Birattari, et al., Eds. ACM, 2007, p. 150-157.
ISBN 978-1-59593-697-4.
Finally, the methodology was applied to the MMAS heuristic and published in
the following paper, an updated version of which appears in Chapter 11. The paper
was winner of the best paper award at the Engineering Stochastic Local Search
Algorithms workshop.
12. Enda Ridge and Daniel Kudenko, Tuning the Performance of the MMASHeuristic in Engineering Stochastic Local Search Algorithms. Designing, Im-
plementing and Analyzing Effective Heuristics, vol. 4638, Lecture Notes in
Computer Science, T. Stutzle and M. Birattari, Eds. Berlin / Heidelberg:
Springer, 2007, pp. 46-60. ISBN 978-3-540-74445-0.
17
1Introduction and motivation
This thesis presents rigorous empirical methodologies for modelling and tuning the
performance of algorithms that solve optimisation problems.
Consider the very common problem of efficiently assigning limited indivisible re-
sources to meet some objective. For example, a manufacturing plant must sched-
ule machines to a particular job in the correct order so that machine utilisation is
maximised and a product is manufactured as quickly as possible. Low cost airlines
must assign cabin crew shifts from a minimum size of workforce and to as many
aircraft as possible. Logistics companies need to deliver products to a set of loca-
tions in an order that minimises delivery cost. Many such similar problems occur
in management, finance, engineering and physics. These problems are known as
Combinatorial Optimisation (CO) problems.
CO problems are notoriously difficult to solve because a large number of poten-
tial solutions must be considered. Constraints on the available resources will limit
the feasible alternatives that need to be considered. However, most CO problems
still contain sufficient alternatives to make the best choices of available options
difficult. CO problems typically require exponential time for solution in the worst
case. In plain terms, as the problem gets larger, the difficulty of finding an exact
solution increases extremely quickly. This has lead to the use of heuristic solu-
tion methods—methods that sacrifice the guarantee of finding an exact solution
in order to find a satisfactory solution in reasonable time. We term this reduc-
tion in solution quality in exchange for an increase in solution time the heuristiccompromise.
Metaheuristics1 are a more recent attempt to combine basic heuristics into a
flexible higher-level framework in order to better solve CO problems. Some of the
most popular metaheuristics for combinatorial optimisation are Ant Colony Opti-
misation (ACO), Evolutionary Computation (EC), Iterated Local Search (ILS), Sim-
1 The terms metaheuristic and heuristic are used interchangeably throughout the thesis.
21
CHAPTER 1. INTRODUCTION AND MOTIVATION
ulated Annealing (SA) and Tabu Search (TS). Many of these metaheuristics have
achieved notable successes in solving difficult and important problems. Industry is
taking note of this. Several companies incorporate metaheuristics into their solu-
tions of complex optimisation problems [12]. These include ILOG (www.ilog.com),
SAP (www.sap.com), NuTech Solutions (www.nutechsolutions.com),
AntOptima (www.antoptima.com) and EuroBios (www.eurobios.com). Metaheuris-
tics are therefore a research area of growing importance.
The flexibility of the metaheuristic framework comes at a cost. Metaheuristics
typically require a relatively large amount of ‘tuning’ in order to adjust them to the
particular problem at hand. This tuning involves setting values of many tuningparameters, much as one would adjust the dials on an old-fashioned television set
to find a given station. This situation is exacerbated if one considers parameteris-
ing internal components of the metaheuristic and then adding or removing these
parameterised components to modify performance. We term these design param-eters. Some metaheuristics can have anything from five to more than twenty-five
of these tuning parameters [33] and the scope for design parameters is effectively
limitless. It quickly becomes very difficult to search through all possible tuning
parameter settings and thus the potential performance of the metaheuristic is not
realised. This is the parameter tuning problem.
The parameter tuning problem is one of the most important research challenges
for any given metaheuristic2. The main elements of this research challenge are as
follows.
1. Screening problem characteristics to determine which problem character-
istics affect metaheuristic performance.
2. Screening tuning parameters to determine which tuning parameters affect
metaheuristic performance.
3. Modelling the relationship between tuning parameters, problem characteris-
tics and performance.
4. Predicting metaheuristic performance for a given problem instance or set of
instances given particular tuning parameter settings.
5. Tuning metaheuristic performance for a given problem instance or set of in-
stances by recommending appropriate tuning parameter settings.
6. Assessing robustness of the tuned metaheuristic performance to variations
in problem instance characteristics. That is, determining whether tuned pa-
rameter settings for a given combination of problem instance characteristics
deteriorate significantly when applied to similar problem instance character-
istics.
The key obstacles to addressing these challenges are as follows:
2 There are, of course, other very important research challenges. Comparing heuristics, for example,is an important challenge that is fraught with its own difficulties.
22
CHAPTER 1. INTRODUCTION AND MOTIVATION
• Problem space. All of the important problem characteristics are generally
not known and may be difficult to determine.
• Parameter space3. The number of tuning parameters and the possible com-
binations of values they can take on is large or even infinite.
• Multiple performance metrics. Performance must be analysed in terms of
both solution quality and solution time because of the heuristic compromise.
• Application scenario. The emphasis in a particular parameter tuning prob-
lem will depend on the specific application scenario. If problem instances are
likely to be similar in characteristics then it is advantageous to have a general
model of the relationship between parameters, instances and performance. If
a small number of problem instances are likely to be tackled and those in-
stances require significant resources for their solution then a relatively fast
tuning approach is to be preferred.
These challenges and obstacles must be addressed for every new metaheuristic
that is proposed, for every modification to an existing metaheuristic that is pro-
posed, and for every new problem type that is addressed. The parameter tuning
problem is ubiquitous. Without addressing these challenges and overcoming these
obstacles, the metaheuristic is of little use in practice as its user cannot set it up
for maximum performance. So how does one address these challenges?
One can distinguish two broad approaches [28]. Analytical approaches attempt
to analytically prove characteristics of the algorithm such as its worst-case and
average-case behaviour. Empirical analyses implement the algorithm in computer
code and evaluate its behaviour on selected problems. Both of these approaches
have been unsatisfactory to date.
The analytical approach is the more ideal of the two in principle because of its
potential generality and pure mathematical foundation. While it is to be expected
that analytical approaches will improve with time and effort, they are far from ideal
at their current level of maturity. The mathematical tools do not yet exist to suc-
cessfully formalise and theorise about the behaviour and performance of existing
cutting-edge metaheuristics. While early attempts at analysis are emerging, they
generally resort to extreme simplifications to the metaheuristic description to ren-
der the analyses tractable. There is also a lack of comparisons of the theoretical
predictions to actual implementations to determine whether the theory predicts
the reality. These simplifications make the majority of conclusions inapplicable for
practical purposes.
An empirical approach would seem an attractive alternative by virtue of its
simplicity—collect enough data and interpret it without bias. The reality is very
different. Which data should be collected? What issues affect the measurement
and collection of the data? How much data is enough data? How should data be
3 We use the term parameter space when considering all the possible combinations of tuning pa-rameters. We use the term design space when considering all the possible combinations of both tuningparameter settings and problem characteristics.
23
CHAPTER 1. INTRODUCTION AND MOTIVATION
interpreted? Can data interpretation be backed by mathematical precision or must
we be limited to subjective interpretation? How do we ensure that an empirical
analysis is both repeatable and reproducible?4
An examination of the professional research journals shows that while empir-
ical analyses of metaheuristics are often large and broad ranging, they are sel-
dom backed by the scientific rigour that one would expect in more mature dis-
ciplines such as the physical, medical and social sciences. Few of the questions
from the previous paragraph regarding empirical methodology are either recog-
nised or clearly addressed by researchers. Proper experimental designs are seldom
used. Interpretations of results are subjective opinions rather than sound statisti-
cal analyses. Parameters are selected without justification or based on the reports
from other studies without verification of their appropriateness for the current sce-
nario [2]. This leaves the metaheuristic ill-defined, experiments irreproducible and
leads to an underestimation of the time needed to deploy the metaheuristic [69].
The list of failings is long and has often been lamented in the literature of the last
two decades [69, 64, 101, 7, 48, 65].
While these criticisms in the literature are justified, others point out that few
publications go further and explicitly illustrate the application of sound established
scientific methodology to the analysis of metaheuristics [28]. Without research
that sets a good example, the impoverished state of the field’s methodology has
thus persisted. Researchers in the natural sciences have available an extensive
lore of laboratory techniques to guide the development of rigorous and conclusive
experiments. This has not been the case in algorithmic research [79]. Attempts
to improve this situation with illustrative case studies and to educate researchers
with tutorials are emerging [122, 27, 9, 90]. A comprehensive methodology for
addressing the aforementioned research challenges is needed. A comprehensive
illustration of the application of such a methodology is needed. Fortunately, a
good candidate methodology already exists.
The field of Design of Experiments (DOE) is defined as:
. . . a systematic, rigorous approach to engineering problem-solving that
applies principles and techniques at the data collection stage so as to
ensure the generation of valid, defensible, and supportable engineering
conclusions. In addition, all of this is carried out under the constraint
of a minimal expenditure of engineering runs, time, and money. [1]
As well as providing this rigorous and efficient approach to data collection,
DOE also provides statistically designed experiments. A statistically designed ex-
periment offers a number of advantages over a design that does not use statisti-
cal techniques [89]. Attention is focussed on measuring sources of variability in
results. The required number of tests is determined reliably and may often be re-
duced. Detection of effects is more precise and the correctness of conclusions is
4 A repeatable experiment is one which the original experimenter can redo and get very similarresults. A reproducible experiment is one which another experimenter can reproduce independentlyand get similar results that lead to the same conclusions.
24
CHAPTER 1. INTRODUCTION AND MOTIVATION
known with the mathematical precision of statistics.
DOE is a well-established field that has existed for over eighty years. It evolved
for the manufacturing industry and is now well supported by commercial soft-
ware. The National Institute of Standards and Technology describes four general
engineering problem areas to which DOE may be applied [1]:
• Screening/Characterizing: the engineer is interested in understanding the
process as a whole in the sense that he/she wishes to rank factors that affect
the process in order of importance.
• Modelling: the engineer is interested in modelling the process with the output
being a good-fitting (high predictive power) mathematical relationship.
• Optimizing: the engineer is interested in optimising the process by adjusting
the factors that effect the process.
• Comparative: the engineer is interested in assessing whether a given choice
is preferable to an alternative.
The first three of these application areas map directly to the parameter tuning
research challenges for metaheuristics identified earlier5. The metaheuristic being
studied is the ‘process’ to which DOE is applied. The rigour of DOE provides
the framework to address the concerns regarding the methodology of empirical
analyses. The statistically designed experiments address any concerns about the
subjective nature of the interpretation of results.
1.1 Hypothesis Statement
We can now identify the central hypothesis of this research:
The problem of tuning a metaheuristic can be successfully addressed with a
Design Of Experiments approach.
If the parameter tuning problem is addressed successfully, then we can expect
• to make verifiably accurate predictions of metaheuristic performance with a
given confidence.
• to make verifiably accurate recommendations on the most important tuning
parameters with a given confidence.
• to make verifiably accurate recommendations on tuning parameter settings
with a given confidence.
5 Comparative DOE studies are appropriate for the comparison of heuristics, typically answeringquestions such as whether one heuristic is better than another. The difficulties of comparative studiesare covered in the literature. Comparative studies are appropriate once all other issues regardingdesign, setup and running have been addressed. This thesis focuses on tuning and so should facilitatefairer comparative studies.
25
CHAPTER 1. INTRODUCTION AND MOTIVATION
• to make all of these recommendations in terms of solution quality and solu-
tion time.
The specific metaheuristic studied in this thesis is Ant Colony Optimisation
(ACO) [47]. The CO problem domain to which ACO will be applied is the Travel-
ling Salesperson Problem [75]. The importance of and need for this research has
already been highlighted in the ACO field [118].
1.2 Thesis structure
The thesis is divided into three parts. Preliminaries are the necessary topics that
must be covered to place the research in context. The second part, Related Work,
presents a synthesis of the methodological issues that arise in empirical analyses of
metaheuristics and critically reviews the literature on parameter tuning in light of
these issues. The third part, Design Of Experiments for Tuning Metaheuristics, is
concerned with methodology. It introduces the experimental testbed and presents
one of the thesis’ main contributions, a Design of Experiments methodology for
metaheuristic parameter tuning. The final part, Case Studies, contains several
examples of the successful application of the methodology. The specific chapters
are now summarised.
Chapter 2 on page 29 gives a background on combinatorial optimisation and
the Travelling Salesperson Problem, the type of optimisation and problem domain
studied in this thesis. Various approaches to solving combinatorial optimisation
problems are covered. The discussion then focuses on Ant Colony Optimisation
(ACO), the family of metaheuristics used to illustrate the methodology advocated
in the thesis. The chapter concludes with an overview of the Design Of Experiments
field.
Chapter 3 brings together and discusses many of the issues that arise when
performing empirical analyses of heuristics. Some of these issues have often been
raised in the research literature but are scattered across a range of related research
fields. This chapter therefore draws on literature from fields such as Operations
Research, Heuristics, Performance Analysis, Design of Experiments and Statistics.
Chapter 4 is a critical review of the literature on parameter tuning in light
of the methodological issues highlighted in the previous chapter. It begins with
approaches to analysing problem difficulty for heuristics. Parameter tuning is
addressed in terms of metaheuristics other than ACO and in terms of the ACO
metaheuristic. For the treatment of ACO parameter tuning, the chapter reviews
analytical, automated and empirical approaches.
Chapter 5 describes the experimental testbed. It covers the problem generator
and metaheuristic code used. It also details the benchmarking of the experiment
machines. All topics are covered in light of the empirical analysis issues discussed
in Chapter 3. This chapter is key to the reproducibility of the results the thesis
presents.
Chapter 6 is a detailed step-by-step description of the Design Of Experiments
methodology that the thesis introduces. The methodology is crafted in terms of the
26
CHAPTER 1. INTRODUCTION AND MOTIVATION
empirical analysis concerns of Chapter 3. This chapter serves as a template for all
the case studies reported in the final part of the thesis.
Chapters 7 to 11 on page 179 are the thesis case studies. They illustrate all
aspects of the thesis’ Design Of Experiment methodology of Chapter 6. Case stud-
ies cover the two best performing members of the ACO metaheuristic family, Ant
Colony System and Max-Min Ant System. Many new results for the ACO field are
presented and open questions from the literature are answered. This underscores
the benefits of adopting the thesis’ rigorous Design Of Experiments methodology.
The thesis concludes with Chapter 12 on page 197. Appendix A is an overview
of Design Of Experiments (DOE) terminology and concepts. It is provided for the
convenience of the reader who is unfamiliar with DOE. It should not be taken as a
replacement for comprehensive textbooks on the subject [89, 84, 85]. Appendix B
contains some statistics related to the TSP. Appendix C is an important complexity
calculation related to the MMAS heuristic.
1.3 Chapter summary
This chapter has introduced and motivated the main thesis of this research.
• Problems of combinatorial optimisation were introduced and the difficulty of
solving them was explained.
• Metaheuristics were introduced as a popular emerging approach for solving
CO problems.
• The parameter tuning problem was identified as a key research challenge
that will always be faced when dealing with newly proposed metaheuristics,
proposed changes to existing metaheuristics and new problem types. The
difficulty of the parameter tuning problem was explained and the importance
of solving the problem in terms of the heuristic compromise of solution time
and solution quality was emphasised. Approaches to addressing the parame-
ter tuning problem were categorised as either analytical or empirical and the
current deficiencies in the state-of-the-art of both approaches were explained.
• Design Of Experiments (DOE) was identified as a well-established field that
may be a very good candidate for empirically solving the parameter tuning
problem in a rigorous fashion.
This lead to the central hypothesis of this thesis:
The problem of tuning a metaheuristic can be successfully addressed with a
Design Of Experiments approach.
27
2Background
The previous chapter introduced combinatorial optimisation problems, discussed
their importance across academia and industry and explained why they are typ-
ically difficult to solve. Metaheuristics were introduced as a general framework
for solving such problems and the parameter tuning problem was presented as one
of the key obstacles to the successful deployment of metaheuristics. The chapter
highlighted the lack of experimental rigour in the field’s attempts to analyse and
understand its heuristics, particularly its lack of a rigourous approach to the pa-
rameter tuning problem. This led to the hypothesis that rigourous techniques can
be adapted from the Design Of Experiments (DOE) field to successfully tackle the
parameter tuning problem.
This chapter gives a more detailed background to the areas mentioned in the
previous chapter’s motivation and hypothesis. It begins with a general descrip-
tion of combinatorial optimisation before focussing on the particular combinatorial
optimisation problem addressed in this thesis. The approaches to solving combi-
natorial optimisation problems are reviewed. The chapter then focuses on the
particular family of metaheuristics that is studied in this thesis. The chapter con-
cludes with some background on the Design Of Experiments techniques that will
be adapted to the parameter tuning problem in this thesis.
2.1 Combinatorial optimisation
Optimisation problems in general divide naturally into two classes: those where
solutions are encoded with real-valued variables and those where solutions are
encoded with discrete variables. Combinatorial Optimisation (CO) problems are of
the latter type.
An illustrative example of a CO problem is that of class timetabling. Such
timetabling typically involves assigning a group of teachers and students to class-
rooms. This assignment is subject to the constraints that a teacher cannot teach
29
CHAPTER 2. BACKGROUND
all subjects, students are only taking a limited number of all the available classes,
teachers and students cannot be in two classrooms at once and no more than one
class can be taught in a classroom at a given time. The variables are discrete
because we cannot consider some fraction of a student, room or teacher. The diffi-
culty of the problem lies in the large number of possible solutions that have to be
searched and the constraints on keeping all teachers and students satisfied. Some
other popular examples of CO problems are the Travelling Salesperson Problem
(TSP) [75], the Quadratic Assignment Problem (QAP) [55, p. 218] and the Job Shop
Scheduling Problem (JSP) [55, p. 242].
The ubiquity of CO problems and their importance for logistics, manufacture,
scheduling and other industries has resulted in a large body of research devoted
to their understanding, analysis and solution.
This thesis is concerned with a particular type of combinatorial optimisation
problem called the Travelling Salesperson Problem.
2.2 The Travelling Salesperson Problem (TSP)
Informally, the Travelling Salesperson Problem (TSP) can be described in the fol-
lowing way.
Given a number of cities and the costs of travelling from any city to any
other city, what is the cheapest round-trip route that visits each city
exactly once? [121]
The most direct solution would be to try all the ordered combinations of cities
and see which combination, or tour, is cheapest. Using such a brute force searchrapidly becomes impractical because the number of possible combinations of n
cities to consider is the factorial of n. This rapid growth in problem search space
size is illustrated in Figure 2.1.
1.00E+001.00E+161.00E+321.00E+481.00E+641.00E+801.00E+961.00E+1121.00E+1281.00E+1441.00E+160
0 20 40 60 80 100
Number of Cities (n)
Com
bina
tions
of c
ities
Figure 2.1: Growth of TSP problem search space. The horizontal axis is the number of cities in a TSPproblem. The vertical axis is the number of combinations of cities that have to be considered. Thefigure shows an exponential growth in search space size with problem size.
In fact, the TSP has been shown to be Nondeterministic Polynomial-time hard(NP-hard). Informally, this means that it is contended that the TSP cannot be
30
CHAPTER 2. BACKGROUND
solved to optimality within polynomially bounded computation time in the worst
case. A detailed examination of the TSP, computational complexity theory and NP-
hardness is beyond the scope of this thesis. The reader is referred to the literature
for a discussion of this important topic [70]. For practical purposes, the difficulty
of the TSP means that a sophisticated approach to its solution is required.
The difficulty of solving the TSP to optimality, despite its conceptually simple
description, has made it a very popular problem for the development and testing
of combinatorial optimisation techniques. The TSP “has served as a testbed for
almost every new algorithmic idea, and was one of the first optimization problems
conjectured to be ‘hard’ in a specific technical sense” [70, p. 37]. This is partic-
ularly so for algorithms in the Ant Colony Optimisation (ACO) field where ‘a good
performance on the TSP is often taken as a proof of their usefulness’ [47, p. 65].
The type of TSP described at the start of this section can be termed the general
asymmetric TSP. It is asymmetric because the cost of travelling between two given
cities can be different depending on the direction of travel. The cost from city
1 to city 2 can be different to the cost from city 2 to city 1. There are several
further categories of TSP problem that we can distinguish [70, p. 58-61]. Their
relationship to one another in terms of generalisations and specialisations of the
general asymmetric TSP are illustrated in Figure 2.2 on the following page.
This thesis focuses exclusively on symmetric TSP instances. The reader is re-
ferred to the literature for details of the other TSP types [70, p. 58-61]. The sym-
metric TSP specialisation was chosen as a problem domain because the heuristics
researched in this thesis were originally developed for this TSP type.
This thesis follows the usual convention of using the term problem to describe
a general problem such as the Travelling Salesperson Problem and an instance to
be a particular case of a problem.
2.3 Approaches to solving combinatorial optimisation pro-
blems
Algorithms to tackle combinatorial optimisation problems can be classified as ei-
ther exact or approximate. Exact methods are guaranteed to find an optimal so-
lution in bounded time. Unfortunately, many problems are NP-hard like the TSP
and so may require exponential time in the worst case. This impracticality of exact
methods has led to the use of approximate (or heuristic) methods—methods that
sacrifice the guarantee of finding an optimal solution in order to find a satisfactory
solution in reasonable time. We term this the heuristic compromise. This com-
promise is even mentioned implicitly in some definitions of CO problems [42, p.
244].
Approximate methods (or heuristics) can be distinguished as being either con-structive methods or local search (or improvement) methods. Constructive meth-
ods start from scratch and add solution components until a complete solution is
found. The nearest neighbour heuristic for the TSP is an example of a constructive
heuristic. It begins at some city and repeatedly chooses the nearest unvisited city
31
CHAPTER 2. BACKGROUND
K-Salesman TSP Dial-a-ride Stacker Crane
GeneralAsymmetric
TSP
Asymmetric triangle InequalityTSP Symmetric TSP
MixedChinesePostman
DirectedHamiltonian
Cycle
Symmetric TriangleInequality TSP
Hamiltonian Cycle Rectilinear TSP Euclidean TSP
Hamiltonian Cycle forGrid Graphs
Figure 2.2: Special cases and generalisations of the TSP. The TSP type studied in this thesis is high-lighted in bold. Image adapted from [70, p. 59].
32
CHAPTER 2. BACKGROUND
until a complete tour has been constructed. Local search (or improvement) heuris-
tics start with an initial solution and iteratively try to improve on this solution by
searching within an appropriately defined neighbourhood of the current solution1.
Others [11] provide an overview of local search approaches. This research focuses
on constructive methods rather than local search.
It is clear that there is a myriad of possible combinations of constructive heuris-
tics and local search heuristics. Some are more suited to particular types of com-
binatorial optimisation problems and instances than others. Metaheuristics try
to combine these more basic heuristics (both constructive and local search) into
higher-level frameworks in order to better search a solution space. Some examples
of metaheuristics for combinatorial optimisation [16, p. 270] are Ant Colony Opti-
misation [47], Evolutionary Computation [83], Simulated Annealing [73], Iterated
Local Search [66], and Tabu Search [56].
The general high-level framework of the metaheuristic makes it difficult to de-
fine exactly what a metaheuristic is. Many definitions have been summarised in
the literature [16, p. 270].
A metaheuristic is an iterative master process that guides and modi-
fies the operations of subordinate heuristics to efficiently produce high-
quality solutions. It may manipulate a complete (or incomplete) single
solution or a collection of solutions at each iteration. The subordinate
heuristics may be high (or low) level procedures, or simple local search,
or just a construction method. [120]
Metaheuristics are typically high-level strategies which guide an under-
lying, more problem specific heuristic, to increase their performance.
. . . . Many of the metaheuristic approaches rely on probabilistic deci-
sions made during the search. But, the main difference to pure ran-
dom search is that in metaheuristics algorithms randomness is not used
blindly but in an intelligent, biased form. [117, p. 23]
Interestingly, the first definition permits a stand-alone constructive heuristic
without local search to be considered as a metaheuristic.
It can be useful to consider the different dimensions along which metaheuristics
can be classified ([16, p. 272] and [117, p. 33-35]). The metaheuristic studied in
this research can then be classified in relation to other metaheuristics.
• Nature-inspired versus non nature-inspired. There are nature-inspired
algorithms, like Genetic Algorithms and Ant Algorithms, and non nature-
inspired ones such as Tabu Search and Iterated Local Search. This dimen-
sion is of little use as most modern metaheuristics are hybrids that fit in both
classes.
• Population-based versus trajectory methods. This describes whether an
algorithm works on a population of solutions or a single solution at any time.
1 We have used the terms local search and improvement together here. Henceforth, we only use theterm local search since this is now the more fashionable term for such heuristics.
33
CHAPTER 2. BACKGROUND
Population-based methods evolve a set of points in the search space. Trajec-
tory methods focus on the trajectory of a single solution in the search space.
• Dynamic versus static objective function. Dynamic metaheuristics modify
the fitness landscape, as defined by the objective function, during search to
escape from local minima.
• One versus various neighbourhood structures. Some metaheuristics al-
low swapping between different fitness landscapes to help diversify search.
Others operate on one neighbourhood only.
• Memory usage versus memory-less methods. Some metaheuristics use
adaptive memory. This involves keeping track of recent decisions made and
solutions found or generating synthetic parameters to describe the search.
Metaheuristics without adaptive memory determine their next action solely
on the current state of their search process.
Having described the need for heuristic (approximate) methods for tackling com-
binatorial optimisation problems and the concept of a metaheuristic, we can now
describe the metaheuristic family examined in this thesis.
2.4 Ant Colony Optimisation (ACO)
Ant Colony Optimisation (ACO) [47] is a metaheuristic based on the way many
species of real ants forage for food. It is helpful to consider this process in a little
detail before describing the actual ACO heuristics.
Real ants manage to find paths between their nest and food sources over large
distances relative to a single ant’s size. They manage to coordinate this foraging for
large swarms of ants despite individual ants having only rudimentary vision and no
centralised swarm leader. It turns out that real ants communicate between them-
selves by leaving chemical markers in their environment. These markers are called
pheromones. By laying down pheromones and sensing existing pheromones, real
ants can locate and converge on trails leading to food sources. These pheromones
also evaporate over time so that as a food source is exhausted and fewer ants visit
it, the trail eventually disintegrates.
The original ant algorithm was inspired by the so-called ‘double bridge exper-
iment’, an experiment in biology that demonstrated this pheromone-laying be-
haviour for real ants. The experiment is summarised here to help in understanding
the subsequent algorithm descriptions. It is described in more detail in the ACO
literature [47, p. 1-5]. A double bridge was set up to connect a nest of ants to a food
source. One bridge was twice as long as the other (Figure 2.3 on the next page).
Ants leave their nest, encounter the fork in their path at point 1 and randomly
choose one of the two bridges. Ants choosing the shorter bridge will arrive at the
food source and start returning to the nest sooner. Because more ants can make
the journey along the shorter bridge in the same time as ants on the longer bridge,
the pheromone markers build up more quickly on the shorter bridge. Subsequent
34
CHAPTER 2. BACKGROUND
ants, leaving the nest and encountering the fork at point 1 in Figure 2.3, sense a
higher level of pheromone on the shorter bridge and therefore favour choosing the
shorter bridge. This positive feedback of attractive pheromone trails enables the
swarm of ants to successfully find the shorter path to the food source without any
centralised leader and without any global vision of the two bridges. Two points
are particularly noteworthy about this experiment. Firstly, when the vast majority
of the ants had converged on the shorter bridge, a small proportion continued to
choose the longer bridge because of the random decision process. This can be con-
sidered as a type of continuous exploration of the environment. Secondly, when
presented with an even shorter bridge after convergence, the ants were unable to
move to the new shortest bridge because pheromone levels were so high on the
original bridge on which they had already converged. The natural evaporation of
the pheromone chemical was too slow to allow ants to ‘forget’ the first bridge.
Nest Food600
15 cm
Nest Food1 2
(a) (b)
Figure 1.1Experimental setup for the double bridge experiment. (a) Branches have equal length. (b) Branches havedi¤erent length. Modified from Goss et al. (1989).
0
50
100
0-20 20-40 40-60 60-80 80-100
% of traffic on one of the branches
0
50
100
0-20 20-40 40-60 60-80 80-100
(a) (b)
% o
f exp
erim
ents
% o
f exp
erim
ents
% of traffic on the short branch
Figure 1.2Results obtained with Iridomyrmex humilis ants in the double bridge experiment. (a) Results for the case inwhich the two branches have the same length (r ¼ 1); in this case the ants use one branch or the other inapproximately the same number of trials. (b) Results for the case in which one branch is twice as long asthe other (r ¼ 2); here in all the trials the great majority of ants chose the short branch. Modified fromGoss et al. (1989).
1.1 Ants’ Foraging Behavior and Optimization 3
Figure 2.3: Experiment setup for the double bridge experiment. Ants leave the nest and move towardsthe food source. One bridge is longer than the other (adapted from [47, p. 3]).
This ability of real ant swarms to find the shortest route along constrained paths
using pheromone markers and random decision processes was the inspiration for
the ant colony heuristic.
2.4.1 The Ant Colony Heuristic
The idea proposed in the original ant heuristic [44] and developed in many ACO
heuristics since then can be summarised in a general sense as follows. A combi-
natorial optimisation problem consists of a set of components. A solution of the
problem is an ordering of these components and a cost is associated with the or-
dering of each solution. This situation is represented by a data structure called a
graph (Figure 2.4 on the following page). Nodes in the graph (the black dots) are
solution components and a directed edge between two nodes is the cost of ordering
those components one after the other in the problem’s solution.
A number of artificial ants construct solutions by moving on the problem’s fully
connected graph representation. A movement from one node to another represents
a given ordering of those nodes in the constructed solution. Movements are gov-
erned by stochastic decisions. The constraints of the problem are built into the
ants’ decision processes. Ants can be constrained to only construct feasible solu-
tions or can also be allowed to construct infeasible solutions when this is beneficial.
The edges of the graph have an associated pheromone value and heuristic value.
35
CHAPTER 2. BACKGROUND
IEEE Transactions on Systems, Man, and Cybernetics–Part B, Vol.26, No.1, 1996, pp.1-13 11
The same process can be observed in the graphs of Fig. 6, where the AS was applied to avery simple 10-cities problem (CCA0, from [20]), and which depict the effect of ant search onthe trail distribution. In the figure the length of the edges is proportional to the distancesbetween the towns; the thickness of the edges is proportional to their trail level. Initially (Fig.6a) trail is uniformly distributed on every edge, and search is only directed by visibilities. Lateron in the search process (Fig. 6b) trail has been deposited on the edges composing good tours,and is evaporated completely from edges which belonged to bad tours. The edges of the worsttours actually resulted to be deleted from the problem graph, thus causing a reduction of thesearch space.
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
a) b)
Fig. 6. Evolution of trail distribution for the CCA0 problem.a) Trail distribution at the beginning of search.b) Trail distribution after 100 cycles.
Besides the tour length, we also investigated the stagnation behavior, i.e. the situation inwhich all the ants make the same tour. This indicates that the system has ceased to explore newpossibilities and no better tour will arise. With some parameter settings we observed that, afterseveral cycles, all the ants followed the same tour despite the stochastic nature of the algorithmsbecause of a much higher trail level on the edges comprising that tour than on all the others.This high trail level made the probability that an ant chooses an edge not belonging to the tourvery low. For an example, see the Oliver30 problem, whose evolution of average branching ispresented in Fig. 7. In fact, after 2500 cycles circa, the number of arcs exiting from each nodesticks to the value of 2, which – given the symmetry of the problem – means that ants arealways following the same cycle.
This led us to also investigate the behavior of the ant-cycle algorithm for differentcombination of parameters α and β (in this experiment we set NCMAX=2500). The results aresummarized in Fig. 8, which was obtained running the algorithm ten times for each couple ofparameters, averaging the results and ascribing each averaged result to one of the threefollowing different classes.
• Bad solutions and stagnation. For high values of α the algorithm enters the stagnationbehavior very quickly without finding very good solutions. This situation is represented bythe symbol ∅ in Fig. 8;
• Bad solutions and no stagnation. If enough importance was not given to the trail (i.e., α wasset to a low value) then the algorithm did not find very good solutions. This situation isrepresented by the symbol ∞.
Figure 2.4: An example of a graph data structure. Nodes (the black dots) represent solution compo-nents and edges (the lines joining the nodes) represent the costs of ordering the connected nodes oneafter the other in the solution. The graph is fully connected because every node is connected to everyother node by an edge (adapted from [44]).
The pheromone value is updated by the ants while the heuristic value comes from
knowledge of the problem or the specific problem instance. Both the pheromone
value and the heuristic value of an edge are components of an ant’s stochastic
decision process when it considers the edges to move along. Stutzle and Dorigo
provide a more formal discussion of how the ant heuristic works [47, p. 34-36].
Ant Colony heuristics have been applied to a wide range of problems that can
be represented by such a graph structure (Table 2.1).
Problem References
1 Travelling Salesman Problem [44], [46], [118]
2 Quadratic Assignment Problem [77], [118]
3 Scheduling [39], [51], [80], [17]
4 Vehicle Routing [25]
5 Set Packing [54]
6 Graph Colouring [32]
7 Shortest Supersequence Problem [81]
8 Sequential Ordering [53]
9 Constraint Satisfaction Problems [116]
10 Data Mining [93]
11 Edge disjoint paths problem [88]
12 Bioinformatics [113]
13 Industrial [10], [59]
14 Dynamic [62, 61]
Table 2.1: A selection of ant heuristic applications (from [15]).
Since this thesis focuses on the TSP problem, we now explain how ant colony
heuristics are applied specifically to the TSP.
36
CHAPTER 2. BACKGROUND
2.4.2 Application to the Travelling Salesperson Problem
The application of the general ant colony heuristic of the previous section to the
TSP of Section 2.2 on page 30 is straightforward. A solution is an ordering of all the
graph nodes because the TSP is a tour of all cities. Ants are therefore restricted to
constructing feasible tours only. Pheromone values are associated with each edge.
Higher pheromone values reflect a greater desirability for visiting one node after
another. The heuristic associated with each edge is simply the inverse of the cost
of adding that edge to the constructed solution. This cost is typically the distance
between the two nodes where distances can be calculated in several ways. These
costs are typically stored in a cost matrix.
2.4.3 The ACO Metaheuristic
Since the introduction of the original ant colony heuristic, Ant System [44], a pat-
tern in implementation has emerged that has allowed Ant System and many of
the subsequent ant colony heuristics to be grouped within a metaheuristic frame-
work. This metaheuristic is called Ant Colony Optimisation (ACO) [43]. The ACO
metaheuristic consists of several stages, illustrated in Figure 2.5.
End While
While (stopping criterion is not yet met)Initialise pheromone trails
For (each ant)
Construct Solutions using aprobabilistic decision rule
End ForFor (ants)
Apply local search
End For
For (graph edges)
Update pheromones
End For
MainLoop
1
2
3
4 Daemon actions
Figure 2.5: The ACO Metaheuristic. The four main stages described in the text are numbered.
1. Initialise Pheromone trails. An initial pheromone value is applied to each
edge in the problem instance.
2. Construct Solutions. Ants visit adjacent states of the problem by moving
between nodes on the problem graph. Once solutions have been constructed,
local search (Section 2.3 on page 31) can be applied to the solutions.
37
CHAPTER 2. BACKGROUND
3. Update Pheromones. Pheromone trails are modified by evaporating from
and by depositing pheromone onto the problem graph’s edges. Evaporation
decreases the pheromone associated with an edge and deposition increases
the pheromone.
4. Daemon Actions. Centralised actions occur that are not part of the individ-
ual ant actions. These typically involve global information such as determin-
ing the best solution found by the construct solutions phase.
All ACO stages are scheduled by a Schedule Activities construct. The meta-
heuristic does not impose any detail on how this scheduling might occur. Solution
construction, for example, might occur asynchronously [111], in sequence or in
parallel.
This thesis investigates two ant colony heuristics within the ACO metaheuris-
tic framework. These are Max-Min Ant System [118] and Ant Colony System [46].
Stutzle and Dorigo [47, p. 69] state that one may distinguish between those heuris-
tics that descend directly from the original Ant System and those that propose sig-
nificant modifications to the structure of Ant System. Of the heuristics studied in
this thesis, MMAS is of the former type and Ant Colony System is of the latter. The
following sections provide a detailed description of the ACO heuristics studied. The
descriptions follow the 4 stages in the ACO metaheuristic description. Ant System
is described first for completeness.
2.4.4 Ant System (AS)
Ant System (AS) was first introduced to the peer-reviewed literature in 1996 [44]
as a heuristic for the TSP. The details of its four stages within ACO are as follows.
Stage 1: Initialise Pheromone trails
An initial pheromone value τ0 is applied to all edges in the problem. In the ACOTSP
code [47] this initial value was calculated according to the following equation:
τ0 =1
ρ ·NNTour(2.1)
where ρ is a heuristic tuning parameter related to the update pheromones stage
described in Section 2.4.4 on the next page and NNTour is the length of a single
tour generated using the nearest neighbour heuristic. If local search has been
specified for the algorithm then this local search is applied to the solution from the
nearest neighbour heuristic.
Stage 2: Construct Solutions
AS ants apply the following so-called random proportional rule when choosing the
next TSP city to visit. The probability of an ant at a city i choosing a next city j, is
given by
38
CHAPTER 2. BACKGROUND
pij =[τij ]
α [ηij ]β∑
l∈Fi
[τil]α [ηil]
βif j ∈ Fi (2.2)
where Fi is the set of cities that the ant has not yet visited. τij is the pheromone
level on the graph edge connecting cities i and j and ηij is the heuristic value
for that edge. α and β are heuristic tuning parameters that adjust the relative
influence of pheromone and heuristic values respectively.
Stage 3: Update pheromones
Pheromones are updated with both evaporation and deposition once all ants have
constructed a solution.
EvaporationIn general, evaporation of pheromone occurs on all edges in the problem graph.
In the original source code, evaporation was limited to edges in the candidatelist (see Section 2.4.8 on page 44) if local search was used. For any given edge
connecting nodes i and j, the new pheromone value τij after evaporation is given
by
τij = (1− ρ)τij (2.3)
where ρ is a heuristic tuning parameter controlling the rate of pheromone evapo-
ration.
DepositionAfter evaporation, all ants deposit pheromone along the problem graph edges
belonging to their constructed solution. For any given edge in an ant’s solution
connecting nodes i and j, the new pheromone value τij after deposition is given by
τij = τij + 1/C (2.4)
where C is the cost of the solution built by the ant. Since better solutions have
lower costs, equation ( 2.4) means that better solutions receive a larger deposition
of pheromone.
Stage 4: Daemon Actions.
There are no actions in this stage of AS.
2.4.5 Max-Min Ant System (MMAS)
Max-Min Ant System [118] makes several modifications within the AS structure.
These modifications involve the use of limits on pheromone values (τmax and τmin)and the reinitialisation of edge pheromone limits.
39
CHAPTER 2. BACKGROUND
Stage1: Initialise Pheromone trails
The maximum pheromone value τmax for all edges is initialised as per Equation
( 2.1 on page 38). The initial value of the pheromone minimum τmin is then set
according to
τmin =τmax
2n(2.5)
where n is the TSP problem size (the number of nodes in the graph). All edges are
initialised to the maximum trail value.
Stage 2: Construct Solutions
The Construct Solutions phase is the same as for AS (Section 2.4.4 on page 38).
Stage3: Update Pheromones
Before any evaporation occurs, the trail limits are updated according to the follow-
ing equations. The trail maximum is always calculated as follows.
τmax =1
ρ · Cbest so far(2.6)
where Cbest so far is the tour length of the best so far ant, the ant that produced
the best solution during the course of the heuristic so far. The calculation of the
trail minimum in the original source code [47] on which the thesis experiments are
based was confounded with whether local search was used. The first method, used
when local search was in use, calculated the new trail minimum as in Equation
( 2.5). This method is described in the book [47]. However, when local search was
not in use, the source code accompanying the book used another calculation.
τmin =τmax
(1− elog(
p2 ))
(candlist+1
2
)elog(
p2 )
(2.7)
where p is another possible heuristic tuning parameter. This was the calculation
used in this thesis research with p fixed at 0.05, a value that was hard-coded in the
source code. Equation ( 2.7) is similar in form to a version used in the literature
[118].
EvaporationPheromone evaporation is the same as for AS (Section 2.4.4 on the preceding
page and Equation ( 2.3 on the previous page)). After evaporation, the trail limits
are checked. Any pheromone value less than the trail min value is reset to be equal
to trail min. Any pheromone value greater than trail max is reset to be equal to
trail max.
DepositionPheromone deposition is also very similar to deposition for AS (Section 2.4.4 on
the preceding page and Equation ( 2.4 on the previous page)) except that only a sin-gle ant is allowed to deposit pheromone. The choice of whether this ant is the best
40
CHAPTER 2. BACKGROUND
ant so far (best so far) or the best ant from the current iteration (best of iteration)
is rather complicated.
The best so far ant is used every u gb heuristic iterations. In all other itera-
tions the best of iteration ant is used. This frequency of best so far ant can vary.
For example, in the original source code and one piece of literature [118], the fre-
quency is varied according to a schedule. The schedule approach was used when
local search was also used. Alternatively, when local search was not in use, the
best so far ant was used every u gb iterations and this value was fixed at 25 in the
original code.
Clearly there are many possible schedules that can be applied to the frequency
of pheromone deposition with best so far ant. In this research we take a simpler
approach of having a fixed frequency restart freq with which best so far ant is
used, as in the case of no local search in the original source code. This fixed
frequency is a heuristic tuning parameter.
Stage 4: Daemon Actions.
In MMAS, the daemon actions involve occasionally reinitialising the pheromone
trail levels. Reinitialisation occurs if both of two conditions are met. In the litera-
ture [47, p. 76], determining the reinitialisation was described as one condition orthe other being met. This research uses an and condition to maintain backwards
compatibility with the original source code. The first condition is whether a given
threshold number of iterations since the last solution improvement has been ex-
ceeded. The second condition (which is expensive to calculate relative to the first
condition) is whether the branching factor has dropped below a given threshold.
Branching factor is a measure of the uniformity of pheromone levels on all edges
in the problem’s graph. Its calculation and its expense are discussed in more
detail in Appendix C on page 233. The check is done after a fixed number of iter-
ations because of the expense of calculating branching factor. There are therefore
three tuning parameters controlling pheromone reinitialisation: threshold itera-
tions (reinit iters), threshold branching factor (reinit Branch) and check frequency
(reinit freq). In the original source code, these were hard coded to 250, 1.0001
and 100 respectively. However, at least one case in the literature [118] uses a
reinit iters of 50.
In the research reported in this thesis, we fix reinit freq =1 so that checks on
these conditions are made in every iteration. We made this decision because the
nesting of the other two parameters within the checking frequency made it im-
possible to combine properly all combinations of these tuning parameters in an
experiment design.
When a reinitialisation is due, trails are reinitialised to the trail max value that
is calculated as:
τmax =1
ρ · Cbest so far(2.8)
41
CHAPTER 2. BACKGROUND
2.4.6 Ant Colony System (ACS)
Ant Colony System [46] differs significantly from AS in its solution construction
and pheromone evaporation procedures.
Stage 1: Initialise Pheromone Trails
The initial pheromone value for all edges is given by
τ0 =1
n ·NNTour(2.9)
where n is the problem size and NNTour is the length of a nearest neighbour tour.
This is different from AS and MMAS (Equation ( 2.1 on page 38)) where the n term
has replaced the pheromone evaporation term ρ. As with AS and MMAS, if local
search is in use, it is applied to the solution generated by the nearest neighbour
heuristic.
Stage 2: Construct Solutions
Solution construction is notably different from previous algorithms. An ant at a
city i chooses a next city j as follows{maxj∈Fi
{[τij ]
α [ηij ]β}if q ≤ q0
J(2.10)
where q is a random variable uniformly distributed in the range [0, 1]. q0 is a tuning
parameter that determines the threshold q value below which exploitation occurs
and above which exploration occurs in Equation ( 2.10). Fi is the set of feasible
cities (cities not yet visited by the ant). J is a randomly chosen city using the same
Equation ( 2.2 on page 39) as AS and MMAS, repeated below for convenience.
pij =[τij ]
α [ηij ]β∑
l∈Fi
[τil]α [ηil]
βif j ∈ Fi (2.11)
ACS was the first ACO algorithm to use a different decision process for explo-
ration and exploitation. The original source code facilitated applying this decision
process to solution construction in all heuristics provided with ACOTSP [47]. This
thesis takes advantage of this detail to apply the exploration/exploitation threshold
option to all heuristics studied.
When a given ant moves between two nodes, a local pheromone evaporation is
immediately applied to the edge connecting those nodes. After a movement from
node i to node j, the new pheromone level on the connecting edge τij is given by
τij = (1− ρlocal)τij + ρlocalτ0 (2.12)
where ρlocal is a heuristic tuning parameter and τ0 is the initial pheromone value of
Equation ( 2.9). Because of this local pheromone evaporation, the order in which
ants construct solutions in ACS may affect the pheromone levels presented to a
42
CHAPTER 2. BACKGROUND
subsequent ant and ultimately the solutions produced by the swarm. There are
two distinct methods to construct solutions given a set of ants. Firstly, one can
iterate through the set of ants, allowing each ant to make one move and associated
local pheromone evaporation. This can be considered parallel solution construc-
tion and was the default implementation in the source code. Secondly, one can
move through the set of ants only once, allowing each ant to build a full tour and
apply all associated local pheromone evaporations. This can be considered as se-quential solution construction. It was an open question in the literature [47, p. 78]
whether there was a difference between sequential and parallel solution construc-
tion so this research included solution construction type as a tuning parameter for
ACS. This thesis will answer that open question.
Stage 3: Update pheromones
EvaporationThere is no evaporation in the update pheromones phase of ACS because phe-
romone has already been evaporated in the construct solutions phase.
DepositionPheromone deposition occurs along the trail of a single ant according to the
following:
τij = (1− ρ) τij + ρ1
Cchosen ant(2.13)
where C is the tour length of the chosen ant and ρ is a tuning parameter. It
is claimed in the literature that the use of the best so far ant is preferable for
instances greater than size 100 [47, p. 77]. We wished to investigate this claim
methodically and so created a tuning parameter that determines the chosen ant
used in ACS pheromone deposition. The tuning parameter determines whether the
chosen ant is the best so far ant or the best of iteration ant.
Stage 4: Daemon Actions.
There are no daemon actions for the ACS algorithm.
2.4.7 Other ACO heuristics
There are of course many other variants within the ACO metaheuristic framework.
Best-Worst Ant System (BWAS) [31], although included in the original source code,
was omitted from our studies. This was because some of the behaviours of BWAS
were triggered by a CPU time measure. This made it impossible to guarantee back-
wards compatibility of our code with the original source code of Stutzle and Dorigo.
Ant System (AS), Elitist Ant System and Rank-based Ant System were omitted be-
cause they do not perform as well as MMAS and ACS and so have become less
popular. Applying the methodology, experiment designs and analyses introduced
by this thesis to AS, EAS, RAS and BWAS would be a straightforward matter to
explore as future work.
43
CHAPTER 2. BACKGROUND
2.4.8 Additional tuning parameters
There are other possible tuning parameters that have been suggested in the liter-
ature or are implicitly used in the original source code or have been introduced in
our implementation of the original source code. We describe these parameters here
and discuss our decisions on their inclusion in the subsequent thesis research.
Exploration and exploitation
It was mentioned in the description of tour construction for ACS (Section 2.4.6
on page 42) that a random decision is made between exploration and exploita-
tion based on an exploration/exploitation threshold and that the original source
code allowed the application of this decision to all ACO algorithms. We decided
to include the use of the exploration/exploitation threshold in all ACO algorithms
investigated. If the threshold is not important, then a q0 threshold of 0 will be
recommended by the tuning methodology and tour construction will default to the
original case of only using the random proportional rule (see Equation ( 2.10 on
page 42)).
Candidate lists
A speed-up trick known as a candidate list was first used in ACS. A candidate list
restricts the number of available choices at each tour construction step to a list
of choices that are rated according to some heuristic. For the TSP, one possible
candidate list for a given city is a list of some number of neighbouring cities, sorted
into increasing distance from the current city. This number of neighbours, the
candidate list length, is a possible tuning parameter. For a static TSP problem,
candidate lists can be constructed for each TSP city at the start of the heuristic
run. This was the case in the original source code. Candidate lists simplify tour
construction as follows. When an artificial ant makes a decision on the next city to
visit, it first checks its current city’s candidate list. If all cities in the list have been
visited, the ant applies the usual tour construction rules to the remaining cities.
If, however, there are unvisited cities in the candidate list, the ant chooses from its
current city’s candidate list according to the usual rule.
For this research, candidate lists were applied to all ACO heuristics. List length
was expressed as a percentage of the problem size.
Computation limit
The previous section described candidate lists and how they are used to limit ant
decisions in the ACS heuristic tour construction. An examination of the origi-
nal source code reveals that candidate lists also influence the update pheromones
stages. Specifically, evaporation of pheromone and subsequent update of phero-
mone levels on edges are limited to the edges in each node’s candidate list. However
the influence of candidate lists was further complicated by its confounding with the
use of local search. Specifically, if local search was applied, then evaporation and
44
CHAPTER 2. BACKGROUND
update were limited to the candidate list. If local search was not specified, evapora-
tion and update were applied to all edges. The decision was taken in this research
to introduce a new heuristic tuning parameter, called the computation limit, that
specifies whether pheromone updates should be limited to the candidate lists or
applied to all edges leading from every node. This tuning parameter can be applied
independently of whether local search was specified. Applying computation to all
edges is obviously extremely expensive and so in this research, computation limit
is always set to be limited to the node candidate lists. In Design of Experiments
(DOE) terms it is a held-constant factor (Section A.1.2 on page 211).
For MMAS, the candidate list length was also involved in the calculation of
updated trail minimum (Equation ( 2.7 on page 40) in Section 2.4.5 on page 40)
and computation of branching factor for the trail reinitialisation decision. These
calculations are not affected by the new computation limit parameter and can be
specified independently.
2.4.9 Summary of tuning parameters
Figure 2.6 summarises the various tuning parameters that are common to all the
ACO heuristics in this research and, where available, the recommended tuning
parameter settings from the literature [47, p. 71]. Some of these settings were
hard-coded from the ACOTSP author’s experience with the algorithms. In the ab-
sence of such experience it is useful to parameterise these hard-coded values and
experiment with tuning them.
# Parameter Description AS MMAS ACS
1 Exponent of pheromone term in the random proportional rule. 1 1 1
2 Exponent of heuristic term in the random proportional rule 2 to 5 2 to 5 2 to 5
3 Global pheromone evaporation term. 0.5 0.02 0.14 Number of ants. n n 105 Exploration/exploitation threshold. None None 0.96 Length of candidate list.
7 Placement Type of ant placement on the TSP graph. random random random
8 Local search type The type of local search to use.
9 Don’t look bits A parameter related to the local search routine.
10 Neighbourhood size
The number of neighbours examined by the local search routine.
11 Computation limitWhether certain computations are limited to the candidate list or applied to the whole problem.
α
βρm
0qc
Figure 2.6: Common tuning parameters and recommended settings for the ACO algorithms. Thesetuning parameters are common to all ACO algorithms in this research. n is the size of problem in termsof number of nodes.
The parameters and recommended settings from the literature for MMAS and
ACS are given in Figure 2.7 on the next page and Figure 2.8 on the following page
respectively. The nested parameter column in the MMAS table is for parameters
that only make sense as part of their parent parameter.
45
CHAPTER 2. BACKGROUND
# ParameterNested parameter Recommended
12 Trail min update type
The calculation used when a new trail minimum is set.
None. Varies in the literature.
13 p A term used in one particular type of trail min update calculation.
None. Hard-coded to 0.05 in the original source code.
14 restart_freq The frequency with which the best-so-far ant is used to deposit pheromone.
None. Varies between fixed frequency and more complicated scheduled frequencies
15 reinit_freqThe frequency, in terms of iterations, with which a check is done on the need for trail reinitialisation.
None. Hard-coded in original source to 100.
16 reinit_itersThe threshold iterations without solution improvement after which a trail reinitialisation is considered.
None. Hard-coded in original source to 250.
17 reinit_Branch The threshold branching factor after which a trail reinitialisation is considered.
None. Hard-coded to 1.00001 in the original source code.
18 lambdaUsed to determine the cut-off point for inclusion of an edge in the branching factor calculation.
None. Hard-coded to 0.05 in the original source code.
Figure 2.7: Tuning parameters and recommended settings for the MMAS algorithm.
# Parameter Description Recommended
12 A term in the local pheromone evaporation equation. 0.1
13 Const The solution construction method. parallel
14Pheromone deposition ant
The choice of whether to use the best_so-far ant or the best_of_iteration ant in pheromone deposition.
None
localρ
Figure 2.8: Tuning parameters and recommended settings for the ACS algorithm.
46
CHAPTER 2. BACKGROUND
It is clear from these summaries that the ACO algorithms have many tuning
parameters2. Looking at the common parameters alone, there are eleven tuning
parameters. When specific heuristics are considered, the number of tuning pa-
rameters increases to potentially 18 for MMAS and 14 for ACS. We say that there
is a large parameter space. Moreover, there are no recommendations for many of
the parameter settings. Sophisticated techniques are required so that it is feasible
to experiment with such large numbers of tuning parameters. Such techniques
exist in a field called Design of Experiments.
2.5 Design Of Experiments (DOE)
This section may contain some terminology unfamiliar to the reader. Further de-
tails on the specific DOE techniques and issues encountered in this thesis are
summarised in Appendix A and are detailed in the literature [1, 89, 84, 85].
The National Institute of Standards and Technology defines Design Of Experi-
ments (DOE or sometimes DEX) as:
. . . a systematic, rigorous approach to engineering problem-solving that
applies principles and techniques at the data collection stage so as to
ensure the generation of valid, defensible, and supportable engineering
conclusions. In addition, all of this is carried out under the constraint
of a minimal expenditure of engineering runs, time, and money. [1]
The systematic approach comes from the clear methodologies and experiment
designs used by DOE. The analysis of the designs is supported with statistical
methods, providing the user with defensible conclusions and mathematically pre-
cise statements about confidence in those conclusions. The DOE principles of data
collection ensure that only sufficient data of a high quality is collected, improving
the efficiency and cost of the experiment. The main capabilities of DOE are as
follows [112]:
1. Quantify multiple variables simultaneously: Many factors and many re-
sponses can be investigated in a single experiment.
2. Identify variable interactions: the joint effect of factors on a response can
be identified and quantified.
3. Identify high impact variables: the relative importance of all factors on the
responses can be ranked.
2 It is possible to draw a distinction between tuning parameters and what we shall term designparameters. Tuning parameters are known to affect heuristic performance and so must be specified forevery deployment of the heuristic. Design parameters, by contrast, are alternative heuristic componentsthat have been parameterised so that they can be plugged into the heuristic. The aim is to determinewhether any of the alternative designs have a favourable affect on performance. An example in thecurrent research is the use of sequential or parallel solution construction in ACS (Section 2.4.6 onpage 42). If an alternative value of the design parameter is shown not to affect performance then thatalternative is removed as a parameter and the improved design is thereby fixed.
47
CHAPTER 2. BACKGROUND
4. Predictive capability within design space: performance at new points in
the design space can be predicted.
5. Extrapolation capability outside design space: occasionally and with some
caution, performance outside the design space can be extrapolated.
These capabilities make DOE an essential approach for any research dealing
with large and expensive experiments. In industry, users of DOE include NASA3
and Google4.
The Operations Research community has been aware of these advantages for
some time, acknowledging that the risk of not adopting DOE is that the absence of
a statistically valid, systematic approach can result in the drawing of insupportable
conclusions [3]. Adenso-Dıaz and Laguna [2] give a brief list of OR papers that have
used statistical experiment design over the past 30 years. Some discussions are
quite general and offer guidelines on the subject [7, 35, 60]. Experimental design
techniques have been used to compare solution methods [3] and to find effective
parameter values [33, 123]. None of these papers’ techniques or methodologies,
however, have become so widespread that they approach being the standard for
experimental work in OR. None have been applied to ACO heuristics.
More often than not, ACO research uses a trial-and-error approach to answer-
ing its research questions. Birattari [12, p. 34-35] identifies two disadvantages
of the trial-and-error approach to parameter tuning and relates these to an in-
dustrial and academic context. From an industrial perspective, the trial-and-error
approach is time-consuming and requires a very specialised practitioner. From
an academic perspective, the approach does not facilitate a methodical scientific
analysis.
When the need for a more methodical approach to parameter tuning is acknowl-
edged, researchers may attempt a One-Factor-At-A-Time (OFAT) analysis. OFAT
involves tuning a single parameter when all others are held fixed, repeating this
process with each parameter one at a time. However it is well recognised outside
the heuristics field that DOE has many advantages over OFAT. Czitrom [37] illus-
trates these advantages by taking three real-world engineering problems that were
tackled with an OFAT analysis and re-analysing them with designed experiments.
In summary, the following advantages are clearly illustrated:
• Efficiency. Designed experiments require fewer resources, in terms of exper-
iments, time and material, for the amount of information obtained.
• Precision. The estimates of the effects of each factor are more precise. Full
factorial and fractional factorial designs use all observations to estimate the
effects of each factor and each interaction. OFAT experiments typically use
only two treatments at a time to estimate factor effects.
3 Obtained by searching the NASA Technical Reports server (http://ntrs.nasa.gov/search.jsp) withthe phrase “Design Of Experiments”.
4 Web page of Peter Norvig, current Director of Research at Google (http://norvig.com/experiment-design.html).
48
CHAPTER 2. BACKGROUND
• Interactions. Designed experiments can estimate interactions between fac-
tors but this is not the case with OFAT experiments.
• More information. There is experimental information in a larger region of
the design space. This makes process optimisation more efficient because
the whole factor space can be studied and searched.
Despite all the advantages and capabilities of DOE presented above, we still
encounter several common excuses for not using DOE [112]. We list these here
with our own refutation of those excuses.
• Claim of no interactions: it may indeed be the case that there are no in-
teractions between factors. This claim can only be defended after a rigorous
DOE analysis has shown it to be true.
• OFAT is the standard: we have seen that trial-and-error approaches and
OFAT approaches are the norm. However, the comparison to DOE to OFAT
that we presented from another field [37] shows that if OFAT is the standard
then it is a seriously deficient standard that must be improved.
• Statistics are confusing: it is true that we cannot expect heuristics re-
searchers to become experts in statistics and Design Of Experiments. That
is the job of statisticians. It is also true that becoming an expert is not nec-
essary for leveraging the power and capabilities of DOE. In other fields such
as medicine and engineering, the research questions are often repetitive. Is
this drug effective? Can this manufacturing process be improved? This per-
mits identifying a small set of experiment designs and analyses that serve
to answer those common research questions. We will see in Chapter 3 that
heuristic tuning involves a similarly small set of research questions. This the-
sis will demonstrate the use of the designs and analyses to answer the most
important of those questions. Even the statistical analyses themselves can
be performed in software that shields the user from unnecessary statistical
details and guides the user in interpreting statistical analyses.
• Experiments are too large: it is true that experiments with tuning heuristics
are large. We have already mentioned the prohibitive size of the design space
in Chapter 1’s list of obstacles to the parameter tuning problem. This thesis
will introduce new experiment designs that permit answering the common
research questions with an order of magnitude fewer experiments.
2.6 Chapter summary
This chapter covered the following topics.
• Combinatorial optimisation. Combinatorial optimisation (CO) was intro-
duced and described. The Travelling Salesperson Problem was highlighted as
a particular type of CO problem.
49
CHAPTER 2. BACKGROUND
• The Travelling Salesperson Problem. The reasons for the popularity of the
TSP were given along with a summary of the various types of TSP. The Sym-
metric TSP is the focus of this thesis.
• Heuristics. The difficulty of finding exact solutions to CO problems necessi-
tates the use of approximate methods or heuristics.
• Metaheuristics. Metaheuristics were introduced as an attempt to gather
various heuristics into common frameworks.
• Ant Colony Optimisation. Ant Colony Optimisation is a particular meta-
heuristic based on the foraging behaviour of real ants. The ACO heuristics
have been applied to many CO problems that can be represented by a graph
data structure. Several types of ACO heuristic were described in detail.
• Design of Experiments. The field of Design Of Experiments was introduced
and its capabilities highlighted. The advantages of DOE over its alternatives,
trial-and-error or One-Factor-At-A-Time were described. Some common ex-
cuses for not adopting DOE were refuted.
The next Chapter will review the issues that arise when using DOE and describe
how DOE should be adapted for experiments with tuning metaheuristics.
50
3Empirical methods concerns
Thus far, this thesis has motivated rigorous empirical research on the parameter
tuning problem for metaheuristics. It was hypothesised that the parameter tuning
problem could be successfully addressed by adapting techniques from the field
of Design Of Experiments. A background on combinatorial optimisation and the
Travelling Salesperson Problem was detailed. The Ant Colony Optimisation (ACO)
family of metaheuristics was introduced and described.
Criticisms of the lack of experimental rigour in the experimental analysis of
heuristics have been made on several occasions in the operations research field
[64, 65]. Such criticisms and calls for increased rigour have also appeared in
the evolutionary computation field [48, 122, 103]. While there has been much
useful and creative research in the ACO field, the issue of experimental rigour
has never been to the fore. In the following, we bring together the most relevant
criticisms, suggestions and general issues relating to the design and analysis of
experiments with heuristics that have appeared in the heuristics and operations
research literature over the previous three decades. We relate these to the relatively
new ACO field. This will facilitate a critical review of the literature on ACO and the
parameter tuning of ACO in the next chapter. It will also strongly influence the
development of the thesis methodology in subsequent chapters.
The material in this chapter is presented approximately in the order an ex-
perimenter would encounter the issues when working in the field. Some issues,
such as reproducibility and responses for example, have an unavoidable overlap—
a poor choice of response or poor reporting will reduce reproducibility for example.
A familiarity with statistics and Design Of Experiments is assumed. A necessary
background on these topics is given in Appendix A and in the literature [89, 84].
53
CHAPTER 3. EMPIRICAL METHODS CONCERNS
3.1 Is the heuristic even worth researching?
A question that heuristics research often neglects is whether the heuristic is even
worth researching. It is tempting to expend effort on extensions of nature-inspired
metaphors and refinements of algorithm details. In fact, these endeavours were
identified as important goals in the early stages of the ACO field [30]. While much
useful work is done in this direction, it is important not to lose sight of the purpose
of optimisation heuristics which is to address the heuristic compromise and solve
difficult optimisation problems to a satisfactory quality in reasonable time. John-
son [69] lists some questions that should be asked before beginning a heuristics
research project:
• What are the questions you want your experiments to address?
• Is the algorithm implemented correctly and does it generate all the data you
will need? We add that, all else being equal, the analytical tractability of the
algorithm changes the conclusions that can be made from our experiments.
• What is an adequate set of test instances and runs?
• Given current computer specifications, which problem instances are too small
to yield meaningful distinctions and which are too large for feasible running
times?
• Who will care about the answers given the current state of the literature?
This final question of ‘care’ ties in with the analysis of Barr et al [7, p. 12] who
state that a heuristic method makes a contribution if it is:
• Fast: produces higher quality solutions quicker than other approaches.
• Accurate: identifies higher quality solutions than other approaches.
• Robust: less sensitive to differences in problem characteristics, data quality
and tuning parameters than other approaches.
• Simple: easy to implement.
• Generalisable: can be applied to a broad range of problems.
• Innovative: new and creative in its own right.
An examination of the literature reveals that not only do researchers often fail
to ask questions regarding speed, accuracy and robustness, they often fail even to
collect the necessary data that would permit answering these questions.
We can speak of one heuristic dominating another heuristic in terms of one or
more of these qualities when the dominating heuristic scores better on these quali-
ties than the dominated heuristic. For example, we often find that a given heuristic
may dominate another in terms of speed and accuracy but is in turn dominated in
terms of its generalizability. In general, a highly dominated heuristic is not worth
studying given the aforementioned overarching aim of heuristics research. It is
54
CHAPTER 3. EMPIRICAL METHODS CONCERNS
nonetheless worthwhile to study a dominated algorithm in some circumstances
[69]. Firstly, the algorithm may be in widespread use or its dominating rival may
be so complicated that it is unlikely to enter into widespread use. Secondly, the
algorithm may embody a general approach applicable to many problem domains
and studying how best to adapt it to a given domain may be of interest.
ACO certainly does embody a general approach for combinatorial optimisation
problems that can be represented by graphs (Section 2.4 on page 34). The version
of ACO studied in this thesis does not incorporate local search and therefore it
could be argued that the thesis experiments with a dominated algorithm. This ar-
gument is easily countered in several ways. Firstly, much research in ACO is still
conducted without local search. Secondly, and more importantly, the emphasis
of this thesis is on parameter tuning rather than on the design of new ACO ver-
sions that improve the state-of-the-art in TSP solving. ACO is a useful subject of
study because of its large number of tuning parameters. Although the thesis’ DOE
approach to tuning will later be shown to improve ACO performance, no claims
are made about the competitiveness of this performance in relation to state-of-
the-art TSP solution methods. This does not preclude applying the thesis’ DOE
methodologies to such state-of-the-art methods.
Assuming that it is worthwhile to study the heuristic in question, the experi-
menter must then determine what type of study will be conducted.
3.2 Types of experiment
Barr et al [7] distinguish between just two types of computational experiments with
algorithms: (1) comparing the performance of different algorithms for the same
class of problems or (2) characterising an algorithm’s performance in isolation. In
fact, there are several types of experiment identified in the literature.
• Dependency study [79] (or Experimental Average-case study [69]). This
aims to discover a functional relationship between factors and algorithm per-
formance measures. It focuses on average behaviour, generating evidence
about the behaviour of an algorithm for which direct probabilistic analysis is
too difficult. For example, one may investigate whether and how the tuning
parameters α and β (Section 2.4) increase the convergence rate of ACO.
• Robustness study [79]. A robustness study looks at the distributional prop-
erties observed over several random trials. Typical questions that a robust-
ness study addresses are: how much deviation from average is there? What
is the range in performance at a given design point? Are there unusual values
in the measurements?
• Probing study [79] (or Experimental Analysis paper [69]). These studies
‘open up’ an algorithm and measure particular internal features of its oper-
ation, attempting to explain and understand the strengths, weaknesses and
workings of an algorithm. For example, an ACO probing study might in-
vestigate whether different types of trail reinitialisation schedule improve the
55
CHAPTER 3. EMPIRICAL METHODS CONCERNS
performance of MMAS (Section 1 2.4.5 on page 39).
• Horse race study [69] (or Competitive Testing [65]). A horse race study
attempts to demonstrate the superiority of one algorithm over another by
running the algorithm on benchmark problem instances. This is typical of
the majority of research in the ACO field. The horse race study has its place
towards the latter stages of a heuristic’s life cycle (Section 3.3). However, its
scientific merits have been strongly criticised [69, 65].
• Application study [69]. An application study uses a particular code in a
particular application and describes the impact of the code in that context.
For example, one might report the application of ACS to a scheduling problem
in a manufacturing plant.
We consider the application study as a specific context for the other types of
study. Dependency studies, robustness studies, probing studies and horse race
studies could all conceivably be conducted with a particular code in a particular
application. The choice of experiment type will depend very much on the life cycles
of both the heuristic and the problem domain in question. This thesis is primarily
a dependency study as it studies the relationship between tuning parameters and
performance. It also has some characteristics of a probing study in that design fac-tors, factors that represent parameterised design decisions, are also experimented
with.
3.3 Life cycle of a heuristic and its problem domain
The heuristic life cycle consists of two main phases, (1) research and (2) develop-
ment [101]. Research aims to produce new heuristics for existing problems or to
apply existing heuristics in creative ways to new problems. Development aims to
refine the most efficient heuristic for a specific problem. Software implementation
details and the application domain become more important in this situation.
Birattari [12] breaks development into two phases. There is what he also terms
a development phase in which the algorithm is coded and tuned. This phase relies
on past problem instances. The second phase is the production phase in which
the algorithm is no longer developed and is deployed to the user. This phase is
characterised by the need to cope with new problem instances.
The research phase of the heuristic lifecycle requires dependency, robustness
and probing studies. The development phase requires horse race and application
studies. Although ACO is still in the research phase of its life cycle, the majority
of work reported on it is more appropriate for the development phase as it focuses
on the typical horse race and application study issues.
The problem domain also has a life cycle [101, p. 264] and this impacts heuris-
tic research. For some problems in the early stages of their life cycle, there exist
few if any solution algorithms. In these cases, being able to consistently construct
a feasible solution is a significant achievement. Later in the problem’s life cycle, a
body of consistent algorithms that produces feasible solutions already exists. At
56
CHAPTER 3. EMPIRICAL METHODS CONCERNS
this stage, research must demonstrate either an insight into the algorithm’s be-
haviour (probing study) or must demonstrate that the algorithm performs better
than other existing methods (horse race and application studies). The TSP as used
in this thesis is undoubtedly in the later stages of its life cycle.
Over ten years ago in 1996, Colorni et al [30] identified 4 progressions in nature-
inspired heuristics and used these to compare the state of the art of 6 types of
heuristic. Their stages, in order of progress, are:
1. the presence of practical results,
2. the definition of a theoretical framework,
3. the availability of commercial packages, and
4. the study of computational complexity and related principles.
We disagree with this ordering, although it is often encountered in computer
science research. The meaning of ‘practical results’ is vague. Assuming the au-
thors mean results on real problems rather than small scale abstractions, then
complexity studies and theoretical frameworks can certainly precede such ‘practi-
cal results’. Their stages make no distinction between problem and heuristic life
cycle. Their view on the state-of-the-art with respect to these stages is summarised
in Table 3.1.
Results Theory Packages Complexity
Simulated An-nealing
Well devel-oped
Well devel-oped
Developing Developing
Tabu Search Developing Developing Developing Developing
Neural Nets Developing Developing Developing Emerging
Genetic Algo-rithms
Developing Developing Emerging Emerging
Sampling andClustering
Developing Emerging Emerging
Ant Systems Emerging Emerging
Table 3.1: The state of the art in nature-inspired heuristics from 10 years ago. Adapted from [30].
With hindsight, we see that this assessment was overly optimistic. A robust
theory and complexity analysis for ant colony algorithms has yet to be established
and research has only recently moved in this direction [42]. There are still no
commercial packages in widespread use, although we have mentioned anecdotal
evidence for the use of ant colony approaches within several companies (Chapter
1). Even some of the supposedly established results that we will review in the next
chapter may have to be revised in light of the experiment design issues we review
in this section and the results this thesis reports.
57
CHAPTER 3. EMPIRICAL METHODS CONCERNS
3.4 Research questions
Having decided on the type of experimental study that is required, based on the
heuristic and problem life cycles, the experimenter can then proceed to refine the
study into one or more specific research questions. There are two main issues in
research with heuristics: how fast can solutions be obtained and how close do the
solutions come to being optimal [101]? These questions cannot be answered in
isolation. Rather we must consider the trade off between feasibility and solution
quality [7, p. 14], a trade off that this thesis terms the heuristic compromise.
This is of course a simplification and other authors have tried to enumerate the
various research questions that one can investigate within this heuristic compro-
mise of quality and speed. In the following, we have categorised these questions
within the types of experimental study identified previously.
3.4.1 Dependency Study
• What are the effects of type and degree of parametric change on the perfor-
mance of each solution methodology [3, p. 880]?
• What are the effects of problem set and size on the performance of each
method [3, p. 880]?
• What are the interaction effects on the solution techniques when the above
factors are changed singly or in combination [3, p. 880], [30]?
• How does running time scale with instance size and is there any dependence
on instance structure [69, 30]?
3.4.2 Robustness
• How robust is the algorithm [7, p. 14]?
• Does a new class of instances cause significant changes in the behaviour of a
previously studied algorithm [69]?
• For a given machine, how predictable are running times/operation counts for
similar problem instances [69]?
• How is running time affected by machine architecture [69]?
• How far is the best solution from those more easily found [7, p. 14]?
• What are the answers to these questions for other performance measures
[69]?
3.4.3 Probing study
• How do implementation details, heuristics and data structure choices affect
running time [69]?
58
CHAPTER 3. EMPIRICAL METHODS CONCERNS
• What are the computational bottlenecks of the algorithm and how do they
depend on instance size and instance structure [69]?
• What algorithm operation counts best explain running time [69, 30]?
3.4.4 Horse race
• Is there a best overall method for solving the problem? [3, p. 880]
• What is the quality of the best solution found? [7, p. 14]
• How long does it take to determine the best solution? [7, p. 14]
• How quickly does the algorithm find good solutions? [7, p. 14]
• How does an algorithm’s running time compare to those of its top competitors
and are those comparisons affected by instance size and structure? [69]
3.5 Sound experimental design
Once one or more research questions have been identified, an experiment can be
planned and executed. A general procedure for experimentation has three steps
[28].
1. Design. An experimental design is conceived. This is the general plan the
experimenter uses to gather data. Crowder et al [34] quote a definition of
good experimental design.
The requirements for a good experiment are that the treatment com-
parisons should as far as possible be free from systematic error,
that they should be made sufficiently precise, that the conclusions
should have a wide range of validity, that the experimental arrange-
ment should be as simple as possible, and finally that the uncer-
tainty in the conclusions should be assessable. [4]
2. Data gathering and exploration. When all data have been gathered, an ex-ploratory analysis is conducted. This involves looking for patterns and trends
in the data using plots and descriptive statistics. Appropriate transformations
of the data may have to be done.
3. Analysis. Formal statistical analyses are performed.
There is some recognition in the literature that formal statistical analyses are
an integral part of an experimental procedure. Attempts have been made to detail
the specifics of the experimental procedure for heuristics.
Developing a sound experimental design involves identifying the vari-
ables expected to be influential in determining code performance (both
those which are controllable and those which are not), deciding the
appropriate measures of performance and evaluating the variability of
59
CHAPTER 3. EMPIRICAL METHODS CONCERNS
these measures, collecting an appropriate set of test problems, and fi-
nally, deciding exactly what questions are to be answered by the experi-
ment [35].
The identification of a methodical and ordered experimental design procedure
is to be welcomed. However, there is a problem with the ordering presented in that
a decision on the research question is left to the very end of the design process. We
posit that this should be the very first step in any procedure because the nature
of the questions the experimenter wants to ask will determine all subsequent de-
cisions in the design. A research question involving comparisons (Section 3.4.4 on
the preceding page) requires a different design to a research question concerning
relationships (Section 3.4.1 on page 58).
A more comprehensive seven step outline of the design and analysis of an ex-
periment comes from outside the heuristics field [84, p. 14].
1. Recognition of and statement of the problem. Although it seems obvious,
it is often difficult to develop a statement of the problem. If the process is new
then a common initial objective is factor screening, determining which factors
are unimportant and need not be investigated. A better understood process
may require optimisation. A system that has been modified may require con-firmation to determine whether it performs the same way as it did in the past.
A discovery objective occurs when we wish to explore new variables such as
an improved local search component. Robustness studies are needed when
there are circumstances in which the responses may seriously degrade.
Clearly, these objectives are reflected in the types of study categorised in
Section 3.4 on page 58.
2. Selection of the response variable and the need for replicates. The re-
sponse variable(s) chosen must provide useful information about the process
under study. Measurement error, or errors in the measuring equipment, must
be considered and may require the use of repeated measurements of the re-
sponse. This is typically the case with measurements of CPU time.
3. Choice of factors, levels, and range. Factors are either potential factorsor nuisance factors [84, p. 15]. Because there is often a large number of
potential factors, they are classified as either:
• Design factors: these are the factors that are actually selected for study.
• Held-constant factors: these factors may have an effect on the re-
sponse(s) but because they are not of interest, they are held constant
at a specific level during the entire experiment.
Nuisance factors are not of interest in the study but may have large effects on
the response. They therefore must be accounted for. Nuisance factors can be
classified [84, p. 16] as:
• Controllable: A controllable nuisance factor is one whose levels can be
set by the experimenter. In traditional design of experiments, a batch
60
CHAPTER 3. EMPIRICAL METHODS CONCERNS
of raw material is a common controllable nuisance factor. In heuristics
DOE, the random seed for the heuristic’s random number generator is a
very common one.
• Uncontrollable: These factors are uncontrollable but can be measured.
Techniques such as analysis of covariance can then be used to com-
pensate for the nuisance factor’s effect. In traditional DOE, operating
conditions such as ambient temperature may be an uncontrollable but
measurable nuisance factor. In heuristics DOE, CPU usage by back-
ground processes is a good example.
Once the design factors have been selected, the experimenter chooses both
the ranges over which the factors are varied and the specific factor levels at
which experiments will be conducted. The region of interest (Section A.2 on
page 213) is usually determined using practical experience and theoretical
understanding, when available. When there is no knowledge of the heuris-
tic, a pilot study can give quick and useful guidelines on appropriate factor
ranges.
The specific factor levels are often a function of the experimental design.
4. Choice of experimental design: choice of design depends on the experimen-
tal objectives. Some designs are more appropriate for modelling, some for
optimisation, some for screening. Some designs can better fit into a sequen-
tial experimental procedure and so are more efficient in terms of experimental
resources. Decisions are also made on the number of replicates and the use
of blocking.
An emphasis is clearly being placed on the importance of deciding on the
research question(s) early on in the experiment procedure and not at the end.
5. Performing the experiment: In the traditional DOE environment of manu-
facturing it is often difficult to plan and organise an experiment. The process
must be carefully monitored to ensure that everything is done according to
plan.
This is less of an issue in the majority of experiments with heuristics for
combinatorial optimisation. However, experiments involving heuristics and
humans, in some visual recognition task say, would have to pay very careful
attention in this step. It goes without saying that all code should be checked
for correctness and bugs.
6. Analysis of the data: Statistical methods are required so that results and
conclusions are objective rather than judgmental. Statistical methods do not
prove cause but rather provide guidelines to the reliability of a result [84, p.
19]. They should be used in combination with engineering knowledge.
Of particular importance here is the danger of misinterpretation of hypothesis
tests and p values. This is addressed in Appendix V on page 211.
61
CHAPTER 3. EMPIRICAL METHODS CONCERNS
7. Conclusions and recommendations: Graphical methods are most useful to
ensure that results are practically significant as well as statistically signifi-
cant. Conclusions should not be drawn without confirmation testing. That is,
new independent experiment runs must be conducted to confirm the conclu-
sions of the main experiment.
Confirmation testing is rare in the ACO literature. Birattari [12] draws at-
tention to the need for independent confirmation as is typical in machine
learning. This thesis places a strong emphasis on independent confirmation
of all its statistical analyses.
Cohen [29] identifies several tips for performance analysis of which the most
relevant to heuristics performance analysis are reproduced here.
1. Use bracketing standards. The tested program’s anticipated performance
should exceed at least one standard and fall short of another.
A typical upper standard in heuristics is an optimal solution. Often times
however the optimal solution is not known. The use of optimal solutions is
discussed in Section 3.10 on page 68. A typical lower standard is a randomly
generated solution. Alternative lower standards for heuristics are simple re-
producible heuristics such as a greedy search.
Adjusted Differential Approximation (Section 3.10.2 on page 69), a quality re-
sponse used in this thesis, incorporates a comparison to an expected random
solution to a problem.
2. Many measures. It is not expensive to collect many performance measures.
If collecting relatively few measures, a pilot study can once again help, deter-
mining which are highly correlated. Highly correlated measures are redun-
dant.
3. Conflicting measures. Collect opponent performance measures. Conflicting
measures are unavoidable with heuristics due to the heuristic compromise of
lower solution time and higher solution quality.
These design of experiment steps and tips provide metaheuristics research with
some much needed procedural rigour. However there remain many pitfalls for the
experimenter.
3.5.1 Common mistakes
Many of the issues that arise in designed experiments for other fields such as
manufacture are thankfully not an issue for the heuristic engineer. Consequently,
metaheuristics researchers have few excuses for poor methodology. Measurement
errors due to gauge calibration, for example do not arise. No human data entry
with the possibility of mistakes is generally required. Nonetheless, some of the
lessons from traditional DOE [67] do translate to designed experiments for heuris-
tics.
62
CHAPTER 3. EMPIRICAL METHODS CONCERNS
• Too narrow factor ranges. Running too narrow a range from high to low for
the factors can make it seem that key factors do not affect the process. The
reality is that they do not affect the process in the narrow range examined.
• Too wide factor ranges. Running too wide a range of factors may recommend
results that are not usable in the real process.
• Sample size and effect size. The sample size must be large enough to detect
the effect size that the experimenter has deemed to be significant and yet not
so large as to detect the tiniest of effects of no practical significance.
The risk of all of these mistakes being made can be greatly mitigated by invest-
ing a small amount of resources in a pilot study. Jain [68, p. 14-25] lists further
common mistakes made in performance evaluation. The most relevant of these are
as follows.
1. Biased Goals. The definition of a performance evaluation project’s goals can
implicitly bias its methodology and conclusions. A goal such as showing that
“OUR system is better than THEIRS” can turn a problem into one of finding
metrics such that OUR system turns out better rather than finding the fair
metrics for comparison [68, p. 15].
This is the danger that others also highlight [65]. The problem of bias is
discussed in further detail in Section 3.14 on page 74.
2. Unsystematic approach [68]. Analysts sometimes select parameter values,
performance metrics and problem instances arbitrarily, making it difficult
to draw any conclusions. This is, unfortunately, very common in the meta-
heuristics field. This thesis provides methodical guidelines and steps from
the Design Of Experiments field that replace this unsystematic approach.
3. Incorrect Performance Metrics. Changing the choice of metrics can change
the conclusions of a study. It is important to conduct a study with several
competing metrics so that any effect of choice of metric can be understood
and accounted for. This is not an issue if the experimenter has followed the
recommendations of recording many performance measures (Section 3.5 on
page 59).
4. Ignoring Significant Factors. Not all parameters have the same effect on
performance and so it is important to identify the most important parameters.
This is dealt with in the screening step mentioned in Section 3.5 on page 59.
5. Inappropriate Experimental Design. Proper selection of parameters and
number of measurements can lead to more information from the same num-
ber of experiments. Jain [68] also highlights the problem with the OFAT
approach and his preference for factorial and fractional factorial designs as
introduced in this thesis.
6. No Sensitivity Analysis. Without a sensitivity analysis, one cannot be sure
whether the conclusions would change if the analysis were done in a slightly
63
CHAPTER 3. EMPIRICAL METHODS CONCERNS
different setting. Furthermore, a sensitivity analysis can help confirm the
relative importance of factors.
7. Omitting Assumptions and Limitations. This can lead a reader of the re-
search to apply an analysis to another context where the assumptions are no
longer valid.
Even within a well-defined experimental framework, the experimenter must be-
ware of these many common pitfalls.
3.6 Heuristic instantiation and problem abstraction
A research question has been identified and an experiment design has been se-
lected to answer this question. The experimenter must now think about the imple-
mentation of the algorithm that is the subject of the experiment and the problem
domain to which the algorithm will be applied. We can consider both algorithms
and problems at different levels of instantiation. Several authors [28, 65, 79] dis-
cuss how different levels of algorithm instantiation are appropriate for different
types of analyses. A general description may be enough to determine whether an
algorithm has a running time that is exponential in the length of its input. Hooker
[65] likens this to an astronomer who tests a hypothesis about the behaviour of
galaxies by creating a simulation. This simulation can improve our understanding
even though the running time is much faster than the real phenomenon. Further
algorithm instantiation, such as details of data structures, is needed to count crit-
ical operations as a function of the input. A complete instantiation in a particular
language with a particular compiler and running on a particular machine is needed
to generate CPU times for particular inputs. This thesis uses fully instantiated al-
gorithms.
As instantiation increases, so too does the importance of implementation is-
sues. There are three main advantages to using efficient algorithm implementa-
tions [69]. Such implementations better support claims of practicality and com-
petitiveness. There is less possibility for the distortion of results achieved by al-
gorithms that are significantly slower than those used in practice. Finally, faster
implementations allow one to experiment with more and larger problem instances.
Clearly there is a balance between code that has been implemented efficiently for
research purposes and code that has been fine-tuned as for a competitive indus-
trial product.
The problem domain can also be treated at several levels of abstraction [12].
• Lowest level. This is a mathematical model of a well defined practical prob-
lem. This level is most often used in industry and application studies where
it is desired to solve a particular instance rather than make generalisations
across a class of instances.
• Intermediate level. At this level, abstractions such as the Travelling Sales-
person Problem and Quadratic Assignment Problem capture the features and
64
CHAPTER 3. EMPIRICAL METHODS CONCERNS
constraints of a class of problems. This thesis is focussed at this level of
problem instantiation.
• Highest level. The highest level of abstraction includes high level ideas such
as deceptive problems [57] but does not represent a specific real world prob-
lem.
Once the appropriate levels of algorithm instantiation and problem abstraction
have been agreed, the experimenter can begin pilot studies.
3.7 Pilot Studies
The discussions of common mistakes in experiment design already mentioned the
usefulness of pilot studies (Section 3.5.1 on page 62). A pilot study is simply a
small scale set of experiment runs that are used for the exploratory analysis of a
process. Pilot studies help refine a full blown experiment design in several ways
[101].
1. Pilot studies can indicate that some factors initially thought important actu-
ally have little effect or have a single best level that can be fixed in all runs.
They help identify design factors (Section 3.5 on page 59). They also can
indicate where two or more factors can be collapsed into a single one.
2. Pilot studies help determine the number and values of levels to use in deter-
mining whether the factor has practical significance.
3. Pilot studies reveal how much variability we can expect in outcomes. This
influences the number of replicates that will be necessary in a sample in
order to obtain reliable results.
4. Pilot studies can help design the algorithm itself, by highlighting appropriate
output data and stopping criteria.
Pilot studies are therefore an important part of the early stages of an exper-
iment design, reducing the risk of some common design mistakes (Section 3.5.1
on page 62). They can never be a replacement for designed experiments with suf-
ficient sample sizes and correct statistical analyses. Conclusions should not be
drawn from pilot studies.
3.8 Reproducibility
The reproducibility of research results is of course fundamentally important to all
sciences. Computer science and research in metaheuristics should be no different.
Reproducing research with computers in general and metaheuristics in particular
presents some unique challenges.
1. Differences between machines [65, 35]: It is difficult to guarantee that al-
gorithms being tested by different researchers are run on machines with the
65
CHAPTER 3. EMPIRICAL METHODS CONCERNS
exact same specifications. Specifying the processor speed, memory etc. is not
enough. What other CPU processes may have run throughout the experiment
or for periods during the experiment? Even if a researcher goes to all the
trouble of setting up a clean environment, how reproducible is that environ-
ment going to be for other researchers who do not have access to the same
machines? How reproducible will that environment remain as technology ad-
vances with new hardware and operating system versions for example? We
will see in Section 3.9 on the next page that many of these concerns can be
overcome with benchmarking but there is as yet no discussion of appropri-
ate benchmarks for ACO heuristics and their problem domains. This thesis
devotes a whole chapter (Chapter 5) to benchmarking its code.
2. Differences in coding skill [65]: It is often unclear what coding technique
is best for a given algorithm. Even if a given technique could be agreed on,
it is difficult to guarantee that different programmers have applied the same
technique fairly. This can be mitigated by using and sharing code. This the-
sis uses code that was made available online by Stutzle [47]. However, the
porting of this code from C to Java and the associated refactoring into an
object-oriented implementation undoubtedly introduces further implementa-
tion differences.
3. Degree of tuning of algorithm parameters [65]: Given that it is possible to
adjust parameters so that an algorithm performs well on a set of problems, we
must ask how much adjustment should be done and whether this adjustment
has been done in the same way as in the original research. This thesis intro-
duces a methodical approach to tuning metaheuristics and therefore greatly
improves this aspect of reproducibility of research.
Strictly then, the reproducibility of an algorithm means that ‘if you ran the
same code on the same instances on the same machine/compiler/operating sys-
tem/system load combination you would get the same running time, operation
counts, solution quality (or the same averages, in the case of a randomised algo-
rithm)’ [69].
This is impossible in practice. A broader notion of reproducibility [69] is re-
quired that is acceptable in classical scientific studies. This notion recognises that
while the classical scientist will use the same methods, he will typically use dif-
ferent apparatus, similar but distinct materials and possibly different measuring
techniques. The experiment is deemed reproduced if it produces data consistent
with the original experiment and reaches the same conclusions. Such a notion of
reproducibility must be expected from metaheuristics research.
3.8.1 Reporting results for reproducibility
Even if methods are reproduced exactly for a heuristic experiment, the way results
are calculated and reported can reduce reproducibility. Many of the common ap-
proaches to reporting the performance of an algorithm have drawbacks from the
perspective of reproducibility [69].
66
CHAPTER 3. EMPIRICAL METHODS CONCERNS
• Report the solution value: This is not reproducible since we cannot per-
form similar experiments on similar instances and determine if we are getting
similar results. Furthermore, it provides no insight into the quality of the
algorithm.
• Report the percentage excess over best solution currently known: this
is reproducible only if the current best solution is explicitly stated. Unfortu-
nately, current bests are a moving target and so leave us in doubt about the
algorithm’s true quality.
• Report the percentage excess over an estimate of a random problem’sexpected optimal: this is reproducible if the estimate and its method of
computing are explicitly stated. It is meaningful only if the estimate is con-
sistently close to the expected optimal.
• Report the percentage excess over a well-defined lower bound: this is
reproducible when the lower bound can be feasibly computed or reliably ap-
proximated.
• Report the percentage excess over some other heuristic: this is repro-
ducible so long as the other heuristic is completely specified. This involves
more than naming the heuristic or citing a reference. Johnson [69] recom-
mends using a simple algorithm as the standard. This standard is preferably
easily specified and deterministic.
3.9 Benchmarking
A machine can be fully described when results are originally reported. Over time,
it becomes increasingly difficult to estimate the relative speeds between the ear-
lier system and the current one because of changes in technology. This has two
consequences. Firstly, the existing results cannot be reproduced. Secondly, new
results cannot even be easily related to the existing results. The solution to this
is benchmarking. Benchmarking is the process of running standard tests on stan-
dard problems on a set of machines so that the machines can be fairly compared in
terms of performance. Johnson [69] advocates benchmarking code in the following
way. The benchmark source code is distributed with the experiment code. The
benchmark is compiled and run on the same machine and with the same compiler
as used for the experiment implementations. The run times for a specified set of
problem instances of varying sizes is reported. Future researchers can calibrate
their own machines in the same way and attempt to normalise existing results to
their newer results. Benchmarking is common in scientific computing and was in-
troduced to the heuristics community at the DIMACS challenges1. Benchmarking
for ACO TSP algorithms has never been reported to our knowledge. The bench-
marking process for this thesis is reported in Chapter 5.
1 http://public.research.att.com/∼dsj/chtsp/download.html
67
CHAPTER 3. EMPIRICAL METHODS CONCERNS
3.10 Responses
The issue of responses was already touched on in our discussion of reproducibility
(Section 3.8 on page 65). The choice of performance measure depends on the
questions that motivate the research [79]. A study of how growth rate is affected
by problem size (‘big O’ studies) would count the dominant operation identified in
a theoretical analysis. A study to recommend strategies for data structures might
measure the number of data structure updates. The literature offers some general
guidelines for choosing good performance measures [79].
• Data should not be summarised too early. Algorithms should report outputs
from every trial rather than the means over a number of trials. This is espe-
cially important when data have unusual distributional properties.
• A good performance measure will exhibit small variation within a design point
compared to the variation between distinct design points.
Barr et al [7, p. 14] observe that research questions can broadly be categorised
as questions of quality, computational effort and robustness. They advise that
measures from each category should be used in a well-rounded study. The litera-
ture also examines more specific responses.
3.10.1 CPU Time
Johnson [69] advocates always reporting CPU times, even if they are not the sub-
ject of a study. He presents some reasons why running times are not reported and
his counter arguments.
• The main subject of the study is one component of the running time, forexample local optimisation. Readers will still want to know how important
this component is relative to the overall running time.
• The main subject of the study is a combinatorial count related to the al-gorithm’s operation. To establish the meaningfulness of this count, readers
will need to study its correlation with running time. For example, an investi-
gation of pruning schemes used by an algorithm could mislead a reader if it
did not report that the better scheme took significantly longer to run.
This issue also arises in research with ACO where there is often a temptation
to extend the ant metaphor without examining the real cost of this added
complexity.
• The main concern of the study is to investigate the quality of solutionsproduced by an approximate algorithm. The main motivation of using an
approximate algorithm is that it trades quality of solution for reduced running
time. Readers will want to know what the trade off is.
McGeoch [79] acknowledges that it is often difficult to find combinatorial mea-
sures that predict running times well when an algorithm is highly instantiated.
68
CHAPTER 3. EMPIRICAL METHODS CONCERNS
Coffin and Saltzman [28] argue that CPU time is an appropriate comparison crite-
rion for algorithms when the algorithms being compared have significantly different
architectures and no comparable fundamental operations. Barr et al [7, p. 15-16]
advise recording the following times.
• Time to best-found solution: this is the time required by the heuristic to
find the solution the author reports. This should include all pre-processing.
• Total run time: this is the total algorithm execution time until the execution
of its stopping rule.
• Time per phase: the timing and quality of solution at each phase should be
reported.
One should exercise caution with the time to best solution response. It is only
after the experiment has concluded that we know this was the best solution found.
It can be deceptive to report this value in isolation if the reader is not told how long
the algorithm actually ran for. This is related to the issue of best solution from a
number of runs (Section 3.10.5 on the next page).
3.10.2 Relative Error and Adjusted Differential Approximation
According to Barr et al [7, p. 15], comparison should be made to the known optimal
solution. We have already mentioned some criticisms of the specifics of how this
comparison is made (Section 3.8 on page 65). Birattari [12] discusses measures of
performance in terms of solution quality. He rightly dismisses absolute error as it
is not invariant with a scaling of the cost function. He also dismisses the use of
relative error since it is not invariant under some transformations of the problem,
as first noted by Zemel [124]. An example is given of how an affine transformation2
of the distance between cities in the TSP, leaves a problem that is essentially the
same but has a different relative error of solutions. Birattari uses a variant of
Zemel’s differential approximation measure [125] defined as:
cde(c, i) =c− ci
crndi − ci(3.1)
where cde(c, i) is the differential error of a solution instance i with cost c, ci is the
cost of the optimal solution and crndi is the expected cost value of a random solu-
tion to instance i. An additional feature of this Adjusted Differential Approximation(ADA) is that its value for a random solution is 1, so the measure indicates how
good a method is relative to a trivial method which in this case is a random solu-
tion. It can therefore be considered as incorporating a lower bracketing standard
(Section 3.5 on page 59).
2 An affine transformation is any transformation that preserves collinearity (i.e., all points lying ona line initially still lie on a line after transformation) and ratios of distances (e.g., the midpoint of aline segment remains the midpoint after transformation). Geometric contraction, expansion, dilation,reflection, rotation, and shear are all affine transformations.
69
CHAPTER 3. EMPIRICAL METHODS CONCERNS
ADA is not yet a widely used solution quality response. This thesis measures
and analyses both relative error and ADA in keeping with Cohen’s [29] recommen-
dation on multiple performance measures (Section 3.5 on page 59).
3.10.3 Relative Terms
Relative terms are responses expressed such as some type of quotient such as the
number of iterations/average number of iterations. Crowder et al [35] are not in
favour of using relative terms when reporting performance. While relative terms
do make comparison more difficult, Johnson [69] argues that relative performance
indicators are often enlightening. It is important that enough information is pro-
vided so that the original components of the relative term can be recovered for
reproducibility (Section 3.8 on page 65).
3.10.4 Frequency of Optimum
While it is of interest to determine the probability with which an algorithm will
find an optimal solution for a given instance, it has limitations when used as a
metric [69]. Firstly, it limits analysis to instances for which optima are actually
known. Secondly, it ignores how near the algorithm gets when it doesn’t find
the optimum. Thirdly, it cannot distinguish between algorithms on larger prob-
lem instances where the probability of finding the optimal solution is usually 0.
Moreover, this response overemphasises finding an optimum when the heuristic
compromise is about finding a good enough solution in reasonable time.
3.10.5 Best Solution from a number of runs
Birattari and Dorigo [13] criticise the use of the best solution from a number of runs
as advocated by others [48]. They dismiss this measure as ‘not of any real interest’
since it is an over optimistic measure of a stochastic algorithm. The authors also
counter the reasoning that in a real world scenario one would always use the best
of several runs [48]. Firstly, it leads to an experiment measuring the performance
of a random restart version of the algorithm. Secondly, this random restart version
is so trivial (repeated run of the same algorithm with no improvement or input from
the previous run) that it would not be a sound restart strategy anyway with the
given resources. Johnson [69] levels two further criticisms at the reporting of the
best solution found from multiple runs on a problem instance. Because the best
run is a sample from the tail of a distribution it is necessarily less reproducible
than the average. Also, if running time is reported, it is generally for that best run
of the algorithm and not for the entire number of runs that yielded the reported
best solution (Section 3.10.1 on page 68). This obscures the time actually required
to find the reported solution. If the number of runs is not stated, there is no way
to determine the real running time. Even when the number of runs is reported,
multiplying the number of runs by the reported run time would overestimate the
time needed. Actions such as setting up data structures need only be done once
when multiple runs are performed.
70
CHAPTER 3. EMPIRICAL METHODS CONCERNS
3.10.6 Use of Averages
Reports of averages should be accompanied at least by a measure of distribution.
Any scaling or normalising of averages should be carefully explained so that raw
averages can be recovered if necessary.
3.11 Random number generators
Several problems can occur with the use of pseudo-random number generators
and differences in numerical precision of machines [79]. These problems can be
identified with replication. Firstly, a faulty implementation of a generator can
introduce a bias in the stream of numbers produced and this can interact with the
algorithm. Treatments should be replicated with more than one random number
generator. Secondly, differences in numerical precision of machines can introduce
biases into an algorithm’s behaviour. Treatments should be replicated with the
same generator and seeds on different machines.
It is difficult to implement a good generator correctly [92]. The source code in
this thesis uses the minimal generator of Park and Miller [92] described in the
literature [99, p. 279]. This is the generator used in the original source code by
Stutzle and Dorigo [47].
3.12 Problem instances and libraries
There are two basic types of test instance: (1) instances from real-world appli-
cations and (2) randomly generated instances. The former are found in libraries
such as TSPLIB [102] or come from private sources. The latter come from instance
generators. A generator is software that, given some parameters, produces a ran-
dom problem instance consistent with those parameters. Real-world data sets are
desirable because the instances automatically represent many of the patterns and
structures inherent in the real world [101]. However, real-world data sets are often
proprietary and may not span all the problem characteristics of interest.
Randomly-generated test instances offer many conveniences.
• Control of problem characteristics [101]. If the problem generator is prop-
erly designed, then the problem characteristics are explicitly under the re-
searcher’s control. This enables the researcher to cover regions of the design
space that may not be well covered by available real-world data or libraries.
This control can be a necessity with the experiment designs in the Design Of
Experiments approach. When problems can be distinguished by some pa-
rameter, these parameters should be treated as independent variables in the
analysis [28, p. 28].
• Replicates [101]. The problem generator can create an unlimited supply of
problem instances. This is particularly valuable in high variance situations
for which statistical methods demand many replicates.
71
CHAPTER 3. EMPIRICAL METHODS CONCERNS
• Known optimum [101]. Some problem generators can generate problem in-
stances with a known optimal solution. Knowing the optimum is important
both for bracketing standards (Section 3.5 on page 59) and for the calculation
of some response measures (Section 3.10 on page 68). However, knowing an
optimum may bias an experiment.
• Stress testing [69]. Problem generators can be used to determine the largest
problem size that can be feasibly run on a given machine. This is important
when deciding on ranges of problem sizes to experiment with in the pilot study
phase (Section 3.7 on page 65). Barr et al [7, p. 18] also support this argu-
ment. They state that many factors do not show up on small instances but
do appear on larger instances. Experiments with smaller instances therefore
may not lead to accurate predictions for larger more realistic instances.
A poorly designed generator can lead to misleading unstructured random prob-
lem instances. Johnson [69] refers to Asymmetric TSP papers that report codes
that easily find optimal solutions to generated unstructured problems with sizes
of the order of thousands of cities yet struggle to solve structured instances from
TSPLIB of sizes less than 53 cities.
Online libraries of problem sets, be they real-world or randomly generated,
should be used with caution [101].
• Quality [101]. It is sometimes unclear where a particular instance originated
from and whether the instance actually models a real-world problem. Inclu-
sion in a library generally does not make any guarantees about the quality of
the instance.
• Not Representative [101, 65]. Some instances appearing in publications
may be contrived to illustrate a particular feature of an algorithm or to illus-
trate an algorithm’s pathological behaviour. They are therefore not suitable
as representative instances and may even be misleading.
• Biased [101]. Problem instances are often published precisely because an
algorithm performs well specifically on those instances. The broader issue of
bias is covered in Section 3.14 on page 74.
• Misdirected research focus [101, 65]. The availability of benchmark test
instances can draw researchers into making algorithms perform well on those
instances. As Hooker [65] puts it, ‘the tail wags the dog’ as problems begin to
design algorithms. This changes the context of a study from one of research to
one of development and encourages the premature publication of horse race
studies (Section 3.2 on page 55) before an algorithm is completely understood.
In summary, it would seem that problem generators are a necessity for designed
experiments. It is preferable to have access to a generator rather than relying on
benchmark libraries. Generators that are well-established and tested are prefer-
able to developing one’s own. This thesis uses a generator from a large community
research competition [58].
72
CHAPTER 3. EMPIRICAL METHODS CONCERNS
3.13 Stopping criteria
Heuristics can run for impractically long time periods. A stopping criterion is some
condition that causes the heuristic to halt execution. One typically sees several
types of stopping criteria in the heuristics literature. We term these (1) CPU time
stopping criterion, (2) computational count stopping criterion and (3) quality stop-
ping criterion respectively. In the first two types, a heuristic is halted after a given
amount of time or after a given number of computational counts (such as the num-
ber of iterations). In the third type, the heuristic is halted once a given solution
quality (typically the optimum solution) is achieved.
Running experiments with a time stopping criterion has been criticised on the
grounds of reproducibility [69]. A run on a different machine or with a different
implementation will have a distinctly different quality because of differences be-
tween the experimental material (Section 3.8 on page 65). Johnson goes so far
as to state that ‘the definition of an algorithm in this way is not acceptable for a
scientific paper’ [69].
Using a computational count as a stopping criterion is preferred by some au-
thors [69] and is generally the most common type of stopping criterion in the liter-
ature. Furthermore, one can report running time alongside computational count.
This permits other researchers to reproduce the work (using the computational
count) and observe differences in run times caused by their machines and imple-
mentations.
Johnson [69] objects to the use of attaining an optimal value as a stopping
criterion on the grounds that in practice one does not typically run an approximate
algorithm on an instance for which an optimal solution is known. In addition, we
argue that this overemphasises the search for optima when this is not the purpose
of a heuristic.
There is some evidence that the choice of stopping criterion could affect the
appropriate choice of tuning parameter settings for a heuristic. Socha [115] inves-
tigated the influence of the variation of a running time stopping criterion on the
best choice of parameters for the Max-Min Ant System (Section 2.4.5 on page 39)
heuristic applied to the University Course Timetabling Problem. Three levels of a
local search component, ten levels of pheromone evaporation rate, eleven levels of
pheromone lower bound and four levels of fixed run-time were investigated. The
local search levels were not varied with the other two parameters so we do not
know whether these interact. Furthermore, the parameter levels used with the
separate local search investigation were not reported and analyses were performed
on only two instances. The remaining three algorithm parameters were compared
in a full factorial type design on a single instance with 10 replicates. The motiva-
tion for this number of replicates was not mentioned. A fractional factorial design
would have been sufficient to determine an effect due to stopping criterion and
this would have offered huge savings in the number of experiment runs. Despite
this, the work does seem to indicate that different parameter settings are more
appropriate for different run-times of MMAS for one instance of the UCTP. This is
intuitive when one realises that the parameters investigated, pheromone evapora-
73
CHAPTER 3. EMPIRICAL METHODS CONCERNS
tion and pheromone lower bound, have an influence on the explore/exploit nature
of the MMAS algorithm. Obviously, exploration is a more sensible strategy when
a greater amount of run-time is available. Pellegrini et al [96] attempt an analysis
of the effect of run-time on solution parameters but this has many flaws that we
discuss in Section 4.3.1 on page 82.
The result of Socha [115] has the following implication for parameter tuning
experiment designs; results are restricted to the specific stopping criterion used.
Either (1) the stopping criterion (and a range of its settings) should be included as a
factor in the experiments or (2) the analyses should be conducted at several levels
of the stopping criterion settings. For example, if a fixed iteration stopping criterion
were used then the number of fixed iterations could be included as a factor or
analyses should be conducted after several different fixed iterations. The former
approach permits the most general conclusions at the cost of greatly increased
experimental resources.
3.14 Interpretive bias
The issue of bias is well recognised in the medical research field [71]. Its dangers
are equally relevant to the heuristics field. Bias is probably unavoidable given the
nature of science.
Good science inevitably embodies a tension between the empiricism of
concrete data and the rationalism of deeply held convictions. Unbiased
interpretation of data is as important as performing rigorous experi-
ments. This evaluative process is never totally objective or completely
independent of scientists’ convictions or theoretical apparatus. [71, p.
1453]
There are several types of bias that can affect the interpretation of results and
we relate these to the heuristics field here.
• Confirmation bias. Researchers evaluate research that supports their prior
beliefs differently from research challenging their convictions. Higher stan-
dards are expected of the research that challenges convictions. This bias is
often unintentional.
• Rescue bias. This bias involves selectively finding faults in an experiment
that contradicts expectations. It is generally a deliberate attempt to evade
evidence.
• Auxiliary hypothesis bias. This is a form of rescue bias in which the original
hypothesis is modified in order to imply that results would have been different
had the experiment been different.
• Mechanism bias. Evidence is more easily accepted when it is supported by
accepted scientific mechanisms.
74
CHAPTER 3. EMPIRICAL METHODS CONCERNS
• ‘Time will tell’ bias. Scientific scepticism necessitates a judicious attitude
of requiring more evidence before accepting a result. This bias affects the
amount of such evidence that is deemed necessary.
“A new scientific truth does not triumph by convincing its opponents
and making them see the light, but rather because its opponents
eventually die, and a new generation grows up that is familiar with
it.” Max Planck [97]
• Orientation bias. This reflects a phenomenon of experimental and recording
error being in the direction that supports the hypothesis. This arises in the
pharmaceuticals industry, for example, where trials consistently favour the
new pharmaceutical treatments.
Clearly these biases can affect interpretation of results regardless of the atten-
tion paid to the aforementioned issues.
3.15 Chapter summary
This chapter has covered the following topics.
• Concerns regarding many aspects of experiment design for heuristics have
appeared throughout the heuristics literature over the past three decades.
These concerns have not been addressed in the Ant Colony Optimisation lit-
erature.
• It is important to ask whether the heuristic is even worth researching. The
temptation to invent creative extensions to algorithms or explore new nature-
inspired metaphors can distract us from the real task of producing optimisa-
tion heuristics that produce feasible solutions in acceptable time.
• There are several types of experiment one can conduct. The appropriate type
will depend on the life cycle of the heuristic and the problem domain. Each
type of experiment can answer several types of research question.
• There are clearly defined steps to good design and analysis of experiments.
Nonetheless, there are many potential pitfalls for the analyst and many design
and analysis decisions that must be made and justified.
• Different levels of heuristic instantiation and problem domain abstraction
are appropriate for different types of study and research question. This the-
sis studies a highly instantiated metaheuristic applied to a problem type of
medium abstraction.
• Machines should be benchmarked so that results can be correctly interpreted
by other researchers and can be scaled to different types of experiment mate-
rial (machine architecture, compiler, programming language etc).
75
CHAPTER 3. EMPIRICAL METHODS CONCERNS
• A broad notion of reproducibility for empirical research with metaheuristics
states that an experiment is reproducible if others can produce consistent
data that leads to the same conclusions.
• There are many types of performance responses one can measure and report.
• One should exercise caution in the choice of random number generator. It
is difficult to implement a generator well and poor implementations can bias
research results.
• Problem instances can be so-called real-world instances or randomly gener-
ated instances. Both have their advantages and disadvantages. Randomly
generated instances are probably more appropriate when one needs explicit
control of problem instance characteristics. It is difficult to implement a gen-
erator well and so is preferable to use an established and well tested genera-
tor. The are several potential dangers of online libraries of instances, be they
real world instances or randomly generated ones.
• Because heuristics can run for a significant time, continuously improving
their solution, one needs to choose a stopping criterion to halt an experiment.
Stopping criteria are generally based on a computation count, a clock time or
when a predefined solution quality is attained.
• There are several types of interpretive bias that can affect the researcher’s
assessment of results, even from the most rigorously designed experiment.
The next chapter will review experiments on tuning metaheuristics in light of
the concerns summarised in this chapter.
76
4Experimental work
The previous chapter summarised the most important experiment design and anal-
ysis concerns that have been raised in the heuristics literature. The discussion of
these concerns was related to Ant Colony Optimisation (ACO) research at a general
level. This chapter examines research that is relevant to parameter tuning of ACO
in the context of the concerns that have been identified. It begins with a review
of the most significant attempts to analyse problem difficulty for algorithms. The
chapter then continues with approaches to tuning heuristics and metaheuristics
other than ACO. This is necessary because the fields of operations research and
heuristics in general have been better than the ACO field at recognising and ad-
dressing the parameter tuning problem. Lessons can be learned from these fields.
Finally, this chapter addresses parameter tuning approaches for the ACO meta-
heuristic, the focus of this thesis. Of course, parameter tuning should be a major
part of any ACO research effort. It is integral to the effective application of the
heuristic. A comprehensive review of parameter tuning would therefore neces-
sitate reviewing almost all ACO literature. A glance through the ACO literature
should convince the reader that methodical and reproducible parameter tuning of
ACO is rarely addressed, despite its identification as an open research topic [47].
This chapter will therefore limit its scope to papers that have explicitly proposed
and investigated methods for the parameter tuning of ACO.
4.1 Problem difficulty
Some problem instances are more difficult for an algorithm (exact or heuristic) to
solve than other instances. It is critically important to understand which instances
can be expected to be more difficult for a given algorithm. Essentially this involves
investigating which levels of one or more problem characteristics (and combina-
tions of levels) have a significant effect on problem difficulty.
Fischer et al [49] investigated the influence of Euclidean TSP structure on the
77
CHAPTER 4. EXPERIMENTAL WORK
performance of two algorithms, one exact and one heuristic. The exact algorithm
was branch-and-cut [5] and the heuristic was the iterated Lin-Kernighan algorithm
[63]. In particular, the TSP structural characteristic investigated was the distribu-
tion of cities in Euclidean space. The authors varied this distribution by taking a
structured problem instance and applying a perturbation operator to the city dis-
tribution until the instance resembled a randomly distributed problem. There were
two perturbation operators. A reduction operator removed between 1% to 75% of
the cities in the original instance. A shake operator offset cities from their origi-
nal location. Using 16 original instances, 100 perturbed instances were created for
each of 8 levels of the perturbation factor. Performance on perturbed instances was
compared to 100 instances created by uniformly randomly distributing cities in a
square. Predictably, increased perturbation lead to increased solution times that
were closer to the times for a completely random instance of the same size. It was
therefore concluded that random Euclidean TSP instances are relatively hard to
solve compared to structured instances. Unfortunately, it is unavoidable that the
reduction operator confounds changed problem with a reduction in problem size, a
known factor in problem difficulty. Nonetheless, the research of Fischer et al leads
us to suspect that structured instances possess some feature that algorithms can
exploit in their solution whereas completely random instances are lacking that
feature and consequently may be unrealistically difficult. These results tie in with
arguments over the merits of problem instance generators discussed previously
(Section 3.12 on page 71).
Van Hemert [119] evolved TSP instances of a fixed size that were difficult to solve
for two heuristics: Chained Lin-Kernighan and Lin Kernighan with Cluster Com-
pensation. TSP instances of size 100 were created by uniform randomly selecting
100 coordinates from a 400x400 grid. An initial population of such instances was
evolved for each of the algorithms where higher fitness was assigned to instances
that required a greater effort to solve. This effort was a combinatorial count of
the algorithms’ most time-consuming procedure. This is an interesting approach
that side-steps the difficult issues related to CPU time measurement discussed in
Section 3.10.1 on page 68 while still acknowledging the relevance and importance
of CPU time. Van Hemert then analysed the evolved instances using several inter-
esting approaches. His aim was to determine whether the evolutionary procedure
made the instances more difficult to solve and whether that difficulty was specific
to the algorithm. The first approach considered was box plots of the mean, median
and 5 and 95 percentile range. Secondly, the author looked at the frequency with
which each algorithm found an optimal solution in each of the problem sets and
the average discrepancy between the algorithm solution and the known optimum.
The problems with the first of these responses has already been discussed (Sec-
tion 3.10.4 on page 70). The average number of clusters in each set was measured
with a deterministic clustering algorithm. The average distribution of tour seg-
ment lengths was measured for both problem sets as well as the average distance
between pairs of nodes. Finally, to verify whether difficult properties were com-
mon to both algorithms, each algorithm was run on the other algorithm’s evolved
problem set. A set evolved for one algorithm was less difficult for the other al-
78
CHAPTER 4. EXPERIMENTAL WORK
gorithm. However, the alternative evolved set still required more effort than the
random set indicating that some difficult instance properties were shared by both
evolved problem sets. Van Hemert’s conclusions may have been limited by the lack
of a rigorous experiment design. The approach can be summarised as evolving
instances and then looking for characteristics that might explain any observed dif-
ferences in problem hardness. This offers no control over problem characteristics.
Ideally, one should hypothesise a characteristic that affects hardness and then
test that hypothesis while controlling for all other characteristics. This was exactly
the approach taken in the next piece of research and in this thesis.
Cheeseman et al [26] explored the idea of defining an ‘order parameter’ for NP
instances such that critical values of this parameter describe instances that are
particularly hard to solve. The basic idea is that such a critical value divides the
space of problems into two regions. One region is underconstrained and so has a
high density of solutions. This makes it relatively easy to find a solution. The other
region is overconstrained and so has very few solutions. However, these solutions
typically have very distinct local maxima/minima and so again are relatively easy
to find. The difficult problems occur at the boundary between these two regions
where there are many minima/maxima corresponding to almost complete solu-
tions. In essence, the algorithm is forced to investigate many ‘false leads’. In some
ways, this concept of critical values of an order parameter resembles that of phasetransitions used in statistical mechanics and physics.
Cheeseman et al [26] investigated the presence of these transitions when vari-
ous algorithms were applied to four problems: finding Hamiltonian circuits, graph
colouring, k-satisfyability and the Travelling Salesperson Problem. In the TSP in-
vestigations, three problem sizes of 16, 32 and 48 were investigated. For each
problem size, many instances were generated such that each instance had the
same mean cost but a varying standard deviation of cost. Mean and standard
deviation of edge lengths were controlled by drawing edge lengths from a Log-
Normal distribution. The computational effort for an exact algorithm to solve each
of these instances was measured and plotted against standard deviation of TSP
edge lengths. The plots showed an increase in the magnitude and sharpness of the
phase transition with increasing problem size.
Although conducted only with an exact algorithm on relatively small instance
sizes, this research leads us to expect that edge length standard deviation may
have a significant influence on problem difficulty for other heuristics. This is an
important research question that this thesis will answer for the ACO algorithms.
4.2 Parameter tuning of other metaheuristics
Adenso-Dıaz and Laguna [2] have used a factorial design combined with a local
search procedure to systematically find the best parameter values for a heuristic.
Their method, CALIBRA, was demonstrated on 6 different combinatorial optimi-
sation applications, mostly related to machine scheduling. CALIBRA begins by
finding a set of ‘optimal’ parameter values using the Taguchi methodology [95] in a
79
CHAPTER 4. EXPERIMENTAL WORK
2k factorial design. The Taguchi methodology is based on a linear assumption that
can lead to large differences between the predicted ‘optimal’ values and the true
optimum. CALIBRA therefore uses this analysis only as a guideline to focus the
search through the parameter space. An iterative local search is then used to im-
prove on the parameter values within a refined region of the parameter space. The
parameter values found by CALIBRA led to better algorithm performance than the
values used by some other authors. In all other situations, the CALIBRA parame-
ter values did not perform significantly better or worse than the parameter values
used by the original authors. A main limitation of CALIBRA is that it can only
tune five algorithm parameters. A more serious limitation is that CALIBRA does
not examine interactions between parameters and so cannot be used in situations
where such interactions might be significant. Later chapters will demonstrate that
interactions are present in ACO tuning parameters and so the more sophisticated
experiment designs of this thesis are required.
Coy et al [33] present a systematic procedure for finding good heuristic param-
eter settings on a range of Vehicle Routing Problems (VRP). This methodology was
applied to two local search heuristics with 6 tuning parameters and a total of 34
VRPs. The new parameters found results that were, on average, 1% to 4% better
than the best known solutions. Broadly, Coy et al’s procedure works by finding
high quality parameter settings for a small number of problems in the problem
set and then combining these settings to achieve a good set of parameters for the
complete problem set. The procedure is as follows:
1. A subset of the problem set is chosen for analysis. This subset should be
representative of the key problem characteristics in the entire set. In their
paper’s case, example key VRP characteristics are demand distribution and
customer distribution.
2. A starting value and range for each parameter is determined. This requires
either a judgement based on previous experience with the heuristic or a pilot
study.
3. A factorial or fractional factorial design is used to determine the parameter
settings. Linear regression gives a linear approximation of the response sur-
face. The path of steepest descent along this surface is calculated, beginning
at the starting point identified from the parameter study.
4. Step 4 is repeated for each problem in the analysis set.
5. The parameter vectors determined in step 4 are averaged to obtain the final
parameter settings for the heuristic over all problem instances.
Coy et al’s approach does not use higher order regression models (such as
quadratic). The author’s chose the simpler linear approach because different re-
sponse surfaces are averaged over all test instances. The authors believed their
approximate approach would not be significantly enhanced by a more complicated
response surface. This comparison was not performed however.
80
CHAPTER 4. EXPERIMENTAL WORK
In Coy et al’s work, their VRP problems were chosen based on three charac-
teristics, distribution of customers, distribution of demand and problem size. The
decisions were based on a graphical analysis of these characteristics for each in-
stance. In their conclusions, Coy et al acknowledge that the method will perform
poorly in two scenarios:
• if the representative test problems are not chosen correctly or
• if the problem class is so broad that it requires very different parameter set-
tings.
While they recommend creating problem subclasses based on the ‘significant’
problem characteristics, they give no detail of how such significance could be de-
termined. The first of these shortcomings could conceivably be mitigated in an
application scenario by building up a repository of instances to which the tuning
procedure has been applied. The second shortcoming is more troublesome. It is
likely that many problems have instances that require quite different parameter
settings. Coy et al’s method does not build a model of the relationship between
instances, parameter settings and performance. It therefore cannot recommend
parameter settings for varying combinations of problem characteristics. The de-
signs introduced in this thesis can make such recommendations.
Parsons and Johnson [94] used a 24 full factorial replicated design to screen
the best parameter settings for four genetic algorithm parameters applied to a data
search problem (specifically DNA sequencing). Their stopping criterion was a fixed
number of trials. They used a type of sequential experimentation procedure of
running first a half fraction, then the other half fraction and finally the replicates
of the data. Only two parameters were deemed important and so a steepest ascent
approach with these parameters was used to determine the centre point of a central
composite design for building a response surface. This response surface then
allowed the authors to improve the genetic algorithm performance on the tested
data set. Experiments on larger data sets with these parameter settings showed
improvements in both solution quality and computational effort. Unfortunately,
no analysis of the data set characteristics was made so we cannot determine why
parameters tuned on one data set worked so well for larger data sets. The authors
could have halved the number of experiment runs by using a 2IV (4-1) fractional
factorial (Section A.3.2 on page 215) instead of the 24 full factorial.
Analysis of Variance (ANOVA) and response surface models have been used for
parameter tuning on several occasions in the heuristics literature. For example,
Van Breedam [22] attempts to find significant parameters for a genetic algorithm
and a simulated annealing algorithm applied to the vehicle routing problem us-
ing an analysis of variance technique. Seven GA and eight SA parameters are
examined. Park and Kim [91] used a non-linear response surface method to find
parameter settings for a simulated annealing algorithm.
None of these methods have been applied to ACO.
81
CHAPTER 4. EXPERIMENTAL WORK
4.3 Parameter tuning of ACO
Existing approaches to understanding the relationship between tuning parame-
ter settings, problem characteristics and performance of ACO fall into a three
categories: (1) Analytical Approaches, (2) Automated Approaches and (3) Empir-
ical Approaches. Recent analytical approaches attempt to understand parameters
and recommend parameter settings based on mathematical proof. Automated ap-proaches attempt to use some other algorithm or heuristic to tune the ACO heuris-
tic. The automated approach may use the heuristic itself in a kind of introspective
manner, in which case we term the approach Automated Self-tuning. Alternatively,
Heuristic Tuning uses some other heuristic to search for good parameter settings
for the tuned heuristic. The final category is Empirical approaches. These gather
data about the heuristic and attempt to analyse it to draw conclusions about the
heuristic. Empirical approaches in ACO are either Trial-and-Error or occasionally
the One-Factor-At-a-Time (OFAT) approach. This thesis uses the Design Of Exper-iments approach, bringing the experiment designs and analysis procedures from
DOE to bear on the parameter tuning problem.
4.3.1 Analytical Approaches
Dorigo and Blum [42] provide a survey of analytical results relating to ant colony
optimisation. They acknowledge that the convergence proofs they summarise for
ACO are of little use to a practitioner since the proofs often assume the availability
of either infinite time or infinite space [42, p. 246]. Their discussion is preceded
with a simplification of the transition probabilities of the Construct Solutions phase
(Section 2.4.3 on page 37) so that heuristic information is omitted. Their motiva-
tion for this simplification is to ‘ease the derivations’ [42, p. 256] however this
simplification ignores the reality that heuristic information is well-established as
a highly significant contributor to algorithm performance. The authors also admit
that none of the convergence proofs discussed make reference to the ‘astronomi-
cally large’ time to find the optimal solution [42, p. 260].
Neumann and Witt attempt an analysis of run time and evaporation rate on the
OneMax problem [86] and the LeadingOnes and BinVal problems [41]. An abstract
ACO algorithm is analysed in both cases. However, no comparison with empirical
data is given and so it is impossible to tell whether any of the proofs’ assumptions
have affected their analyses’ application to instantiated ACO algorithms. Despite
this, the authors claim that ‘It is shown that the so-called evaporation factor ρ, the
probably most important parameter in ACO algorithms, has a crucial impact on
the runtime’ [41, p. 34]. These claims are far too general given that their analyses
are at such an early stage and apply only to a single abstract algorithm with no
test of predictions on real instantiated algorithms. These claims will be tested later
in this thesis.
Pellegrini et al [96] attempt an analytical prediction of suitable parameter set-
tings for MMAS for a given run time. The most important parameters are deemed
to be the number of ants, the pheromone evaporation rate and the exponents of the
82
CHAPTER 4. EXPERIMENTAL WORK
pheromone and heuristic terms. This is a subjective judgement as no screening or
other analysis is done to verify it. Dorigo and Stutzle’s MMAS code [47] was used
and so we should expect results to be consistent with results from this thesis. Ex-
periments were performed without local search and assuming pheromone update
is not time consuming. This assumption is likely incorrect if one examines the code
used. This shows that when no local search is used, pheromone evaporation takes
places on all edges in the problem rather then being limited to the edges in the can-
didate list. This issue was described in Section 2.4.8 on page 44 and a computationlimit parameter was introduced. In general, the reasoning about the parameters is
reported quite vaguely. For example, ‘It is easy to see that the number of iterations
is the leading force, at least until a certain threshold. Nonetheless, the number
of nodes has a remarkable impact as well.’ We have no measure of ‘easy’, leading
force’, ‘a certain threshold’ or ‘remarkable’.
The values recommended by the analysis are then compared to the values rec-
ommended by an automated tuning algorithm, F-Race [14]. Available solution time
was set to six arbitrarily chosen levels varying between 5 and 120 seconds. The
F-Race recommendations were observed to match the predicted trends of the anal-
ysis for the exponent parameters and the number of ants but not the pheromone
decay parameter for just two distinct levels of instance size, 300 and 600. The
failure on prediction of pheromone value may be due to the incorrect assumption
highlighted earlier. Regardless, the results do not confirm the author’s analysis
but rather confirm that the authors’ analysis would appear to agree with some
of the F-Race results. Once again, parameters were treated in isolation and the
unstated assumption is that there is no possibility of many different combinations
of parameter settings achieving the same performance under the constraint of a
fixed solution time. Times were not benchmarked properly so we cannot compare
other tuning procedures to Pellegrini et al’s analysis.
Hooker [64] highlights the failings of theoretical analysis of algorithm perfor-
mance on several fronts.
• Not practical. The results do not usually tell us how an algorithm will per-
form on practical problems.
• Not representative. Complexity results are asymptotic or apply to a worst
case that seldom occurs. Worst case analyses, by definition, do not give an
indication of how a heuristic will perform in more representative scenarios
[101]. Average case results presuppose a probability distribution over ran-
domly generated problems that is typically unreflective of reality.
• Too simplified. Results are usually obtained for the simplest kinds of algo-
rithms and so do not apply to the complex algorithms used in practice.
All the analyses mentioned demonstrate some or all of Hooker’s failings. In gen-
eral, results are still too premature to recommend parameter settings for an ACO
algorithm when presented with a given problem instance and the complete heuris-
tic compromise. Analytical approaches are not ready to address the parameter
tuning problem.
83
CHAPTER 4. EXPERIMENTAL WORK
4.3.2 Automated Approaches
Self-tuning
Randall [100] has examined how ACS (Section 2.4.6 on page 42) can use its own
mechanisms to tune its parameters at the same time as it is solving TSP and QAP
problems. Four tuning parameters were examined, β, local and global pheromone
update factors ρlocal and ρglobal and the exploration/exploitation threshold q0. The
number of ants m was arbitrarily fixed at 10. Each ant maintained its own pa-
rameter values. A separate pheromone matrix was used to learn new parameter
values. The self-tuning test ACS was compared to a control ACS with fixed param-
eter values taken from another author’s implementation from 7 years previously
on different problem instances [46]. Twelve instances from the TSPLIB (sizes 48 to
442) and QAPLIB (sizes 12 to 64) were used for comparison of the test and con-
trol. Experiments on each instance were repeated 10 times with different random
seeds and were halted after 3000 iterations. The choice of number of replicates
was not justified. Only a single fixed iteration stopping criterion was examined.
Randall claims that this number of iterations ‘should give the ACS solver sufficient
means to adequately explore the search space of these problems.’ [46, p. 378].
No evidence was given to support this claim. Furthermore, because problem size
varies, a fixed iteration will give a less adequate exploration of the search space of
larger instances. Two responses were measured, the percentage relative error of
the solution and the CPU time until the best solution was found. Note that there
is some disagreement over the use of this measure as it does not account for the
3000 iterations that were actually run (Section 3.10.1 on page 68). Responses were
listed in a table with a row for each problem instance. No attempt was made to
qualify the significance of differences between control and test responses with a
statistical test.
Although it is difficult to interpret the results in these circumstances, it seems
that the parameter tuning strategy had little practically significant effect on the
quality of solutions found and therefore is not better than a set of parameter val-
ues recommended by other authors in different circumstances. Part of the diffi-
culty in detecting differences between test and control is that many of the problem
instances were so small as to be solved relatively easily in the set-up phase of ACS,
when a nearest neighbour heuristic is applied to the TSP graph (Section 2.4).
Although the self-tuning approach is intuitively appealing, it has two major defi-
ciencies. Firstly, it offers no understanding of the relationship between parameters
and performance—the algorithm tunes itself and it is hoped that performance im-
proves. An understanding of this relationship necessitates building some model
(analytical or empirical) of the relationship. Of course, the particular application
scenario will determine whether modelling this relationship is advantageous (see
Chapter 1). The second deficiency is related to the first. Without a model, there
is no understanding of the relative importance of tuning parameters. This runs
the risk of wasted resources tuning parameters that actually have no effect on
performance.
84
CHAPTER 4. EXPERIMENTAL WORK
Heuristic Tuning
Botee and Bonabeau [19] investigated the use of a simple Genetic Algorithm to
tune 12 parameters of a modified version of the Ant Colony System algorithm on
two problems from TSPLIB, Oliver30 and Eil51. The parameters they evolved are
summarised in Table 4.1 on the next page. The ACS equations were modified in
several ways to give the genetic algorithm more flexibility. The trail evaporation
parameter ρ, which was the same for local and global pheromone updates in the
original ACS, was separated into ρlocal and ρglobal for the local and global phe-
romone update equations respectively. A new numerator Q and a new exponent γ
were introduced into the pheromone deposit Equation ( 2.13 on page 43).
τij = (1− ρ) τij + ρQ
(Cchosen ant)γ (4.1)
The chosen ant was always the best so far ant. The ACS implementation was
augmented with 2-opt local search. The number of repetitions was of the form
σ = a · nb where a and b determine how σ scales with problem size n. The original
trail value was set to a small value rather than according to the nearest neighbour
approach advocated in previous literature [46]. The candidate list length was fixed
at 1.
Each colony, characterised by the given parameters, was treated as an indi-
vidual. The population of 40 colonies was randomly generated and run for 100
generations. The GA found a set of parameter values that always found the opti-
mal solution to Oliver30 in fewer ant cycles than ACS. The solution found for Eil51
was comparable to the best known solution.
The modified algorithm, tuned by the GA, found an optimal solution to Oliver30
after 4928 ant cycles (14 ants for 352 iterations averaged over 30 repetitions). The
original algorithm found the optimal solution in 8300 ant cycles (10 ants for 830
iterations). The results for Eil51 are not comparable to the original paper [45]
as this paper used a different but similar problem Eil50. The evolved parameter
values are summarised in Table 4.1 on the next page.
While the reported results are interesting, the general conclusions we can draw
from them are limited. Firstly, the experiments introduced 4 new ACS parameters
at once—the γ exponent and Q numerator in the global pheromone update equa-
tion, and the separation of the pheromone decay parameter into local and global
versions. The use of σ repetitions of a local 2-opt search procedure was also intro-
duced and the candidate list length was removed by fixing it at a value 1. It there-
fore makes it impossible to tell whether the improvements in ant cycles that the
authors report are due to the genetic algorithm’s tuning or the introduced param-
eters or the removed parameter. These factors are confounded (Section A.1.6 on
page 212). The authors then carry all the parameter values from the Oliver30 prob-
lem instance to the Eil51 instance, except for number of ants, which is changed
from 14 to 25. Neither the use of the same parameters or the arbitrary change in
the number of ants is justified. Secondly, while the performance improvement of
40% on Oliver30 seems very large, only two simple problem instances were tested
and the differences in performance between both algorithms were not tested for
85
CHAPTER 4. EXPERIMENTAL WORK
Symbol Parameter Values
m Number of ants 14
q0 Exploration/exploitationthreshold
0.38
α Influence of pheromonetrails
0.37
β Influence of heuristic 6.68
ρlocal Local pheromone decay 0.30
ρglobal Global pheromone depo-sition
0.31
Q Global pheromone up-date term
78.04
γ Global pheromone up-date term
0.67
τ0 Initial pheromone levels 0.41
a Local search term 5
b Local search term 0.97
Table 4.1: Evolved parameter values for ACS. Results are from Botee and Bonabeau [19, p. 154]applied to Oliver30 and Eil51.
statistical significance. A 40% improvement on such small problems is probably of
little practical significance.
Overall, the approach of tuning ACS with the GA has some other weak points.
The GA is itself a heuristic and so probably needs its own tuning. We have seen that
this is best done with a DOE approach (Section 4.2 on page 79). The authors do not
specify where their GA parameter settings such as population size come from. This
introduces another parameter tuning problem on top of the ACS parameter tuning
problem. Their methodology does not incorporate any screening of parameters.
The GA therefore expends time tuning parameters that may not have any effect on
ACS performance.
Birattari [12] uses algorithms derived from a machine learning technique known
as racing [78] to incrementally tune the parameters of several metaheuristics. Tun-
ing is achieved with a fixed time constraint where the goal is to find the best config-
uration of an algorithm within this time. While the dual problem of finding a given
threshold quality in as short a time as possible is acknowledged, the author does
not pursue the idea of a simultaneous bi-objective optimisation of both time and
quality. Solution times were subsequently investigated by others [38]. Compar-
isons are made between 4 types of racing algorithm and a baseline algorithm that
uses a brute force approach to tuning. These racing algorithms were used to tune
Iterated Local Search for the Quadratic Assignment Problem and Max-Min Ant Sys-
tem for the Travelling Salesperson Problem. Experiments were run on Dorigo and
Stutzle’s code [12, p. 120], the same original source code on which this thesis is
based.
86
CHAPTER 4. EXPERIMENTAL WORK
4.3.3 Empirical Approaches
One Factor at a Time
Even if the previous self-tuning (Section 4.3.1 on page 83) and heuristic tuning
(Section 4.3.2 on page 85) approaches were successful, they are of limited use
when attempting to understand the all-important relationship between tuning pa-
rameters, problem characteristics and performance. Understanding this relation-
ship requires a model. Building a model involves sampling various points in the
space of parameter settings and problem instances (the design space) and then
measuring performance at those points. When there are many parameters or prob-
lem characteristics, the researcher must confront a vast high-dimensional design
space. One way to tackle this is to use a One-Factor-At-a-Time (OFAT) approach.
OFAT involves fixing the values of all but one of the tuning parameters. The re-
maining parameter is varied until performance is maximised. Another parameter
is chosen to be varied and all other parameter values are fixed. This process con-
tinues one factor at a time until all parameters have been tuned. While OFAT may
occasionally be quick, there are limitations to the conclusions that one may draw
from an OFAT analysis (Section 2.5 on page 47). However this approach occurs in
the literature and so an illustrative case is reviewed here.
Stutzle studied the three modifications to Ant System introduced for Max-Min
Ant System [118, p. 18] with an OFAT approach. Default parameter values were
β = 2, α = 1, m = n, ρ = 0.98 and candidate list lengths were 20. ρ was varied
between 0.7 and 0.99 for two small instances from TSPLIB, KroA100 and d198.
The response measured was quality of solution. Stutzle’s recommendation was a
low value of ρ when a low number of tour constructions are performed and a high
value of ρ when a high number of tour constructions are performed.
The trade-off of initialisation to τmin or τmax was also examined along with the
use of the global or solution best ant for pheromone update. An informal examina-
tion of a table of the differences between the trail initialisations showed a negligible
practical difference in solution quality (0.9% maximum) on small instances of size
51 to 318. However, only a single problem characteristic, problem size, was re-
ported. Interestingly, the difference between trail initialisation methods increased
with problem size but this trend was unfortunately not explored further. A similar
result was obtained for the choice of ant for pheromone update. MMAS was then
compared to Ant System [44], Elitist Ant System [44] and Rank-based Ant System
[24] where the parameter settings for these algorithms are listed but no motivation
of the choice of these parameter settings was given. Comparisons were made on
three small instances of size 51, 100 and 198 with a fixed tours stopping criterion.
The results were listed in absolute terms without a statistical test for significance.
We can express that results table in terms of relative solution quality. Although
MMAS did indeed find the best solutions, we see that the difference from the next
best solution provided by ACS was only 0.6%. It is claimed that since MMAS out-
performs ACS, and ACS was demonstrated to outperform other nature-inspired
algorithms [46], MMAS is therefore competitive with other algorithms. However,
this claim ignores the resources used in tuning these algorithms before their per-
87
CHAPTER 4. EXPERIMENTAL WORK
formance was measured.
The investigation of the benefits of several variants of MMAS with local search
could have been done differently. Firstly, some of the parameters were inexplica-
bly changed. Specifically, the number of ants was now fixed at m = 25 and the
pheromone decay held constant at ρ = 0.8. Solution CPU time was reported but
without any accompanying benchmarking. The instance sizes were varied from
198 to 1577.
Design Of Experiments
The Design of Experiments (DOE) approach is preferable to OFAT. Surprisingly,
before the publication of results from this thesis [104, 106, 109, 105, 107], DOE
has been almost completely absent from the ACO literature.
Silva and Ramalho [114] give a small summary of the use of DOE techniques
in ACO. However, their categorisation of the techniques is unusual. They include
One-Factor-At-a-Time analysis as a ‘simple’ type of DOE and the general category
of ‘data analysis’ is seen as separate from DOE. Only one reference [52] is listed
as applying DOE but a reading of the reference shows that it does not actually
use DOE. The authors then illustrate the use of a full 2k factorial with 7 factors
and 5 replicates on a single instance of the Sequential Ordering Problem. A single
solution quality response that is independent of CPU time is measured. Normal
plots and residual plots are used to check model quality. The authors then use
what they term the ‘observation method’ to recommend tuning parameter values.
It is not clear what this method is. Non-integers values of α = 0.25 and β = 1.5 are
recommended. This recommendation would actually lead to extremely high CPU
times because of non-integer exponentiation in the ant decisions (Equation ( 2.2 on
page 39)). The authors did not measure the CPU response and so would have been
unaware of this problem. This dramatically supports the argument for recording
CPU time, regardless of the focus of the experiment (Section 3.10.1 on page 68).
Gaertner and Clark [50] attempted to find optimal parameter settings for three
parameters of a single ACO heuristic, Ant Colony System, using a full factorial de-
sign. While there were many flaws in the execution and analysis of their research,
we include it in this section because of its use of a factorial design. Although
the authors identified 6 tuning parameters, α, β, ρ, q0, m and Q, they immedi-
ately argued that all but 3 of these could be omitted from consideration. Firstly,
they claimed that it is sufficient to fix α and only vary β. They claimed that Q is
a constant despite listing it as a parameter. Finally, they claimed that m could
‘reasonably’ be set to the number of cities in the problem. This left only three
parameters that were actually considered, β, ρ and q0. We know from our review
in Section 2.4.9 on page 45 that the number of tuning parameters is actually far
greater for ACS. The authors then partitioned the three parameters β, ρ and q0
into 14, 9 and 11 values respectively. No reasoning was given for this granularity
of partitioning or why the number of partitions varied between parameters. Each
‘treatment’ was run 10 times with a 1000 iteration or optimum found stopping cri-
terion on a single 30 city instance, Oliver30. This resulted in 13,860 experiment
88
CHAPTER 4. EXPERIMENTAL WORK
runs that took the authors several weeks to execute on a dual 2.2GHz proces-
sor with 2Gb RAM. While the excessive running time may have been due to poor
implementation, the approach was nonetheless incredibly inefficient—a response
surface design (Section A.3.4 on page 218) for 3 factors with a full factorial and
10 replicates would have required approximately 150 runs, 1% of that used by the
authors. It was also expensive although we cannot relate their figures to present
day values because no benchmarking was reported. CPU time was not reported
for the various parameter settings and so the heuristic compromise was ignored.
The authors also make an unfair comparison with other authors’ work, claiming
that they find the optimal solution faster on average without local search. This
claim fails to acknowledge the authors’ significant effort (13,860 runs over several
weeks) to find their conclusion and the use of prior knowledge of an optimum to
stop an experiment once the optimum was found. Section 3.13 on page 73 dis-
cussed the problem with using an optimum as a stopping criterion. The authors
claim that their parameter setting is robust because their empirical search found
the same parameter setting for 3 values of relative error, 0%, 1% and 5%. This is
not a robustness analysis. The authors made no attempt to see how the response
varies when the input parameter is perturbed.
This thesis presents a far more rigorous experimental approach that draws con-
tradictory conclusions to those of Gaertner and Clark, is an order of magnitude
more efficient in terms of experiment runs and deals with all ACS tuning parame-
ters across a space of problem instances rather than a single instance.
4.4 Chapter Summary
This chapter covered the following topics.
• Problem Difficulty. It is critically important to determine the characteris-
tics that affect the difficulty of a problem presented to a heuristic. Without
this, it is impossible to generalise the relationship between instances, tuning
parameters and heuristic performance.
– Some authors have tried manipulating instances and examining the re-
sulting affect on problem difficulty. Others have tried to evolve difficult
instances and then determine why those instances were difficult.
– Some authors have hypothesised that a particular problem characteristic
made an instance difficult and then generated many instances with dif-
ferent levels of that hypothesised characteristic. This approach is prefer-
able since it fits within the more scientific approach of hypothesise and
test. Such an approach has not yet been applied to ACO heuristics for
the TSP.
– Results with exact algorithms for the TSP suggest that the standard de-
viation of edge lengths in the TSP instance has an effect on the difficulty
of the instance.
89
CHAPTER 4. EXPERIMENTAL WORK
• Parameter Tuning of other heuristics. Other heuristics have been tuned
using basic Design Of Experiments techniques such as factorial and frac-
tional factorial designs.
• Parameter tuning of ACO heuristics.
– Approaches to tuning ACO can be categorised as either (1) Analytical,
(2) Automated or (3) Empirical. Analytical approaches attempt to prove
properties about the parameter-problem-performance relationship using
mathematical proof. Automated approaches use an algorithm to auto-
matically tune the heuristic. This algorithm may be the heuristic itself.
Empirical approaches gather data from actual algorithm runs and at-
tempt to build a model to reason about the data and draw conclusions.
– Automated Tuning with another heuristic, such as a genetic algorithm,
is inefficient for two reasons. Firstly, there is no ability to screen out
parameters that are not effecting performance and so effort is wasted on
tuning potentially ineffective parameters. Secondly, the tuning heuris-
tic does not build up a model of the relationship between parameters,
problem instances and performance of the tuned heuristic. This severely
limits what can be learned from running the tuning procedure.
– Automated Self-tuning involves applying the heuristic’s own optimisa-
tion mechanisms to the heuristic’s tuning parameters. This is intuitively
a sensible approach to parameter tuning. However, it suffers from the
same lack of screening and modelling as the automated tuning approach.
Examples from the literature have been poorly executed experimentally
and so we cannot determine whether this is a viable approach to param-
eter tuning.
– Researchers often use a One-Factor-At-a-Time (OFAT) approach. While
this does give useful insights into the importance and effects of various
parameter settings, the OFAT approach has many recognised deficiencies
for parameter tuning.
– The Design of Experiments approach has been used on two occasions to
recommend ACS parameter settings. There were several flaws with the
execution of the DOE methods however. Furthermore, the authors did
not examine CPU time and so made recommendations of setting exponent
tuning parameters to non-integer values.
This concludes the first part of the thesis. Chapter 2 on page 29 gave a back-
ground on combinatorial optimisation and the Travelling Salesperson Problem.
Metaheuristics were introduced as an approach to finding approximate solutions
to these difficult and important problems and the most important Ant Colony Op-
timisation (ACO) heuristics were described in detail. A review of related work in
Chapter 3 began by collecting and organising the many experiment design and
analysis issues that arise in empirical research with metaheuristics. Several ap-
proaches to determining the problem characteristics that affect performance and
90
CHAPTER 4. EXPERIMENTAL WORK
to parameter tuning were reviewed in this chapter. However, in light of the con-
cerns raised in Chapter 3, the vast majority of these approaches and their execu-
tion have been deficient in several ways.
The next part of this thesis will address these deficiencies, comprehensively de-
scribing an adapted DOE approach for addressing the parameter tuning problem.
91
5Experimental testbed
Before detailing the adapted DOE methodology that this thesis introduces, we must
first address the thesis’ ‘apparatus’. This chapter covers all issues related to the ex-perimental testbed. The experimental testbed in metaheuristic research comprises
three items. These are:
1. the code for the problem generator that creates the test instances that the
algorithms then solve,
2. the code for the metaheuristics on which the experimenter is conducting
research, and
3. the machines that run all experiments in the research.
Of course, either the machines or the problem generators could be the subject of
the research (Section 3.4 on page 58). One often asks whether a problem generator
is appropriate for an algorithm. In an industrial context, one may be concerned
about the machine characteristics that best suit the algorithms and problems. This
chapter deals with these three experimental testbed items in order.
5.1 Problem generator
We have already discussed the difficulty in creating a problem generator and the
arguments for and against the use of problem generators (Section 3.12 on page 71).
Problem generators are a large area of research because of these difficulties and so
are beyond the scope of this research. It is therefore desirable to choose a problem
generator that is acceptable for other researchers. Preferably, the generator will
already have been subjected to extensive use so that any peculiarities with the
generator are more likely to have become known. We have chosen to use a problem
generator provided with the 8th DIMACS Implementation Challenge: The Travelling
95
CHAPTER 5. EXPERIMENTAL TESTBED
Salesman Problem1. The DIMACS challenge was a large competition held within
the TSP research community with the stated goals of:
• creating ‘a reproducible picture of the state of the art in the area of TSP
heuristics (their effectiveness, their robustness, their scalability, etc.), so that
future algorithm designers can quickly tell on their own how their approaches
compare with already existing TSP heuristics’ and
• enabling ‘current researchers to compare their codes with each other, in
hopes of identifying the more effective of the recent algorithmic innovations
that have been proposed. . . ’.
One way the DIMACS challenge facilitated these goals was to provide researchers
with problem generators to generate instances on which their codes could be
tested. Problem generators were provided for several of the possible types of TSP
instance (Section 2.2). In particular, the DIMACS challenge provided a generator
called portmgen to generate symmetric instances with a given number of nodes
where edges lengths were chosen with a uniform random distribution.
Other researchers have used instances with edges drawn from a Log-Normal
distribution (Section 3.1 on page 54) so that the standard deviation of edge lengths
could be treated as a factor and controlled in the experimental sense. This was
shown to have an important effect on problem difficulty for an exact algorithm
[26]. This same factor is investigated later in this thesis (Chapter 7). The Log-
Normal distribution is the probability distribution of any random variable whose
logarithm is normally distributed. There is a good introduction to the Log-Normal
Distribution online2. The distribution has the probability density function:
f (x;µ, σ) =e− ln(x−µ)2/2σ2
xσ√
2π(5.1)
for x > 0 where µ and σ are the mean and standard deviation of the variable’s
logarithm. For our purposes of controlling edge length standard deviation, we note
that relationships can be derived to solve for the Log-Normal parameters µ and σ of
Equation ( 5.1) given a desired expected mean E (x) and expected variance V ar (x)of the resulting distribution.
µ = ln (E (x))− 12
ln
(1 +
V ar (x)E (x)2
)(5.2)
σ2 = ln
(1 +
V ar (x)E (x)2
)(5.3)
For example, if we want a Log-Normal distribution with a certain standard de-
viation and certain mean, these equations will tell us what values of parameters µ
and σ to use when creating our distribution.
1 http://www.research.att.com/∼dsj/chtsp/2 http://en.wikipedia.org/w/index.php?title=Log-normal distribution&oldid=136064053
96
CHAPTER 5. EXPERIMENTAL TESTBED
For this research, the DIMACS portmgen generator [58] was ported to Java
and refactored into an Object-Oriented implementation we call Jportmgen. The
generator’s behaviour was preserved during the porting using unit tests. The DI-
MACS portmgen and the thesis’ Jportmgen produced identical instances for a given
pseudo-random generator seed. The code was then modified such that chosen edge
lengths exhibited a Log-Normal distribution with a desired mean and standard de-
viation, as per Cheeseman et al [26]. This new implementation therefore allows
the experimenter to control problem size, edge length mean and edge length stan-
dard deviation while remaining true to the DIMACS generator accepted for the TSP
community’s largest research project. Different distributions can be plugged into
the generator, including the original DIMACS uniform distribution.
Although Cheeseman et al [26] did not state their motivation for using the
Log-Normal distribution, a plot of the relative frequencies of the normalised edge
lengths of Euclidean instances from the online benchmark library, TSPLIB [102],
shows that the majority have a Log-Normal shape (Appendix B). Figure 5.1 shows
relative frequencies of the normalised edge lengths of several instances created by
Jportmgen.
0
0.5
1
0 0.2 0.4 0.6 0.8Normalised Edge Lengths
Nor
mal
ised
Rel
ativ
e Fr
eque
ncy
Mean 100, StDev 70Mean 100, StDev 30Mean 100, StDev 10
Figure 5.1: Relative frequencies of normalised edge lengths for several TSP instances of the samesize and same mean cost. Instances are distinguished by their standard deviation. All instancesdemonstrate the characteristic Log-Normal shape.
Unless otherwise stated, all future references to the problem generator will refer
to the Jportmgen generator. All generated instances in the thesis are created with
this Log-Normal version of the DIMACS portmgen.
5.2 Algorithm implementation
The next important aspect of the testbed is the algorithm implementation. Repro-
ducible algorithm implementations are both extremely important and yet difficult
to achieve (Section 3.8 on page 65). The best way to overcome these issues is to
provide the source code on which all experiments are conducted. Furthermore, in
the interest of advancing the field, research and its results should be both consis-
tent with previous research and extensible in future research by others. Meeting
these basic demands of a scientific field requires the community’s adoption of a
97
CHAPTER 5. EXPERIMENTAL TESTBED
standard implementation of its ACO algorithms. The closest thing to a standard
implementation for ACO algorithms is the C code written by Stutzle and Dorigo for
their definitive book on the field [47]. This was made available to the community on
the world wide web3 and is recommended for experiments with ACO [42, p. 275].
For the reasons mentioned above (reproducibility, relevance to previous re-
search and extensibility in future research) we made the decision to use the ACOTSP
code of Stutzle and Dorigo. ACOTSP was a procedural C implementation. We
translated ACOTSP into a Java implementation that we will refer to henceforth as
JACOTSP. The Java implementation now benefits from the usual Object-Oriented
advantages4, in particular its extensibility. The class hierarchy in JACOTSP en-
sures that algorithm subclasses share the same data structures and differ only
in the implementation details of individual methods. The Template design pat-
tern [72] proves particularly useful in this regard. The Delegator pattern allows
different termination conditions, for example, to be ‘plugged in’ to the algorithms
without disrupting the rest of their structure. JACOTSP runs on symmetric TSP
problems using 6 ACO algorithms namely Ant System, Ant Colony System, Rank-
based Ant System, Elitist Ant System, Best-Worst Ant System and Max-Min Ant
System. One may question the impact of Java on computation times when com-
pared to the original ACOTSP C implementation. While the early releases of Java
were indeed slow, subsequent releases addressed this issue in the context of sci-
entific computing [23]. Java is now an acceptable choice for scientific computing
according to performance benchmarks5 and is used for high performance scientific
computing in laboratories such as CERN 6. To focus too closely on relative running
times would be to miss the aim of the thesis which is to demonstrate effective tun-
ing of a heuristic resulting in improvements in solution quality and solution time.
As discussed in Chapter 3, these solution times will always be dependent on the
particular implementation details, regardless of the programming language used.
The random number generator used in ACOTSP and ported to JACOTSP is the
Minimal Random Number Generator of Park and Miller [92]. Its implementation
and a discussion of its merits can be found in the literature [99, p. 278-279]. The
reimplementation of the random number generator, in particular, ensures that JA-
COTSP produces the same behaviour (and ultimately the same solutions) as its
ACOTSP predecessor. This backwards compatibility was ensured with unit tests
that compare the output files of ACOTSP with those of JACOTSP for a variety of
input parameters. Such compatibility does not make sense when new tuning pa-
rameters are identified or when aspects of the algorithm’s internal design are pa-
rameterised and varied. Breaking such compatibility is inevitable if the ACO algo-
rithms are to evolve. Given that the random number generator is well-established
and that backwards compatibility was important, we did not investigate alternative
generators as is sometimes advisable (Section 3.11 on page 71).
3 ACOTSP, available at http://iridia.ulb.ac.be/∼mdorigo/ACO/aco-code/public-software.html4In general, the use of an object model leads to systems with the following attributes of well-
structured complex systems: abstraction, encapsulation, modularity and hierarchy [18].5http://shootout.alioth.debian.org/6http://dsd.lbl.gov/˜hoschek/colt/
98
CHAPTER 5. EXPERIMENTAL TESTBED
The original ACOTSP contained a timer that reported the CPU time for which
ACOTSP was running. A slightly different approach was taken in JACOTSP be-
cause accessing CPU times in Java was problematic in the Java version in which
the JACOTSP project was started. Newer versions have since overcome this. For
this reason, the decision was taken to use a timer supplied with the Colt project7.
Colt is a set of high performance scientific computing libraries used at the CERN
labs. JACOTSP therefore measures elapsed time rather than CPU time. The timer
was paused during the calculation and output of data that is not essential to the
functioning of the JACOTSP ant algorithms. For example, branching factor cal-
culation is not timed for ACS but is timed for MMAS because it is used in trail
reinitialisation. Any concerns over the interruption of the timer by other operating
system processes are easily allayed by randomising experiment running orders.
Unless otherwise stated, the times reported in this thesis’ case studies are elapsed
times rather than CPU times.
5.2.1 Profiling
Exponentiation is a mathematical operation, written an, involving two numbers,
the base a and the exponent n. When n is a whole number (an integer), the expo-
nentiation operation corresponds to repeated multiplication. However, when the
exponent is a real number (say 1.73) a different approach to calculation is re-
quired and this approach is computationally very expensive. In Java, the language
of the JACOTSP implementation, the natural logarithm method is used for real ex-
ponents. The details of this method are beyond the scope of this discussion. A
simple profiling of the JACOTSP code showed that real exponent values caused
the vast majority of computational effort to be expended on exponentiation. Re-
call from the design of ACO (Section 2.4.3 on page 37) that two exponentiations
are involved in every ant movement decision (see equation ( 2.2 on page 39) for
example).
The implication of this and the common knowledge of the expense of exponenti-
ation is that tuning parameters that are exponents should be limited to be integer
values only. However, we have seen at least one case in the literature [114] where
authors looking only at solution quality and not recording CPU time actually rec-
ommend non-integer values of these exponents (Section 4.3.3 on page 88). Any
gain in quality from using a non-integer α and β will most likely be offset by the
huge deterioration in solution time. This is further evidence to support the recom-
mendation of measuring CPU time (Section 3.10.1 on page 68) and for this thesis’
emphasis on the heuristic compromise.
7 http://dsd.lbl.gov/˜hoschek/colt/index.html
99
CHAPTER 5. EXPERIMENTAL TESTBED
5.3 Benchmarking the machines
The remaining aspect of our experimental testbed is the physical machines on
which all experiments are conducted. Clearly, all machines can differ widely.
There are differences in processor speeds, memory sizes, chip types, operating
systems, operating system versions and, in the case of Java, different versions of
different virtual machines. Even if machines are identical in terms of all of these
aspects, they may still differ in terms of running background processes such as
virus checkers. This is the unfortunate reality of the majority of computational re-
search environments. Furthermore, such differences will almost certainly occur in
the computational resources of other researchers who attempt to reproduce or ex-
tend previous work of others. Ultimately, such differences mean that experiments
that are identical in terms of the two previous testbed issues of algorithm code
and problem instances will still differ when run on supposedly identical machines.
These differences necessitate the benchmarking of the experimental testbed (Sec-
tion 3.9 on page 67).
Reproducibility of results (Section 3.8 on page 65) is a second important mo-
tivation for benchmarking. Other researchers can reproduce the benchmarking
process on their own experimental machines. They can thus better interpret the
CPU times reported in this research by scaling them in relation to their own bench-
marking results. This mitigates the decline in relevance of reported CPU times with
inevitable improvements in technology. It is hoped that the benchmarking advo-
cated in this thesis becomes commonplace in reported ACO and metaheuristics
research.
5.3.1 Benchmarking method
The clear and simple benchmarking procedure of the DIMACS [58] challenge is
applied here and its results described below.
1. A set of TSP instances is generated with one of the DIMACS problem gener-
ators. These instances range in size from one thousand nodes to one million
nodes.
2. The DIMACS greedy search, a deterministic algorithm, is applied to each in-
stance for a given number of repetitions and the total time for all repetitions
is recorded. The number of repetitions varies inversely with the size of the
instance. For example, the instance of size 1 million is solved only once by
the greedy search while the instance of size 1 thousand is solved 1 thousand
times.
5.3.2 Results and discussion
The results of the DIMACS benchmarking of our experimental machines are illus-
trated in Figure 5.2 on the facing page and the corresponding data are presented
in Figure 5.3 on the next page.
100
CHAPTER 5. EXPERIMENTAL TESTBED
DIMACS Benchmarking of experiment machines
0.00
10.00
20.00
30.00
40.00
50.00
E1k.0 E3k.0 E10k.0 E31k.0 E100k.0 E316k.0 E1M.0
Instance
Tim
e (s
)
116253111156188136
Figure 5.2: Results of the DIMACS benchmarking of the experiment testbed.
26/09/2007benchmarking.xls
Instance Size Repetitions 116 253 111 156 188 136 96E1k.0 1000 1000 5.45 4.38 5.25 5.00 3.31 4.81 3.37E3k.0 3000 316 5.67 4.61 7.61 5.25 3.78 5.23 3.75E10k.0 10000 100 6.91 7.25 8.81 6.44 4.99 6.64 5.32E31k.0 31000 32 11.77 16.52 13.41 11.20 9.48 11.00 10.84E100k.0 100000 10 22.86 26.53 26.77 21.03 10.87 19.85 12.82E316k.0 316000 3 28.61 32.05 34.50 27.61 12.56 25.86 14.70E1M.0 1000000 1 39.52 44.80 49.23 38.03 16.55 35.44 19.31
Time (s)
Figure 5.3: Data from the DIMACS benchmarking of the experiment testbed.
101
CHAPTER 5. EXPERIMENTAL TESTBED
The horizontal axis represents the different instances for which the benchmark-
ing was conducted where instances are arranged in order of increasing size. The
vertical axis is the total time in seconds for the benchmarking run. Each bar rep-
resents a different machine, identified by a machine ID. It is evident that every
machine’s benchmark time increases with instance size. The differences between
machines become more pronounced as instance size increases. In general, ma-
chine 188 is always fastest.
The benchmarking times indicate that despite the similarity of the specification
of most of the machines, there are still differences in CPU times and these differ-
ences seem to amplify in larger instances. The benchmarking has thus identified a
nuisance factor in the experimental testbed and the need to randomise experiment
runs across the experiment testbed. Efforts to use completely identical machines
for all experiments will still encounter this nuisance factor. Any successful perfor-
mance analysis methodology will have to cope with this reality.
5.4 Chapter summary
This chapter covered the following:
• Decision to use a publicly available problem generator. There are concerns
over the difficulties of developing reliable problem generators. This research
uses a publicly available problem generator that was also used in the DIMACS
challenge, a large research competition within the TSP community.
• Modified generator to control problem characteristic. Problem instance
standard deviation of edge lengths may be an important problem character-
istic affecting problem difficulty. To this end, a modified DIMACS problem
generator draws its edge lengths such that they exhibit a Log-Normal dis-
tribution with a desired mean and standard deviation. The choice of this
distribution is in keeping with previous work by Cheeseman et al [26].
• OOP implementation of publicly available algorithm source code. There
is a suite of publicly available C code of the main ACO algorithms to accom-
pany the field’s main book [47]. We have reimplemented this code in Java
and refactored it into an extensible Object-Oriented (OO) implementation.
Our java code continues to reproduce the solutions of the original C code
by Dorigo and Stutzle. Results from this research are therefore applicable to
other research that has used their publicly available C code.
• Highlighting the computationally expensive exponentiation calculation.We saw from our overviews of the ACO metaheuristic (Section 2.4.3 on page 37)
that all involve a very large number of exponentiation calculations in which
the tuning parameters α and β are the exponents. It is well known that ex-
ponentiation with non-integer exponents is computationally very expensive.
This was highlighted in our profiling of the code but is often missed by re-
searchers who ignore the heuristic compromise and do not record CPU times.
Tuning of the α and β parameters will therefore be restricted to integer values.
102
CHAPTER 5. EXPERIMENTAL TESTBED
• Benchmarked experimental machines. The DIMACS benchmarking ap-
proach was applied to all machines used during the course of this research.
The emphasis of the thesis research is not to compare algorithms in a ‘horse-
race’ study (Section 3.4.4 on page 59). Benchmarking the machines benefits
future research in that CPU times reported in the thesis can be interpreted
and scaled by other researchers using the same accepted benchmarking ap-
proach, regardless of the inevitable differences in their experimental testbed.
103
6Methodology
Chapter 3 brought together high-level concerns spanning all aspects of Design
Of Experiments (DOE) with particular emphasis on DOE for metaheuristics. The
reader is referred to Appendix A for a background on DOE. This chapter focuses
on a sequential experimentation methodology that deals with these concerns. The
methodology efficiently takes the experimenter from the initial situation of almost
no knowledge of the metaheuristic’s behaviour to the desired situation of a mod-
elled metaheuristic with recommendations on tuning parameter settings for given
problem characteristics. The methodology is based on a well-established proce-
dure from DOE that this thesis modifies for its application to metaheuristics. It
was first introduced to the ACO community only recently by the author [104]. The
design generation and statistical calculations can be performed with most modern
statistical software packages including SPSS, NCSS PASS, Minitab and Microsoft
Excel. Examples of the methodology’s successful application to the parameter tun-ing problem are presented in the chapters in Part IV of the thesis. This chapter
begins with a relatively high-level overview of the whole sequential experimenta-
tion methodology before detailing all its stages and decisions.
6.1 Sequential experimentation
Experimentation for process modelling and process improvement is inherently it-
erative. This is no different when the studied process is a metaheuristic. Box
[20] gives some sample questions that often arise after an experiment has been
conducted.
“That factor doesn’t seem to be doing anything. Wouldn’t it have been
better if you had included this other variable?”
“You don’t seem to have varied that factor over a wide enough range.”
105
CHAPTER 6. METHODOLOGY
“’The experiments with high factor A and high factor B seem to give the
best results; it’s a pity you didn’t experiment with these factors at even
higher levels.”
In sequential experimentation, there are six directions in which a subsequent
experiment commonly moves [20]. These depend on the results from the first
experiment.
1. Move to a new location in the design space because the initial results suggest
a trend that is worth pursuing.
2. Stay at the current location in the design space and add further treatments
to the design to resolve ambiguities that may exist in the design. Such ambi-
guities are typically due to effects that cannot be separated from one another
due to the nature of the experiment design. Such ‘entanglement’ of effects is
termed aliasing (Section A.3 on page 214).
3. Rescale the design if it appears that certain variables have not been scaled
over wide enough ranges.
4. Remove or add factors to the experiment.
5. Repeat some runs to better estimate the replication error.
6. Augment the design to assess the curvature of the response. This is particu-
larly important when large two-factor interactions occur between factors.
These questions and the decisions for subsequent experiments are part of a
larger sequential experimentation methodology. This ‘bigger picture’ is illustrated
in Figure 6.1 on the next page.
The main advantage of the sequential experimentation methodology is its effi-
ciency of resources. It is rare that the experimenter begins with a full knowledge
of the metaheuristic (hence the point of experimentation). A revision of experiment
design decisions is therefore inevitable as more is learned about the metaheuris-
tic. The sequential methodology avoids the risky and oft unsuccessful approach of
running a large all-encompassing expensive experiment up front. Instead, many of
the designs and their existing data are incorporated into subsequent experiments
so that no experimental resources go to waste. Factors are carefully examined
before a decision on their inclusion in an experiment. Calculations of statistical
power reveal when a sufficient number of replicates have been gathered.
We now describe each of the stages in this sequential experimentation method-
ology of this thesis with reference to modelling and tuning the performance of a
metaheuristic.
6.2 Stage 1a: Determining important problem character-
istics
The first stage in the sequential experimentation approach is to determine all prob-
lem characteristics that affect at least one of the responses of interest. Without
106
CHAPTER 6. METHODOLOGY
Heuristic tuningparameters
Known & suspectedproblem characteristics
screening
runs to confirmmodel1.
Scr
eeni
ng augment designwith foldover
estimate main effects andinteractions
curvature?
augment design with axialpoints or new design space
Response Surface Methods
runs to confirmmodel
2. M
odel
ling
Numerical Optimisation
3. T
unin
g
None
Stage
runs to confirmresults
4. E
valu
atio
n
Overlay plots for optimalparameter settings
Figure 6.1: The sequential experimentation methodology. The methodology covers four main stages,screening, modelling, tuning and evaluation of results.
107
CHAPTER 6. METHODOLOGY
sufficient problem characteristics, the response surface models from later in the
procedure will not make good predictions of performance on new instances.
6.2.1 Experiment Design
The main difficulty encountered in attempting to experiment with problem instance
characteristics is that of the uniqueness of instances. That is, while several in-
stances may have the same characteristic that is hypothesised to affect the re-
sponse, these instances are nonetheless unique. For example, there is a poten-
tially infinite number of possible instances that all have the same characteristic
of problem size. The uniqueness of instances will therefore cause different values
of the response despite the instances having identical levels of the hypothesised
characteristic. The experimenter’s difficulty is one of separating the effect (if any)
due to the hypothesised characteristic from the unavoidable variability between
unique instances.
A given heuristic encounters instances with different levels of some character-
istic. The experimenter wishes to determine whether there is a significant overall
difference in heuristic performance response for different levels of this character-
istic. The experimenter also wishes to determine whether there is a significant
variability in the response when unique instances have the same level of the prob-
lem characteristic.
There is a well-established experiment design to overcome this difficulty. It is
termed a two-stage nested (or hierarchical) design. Figure 6.2 illustrates the two-
stage nested design schematically.
1
1 2 3
2
4 5 6
Problemcharacteristic
Instance
Observations
y111y112
y11r
y121y122
y12r
y131y132
y13r
y241y242
y24r
y251y252
y25r
y261y262
y26r
Figure 6.2: Schematic for the Two-Stage Nested Design with r replicates. (adapted from [84]). There areonly two levels of the parent problem characteristic factor. Note the instance numbering to emphasisethe uniqueness of instances within a given level of the problem characteristic.
Note that this thesis applies the two-stage nested design to the heuristics re-
searcher’s aim of determining whether a problem characteristic merits inclusion in
the sequential experimentation methodology. This design cannot capture possible
interactions between more than one problem characteristic. Capturing such inter-
actions would require a more complicated crossed nested design or the factorial
designs encountered later in the sequential experimentation procedure. The goal
here is to quickly determine whether the hypothesised characteristic should be
included in subsequent experiments in the sequential methodology. This design,
first introduced into ACO research by the author [105, 110] is now receiving wider
108
CHAPTER 6. METHODOLOGY
attention and theoretical backing in the community [6].
6.2.2 Method
This is an overview of the method for determining whether a problem character-
istic should be included in subsequent stages of the methodology. A case study
illustrates this method with real data in Chapter 7.
1. Responses variables. Identify the response(s) to be measured. For experi-
ments with metaheuristics, these must reflect some measure of solution qual-
ity and some measure of solution time.
2. Design factors and factor ranges. Choose the problem characteristic hy-
pothesised to affect the response of interest and the range over which that
factor will be varied. The null hypothesis is that changes in this charac-
teristic cause no significant change in the metaheuristic performance. The
alternative hypothesis is that changes in the characteristic do indeed cause
changes in the performance.
3. Held-constant factors. Fix all tuning parameters and all other problem char-
acteristics at some values that remain fixed for the duration of the experiment.
These values may be values commonly encountered in the literature, values
in the middle of the range of interest or values determined to be of interest by
pilot studies. As noted above, this does not permit an examination of interac-
tions between the hypothesised characteristic and both other characteristics
and the metaheuristic’s tuning parameters. These interactions are examined
later in the sequential methodology.
4. Experiment design. Generate a 2-stage nested design where the hypothe-
sised characteristic is the parent factor (Factor A) and the unique instances
(Factor B) are nested within a given level of this parent. This design is il-
lustrated schematically in Figure 6.2 on the preceding page. Preferably, there
should be at least three levels of the parent factor so that it can be determined
whether the hypothesised effect of the problem characteristic is linearly re-
lated to the performance metric (Section 3.5 on page 59). We can have as
many instances nested within each parent level as we like. The number of in-
stances must be the same within each level of the parent. A treatment is then
a run of the metaheuristic on an instance within a given level of the problem
characteristic.
5. Replicates and randomise. Replicate each treatment and randomise the run
order. Collect the data in the randomised run order.
6. Analysis. The data can now be analysed with the General Linear Model. The
statistical technicalities behind this analysis have recently been explained in
the context of metaheuristics research [6]. These translate into the following
settings in most statistical software:
109
CHAPTER 6. METHODOLOGY
• Factor A is entered into the model as the parent. Factor B is nested
within Factor A.
• Factor B is set as a random factor since the unique instances are ran-
domly generated.
• Factor A is a fixed factor because its levels were chosen by the experi-
menter.
7. Diagnostics. The usual diagnostic plots (Section A.4.2 on page 221) are ex-
amined to verify that the assumptions for the application of ANOVA have not
been violated.
8. Response Transformation. A transformation of the response may be needed
(Section A.4.3 on page 222). If so, the experimenter returns to step 5 and
reanalyses the transformed response.
9. Outliers. If outliers are identified in the data, these are deleted and the
experimenter returns to step 5.
10. Power. The gathered data are analysed to determine the statistical power. If
insufficient power has been reached for the study’s level of significance then
further replicates are added and the experimenter returns to the Replicates
and randomise step.
11. Interpretation. When satisfied with the model diagnostics, the ANOVA table
from the analysis can be interpreted.
12. Visualisation. The box plot is also useful for visualising the practical signifi-
cance of any statistically significant effects.
This approach can be repeated for any problem characteristic that is hypoth-
esised to affect performance. Once satisfied with a set of characteristics, the ex-
perimenter proceeds to do a larger scale screening experiment with all the tuning
parameters and all the relevant problem characteristics. Of course, if the set of
problem characteristics turns out to be insufficient, the experimenter returns to
this stage of the overall sequential experimentation methodology.
6.3 Stage 1b: Screening
The next stage in the sequential experimentation methodology is screening. Screen-
ing aims to determine which factors have a statistically significant effect and prac-
tically significant effect on each response as well as the relative size of these ef-
fects. Chapters 8 on page 143 and 10 on page 169 present detailed case studies
of screening. The experiment design and methodology presented here were first
introduced to the ACO community by the author [104, 107, 108]. A similar DOE
screening methodology was applied to Particle Swarm Optimisation a year later
[74].
110
CHAPTER 6. METHODOLOGY
6.3.1 Motivation
A detailed motivation for screening heuristic tuning parameters and problem char-
acteristics has been covered in Chapter 1. It is important to screen heuristic tuning
parameters and problem characteristics for several reasons. We learn which pa-
rameters have no effect on performance and this saves experimental resources in
subsequent performance modelling experiments. It also improves the efficiency of
other heuristic tuning methods (Section 4.3.2 on page 85) by reducing the search
space these methods must examine. This was already discussed on several occa-
sions as one of the major advantages of the DOE approach to parameter tuning
over alternatives like automated tuning (Section 4.3.2 on page 84). Screening ex-
periments also provide a ranking of the importance of the tuning parameters. In a
case of limited resources, this arms the experimenter with knowledge of the most
important factors to examine and those parameters that one may afford to treat
as held-constant factors. Finally, screening is a useful design tool. Alternative
new heuristic features can be ‘parameterised’ into the heuristic and a screening
analysis will reveal whether these new features have any significant effect on per-
formance.
6.3.2 Research questions
The research questions in any heuristic screening study can be phrased as follows.
Screening. Which of the given set of heuristic tuning parameters and
problem characteristics have an effect on heuristic performance in terms
of solution quality and solution time?
Ranking. Of the tuning parameters and problem characteristics that
affect heuristic performance, what is the relative importance of each in
terms of solution quality and solution time?
Adequacy of a Linear Model. Is a linear model of the responses ade-
quate to predict performance or is a higher order model required?
These research questions lead to a potentially large number of hypotheses. It
would not contribute to this case study to exhaustively list all here. Some illus-
trative examples follow. These examples can be specified for any tuning parameter
or problem characteristic and must be analysed for all performance metrics in the
study. A screening hypothesis would look like the following.
• Null Hypothesis H0: the tuning parameter A has a significant affect on per-
formance measure X.
• Alternative Hypothesis H1: the tuning parameter A has no significant affect
on performance measure X.
A ranking hypothesis would look like the following.
• Null Hypothesis H0: the tuning parameters A and B have an equally important
effect on performance measure X.
111
CHAPTER 6. METHODOLOGY
• Alternative Hypothesis H1: the tuning parameter A has a stronger effect than
tuning parameter B on performance measure X.
6.3.3 Experiment Design
Full factorial designs are expensive in terms of experimental resources and pro-
vide more information than is needed in a screening experiment (Section A.3.2
on page 215). For screening purposes it is sufficient to use a fractional factorial(FF). The minimum appropriate fractional factorial design resolution for screening
is a resolution IV since no main effects are aliased but a resolution V design is
preferable when possible since this provides information on unaliased second or-
der effects. Fractional factorials, aliasing and resolution are discussed in more
detail in Appendix A.2.
6.3.4 Method
The methodology for factor screening is described below.
1. Responses variables. Identify the response(s) to be measured. For experi-
ments with metaheuristics, these must reflect some measure of solution qual-
ity and some measure of solution time.
2. Design factors and factor ranges. Choose the algorithm tuning parameters
and problem characteristics that will be screened as well as the ranges over
which these factors and problem characteristics will vary. Sometimes factors
have a restricted range due to their nature. Alternatively, factors may have
an open range. In either case, a pilot study (Section 3.7 on page 65) may be
required to determine sensible factor ranges.
3. Held constant factors and values. Because of limited resources, it may not
be possible to experiment with all potential design factors. A pilot study may
have revealed that some factors have a negligible effect on performance. These
factors must be held at a constant value for the duration of the experiments.
4. Experimental Design. For the given number of factors, choose an appro-
priate fractional factorial design. If the number of factors and resource lim-
itations prevent the use of a resolution V design then examine the available
resolution IV designs for the given number of factors. Where there are sev-
eral available designs of resolution IV, check whether the design requiring a
smaller number of treatments has a satisfactory aliasing structure. Note that
any resolution IV design will have aliased two-factor interactions. However,
knowledge of the system and its most likely interactions may make some of
these aliases negligible. Furthermore, judicious assignment of factors will re-
sult in the most important factors having the least aliasing in the generated
design.
The choice of design will fix the number and value of the factor levels. The
chart provided in Section A.3.2 on page 215 can help in making a decision
between resolution, number of runs and aliasing.
112
CHAPTER 6. METHODOLOGY
5. Run Order. This is now the minimum design with a single replicate. Generate
a random run order for the design.
6. Gather Data. Collect data according to the treatments and their random run
order.
7. Significance level, effect size and replicates. Choose an appropriate sig-
nificance (alpha) level for the study. Choose an appropriate effect size that
the screening must detect. For the study’s chosen significance level (5% in
this thesis), examine the design’s power to detect the chosen effect size. If the
power is not greater than 80%, add replicates to the design and return to the
Run Order step to gather the extra data. Sufficient data has been gathered
when power reaches 80%.
At this stage, sufficient data has been gathered with which to build a model of
each response. The steps in this model building stage of the methodology are the
subject of the next section.
6.3.5 Build a Model
The following model building steps must be repeated for each of the responses
separately. Before this however, it is necessary to check that none of the responses
are highly correlated. Only one response in a set of highly correlated responses
needs to be analysed. A scatter plot of each response against each other response
visually demonstrates correlation. Recall that two solution quality responses are
measured in this research (Section 3.10.2 on page 69). These responses are of
course highly correlated. However, both responses are always analysed separately.
It is an open question for the ACO field as to which solution quality response is the
more appropriate and we wished to examine whether the conclusions using one
would differ from the conclusions using the other.
1. Find important effects. Various techniques such as Half-normal plots can
be used to identify the most important effects that should be included in a
model of the data. This thesis uses backwards regression (Section A.4.1 on
page 220) with an alpha out value of 0.1.
2. ANOVA test. The result of the backwards regression is an ANOVA on the
model containing these most important effects.
3. Diagnosis. The usual diagnostic tools (Section A.4.2 on page 221) are used
to verify that the ANOVA model is correct and that the ANOVA assumptions
have not been violated.
4. Response Transformation. The diagnostics may reveal that a transforma-
tion of the response is required. In this case, perform the transformation and
return to the Find Important Effects step.
5. Outliers. If the transformed response is still failing the diagnostics, it may be
that there are outliers in the data. These should be identified and removed
before returning to the Find Important Effects step.
113
CHAPTER 6. METHODOLOGY
6. Model significance. Check that the overall model is significant.
7. Model Fit. Check that the predicted R-Squared value is in reasonable agree-
ment with the Adjusted R-Squared value and that both of these are close to
1. Check that the model has a signal to noise ratio greater than about 4.
At this stage in the procedure, it may be necessary to augment the design de-
pending on whether a resolution IV or resolution V design had been used and
depending on whether aliased effects are statistically significant. This is one of
the common iterative experimentation situations identified earlier (Section 6.1 on
page 105).
6.3.6 Augment model
In a resolution IV design, some second order effects will be aliased with other ef-
fects. If the experimenter deems these effects to be important then the design must
be augmented with additional treatments so that the aliasing of these effects can
be removed. There is a methodical approach to fractional factorial augmentation
that is termed foldover. The details of foldover are beyond the scope of this thesis
but are covered in the literature [84]. In essence, foldover is a way to add specific
treatments to a design such that a target effect will no longer be aliased in the new
augmented design. If foldover is performed, the new design can be analysed as per
Sections 6.3.4 on page 112 and 6.3.5 on the previous page.
At this point, we have reduced models for each response that pass the diagnos-
tics and in which no important effects are aliased. However, diagnostics involve
some subjective judgements. While the ANOVA procedure can be robust to slight
violations of these diagnostics, it is still good practice to independently confirm the
models’ accuracy on some real data.
6.3.7 Confirmation
Before drawing any conclusions from a model, it is important to confirm that the
model is sufficiently accurate. As in traditional DOE, confirmation is achieved
by running experiments at new randomly chosen points in the design space and
comparing the actual data to the model’s predictions. Confirmation is not a new
rigorous experiment and analysis in itself but rather a quick informal check. In the
case of a heuristic, these randomly chosen points in the design space equate to new
problem instances and new randomly chosen combinations of tuning parameters.
The methodology that this thesis proposes is as follows:
1. Treatments. A number of treatments are chosen where a treatment consists
of a new problem instance and a new set of tuning parameter values with
which the instance will be solved.
2. Generate Instances. The required problem instances are generated.
3. Select Tuning Parameters. It is important to remember that the screening
design uses only high and low values of the factors. It therefore can only pro-
duce a linear model of the response. This will not be an accurate predictor of
114
CHAPTER 6. METHODOLOGY
the response within the centre of the design space if the response actually ex-
hibits curvature. However, the model should still be accurate near the edges
of the design space (at the high and low values of the factors). The randomly
chosen tuning parameters should therefore be restricted to be within a cer-
tain percentage of the edges of the factors’ ranges. In this research, a limit of
within 10% of the factor high and low values is used.
4. Random run order. A random run order is generated for the treatments
and a given number of replicates. In this research, 3 replicates are used. 3
replicates is enough to give an estimate of how variable the response is for a
given treatment. We are conducting this confirmation to ensure our subjec-
tive decisions in the model building were correct. We are not conducting a
new statistically designed experiment that would introduce further subjective
diagnostic decisions as per the previous section.
5. Prediction Intervals. The collected data is compared to the model’s 95%
high and low prediction intervals (Section A.3.5 on page 219). We identify two
criteria upon which our satisfaction with the model (and thus confidence in
its predictions) can be judged [106].
• Conservative: we should prefer models that provide consistently higher
predictions of relative error and higher solution time than those actually
observed. We typically wish to minimise these responses and so a con-
servative model will predict these responses to be higher than their true
value.
• Matching Trend: we should prefer models that match the trends in
heuristic performance. The model’s predictions of the parameter combi-
nations that give the best and worst performance should match the com-
binations that yield the actual metaheuristic’s observed best and worst
performance.
6. Confirmation. If the model is not a satisfactory predictor of the actual al-
gorithm then the experimenter must return to the model building phase and
attempt to improve the model.
At this stage, we have built models of each response from the gathered data and
the models have been confirmed to be good predictors of the algorithm responses
around the edges of the design space. We can now analyse the models and rank
and screen the factors.
6.3.8 Analysis
The following steps for analysing a screening model must be repeated for each
response independently.
1. Rank most important factors. The terms in the model should be ranked
according to their ANOVA Sum of Square values (Section A.4 on page 220).
115
CHAPTER 6. METHODOLOGY
These ranks can then be studied alongside the corresponding p values for the
model terms. The most important model terms will have large sum of square
values and a p value that shows they are statistically significant.
2. Screen factors. Factors that are not statistically significant and have a rel-
atively low ranking can be removed immediately from the subsequent experi-
ments as they do not have an important influence on the response. Further-
more, factors that are statistically significant but practically insignificant can
also be considered for removal from subsequent experiments. The extent to
which we screen out factors will depend on the experimental resources avail-
able for subsequent experiments, our knowledge of the metaheuristic’s tuning
parameters and our confidence in the screening experiment’s recommenda-
tions. Note also that if a factor is important for even one response then it must
remain in the subsequent experiments. Subsequent experiments combine all
responses into a single model of performance.
3. Model graphs. Graphs of the response for each factor should be examined.
This serves two purposes. It confirms our decision to screen factors in the
previous step. Furthermore, it shows us whether statistically significant and
highly ranked factors actually have a practically significant effect on the re-
sponse.
At this stage, the relative importance of each term in the model has been as-
sessed. A linear relationship between the factors and the response has been as-
sumed. The screening design has therefore yielded a planar relationship between
the factors and the response. The relationship between the factors and response
is often of a higher order than planar. A higher order relationship will exhibit some
curvature. It is therefore important to determine whether such curvature exists so
that we can establish the need to use a more sophisticated response surface and
associated experiment design in subsequent experiments.
6.3.9 Check for curvature
Adding centre points to a design allows us to determine whether the response
surface is not planar but actually contains some type of curvature. A centre point
is simply a treatment that is a combination of all factors’ values at the centre of
the factors’ ranges1. The average response value from the actual data at the centre
points is compared to the estimated value of the centre point that comes from
averaging all the factorial points. If there is curvature of the response surface in the
region of the design, the actual centre point value will be either higher or lower than
predicted by the factorial design points. If no curvature exists then the screening
experiment’s planar models should be sufficient to predict responses. This should
first be confirmed as per Section 6.3.7 on page 114 but with the difference that
1 Centre points must therefore be replicated for each level of each categoric factor since the ‘middle’of a categoric factor does not make any sense.
116
CHAPTER 6. METHODOLOGY
treatments are drawn from throughout the design space rather than just from the
edges.
If the analysis with centre points reveals the possibility of curvature in the ac-
tual data then further experiments are required to predict the responses through-
out the whole design space. These experiments are the subject of the next stage in
the methodology.
6.4 Stage 2: Modelling
The response surface methodology is similar to the screening methodology of Sec-
tion 6.3.4 on page 112 in many regards. It involves a fractional factorial type de-
sign, analysed with ANOVA. Its data are collected following the good practices for
experiments with heuristics developed in Chapter 3 and illustrated in the screen-
ing methodology of Section 6.3 on page 110. The most significant difference in the
response surface methodology is in the experiment design it requires. The screen-
ing design used a simple linear model to determine whether there was a significant
difference between high and low levels of the factors. For this purpose, only the
edges of the design space were of interest and only these were examined when con-
firming the model (see Section 6.3.7 on page 114). A Response Surface Model, by
contrast, requires a more sophisticated design since it attempts to build a model of
the factor-problem-performance relationship across the whole design space. The
experiment designs and methodology presented here were first introduced to the
ACO field by the author [104, 106].
6.4.1 Motivation
A detailed motivation for performance modelling has already been presented in
Chapter 1. Modelling heuristic performance is a sensible way to explore the vast
design space of tuning parameter settings and their relationship to problem in-
stances and heuristic performance. A good model can be used to quickly rec-
ommend tuning parameter settings that maximise performance given an instance
with particular characteristics. Performance models can also provide visualisa-
tions of the robustness of parameter settings in terms of changing problem char-
acteristics.
6.4.2 Research questions
This parameter tuning study addresses the following research questions.
• Screening. Which tuning parameters and which problem characteristics
have no significant effect on the performance of the metaheuristic in terms
of solution quality and solution time? If a screening study has already been
conducted correctly and the tuning study is performed on the screened pa-
rameter set, one would expect all remaining parameters to have a significant
effect on performance. The screening study is a more efficient method of an-
swering screening questions but the tuning study can nonetheless identify
117
CHAPTER 6. METHODOLOGY
further unimportant parameters that may have been missed in the screening
study.
• Ranking. What is the relative importance of the most important tuning pa-
rameters and problem characteristics?
• Sufficient order model. What is the minimum order model that satisfactorily
models performance? We know from the screening study whether or not a
linear model is sufficient. Tuning studies offer more advanced fit analysesthat recommend the minimum order model that is required for the data.
• Relationship between tuning, problems and performance. What is the
relationship between tuning parameters, problem characteristics and the re-
sponses of solution quality and solution time. A tuning study yields a math-
ematical equation describing this relationship for each response.
• Tuned parameter settings. What is a good set of tuning parameter settings
given an instance with certain characteristics? Are these settings better than
what can be achieved with randomly chosen settings? Are these settings
better than alternative settings from the literature?
The first two research questions are identical to questions in the screening
study of Chapter 8. The tuning study does not obviate the need for a screening
study. A screening study is a simpler and more efficient method for answering
these particular research questions.
6.4.3 Experiment Design
Building a response surface requires a particular type of experiment design. There
are several alternatives available and these are discussed in Section A.3.4 on
page 218. In this thesis, the Face-Centred Composite (FCC) design is used for
all models. The FCC is most appropriate for situations where design factors are
restricted to be within a certain range. This is the case with many metaheuristic
tuning parameters. For example, in ACO, the pheromone related parameter ρ must
be within the range 0 < ρ < 1.
6.4.4 Method
The method for response surface modelling is detailed below. As already noted, it
is very similar to the screening methodology of Section 6.3 on page 110.
1. Responses variables. Identify the response(s) to be measured. For experi-
ments with metaheuristics, these must reflect some measure of solution qual-
ity and some measure of solution time.
2. Design factors and factor ranges. Choose the algorithm tuning parameters
and problem characteristics whose relationship to performance metrics will
be modelled by the response surfaces. Note that problem characteristics must
be included in the model so that we can correctly model their relationship to
118
CHAPTER 6. METHODOLOGY
the tuning parameters and the heuristic performance. If the response surface
design has been augmented from a previous screening design by adding star
points then the factors and factor ranges are already determined.
3. Held constant factors and values. Because of limited resources, it may not
be possible to experiment with all potential design factors. A pilot study may
have revealed that some factors have a negligible effect on performance. These
factors must be held at a constant value for the duration of the experiments.
4. Screened out factors. Factors that have been screened out in the previous
screening study can be set to any values with impunity.
5. Experiment design. For the given number of factors, choose an appropriate
fractional factorial design for the factorial part of the Face-Centred Compos-
ite design. Where there are several available designs of resolution IV, check
whether the design requiring a smaller number of treatments has a more
satisfactory aliasing structure. The choice of the FCC design determines the
location of the design’s star points and consequently the levels of the factors.
6. Run Order. This is now the minimum design with a single replicate. Generate
a random run order for the design.
7. Gather Data. Collect data according to the treatments and their run order.
8. Significance level, effect size and replicates. Choose an appropriate sig-
nificance (alpha) level for the study. Choose an appropriate effect size that
the screening must detect. For the study’s chosen significance level (5% in
this thesis), examine the design’s power to detect the chosen effect size. If
the power is not greater than 80%, introduce replicates into the design. Add
replicates to the design and return to the Run Order step and gather the
extra data. Sufficient data has been gathered when power reaches 80% by
convention.
At this stage, we have sufficient data to build a response surface model of each
response.
6.4.5 Build a Model
Building the models for a response surface differs in several ways from building
a model for screening. When screening, a model of each response was analysed
separately. This meant that outliers removed from one response’s model would not
necessarily be removed from another response’s model. With the response surface
models however, we will ultimately be combining all the models’ recommendations
into a simultaneous tuning of all the responses. It is therefore more appropriate
that outlier runs deleted from the analysis of one response will remain deleted from
the analysis of other responses.
When screening, it was sufficient to use two levels of each factor and so we were
limited to a linear/planar model of the responses. The response surface model is
119
CHAPTER 6. METHODOLOGY
more complicated in that it can model higher-order surfaces. The first step is
therefore to determine the most appropriate model for the data.
1. Model Fitting. The highest order model that can be generated from the FCC
design is quadratic. All lower order models (linear and 2-factor interaction)
are generated and then assessed on two aspects: their significance and their
R-squared values.
(a) Begin with the lowest order model, the linear model. If the model is not
significant, it is removed from consideration and the next highest order
model is examined for significance.
(b) Examine the adjusted R-squared and predicted R-squared values. These
should be within 0.2 of one another and as close to 1 as possible. If
the R-squared values are not satisfactory, the model is removed from
consideration and the next highest order model is examined.
If models of a higher order than quadratic are required then an alternative
design to the FCC will have to be used.
2. Find important effects. A stepwise linear regression is performed on the
chosen model to estimate its coefficients. Note that this is different from the
screening stage of Section 6.3 on page 110. Here, terms are being removed
from the model to give the most parsimonious model possible. Screening, by
contrast, determines which factors (and all associated model terms) do not
even take part in the experiment design. Of course, if screening has been
accurate then few main effect terms should be removed from the response
surface model by stepwise linear regression.
3. Diagnosis. The usual diagnostics of the linear regression model are per-
formed. If the model passes these tests then its proposed coefficients can be
accepted.
4. Response Transformation. The diagnostics may reveal that a transforma-
tion of the response is required. In this case, perform the transformation and
return to step 2.
5. Outliers. If the transformed response is still failing the diagnostics, it may be
that there are outliers in the data. These should be identified and removed
before returning to step 2.
6. Model significance. Check that the overall model is significant.
7. Model Fit. Check that the predicted R-Squared value is in reasonable agree-
ment with the Adjusted R-Squared value and that both of these are close to
1. Check that the model has a signal to noise ratio greater than about 4.
As with the screening procedure, it is good practice to independently confirm
the models’ accuracy on some real data as recommended by the DOE approach.
120
CHAPTER 6. METHODOLOGY
6.4.6 Confirmation
The methodology for confirming our response surface models differs in one im-
portant way from the previous method for confirming our screening models (Sec-
tion 6.3.7 on page 114). The randomly chosen instances and parameter settings
are now drawn from across the entire design space rather than being limited to the
edges of the design space. The methodology is the same as that of Section 6.3.7 on
page 114 in all other aspects and so is not repeated here.
6.4.7 Analysis
The following analysis is performed for each model separately.
1. Rank most important factors. The terms in the model equations should be
ranked according to their ANOVA F values. These ranks can then be stud-
ied alongside the corresponding p values for the model terms. The rankings
should be in approximate agreement with the previous screening study.
2. Model graphs. Graphs of the responses for each factor should be examined.
This shows us whether statistically significant and highly ranked factors ac-
tually have a practically significant effect on the response. It also shows the
likely location of optimal response values. Surface plots of pairs of factors are
particularly insightful.
At this stage, the experimenter has a model of each of the responses over the
entire design space and these models have been confirmed to be accurate predic-
tors of the actual metaheuristic. It is now possible to use this model for tuning the
actual metaheuristic.
6.5 Stage 3: Tuning
Screening has provided us with a reduced set of the most important problem char-
acteristics and the most important algorithm tuning parameters. The Response
Surface Model has given us mathematical functions relating these characteristics
and tuning parameters to each of the responses of interest. The accuracy of these
models’ predictions have been methodically confirmed and we are satisfied that
the models meets our quality criteria (see Section 6.4.6).
Since the Response Surface Models are mathematical functions of the factors,
it is possible to numerically optimise the responses by varying the factors. This
allows us to produce the most efficient process. There are several possible optimi-
sation goals. We may wish to achieve a response with a given value (target value,
maximum or minimum). Alternatively, we may wish that the response always falls
within a given range (relative error less than 10%). More usually, we may wish to
optimise several responses because of the heuristic compromise. We have seen in
the literature review of Chapter 4 on page 77 that such tuning rarely deals with
121
CHAPTER 6. METHODOLOGY
both solution quality and solution time simultaneously and so neglects the heuris-
tic compromise. We introduce here a technique from Design Of Experiments that
allows multiple response models to be simultaneously tuned.
6.5.1 Desirability Functions
The multiple responses are expressed in terms of desirability functions [40]2. De-
sirability functions are described in more detail in Section A.3.6 on page 220. The
overall desirability for all responses is the geometric mean of the individual desir-
abilities.
The well-established Nelder-Mead downhill simplex [98, p. 326] is then applied
to the response surface model’s equations such that the desirability is maximized.
We specify the optimization in this research with the dual goals of minimizing both
solution error and solution time, while allowing all algorithm-related factors to vary
within their design ranges3. Equal priority is given to the dual goals. This does not
preclude running an optimisation that favours solution quality or favours solution
time, as determined by the tuning application scenario. Problem characteristics
are also factors in the model since we want to establish the relationship between
these problem characteristics, the algorithm parameters and the performance re-
sponses. It does not make sense to include these problem characteristic factors
in the optimization. The optimisation process would naturally select the easiest
problems as part of its solution. We therefore needed to choose fixed combina-
tions of the problem characteristics and perform the numerical optimizations for
each of these combinations. A sensible choice of such combinations is a three-
level factorial of the characteristics. A more detailed description of these methods
follows.
6.5.2 Method
These are the steps used to simultaneously optimise the desirability of the solution
quality and solution time responses.
1. Combinations of problem characteristics. A three-level factorial combina-
tion of the problem characteristics is created. In the case of two characteris-
tics, this creates 9 combinations of problem characteristics. These combina-
tions are illustrated in Table 6.1 on the next page.
2. Numerical optimisation. For each of these combinations, a numerical opti-
misation of desirability is performed using the Nelder-Mead Simplex with the
2 From the NIST/SEMATECH e-Handbook of Statistical Methods, available athttp://www.itl.nist.gov/div898/handbook/pri/section5/pri5322.htm.
3 It is important to note that optimisation of desirability does not necessarily lead to parameter rec-ommendations that yield optimal metaheuristic performance. As stated in the opening of Section 6.5.1,desirability functions are a geometric mean of the desirability of each individual response. Furthermore,a response surface model is an interpolation of the responses from various points in the design space.There is therefore no guarantee that the recommended parameters result in optimal performance, theyonly result in tuned performance that is better than performance in most of the design space. Onemust be careful to distinguish between optimised desirability and tuned parameters.
122
CHAPTER 6. METHODOLOGY
Standard Order Characteristic A Characteristic B
1 Level 1 of A Level 1 of B
2 Level 2 of A Level 1 of B
3 Level 3 of A Level 1 of B
4 Level 1 of A Level 2 of B
5 Level 2 of A Level 2 of B
6 Level 3 of A Level 2 of B
7 Level 1 of A Level 3 of B
8 Level 2 of A Level 3 of B
9 Level 3 of A Level 3 of B
Table 6.1: A full factorial combination of two problem characteristics with three levels of eachcharacteristic. This results in nine treatments.
following settings:
• Cycles per optimisation is 30.
• Simplex fraction is 0.1.
• Design points are used as the starting points.
• The maximum number of solutions is 25.
The optimisation goal is to minimise the solution error response and the so-
lution run time response. These goals can be given different priorities if, for
example, quality is deemed more important than time. The problem charac-
teristics are fixed at the values corresponding to the 3 level factorial combi-
nation.
3. Choose best solution. When the optimisation has completed, the solution
with the highest desirability is taken and the others are discarded. Note that
there may be several solutions of very similar desirability but with differing
factor settings. This is due to the nature of the multiobjective optimisation
and the possibility of many regions of interest (Section A.2 on page 213).
4. Round off integer-valued parameters. If a non-integer value of an exponent
tuning parameter was recommended, this is rounded to the nearest integer
value for the reasons explained in Section 5.2.1 on page 99.
This optimisation procedure has given us recommended parameter settings for
9 locations covering the problem space. Of course, a user requiring more refined
parameter recommendations will have to run this optimisation procedure for the
problem characteristics of the scenario to hand. Optimisation of desirability is not
expensive but requires access to appropriate tools. Alternatively, an interpolation
of the recommendations across the design space is possible. The experimenter can
now evaluate the tuning recommendations.
123
CHAPTER 6. METHODOLOGY
6.6 Stage 4: Evaluation
There are two main aspects to the evaluation of recommended tuning parameter
settings. Firstly, we wish to assess how well the presented method’s recommended
parameter settings perform in comparison to alternative recommended settings.
Secondly, we wish to determine how robust the recommended settings are when
used with other problem characteristics.
6.6.1 Comparison
We wish to determine by how much the results obtained with the tuned parame-
ter settings are better than the results obtained with randomly chosen parameter
values and the results obtained with alternative parameter settings. We wish to
compare with randomly chosen parameters to demonstrate that the effort involved
in the methodology does indeed offer an improvement in performance over not
using any method at all. We compare with alternative parameter settings to see
whether the methodology is competitive with the literature or supports values rec-
ommended in the literature. Such alternative settings may include those recom-
mended by others or those determined using some other tuning methodology. In
this research, all results using DOE tuned parameters are compared to the results
obtained with the parameter settings recommended in the literature [47].
The methodology is described below. It is similar in terms of set-up to the previ-
ous confirmation experiments for the screening models (Section 6.3.7 on page 114)
and response surface models (Section 6.4.6 on page 121). It must be repeated for
every combination of problem characteristics used to cover the design space (see
Section 6.5.2 on page 122).
1. Generate problem instances. A number of problem instances are created at
randomly chosen locations in the problem space.
2. Randomly choose combinations of tuning parameters. For each instance,
a number of sets of parameter settings are chosen randomly within the design
space.
3. Use models to choose parameter settings. For each instance, a parameter
setting is chosen using the desirability optimisation of the response surface
model. In this research, two models were built: a relative error vs time model
and an ADA vs time model. There are therefore two sets of parameter recom-
mendations, one for each model.
4. Solve instances. The instances are solved using the various parameter set-
tings.
5. Plot. A scatter plot is used to illustrate the differences, if any, between the
solutions obtained with the various parameter settings.
It is important once again to draw the reader’s attention to the heuristic compro-
mise. We may find that recommended parameter settings offer a similar solution
124
CHAPTER 6. METHODOLOGY
quality to the DOE method’s recommendations. However, the DOE methodology is
constrained to also find settings that offer good solution times. Only when both
responses are examined do we have a realistic comparison of the performance of
the parameter settings.
6.6.2 Robustness
At this point, we have tuned parameter settings for each member of a set of combi-
nations of problem characteristics. This set was chosen so that it covers the whole
space of possible problems in some sensible fashion. We also have the model
equations of the responses that allow us to calculate tuned parameter settings for
new combinations of problem instance characteristics. It may not always be con-
venient or necessary to perform these calculations. A technique from Design Of
Experiments called overlay plots is adapted in this thesis so that statements can be
made about the robustness of tuned parameter settings across a range of different
problem instance characteristics [106].
Overlay plots are a visual representation of regions of the design space in which
the responses fall within certain bounds. They can be considered as a more relaxed
optimisation than that of tuning where the experimenter was looking for maxima
and minima. For example, the experimenter might query the response models for
ranges of some tuning parameters within which the solution time response is less
than a certain value. Overlay plots are very useful in the context of robustness of
tuned parameters. Recall that the researcher has several sets of tuned parameter
values for various combinations of problem characteristics. For any one of these
combinations of problem characteristics, the associated tuned parameter settings
should result in maxima or minima of the responses. Now one may also wonder
whether adjacent combinations of problem characteristics in the problem space
could also be quite well solved with the same tuned parameter settings. Overlay
plots allow one to visualise the answer to this question. Figure 6.3 on the next
page illustrates an overlay plot.
The horizontal and vertical axes represent values of two problem instance char-
acteristics. The tuned parameter settings are listed to the left. The white area
represents all instance combinations that are solvable with these parameter set-
tings within given relaxed bounds on the performance responses. Clearly, this area
will be larger for more robust tuned parameter settings.
Overlay plots are a powerful tool that is only available when one has a model of
metaheuristic performance across the whole design space. They are backed by the
same rigour and statistical confidence as all DOE methods.
6.7 Common case study issues
Thus far, this chapter has detailed the methodologies that the thesis has adapted
from DOE and will apply to the parameter tuning problem. Chapter 3 highlighted
how even within a good methodology there are many experimental design decisions
that must be taken. This section summarises those common issues and decisions
125
CHAPTER 6. METHODOLOGY
Figure 6.3: A sample overlay plot. The horizontal and vertical axes represent two problem charac-teristics. The tuning parameter settings are listed on the left. Two constraints on solution have beenspecified: the time must be less than 5 seconds and the relative error must be less than 1%. The whitearea is the region of the problem space that can be solved within these constraints on time and solutionquality.
taken across all subsequent case studies. They are reported here to avoid rep-
etition in the case studies but a stand-alone case study should report all of the
following for completeness and to aid reproducibility.
6.7.1 Instances
All TSP instances were of the symmetric type. In the Euclidean TSP, cities are
points with integer coordinates in the two-dimensional plane. For two cities at
coordinates (x1, y1) and (x2, y2), the distance between the cities is computed ac-
cording to the Euclidean distance√
(x1 − x2)2 + (y1 − y2)2. However this definition
of distance must be modified slightly. In the given form, this equation produces
irrational numbers that can require infinite precision to be described correctly.
This causes problems when comparing tour lengths produced by solution tech-
niques. In all problems encountered in this thesis, distance is calculated using
(int)[√
(x1 − x2)2 + (y1 − y2)2 + 0.5]. This is the so-called EUC2D distance type as
specified in the online TSP benchmark library TSPLIB [102] and as used in the
original ACOTSP code (Section 5.2 on page 97).
The TSP problem instances ranged in size from 300 cities to 500 cities with cost
matrix standard deviation ranging from 10 to 70. All instances had a mean of 100.
The same instances were used for each replicate of a design point.
Instances were generated with a version of the publicly available portmgen
problem generator from the DIMACS challenge [58] as described in Section 5.1
on page 95.
126
CHAPTER 6. METHODOLOGY
6.7.2 Stopping criterion
All experiments except those in Chapter 7 were halted after a stagnation stopping
criterion. Stagnation was defined as a fixed number of iterations in which no
improvement in solution value had been obtained. Responses were measured at
several levels of stagnation during an experiment run: 50, 100, 150, 200 and 250
iterations. This facilitated examining the data at alternative stagnation levels to
ensure that conclusions were the same regardless of stagnation level.
6.7.3 Response variables
Three response variables were measured. The time in seconds to the end of an ex-
periment reflects the solution time. The adjusted differential approximation (ADA)
and relative error from a known optimum reflect the solution quality. These were
described in Section 3.10.2 on page 69.
Concorde [5] was used to calculate the optima of the instances. Expected ran-
dom solution values of the instances, as used in the ADA calculation, were gener-
ated by randomly permuting the order of cities in a TSP instance 200 times and
taking the average tour length from these permutations.
6.7.4 Replicates
The design points were replicated in a work up procedure (Section A.6.1 on page 227)
until sufficient power of 80% was reached for detecting a given effect size with an
alpha level of 5% for all responses. The 80% power and 5% significance level were
chosen by convention. The size of effect that could feasibly be detected depended
on the particular response and the particular experiment design.
6.7.5 Benchmarks
All experimental machines were benchmarked as per the DIMACS benchmarking
procedure described in Chapter 5. The results of this benchmarking are presented
in Section 5.3 on page 100. These benchmarks should be used when scaling CPU
times in future research.
6.7.6 Factors, levels and ranges
Held-Constant Factors
There are several held constant factors common to all case studies. Local search, a
technique typically used in combination with ACO heuristics, was omitted. There
are two reasons for this omission. Firstly, there is a large number of local search
alternatives from which to choose and choosing one would have restricted the
thesis’s conclusions to a particular local search implementation. Secondly, the
overwhelming contribution to ACO solution quality comes from local search. This
would defeat the thesis aim of evaluating and demonstrating the tuning of a heuris-
tic with a large number of parameters. All instances had a cost matrix mean of 100.
127
CHAPTER 6. METHODOLOGY
The computation limit parameter (Section 2.4.8 on page 44) was fixed at being
limited to the candidate list length as this resulted in significantly lower solution
times.
Nuisance Factors
A limitation on the available computational resources necessitated running exper-
iments across a variety of machines with slightly different specifications. There
was no control over background processes running on these machines. Runs
were executed in a randomised order across these machines to counteract any
uncontrollable nuisance factors due to the background processes and differences
in machine specification.
6.7.7 Outliers
This research used the approach of deleting outliers from an analysis until the
analysis passed the usual ANOVA diagnostics (Section A.4.2 on page 221).
6.8 Chapter summary
This chapter has detailed the sequential experimentation methodology that is used
in the rest of the thesis. The following topics were covered.
• Iterative experimentation. Experimentation for modelling any process is
inevitably iterative. Metaheuristics are no different. The experimenter often
knows little about the heuristic at the start of experimentation. Once data
is gathered and understanding of the heuristic deepens, the original experi-
ment design decisions may have to be revised. Any procedure for algorithm
modelling must efficiently incorporate the iterative nature of experimentation.
• Sequential experimentation procedure. There is a well-established sequen-
tial experimentation procedure in traditional DOE. This thesis modifies that
procedure so that it can be applied to metaheuristics.
• Choosing problem characteristics. Experimentation with metaheuristics is
different from traditional DOE in many regards. In particular, both problem
characteristics and algorithm tuning parameters must be incorporated into
the model so that parameter settings for new instances can be selected. In
the spirit of sequential experimentation, we would like to determine quickly
which problem characteristics should be included in subsequent experiment
designs rather than building a large all-encompassing and expensive design.
The two-stage nested design was introduced in Section 6.2 on page 106 as a
methodical way to generalise performance effects due to a problem charac-
teristic despite the individual uniqueness of each problem instance.
• Performing confirmation runs. Confirmation runs are important when val-
idating conclusions in traditional DOE. Many DOE analyses involve some
128
CHAPTER 6. METHODOLOGY
subjective decisions. It is important then to confirm conclusions from the
DOE procedures. This increases our confidence in our analysis—if the meth-
ods give good predictions of actual performance then they are sufficient for
our engineering purposes. There are two major types of confirmation runs
encountered here.
1. Model confirmation runs are used to confirm that the ANOVA equation
is a good prediction of the actual response.
2. Tuning confirmation runs are used to confirm that the recommended
tuning parameter settings are indeed as good as or better than alterna-
tives.
• Factor Screening. Screening experiments allow us to rank the importance
of each tuning parameter and problem characteristic, determining those that
matter most to performance and those that have no statistically significant
or practically significant effect. Screening experiment designs do not have
to be of a high enough resolution to make statements about higher order
interactions. They are a quick way to reduce the size of subsequent response
surface designs.
• Response Surface Modelling. Response surface modelling determines the
relationship between tuning parameters, problem characteristics and perfor-
mance responses. A surface is built for each response separately.
• Desirability functions. Desirability functions are a DOE technique for com-
bining multiple responses into a single response. We have introduced desir-
ability functions to the ACO community [104, 106, 109]. This permits easy
tuning of factors while observing the heuristic compromise of high solution
quality in reasonable solution time.
• Tuning. Once all responses are expressed in a single desirability function,
the multi-objective optimisation of all the responses in terms of the tuning
parameters can be performed using well-established numerical optimisation
methods. Optimisation can only be performed for fixed combinations of prob-
lem characteristics. Including problem characteristics in the optimisation
would not make sense as the optimisation would select the easiest combi-
nation of problem characteristics. We therefore perform the optimisations at
combinations of problem characteristics that span the design space. The rec-
ommendations from tuning for these combinations of problem characteristics
can be interpolated across the design space. Alternatively, tuning can be per-
formed for every specific combination of problem characteristics that the user
is presented with.
• Overlay plots. Since tuned parameter settings relate to specific combina-
tions of problem characteristics, it is useful to determine how robust those
parameter settings are to changes in the original problem characteristic val-
ues. Overlay plots provide a useful visual tool for doing this. Given a bound
129
CHAPTER 6. METHODOLOGY
on the responses, we can plot all the combinations of problem characteris-
tics that the algorithm will solve within these bounds using the given tuned
parameter settings.
The chapters in the next part of this thesis will apply this sequential experimen-
tation methodology and its procedures to several ACO algorithms to test the thesis
hypothesis.
130
7Case study: Determining whether
a problem characteristic affectsheuristic performance
This Chapter reports a case study on how to determine whether a problem charac-
teristic affects the performance of a metaheuristic. The methodology for this case
study was proposed and described in Chapter 6.
The Chapter reports a new result for ACO. The standard deviation of TSP edge
lengths has a significant effect on the difficulty of TSP instances for two ACO algo-
rithms, ACS and MMAS1. The results reported in this Chapter have been published
[105, 110]. The results support conclusions from similar experiments with exact
algorithms [26] and provide a detailed illustration of the application of techniques
of which the research community is becoming increasingly aware [6].
7.1 Motivation
An integral component of the construct solutions phase (Section 2.4.3 on page 37)
of ACO algorithms depends on the relative lengths of edges in the TSP. These edge
lengths are often stored in a TSP cost matrix. The probability with which an ant
chooses the next node in its solution depends, among other things, on the rela-
tive length of edges connecting to the nodes being considered (Equation ( 2.2 on
page 39)). Intuitively, it would seem that a high variance in the distribution of edge
lengths would result in a different problem to a low variance in the distribution of
edge lengths. This has already been investigated for exact algorithms for the TSP
1Recall from Section 6.7.6 on page 127 that edge length mean was held constant during all experi-ments. The results in terms of standard deviation of edge lengths can therefore be interpreted in termsof the scale-free ratio of standard deviation to mean edge length.
133
CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTICAFFECTS HEURISTIC PERFORMANCE
(Section 4.1 on page 77, [26]). It was shown that the standard deviation of edge
lengths in a TSP instance has a significant effect on the problem difficulty for an
exact algorithm. This leads us to suspect that standard deviation of edges lengths
may also have a significant effect on problem difficulty for the ACO heuristics.
This research is worthwhile for several reasons. Current research on ACO algo-
rithms for the TSP does not report the problem characteristic of standard deviation
of edge lengths. Assuming that such a problem characteristic affects performance,
this means that for instances of the same or similar sizes, differences in perfor-
mance are confounded (Section A.1.6 on page 212) with possible differences in
standard deviation of edge lengths. Consequently, too much variation in perfor-
mance is attributed to problem size and none to problem edge length standard
deviation. Furthermore, in attempts to model ACO performance, all important
problem characteristics must be incorporated into the model so that the relation-
ship between problems, tuning parameters and performance can be understood.
With this understanding, performance on a new instance can be satisfactorily pre-
dicted given the salient characteristics of the instance.
7.2 Research question and hypothesis
The research question of this case study can be phrased as follows:
Does the variability of edge lengths in the Travelling Salesperson Prob-
lem affect the difficulty of the problem for the ACO metaheuristic?
This can be refined to the following research hypotheses, phrased in terms of
either MMAS or ACS:
• Null Hypothesis H0: the standard deviation of edge lengths in TSP instances’
cost matrices has no effect on the average quality of solutions produced by
the algorithm.
• Alternative Hypothesis H1: the standard deviation of edge lengths in TSP
instances’ cost matrices affects the average quality of solutions produced by
the algorithm.
7.3 Method
7.3.1 Response Variable
The response variable was solution quality, measured as per Section 6.7.2 on
page 127. The response was measured after 1000, 2000, 3000, 4000 and 5000
iterations.
7.3.2 Instances
Instances were generated as per Section 6.7.1 on page 126. Standard deviation of
TSP cost matrix was varied across the 5 levels: 10, 30, 50, 70 and 100. Three prob-
134
CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTICAFFECTS HEURISTIC PERFORMANCE
lem sizes; 300, 500 and 700 were used in the experiments. The same instances
were used for the two algorithms and the same instance was used for replicates of
a design point.
7.3.3 Factors, Levels and Ranges
Design Factors
There were two design factors. The first was the standard deviation of edge lengths
in an instance. This was a fixed factor, since its levels were set by the experimenter.
Five levels: 10, 30, 50, 70 and 100 were used. The second factor was the individual
instances with a given level of standard deviation of edge lengths. This was a
random factor since instance uniqueness was caused by the problem generator
and so was not under the experimenter’s direct control. Ten instances were created
within each level of edge length standard deviation.
Held-constant Factors
There were many common held-constant factors as per Section 6.7.6 on page 127.
This study also contained further held-constant factors. Problem size was fixed
for a given experiment. Sizes of 300, 500 and 700 were investigated. Two ACO
algorithms were investigated: MMAS (Section 2.4.6 on page 42) and ACS (Sec-
tion 2.4.5 on page 39). These algorithms were chosen because they are claimed to
be the best performing of the ACO heuristics and because they are representative
of the two main types of ACO heuristic: Ant System descendents and non-Ant Sys-
tem descendents respectively. The held constant tuning parameter settings for the
heuristics are listed in the following table.
Parameter Symbol ACS MMAS
Ants m 10 25
Pheromone emphasis α 1 1
Heuristic emphasis β 2 2
Candidate List length 15 20
Exploration threshold q0 0.9 N/A
Pheromone decay ρglobal 0.1 0.8
Pheromone decay ρlocal 0.1 N/A
Solution construction Sequential Sequential
Table 7.1: Parameter settings for the problem difficulty experiments with the ACS and MMAS algo-rithms. Values are taken from the original publications [118, 47]. See Section 2.4 on page 34 for adescription of these tuning parameters and the MMAS and ACS heuristics.
It is important to stress that this research’s use of parameter values from the
literature by no means implies support for such a ‘folk’ approach to parameter
selection in general. Selecting parameter values as done here strengthens any
conclusions in two ways. It shows that results were not contrived by searching for
135
CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTICAFFECTS HEURISTIC PERFORMANCE
a unique set of tuning parameter values that would demonstrate the hypothesised
effect. Furthermore, it makes the research conclusions applicable to all other
research that has used these tuning parameter settings without the justification
of a methodical tuning procedure, as proposed in this thesis. Recall from the
motivation (Section 7.1 on page 133) that demonstrating an effect of edge length
standard deviation on performance with even one set of tuning parameter values
is sufficient to merit the factor’s consideration in parameter tuning studies. The
results from this research will therefore directly affect the results on parameter
tuning in later chapters.
7.3.4 Experiment design, power and replicates
The experiment design is a two-stage nested design (Section 6.2.1 on page 108).
The standard deviation of edge lengths is the parent factor. The individual in-
stances factor is nested within this.
The heuristics are probabilistic (Section 2.4) and so repeated runs with identi-
cal inputs (instances, parameter settings etc.) will produce different results. All
treatments are thus replicated in a work up procedure (Section A.6.1 on page 227)
until sufficient power of 80% was reached to detect an effect for the study’s signifi-
cance level of 1%. Power was calculated with Lenth’s power calculator [76]. For all
experiments, 10 replicates were sufficient to meet these requirements.
7.3.5 Performing the Experiment
Randomised run order
Available computational resources necessitated running experiments across a va-
riety of similar machines. Runs were executed in a randomised order across
these machines to counteract any uncontrollable nuisance factors. While such
randomising is strictly not necessary when measuring a machine-independent re-
sponse, it is good practice nonetheless.
Stopping Criterion.
Experiments were halted after a fixed iteration stopping criterion (Section 3.13 on
page 73). The number of fixed iterations was 5000. A potential problem with this
approach is that the choice of combinatorial count can bias the results. Should
we stop after 1000 iterations or 1001? Taking response measurements after 1000,
2000, 3000, 4000 and 5000 iterations mitigated this concern. The data were
separately analysed at the 1000 and 5000 measurement points.
7.4 Analysis
7.4.1 ANOVA
The two-stage nested designs were analysed with the General Linear Model. Stan-
dard deviation was treated as a fixed factor since we explicitly chose its levels and
136
CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTICAFFECTS HEURISTIC PERFORMANCE
instance was treated as a random factor. The technical reasons for this decision
in the context of experiments with heuristics have recently been well explained in
the metaheuristics literature [6].
To make the data amenable to statistical analysis, a transformation (as per
Section A.4.3 on page 222) of the responses was required for each analysis. The
transformations were either a log10, inverse square root transformation or a square
root transformation.
Outliers were deleted and the model building repeated until the models passed
the usual ANOVA diagnostics for the ANOVA assumptions of model fit, normality,
constant variance, time-dependent effects, and leverage (Section A.4.2 on page 221).
Figure 7.1 lists the number of data points deleted in each analysis.
Algorithm Problem size ADA Relative
ErrorACS 300 3 3
500 0 2700 5 4
MMAS 300 3 7500 1 2700 4 2
Figure 7.1: Number of outliers deleted during the analysis of each problem difficulty experiment. Eachexperiment had a total of 500 data points.
The number of outliers deleted in each case is very small in comparison to the
total number of 500 data points. Further details on these analyses and diagnostics
are available in many textbooks [84] and in Appendix A.
7.5 Results
In all cases, the effect of standard deviation of edge length on solution quality
was deemed statistically significant at the p < 0.01 level. The effect of individual
instance was also deemed statistically significant at the p < 0.01 level, however, an
examination of the data shows that this effect was not of practical significance.
The following figures illustrate box plots of the data for the problem sizes of
300 and 700 for ACS and MMAS. The same trends were observed for problems
of size 500 and so these plots are omitted. In each box-plot, the horizontal axis
shows the five levels of standard deviation of the instances’ edge lengths at the five
measurement points, 1000, 2000, 3000, 4000 and 5000 iterations. The vertical
axis shows the solution quality response in its original scale. There is a separate
plot for each algorithm and each problem size. Vertical axes have not been set to
the same scale in the various plots. This is to discourage performance comparisons
between plots because parameters had not been tuned to the different problem
sizes. Outliers have been included in these plots.
An examination of the plots shows that only standard deviation had a practically
significant effect on solution quality.
At each measurement point, there was a slight improvement in the response.
137
CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTICAFFECTS HEURISTIC PERFORMANCE
A:problemStDevRelErr 5000RelErr 4000RelErr 3000RelErr 2000RelErr 1000
1007050301010070503010100705030101007050301010070503010
14
12
10
8
6
4
2
0
Relative Error, ACS, size 300, mean 100
Figure 7.2: Relative Error response for ACS on problems of size 300, mean 100.
A:problemStDevRel Err 5000Rel Err 4000Rel Err 3000Rel Err 2000Rel Err 1000
1007050301010070503010100705030101007050301010070503010
18
16
14
12
10
8
6
4
2
0
Relative Error, ACS, size 700, mean 100
Figure 7.3: Relative Error response for ACS on problems of size 700, mean 100.
138
CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTICAFFECTS HEURISTIC PERFORMANCE
A:problemStDevRel Err 5000Rel Err 4000Rel Err 3000Rel Err 2000Rel Err 1000
1007050301010070503010100705030101007050301010070503010
20
15
10
5
Relative Error, MMAS, size 300, mean 100
Figure 7.4: Relative Error response for MMAS on problems of size 300, mean 100.
A:problemStDevRel Err 5000Rel Err 4000Rel Err 3000Rel Err 2000Rel Err 1000
1007050301010070503010100705030101007050301010070503010
25
20
15
10
5
Relative Error MMAS, size 700, mean 100
Figure 7.5: Relative Error response for MMAS on problems of size 700, mean 100.
139
CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTICAFFECTS HEURISTIC PERFORMANCE
This is to be expected since the metaheuristic has a larger number of iterations in
which to solve the problems. In all cases, problem instances with a lower standard
deviation had a significantly lower relative error value than instances with a higher
standard deviation.
In all cases, there was a higher variability in the relative error response between
instances with a higher standard deviation. The same conclusions were drawn
from an analysis of the ADA quality response.
7.6 Conclusions
For ACS and MMAS, applied to TSP instances generated with log-normally dis-
tributed edge lengths such that all instances have a fixed cost matrix mean of 100
and a cost matrix standard deviation varying from 10 to 70:
1. a change in cost matrix standard deviation leads to a statistically and prac-
tically significant change in the difficulty of the problem instances for these
algorithms.
2. there is no practically significant difference in difficulty between instances
that have the same size, cost matrix mean and cost matrix standard deviation.
3. there is no practically significant difference between the difficulty measured
after 1000 algorithm iterations and 5000 algorithm iterations.
Difficulty here means relative error from an optimum and the adjusted differ-
ential approximation. We therefore reject the null hypothesis of Section 7.2 on
page 134.
7.6.1 Implications
These results are important for the ACO community for the following reasons:
• They demonstrate in a rigorous, designed experiment fashion, that quality of
solution of an ACO TSP algorithm is affected by the standard deviation of the
cost matrix.
• They demonstrate that cost matrix standard deviation must be considered as
a factor when building predictive models of ACO TSP algorithm performance.
• They clearly show that performance analysis papers using ACO TSP algo-
rithms must report instance cost matrix standard deviation as well as in-
stance size since two instances with the same size can differ significantly in
difficulty.
• They motivate an improvement in benchmark libraries so that they provide a
wider crossing of both instance size and instance cost matrix standard devia-
tion. Plots of instances in the TSPLIB show that generated instances generally
have the same shaped distribution of edge costs (Appendix B).
140
CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTICAFFECTS HEURISTIC PERFORMANCE
7.6.2 Assumptions and restrictions
For completeness and for clarity, we state that this case study did not examine the
following issues.
• It did not examine clustered problem instances or grid problem instances.
These are other common forms of TSP in which nodes appear in clusters and
in a very structured grid pattern respectively. The conclusions should not be
applied to other TSP types without a repetition of this case study.
• Algorithm performance was not being examined since no claim was made
about the suitability of the parameter values for the instances encountered.
Rather, the aim was to demonstrate an effect for standard deviation and so
argue that it should be included as a factor in experiments that do examine
algorithm performance. These experiments are the subject of subsequent
case studies in the thesis.
• We cannot make a direct comparison between algorithms since algorithms
were not tuned methodically. That is, we are not entitled to say that ACS did
better than MMAS on, say, instance X with a standard deviation of Y.
• We cannot make a direct comparison of the response values for different sized
instances. Clearly, 3000 iterations explores a bigger fraction of the search
space for 300-city problems than for 500 city problems. Such a comparison
could be made if it was clear how to scale iterations with problem size. Such
scaling is an open question.
7.7 Chapter summary
This Chapter presented a case study on determining whether a problem charac-
teristic has an effect on the difficulty of problems for a given heuristic. Specifically,
it investigated whether the standard deviation of edge lengths in TSP instances
affects the quality of solutions produced by the MMAS and ACS heuristics.
The result, that symmetric TSP edge length standard deviation affects problem
difficulty is a new result for the ACO community. This result is particularly im-
portant for approaches to modelling and analysing ACO performance. This will be
illustrated in subsequent chapters in the thesis.
141
8Case study: Screening Ant
Colony System
This Chapter reports a case study on screening the factors affecting the perfor-
mance of a heuristic. The methodology for this case study was described in Chap-
ter 6. The particular heuristic studied is Ant Colony System for the Travelling
Salesperson Problem (Section 2.4.6 on page 42).
This chapter reports many new results for ACS. Established tuning parameters
previously thought to affect performance are actually shown to have no effect at
all. New tuning parameters that were thought to affect performance are investi-
gated. A new TSP problem characteristic is shown to have a very strong effect on
performance, confirming the results of Chapter 7. All analyses are conducted for
two performance measures, quality of solution and solution time. This provides an
accurate measure of the heuristic compromise that is rarely seen in the literature.
Finally, it is shown that models of ACS performance must be of a higher order than
linear. The results reported in this Chapter have been published in the literature
[108, 107].
8.1 Method
8.1.1 Response Variables
Three responses were measured as per Section 6.7.2 on page 127. These responses
were percentage relative error from a known optimum (henceforth referred to as
Relative Error), adjusted differential approximation (henceforth referred to as ADA)
and solution time (henceforth referred to as Time).
143
CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM
8.1.2 Factors, levels and ranges
Design Factors
There were 12 design factors, 10 representing the ACS tuning parameters and 2
representing the TSP problem characteristics being investigated. The design fac-
tors and their high and low levels are summarised in the following table. A descrip-
tion of the ACS tuning parameter factors was given in Section 2.4.9 on page 45.
H:solutionConstruction and J:antPlacement could be considered as parameterised
design features as mentioned in Section 6.3.1 on page 111. The antsFraction and
nnFraction are expressed as a percentage of problem size.
The factor ranges were chosen to encompass common parameter values en-
countered in the literature. The experiences of other researchers using ACS has
been that 10 ants is a good parameter setting. If this is the case, our experi-
ments will confirm this recommendation methodically. Recall that this is a se-
quential experimentation methodology and does not preclude incorporating the
experimenter’s prior knowledge into the chosen factor ranges.
Factor Name Type Low High
A alpha Numeric 1 13
B beta Numeric 1 13
C antsFraction Numeric 1.00 110.00
D nnFraction Numeric 2.00 20.00
E q0 Numeric 0.01 0.99
F rho Numeric 0.01 0.99
G rhoLocal Numeric 0.01 0.99
H solutionConstruction Categoric Parallel sequential
J antPlacement Categoric Random same
K pheromoneUpdate Categoric BestSoFar bestOfIteration
L problemSize Numeric 300 500
M problemStDev Numeric 10.00 70.00
Table 8.1: Design factors for the screening study with ACS. There are two problem characteristicfactors (L-problemSize and M-problemStDev). The remaining 10 factors are ACS tuning parameters.
Held-Constant Factors
The held constant factors are as per Section 6.7.6 on page 127.
8.1.3 Instances
All TSP instances were of the symmetric type and were created as per Section 6.7.1
on page 126. The TSP problem instances ranged in size from 300 cities to 500
cities with cost matrix standard deviation ranging from 10 to 70. All instances had
a mean of 100. The same instances were used for each replicate of a design point.
144
CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM
8.1.4 Experiment design, power and replicates
The experiment design was a Resolution IV (12-5) fractional factorial with 24 centre
points.
The number of replicates was 8, determined using the work-up procedure (Sec-
tion A.6.1 on page 227) until a power of 80% was achieved for a significance level
of 5% when detecting an effect size of 0.18 standard deviations. This yielded a total
of 1048 runs. Figure 8.1 gives the descriptive statistics for the collected data and
the actual effect size for each response.
Iterations Time Relative Error ADA
50 Mean 51.79 7.77 4.15StDev 103.90 10.36 4.06Max 1173.42 56.63 20.21Min 0.13 0.45 0.64Actual Effect Size of 0.18 standard deviations 18.70 1.87 0.73
100 Mean 106.48 7.55 4.06StDev 241.33 10.23 4.04Max 3211.93 56.37 20.21Min 0.20 0.45 0.64Actual Effect Size of 0.18 standard deviations 43.44 1.84 0.73
150 Mean 161.81 7.42 4.00StDev 391.29 10.13 4.01Max 5813.52 56.37 20.21Min 0.29 0.40 0.43Actual Effect Size of 0.18 standard deviations 70.43 1.82 0.72
200 Mean 238.37 7.35 3.95StDev 673.29 10.08 3.97Max 7828.67 56.37 20.21Min 0.37 0.40 0.43Actual Effect Size of 0.18 standard deviations 121.19 1.81 0.71
250 Mean 285.87 7.30 3.91StDev 786.47 10.02 3.94Max 9898.04 56.37 20.21Min 0.44 0.37 0.43Actual Effect Size of 0.18 standard deviations 141.56 1.80 0.71
Figure 8.1: Descriptive statistics for the ACS screening experiment. Statistics are given at five stag-nation points ranging from 50 to 250 iterations. The actual effect size equivalent to 0.18 standarddeviations is also listed for each response variable.
8.1.5 Performing the experiment
Responses were measured at five stagnation levels. An examination of the de-
scriptive statistics in Figure 8.1 verifies that the stagnation level did not have a
large effect on the response values and therefore the conclusions after a 250 iter-
ation stagnation should be the same as after lower iteration stagnations. The two
solution quality responses show a small but practically insignificant decrease in
solution error as the stagnation iterations is increased.
145
CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM
The Time response increases with increasing stagnation iterations because the
experiments take longer to run. For all cases, the level of stagnation iterations
has little practically significant effect on the three responses. ACS did not make
any large improvements when allowed to run for longer. It is therefore sufficient to
perform analyses at the 250 iterations stagnation level.
8.2 Analysis
8.2.1 ANOVA
Effects for each response model were selected using stepwise regression (Sec-
tion A.4.1 on page 220) applied to a full 2 factor interaction model with an alpha
out threshold of 0.10. Some terms removed by stepwise regression were added
back into the final model to preserve hierarchy.
To make the data amenable to statistical analysis, a transformation of the re-
sponses was required for each analysis. The transformation was a log10 for all three
responses.
Outliers were deleted and the model building repeated until the models passed
the usual ANOVA diagnostics for the ANOVA assumptions of model fit, normality,
constant variance, time-dependent effects, and leverage (Section A.4.2 on page 221).
35 data points (3% of total data) were removed when analysing Relative Error. 36
data points (3% of total data) were removed when analysing ADA. 32 data points
(3% of total data) were removed when analysing Time.
8.2.2 Confirmation
Confirmation experiments were run according to the methodology detailed in Sec-
tion 6.3.7 on page 114 in order to confirm the accuracy of the ANOVA models. Re-
call that the general idea is to run the algorithm on new randomly chosen combina-
tions of parameter settings and problem characteristics and compare performance
to the ANOVA models’ predictions. The randomly chosen treatments produced
actual algorithm responses with the descriptives listed in the following figure.
Max Min Mean StDevRelative Error 250 85.59 0.91 11.78 17.60ADA 250 39.37 0.93 7.57 9.69Time 250 16141 1 477 1869
Figure 8.2: Descriptive statistics for the confirmation of the ACS screening ANOVA. The responsedata is from runs of the actual algorithm on the randomly generated confirmation treatments with astagnation stopping criterion of 250 iterations.
The large ranges of each response reinforce the motivation for correct parameter
tuning as there is clearly a high cost in incorrectly tuned parameters.
The next three figures illustrate the 95% prediction intervals (Section A.3.5 on
page 219) and actual data for the three response models, Relative Error, ADA and
Time respectively.
146
CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM
0.05.0
10.015.020.025.030.035.040.045.050.0
0 10 20 30 40 50Treatment
Rel
ativ
e E
rror 2
50
Relative Error25095% PI low
95% PI high
Figure 8.3: 95% Prediction intervals for the ACS screening of Relative Error.
05
10152025303540
0 10 20 30 40 50Treatment
AD
A 2
50
ADA 250
95% PI low
95% PI high
Figure 8.4: 95% Prediction intervals for the ACS screening of ADA.
The two solution quality responses are well predicted by their models. The
models match the trends of the actual data, successfully picking up the extremely
low and extremely high response values which vary over a range of 85% for relative
error and 39 for ADA. Both quality models tend to overestimate high values.
The time response is also well predicted by its model. Time is subject to the
nuisance factor of different experiment machines and is a more variable response
due to the nature of the ACS algorithm. The extremely high times and extremely
low times are predicted well for all the 50 treatments. This was achieved over a
range of over 16000s (see Figure 8.1 on page 145).
The three ANOVA models are therefore satisfactory predictors of the three ACS
performance responses for factor values within 10% of the factor range limits listed
in Section 8.1 on page 144.
8.3 Results
Figure 8.6 on the next page gives a summary of the Sum of Squares ranking and
the ANOVA F and p values for the three responses. Only the main effects are listed.
Those main effects that rank in the top 12 are highlighted in bold. Rankings
are based on the full two factor interaction model of 78 terms, before stepwise
regression was applied.
147
CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM
1
10
100
1000
10000
0 10 20 30 40 50Treatment
Tim
e 25
0
Time 250
95% PI low
95% PI high
Figure 8.5: 95% Prediction intervals for the ACS screening of Time.
ScreeningResults_ACS.xls
Main Effect Rank F value p value Rank F value p value Rank F value p value
A-alpha 62 1.15 0.2840 61 1.38 0.2405 16 12.01 0.0006B-beta 3 2103.73 < 0.0001 3 2128.02 < 0.0001 13 16.70 < 0.0001C-antsFraction 11 305.13 < 0.0001 10 310.74 < 0.0001 1 37566.17 < 0.0001D-nnFraction 4 1739.78 < 0.0001 4 1744.51 < 0.0001 3 1680.96 < 0.0001E-q0 2 6920.35 < 0.0001 2 6955.10 < 0.0001 7 41.76 < 0.0001F-rho 45 4.63 0.0317 44 5.08 0.0244 56 0.94 0.3313G-rhoLocal 13 227.57 < 0.0001 12 226.37 < 0.0001 41 2.88 0.0901H-solutionConstruction 14 151.18 < 0.0001 13 154.72 < 0.0001 4 80.62 < 0.0001J-antPlacement 75 0.01 0.9386 75 0.00 0.9831 54 1.28 0.2576K-pheromoneUpdate 47 4.41 0.0360 46 4.85 0.0278 5 49.39 < 0.0001L-problemSize 12 289.62 < 0.0001 18 52.52 < 0.0001 2 3334.25 < 0.0001M-problemStDev 1 43076.29 < 0.0001 1 9426.17 < 0.0001 43 2.57 0.1096
Relative Error ADA Time
1 of 1
Figure 8.6: Summary of ANOVAs for Relative Error, ADA and Time. Only the main effects are shown.Rankings are based on the full two factor interaction model of 78 terms, before backward selection wasapplied.
148
CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM
This table contains several results and answers to some open questions in the
ACO literature.
8.3.1 Screened factors
J-AntPlacement, is statistically insignificant at the 0.05 level for both quality re-
sponses and the time response. J-AntPlacement can therefore be directly screened
out.
Factor F-rho is statistically insignificant at the 0.05 level for the time response
and statistically significant for both quality responses. Despite its significance, the
rankings of F-Rho are very low (45, 44 and 56 out of 78 effects) across all three
responses. F-rho is therefore screened out.
A-Alpha is statistically insignificant at the 0.05 level for both quality responses
but statistically significant at the 0.05 level for the time response. This is reflected
in the low rankings for the quality responses and the high ranking (almost within
the top 12) for the time response. A-alpha should be set to 1 since this requires
marginally less time to compute the ACS decision equation than a higher value of
A-alpha.
K-pheromoneUpdate is statistically significant for all three responses but has
a very low ranking for both quality responses. An examination of the plot of time
for the K-pheromoneUpdate factor shows that the effect on time is not practically
significant. K-pheromoneUpdate can therefore be screened out.
In summary, 4 factors are screened out, A-Alpha, F-Rho, J-AntPlacement and
K-PheromoneUpdate.
8.3.2 Relative Importance of Factors
Of the two problem characteristics, the factor with the larger effect on solution
quality is the problem standard deviation M-problemStDev. The problem size L-
problemSize has a stronger effect on solution time than M-problemStDev. ACS
takes longer to reach stagnation on larger problem instances than smaller in-
stances.
Of the remaining unscreened tuning parameters, the heuristic exponent B-beta,
the amount of ants C-antsFraction, the length of candidate lists D-nnFraction and
the exploration/exploitation threshold E-q0 have the strongest effects on solution
quality. The same is true for solution time.
These results are important because they highlight the most important tuning
parameters and problem characteristics in terms of both heuristic performance
dimensions. These factors are the minimal set of design factors one should exper-
iment with when modelling and tuning ACS.
8.3.3 Adequacy of a Linear Model.
A test for curvature as per Section 6.3.9 on page 116 shows there is a significant
amount of curvature in all three responses. This means that the linear model for
screening is not adequate to explore the whole design space. A higher order model
149
CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM
of all three responses is required. This is an important result because it confirms
that a One-Factor-At-a-Time (Section 4.3.3 on page 87) approach is insufficient for
investigating the performance of ACS.
8.3.4 Comparison of Solution Quality Measures
It was already mentioned that two solution quality responses were measured to
investigate if either response lead to different conclusions. An examination of the
ANOVA summaries reveals that both Relative Error and ADA have almost the same
rankings and same statistical significance or insignificance for all factors except L-
problemSize. While L-problemSize has a significant effect on both responses, the
ADA response has a lower ranking of 18 compared to the Relative Error response
ranking of 12. This is due to the nature of how the two responses are calculated
(Section 3.10.2 on page 69).
This result shows that for screening of ACS, the choice of solution quality re-
sponse will not affect the conclusions. However, ADA may be preferable as it ex-
hibits a lower range and variability than Relative Error (Figure 8.1 on page 145).
The advantage of a lower variability was discussed in the context of statistical
power (Section A.6 on page 225).
8.4 Conclusions and discussion
The following conclusions are drawn from the ACS screening study. These conclu-
sions apply for a significance level of 5% and a power of 80% to detect the effect
sizes listed in Figure 8.1 on page 145. These effect sizes are a change in solution
time of 140s, a change in Relative Error of 1.8% and a change in ADA of 0.71.
Issues of power and effect size are discussed in Section A.6 on page 225.
• Tuning Ant placement not important. The type of ant placement has no
significant effect on ACS performance in terms of solution quality or solution
time. This was an open question in the literature. It is remarkable because
intuitively one would expect a random scatter of ants across the problem
graph to explore a wider variety of possible solutions. This result shows that
this is not the case.
• Tuning alpha not important. Alpha has no significant effect on ACS per-
formance in terms of solution quality or solution time. This confirms the
common recommendation in the literature of setting alpha equal to 1. Alpha
has also been analysed with an OFAT approach in Appendix D on page 235.
• Tuning Rho not important. Rho has no significant effect on ACS perfor-
mance in terms of solution quality or solution time. This is a new result for
ACS. It is a surprising result since Rho is a term in the ACS update phero-
mone equations and other analytical results in a much simplified scenario
suggested it was important [?].
150
CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM
• Tuning Pheromone Update Ant not important. The ant used for pheromone
updates is practically insignificant for all three responses. An examination
of the plot of time for the K-pheromoneUpdate factor shows that the effect
on time is not practically significant. K-pheromoneUpdate can therefore be
screened out.
• Most important tuning parameters. The most important ACS tuning pa-
rameters are the heuristic exponent B-beta, the amount of ants as a fraction
of problem size C-antsFraction, the length of candidate lists as a fraction of
problem size D-nnFraction and the exploration/exploitation threshold E-q0.
These are the factors one should focus on as design factors when experimen-
tal resources are limited.
• Problem standard deviation is important. This confirms the main result of
Chapter 7 in identifying a new TSP problem characteristic that has a signif-
icant effect on the difficulty of a problem for ACS. ACO research should be
reporting this characteristic in the literature.
• Higher order model needed. A higher order model, greater than linear, is
required to model ACS solution quality and ACS solution time. This is an
important result because it demonstrates for the first time that simple OFAT
approaches seen in the literature are insufficient for accurately modelling and
tuning ACS performance.
• Comparison of solution quality responses. The is no difference in the con-
clusions of the screening study using either the ADA or Relative Error solu-
tion quality responses. ADA has a slightly smaller variability and so results
in more powerful experiments than Relative Error.
8.5 Chapter summary
This chapter has presented a case study on screening the tuning parameters and
problem characteristics that affect the performance of ACS. This illustrated the
application of the methodology in Section 6.3 on page 110 with a fully instantiated
ACO heuristic, Ant Colony System. Many new results were presented and exist-
ing recommendations in the literature were confirmed in a rigorous fashion. In
the next chapter, these results will be used to efficiently build an accurate model
of ACS performance. The full model and the reduced model using the screening
results will be compared. This will confirm that the screening decisions recom-
mended in this study were correct.
151
9Case study: Tuning Ant Colony
System
This Chapter reports a case study on tuning the factors affecting the performance
of a heuristic. The methodology for this case study was described in Chapter
6. The particular heuristic studied is Ant Colony System (ACS) for the Travelling
Salesperson Problem (Section 2.4.6 on page 42).
This chapter reports many new results for ACS. All analyses are conducted for
two performance measures, quality of solution and solution time. This provides an
accurate measure of the heuristic compromise that is rarely seen in the literature.
It is shown that models of ACS performance must be at least quadratic. ACS is
tuned using a full parameter set and a screened parameter set resulting from the
case study of the previous chapter. This verifies that screening decisions from the
previous chapter are correct.
The results reported in this Chapter have been published in the literature [106].
9.1 Method
9.1.1 Response Variables
Three responses were measured as per Section 6.7.2 on page 127. These responses
were percentage relative error from a known optimum (henceforth referred to as
Relative Error), adjusted differential approximation (henceforth referred to as ADA)
and solution time (henceforth referred to as Time).
153
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
9.1.2 Factors, levels and ranges
Design factors
In the full parameter set RSM, there were 12 design factors as per the screening
study of Chapter 8. The factors and their high and low levels are repeated in the
following table for convenience.
Factor Name Type Low Level High Level
A alpha Numeric 1 13
B beta Numeric 1 13
C antsFraction Numeric 1.00 110.00
D nnFraction Numeric 2.00 20.00
E q0 Numeric 0.01 0.99
F rho Numeric 0.01 0.99
G rhoLocal Numeric 0.01 0.99
H solutionConstruction Categoric parallel sequential
J antPlacement Categoric random same
K pheromoneUpdate Categoric bestSoFar bestOfIteration
L problemSize Numeric 300 500
M problemStDev Numeric 10.00 70.00
Table 9.1: Design factors for the tuning study with ACS. The factor ranges are also given.
Tuning parameters for the Screened model, based on the results of the previous
Chapter, did not include A-alpha, F-rho, J-antPlacement and K-pheromoneUpdate.
These two screened factors took on randomly chosen values within their range for
each experiment run.
Held-Constant Factors
The held constant factors are as per Section 6.7.6 on page 127.
9.1.3 Instances
All TSP instances were of the symmetric type and were created as per Section 6.7.1
on page 126. The TSP problem instances ranged in size from 300 cities to 500
cities with cost matrix standard deviation ranging from 10 to 70. All instances had
a mean of 100. The same instances were used for each replicate of a design point.
9.1.4 Experiment design, power and replicates
The experiment design was a Minimum Run Resolution V Face-Centred Composite
(Section A.3.4 on page 218) with six centre points.
The number of replicates was increased in a work-up procedure (Section A.6.1
on page 227) until a power of 80% was achieved for an significance level of 5% when
154
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
detecting a given effect size. The next two figures give the descriptive statistics
for the collected data and the actual effect size for each response in the full and
screened experiments with the FCC design.
Iterations Time Relative Error ADA
50 Mean 65.33 11.01 5.60StDev 194.96 19.49 7.45Max 3131.77 125.84 41.75Min 0.17 0.55 0.66Actual Effect Size of 0.2 stdevs 38.99 3.90 1.49
100 Mean 136.38 10.73 5.44StDev 469.81 19.09 7.24Max 7825.44 124.20 41.51Min 0.28 0.55 0.66Actual Effect Size of 0.2 stdevs 93.96 3.82 1.45
150 Mean 204.39 10.57 5.36StDev 681.54 18.95 7.18Max 12075.97 124.20 41.51Min 0.38 0.47 0.66Actual Effect Size of 0.2 stdevs 136.31 3.79 1.44
200 Mean 270.25 10.47 5.31StDev 906.16 18.85 7.15Max 15423.77 124.20 41.42Min 0.50 0.46 0.60Actual Effect Size of 0.2 stdevs 181.23 3.77 1.43
250 Mean 341.36 10.40 5.27StDev 1121.20 18.81 7.13Max 15573.66 123.74 41.42Min 0.61 0.46 0.60Actual Effect Size of 0.2 stdevs 224.24 3.76 1.43
Figure 9.1: Descriptive statistics for the full ACS FCC design. The actual detectable effect size of 0.2standard deviations is shown for each response and for each stagnation point. There is little practicaldifference in effect size of the solution quality responses for an increase in the stagnation point.
The full design could achieve sufficient power with 5 replicates while detecting
an effect of size 0.2 standard deviations.
The screened design could achieve sufficient power with 10 replicates while
detecting an effect of size 0.29 standard deviations. Unfortunately, experimental
resources did not permit using a larger number of replicates. This is further mo-
tivation for the use of DOE and fractional factorials. Without the vast savings of
fractional factorials (Section A.3.3 on page 218) this experiment would have been
completely infeasible.
9.1.5 Performing the experiment
Responses were measured at five stagnation levels. An examination of the de-
scriptive statistics verifies that the stagnation level did not have a large effect on
the response values and therefore the conclusions after a 250 iteration stagnation
should be the same as after lower iteration stagnations. The two solution quality
155
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
Iterations Time Relative Error ADA
50 Mean 53.70 10.70 6.80StDev 73.01 16.85 9.89Max 718.67 92.85 42.59Min 0.18 0.65 0.90Actual Effect Size of 0.29 stdevs 21.17 4.89 2.87
100 Mean 106.44 10.48 6.69StDev 143.81 16.69 9.82Max 1274.12 92.85 42.59Min 0.36 0.62 0.81Actual Effect Size of 0.29 stdevs 41.70 4.84 2.85
150 Mean 160.90 10.35 6.63StDev 225.24 16.60 9.79Max 1653.75 92.85 42.46Min 0.50 0.62 0.68Actual Effect Size of 0.29 stdevs 65.32 4.82 2.84
200 Mean 208.49 10.28 6.57StDev 278.26 16.57 9.74Max 1801.84 92.45 42.46Min 0.60 0.55 0.64Actual Effect Size of 0.29 stdevs 80.70 4.81 2.83
250 Mean 262.77 10.22 6.54StDev 363.71 16.52 9.72Max 3721.72 92.45 42.15Min 0.69 0.55 0.64Actual Effect Size of 0.29 stdevs 105.47 4.79 2.82
Figure 9.2: Descriptive statistics for the screened ACS FCC design. The actual detectable effect sizeof 0.3 standard deviations is shown for each response and for each stagnation point. There is littlepractical difference in effect size of the solution quality responses for an increase in the stagnationpoint.
156
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
responses show a small but practically insignificant decrease in solution error as
the stagnation iterations is increased.
The Time response increases with increasing stagnation iterations because the
experiments take longer to run. For all cases, the level of stagnation iterations
has little practically significant effect on the three responses. ACS did not make
any large improvements when allowed to run for longer. It is therefore sufficient to
perform analyses at the 250 iterations stagnation level.
9.2 Analysis
9.2.1 Fitting
A fit analysis was conducted for each response in the full experiments and the
screened experiments. For both the full and screened cases, at least a quadraticmodel was required to model the responses. For the Minimum Run Resolution V
Face-Centred Composite, cubic models are aliased and so were not considered.
9.2.2 ANOVA
Effects for each response model were selected using stepwise regression (Sec-
tion A.4.1 on page 220) applied to a full quadratic model with an alpha out thresh-
old of 0.10. Some terms removed by stepwise regression were added back into the
final model to preserve hierarchy.
To make the data amenable to statistical analysis, a transformation of the re-
sponses was required for each analysis. The transformation was a log10 for all three
responses.
Outliers were deleted and the model building repeated until the models passed
the usual ANOVA diagnostics for the ANOVA assumptions of model fit, normality,
constant variance, time-dependent effects, and leverage (Section A.4.2 on page 221).
138 data points (∼5% of total data) were removed when analysing the full model of
ADA-Time. 122 data points (∼5% of total data) were removed when analysing the
full model of RelativeError-Time. 47 data points (∼5% of total data) were removed
when analysing the screened model of ADA-Time. 15 data points (∼2% of total
data) were removed when analysing the screened model of RelativeError-Time.
9.2.3 Confirmation
Confirmation experiments were run according to the methodology detailed in Sec-
tion 6.4.6 on page 121 in order to confirm the accuracy of the ANOVA models.
The randomly chosen treatments produced actual algorithm responses with the
descriptives listed.
The large ranges of each response reinforce the motivation for correct parameter
tuning as there is clearly a high cost in incorrectly tuned parameters.
The next two figures illustrate the 95% prediction intervals (Section A.3.5 on
page 219) and actual confirmation data for full and screened response surface
models of Relative Error and Time.
157
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
Iterations TimeRelative
Error ADA100 Mean 70.14 7.01 3.83
StDev 84.14 3.48 3.27Max 528.28 17.65 16.15Min 1.55 3.12 1.00
150 Mean 109.80 6.88 3.77StDev 130.37 3.41 3.24Max 774.39 17.65 16.15Min 2.17 2.77 1.00
200 Mean 169.66 6.75 3.71StDev 220.76 3.38 3.22Max 1084.58 16.92 15.48Min 2.83 2.77 1.00
250 Mean 220.19 6.69 3.67StDev 287.57 3.35 3.18Max 1652.54 16.84 15.41Min 3.45 2.77 1.00
Figure 9.3: Descriptive statistics for the confirmation of the ACS tuning. The response data is fromruns of the actual algorithm on the randomly generated confirmation treatments.
ACS_RelErr-Time.xls
Full Response Surface ModelScreened FCD RSMFull CCC RSMScreened CCC RSM
Full Response Surface Model
0.05.0
10.015.020.025.030.035.0
0 10 20 30 40 50Treatment
Rel
ativ
e E
rror 2
50
RelativeError 250
95% PI low
95% PI high
Full Response Surface Model
1
10
100
1000
10000
0 10 20 30 40 50Treatment
Tim
e 25
0
Time 250
95% PI low
95% PI high
Screened FCD RSM
0.05.0
10.015.020.025.030.035.0
0 10 20 30 40 50Treatment
Rel
ativ
e E
rror 2
50
RelativeError 25095% PI low
95% PI high
Screened FCD RSM
1
10
100
1000
10000
0 10 20 30 40 50Treatment
Tim
e 25
0
Time 250
95% PI low
95% PI high
Figure 9.4: 95% Prediction intervals for the full ACS response surface model of Relative Error-Time.The horizontal axis is the randomly generated treatment. The vertical axis is the Relative Error or Timeresponse.
158
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
ACS_RelErr-Time.xls
Full Response Surface ModelScreened FCD RSMFull CCC RSMScreened CCC RSM
Full Response Surface Model
0.05.0
10.015.020.025.030.035.0
0 10 20 30 40 50Treatment
Rel
ativ
e E
rror 2
50
RelativeError 250
95% PI low
95% PI high
Full Response Surface Model
1
10
100
1000
10000
0 10 20 30 40 50Treatment
Tim
e 25
0
Time 250
95% PI low
95% PI high
Screened FCD RSM
0.05.0
10.015.020.025.030.035.0
0 10 20 30 40 50Treatment
Rel
ativ
e E
rror 2
50
RelativeError 25095% PI low
95% PI high
Screened FCD RSM
1
10
100
1000
10000
0 10 20 30 40 50Treatment
Tim
e 25
0
Time 250
95% PI low
95% PI high
Figure 9.5: 95% Prediction intervals for the screened ACS response surface model of RelativeError-Time. Screening was conducted in the previous Chapter. The horizontal axis is the randomly generatedtreatment. The vertical axis is the Relative Error or Time response.
159
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
For both screened and full model, the predictions are very similar for both the
Relative Error and Time responses. This shows that the screening decisions from
the previous case study were correct. Looking at the predictions in general we see
that time was better predicted than relative error. The time models match all the
trends in the actual data. The relative error models however exhibit some false
peaks and miss some actual peaks.
Similar results were observed for the full and screened models of ADA-Time.
The ADA model failed to predict one more peak than the relative error model.
The RelativeError-Time and ADA-Time models are therefore deemed good pre-
dictors of the ACS responses.
9.3 Results
9.3.1 Screening and relative importance of factors
The following two figures give the ranked ANOVAs of the Relative Error and Time
models from the RelativeError-Time analysis. The terms have been rearranged in
order of decreasing sum of squares so that the largest contributor to the models
comes first.
Rank TermSum of squares F value p value Rank Term
Sum of squares F value p value
1 J-problemStDev 182.23 121362.13 < 0.0001 36 EM 0.21 137.62 < 0.00012 E-q0 64.87 43205.00 < 0.0001 37 F-rho 0.20 134.68 < 0.00013 D-nnFraction 22.01 14656.74 < 0.0001 38 BH 0.20 132.31 < 0.00014 DE 17.71 11794.89 < 0.0001 39 DF 0.20 131.48 < 0.00015 BD 8.86 5901.77 < 0.0001 40 GH 0.19 124.68 < 0.00016 EJ 4.61 3068.80 < 0.0001 41 F^2 0.18 122.13 < 0.00017 B-beta 4.46 2967.32 < 0.0001 42 J^2 0.15 100.38 < 0.00018 AD 4.36 2904.26 < 0.0001 43 AH 0.14 94.24 < 0.00019 BE 3.84 2554.84 < 0.0001 44 A-alpha 0.12 80.01 < 0.0001
10 H-problemSize 3.66 2439.84 < 0.0001 45 FJ 0.12 77.53 < 0.000111 AJ 3.38 2250.48 < 0.0001 46 EK 0.10 63.99 < 0.000112 AB 3.26 2174.28 < 0.0001 47 HJ 0.09 58.21 < 0.000113 CE 2.76 1836.57 < 0.0001 48 DK 0.06 42.16 < 0.000114 G-rhoLocal 2.57 1714.47 < 0.0001 49 JK 0.06 42.03 < 0.000115 D^2 2.45 1631.98 < 0.0001 50 DH 0.06 41.72 < 0.000116 DJ 2.31 1540.95 < 0.0001 51 AE 0.06 37.78 < 0.000117 AF 2.26 1504.79 < 0.0001 52 FM 0.05 33.60 < 0.000118 E^2 2.24 1492.39 < 0.0001 53 HK 0.05 32.45 < 0.000119 BJ 2.21 1468.81 < 0.0001 54 CG 0.04 28.05 < 0.000120 B^2 2.13 1417.47 < 0.0001 55 BM 0.04 25.03 < 0.000121 C-antsFraction 1.93 1286.26 < 0.0001 56 DM 0.04 25.02 < 0.000122 EG 1.88 1253.72 < 0.0001 57 GK 0.04 24.44 < 0.000123 CD 1.69 1125.00 < 0.0001 58 JM 0.03 22.86 < 0.000124 CJ 1.50 1001.96 < 0.0001 59 BC 0.03 22.76 < 0.000125 EF 1.30 862.88 < 0.0001 60 GJ 0.03 19.37 < 0.000126 EH 1.23 820.42 < 0.0001 61 CM 0.03 17.58 < 0.000127 DG 0.98 649.57 < 0.0001 62 BG 0.03 17.42 < 0.000128 K-solutionConstruction 0.84 562.08 < 0.0001 63 CH 0.02 16.61 < 0.000129 AG 0.42 279.16 < 0.0001 64 FG 0.02 9.99 0.001630 CF 0.38 254.96 < 0.0001 65 KM 0.01 7.98 0.004831 H^2 0.32 215.21 < 0.0001 66 GL 0.01 7.51 0.006232 M-pheromoneUpdate 0.28 188.89 < 0.0001 67 BK 0.01 7.49 0.006233 G^2 0.26 172.66 < 0.0001 68 GM 0.01 6.05 0.014034 A^2 0.24 162.14 < 0.0001 69 EL 0.01 5.58 0.018335 FH 0.21 141.53 < 0.0001 70 L-antPlacement 0.01 4.39 0.0363
71 FL 0.00 2.92 0.087872 BF 0.00 2.75 0.0972
Figure 9.6: RelativeError-Time ranked ANOVA of Relative Error response from full model. The tablelists the remaining terms in the model after stepwise regression in order of decreasing Sum of Squares.
Looking first at the Relative Error rankings, we see that the least important
160
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
Rank TermSum of squares F value p value Rank Term
Sum of squares F value p value
1 C-antsFraction 1214.31 35326.80 < 0.0001 23 FJ 0.43 12.41 0.00042 C^2 248.18 7219.98 < 0.0001 24 AE 0.42 12.36 0.00043 H-problemSize 122.02 3549.80 < 0.0001 25 G-rhoLocal 0.35 10.30 0.00134 D-nnFraction 31.19 907.44 < 0.0001 26 AB 0.33 9.68 0.00195 E-q0 4.88 141.91 < 0.0001 27 B-beta 0.31 9.09 0.00266 DH 4.57 132.87 < 0.0001 28 AG 0.30 8.64 0.00337 EM 3.37 98.14 < 0.0001 29 HK 0.27 7.79 0.00538 HJ 2.75 80.00 < 0.0001 30 BE 0.25 7.34 0.00689 K-solutionConstruction 2.59 75.23 < 0.0001 31 BC 0.24 7.08 0.0078
10 M-pheromoneUpdate 2.44 71.03 < 0.0001 32 FM 0.24 6.88 0.008811 EF 1.95 56.76 < 0.0001 33 DK 0.22 6.29 0.012212 BD 1.44 41.91 < 0.0001 34 FH 0.19 5.39 0.020313 DE 1.43 41.68 < 0.0001 35 AH 0.18 5.31 0.021314 GM 1.31 38.21 < 0.0001 36 AJ 0.18 5.20 0.022715 DM 1.23 35.79 < 0.0001 37 GH 0.17 4.91 0.026816 CD 1.23 35.73 < 0.0001 38 BJ 0.17 4.86 0.027617 EG 1.14 33.10 < 0.0001 39 AC 0.15 4.29 0.038518 CG 1.04 30.19 < 0.0001 40 EJ 0.14 4.19 0.040819 BM 0.63 18.19 < 0.0001 41 J-problemStD 0.13 3.69 0.054720 CH 0.59 17.20 < 0.0001 42 BG 0.11 3.26 0.071221 CF 0.50 14.58 0.0001 43 AF 0.11 3.06 0.080422 JM 0.45 13.21 0.0003 44 HM 0.09 2.76 0.0966
45 A-alpha 0.01 0.28 0.596346 F-rho 0.01 0.22 0.6386
Figure 9.7: RelativeError-Time ranked ANOVA of time response from full model. The table lists theremaining terms in the model after stepwise regression in order of decreasing Sum of Squares.
main effects are L-antPlacement, A-alpha, F-rho and M-pheromoneUpdate. These
are exactly the terms that were deemed unimportant in the screening study of the
previous chapter.
By far the most important terms are the main effects of the candidate list length
tuning parameter and the exploration/exploitation tuning parameter as well as
their interaction. This is a very important result because it shows that candidate
list length, a parameter that we have often seen set at a fixed value or not used is
actually one of the most important parameters to set correctly.
Looking at the Time rankings, we see that L-antPlacement was completely re-
moved from the model. The least important main effects were then F-rho and
A-alpha. These results also confirm the screening decisions. However, the M-
pheromoneUpdate term has now risen in importance in its effect on time.
By far the most important tuning parameters are the amount of ants and the
lengths of their candidate lists. This is quite intuitive as the amount of processing
is directly related to these parameters. The result regarding the cost of the amount
of ants is particularly important because the amount of ants does not have a rel-
atively strong effect on solution quality. The extra time cost of using more ants
will not result in gains in solution quality. This is an important result because
it methodically confirms the often recommended parameter setting of letting the
number of ants equal to a small number (usually 10).
An examination of the ranked ANOVAs from the ADA-Time model gives the same
ranking of tuning parameter contributions to time and ADA. As with the previous
ACS screening study, the choice of solution quality response does not change the
conclusions of the relative importance of the tuning parameters.
161
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
9.3.2 Tuning
A desirability optimisation is performed as per Section 6.5.2 on page 122. Recall
that equal preference is given to the minimisation of relative error and solution
time. The results from the full and screened RelativeError-Time models are pre-
sented in the following two figures.ACS_RSM_All_R5_relErr-Time_05.dx7_desirability.xls
Size
StD
ev
alph
a
beta
ants
Frac
tion
nnFr
actio
n
q0 rho
rhoL
ocal
solu
tionC
onst
ruct
ion
antP
lace
men
t
pher
omon
eUpd
ate
Tim
e05
Rel
ativ
e Er
ror
Des
irabi
lity
300 10 8 2 1.00 1.00 0.99 0.69 0.96 parallel random bestSoFar 1.15 0.46 0.96300 40 13 5 1.00 1.00 0.98 0.95 0.28 sequential random bestSoFar 1.46 1.24 0.86300 70 1 11 1.00 20.00 0.98 0.05 0.70 parallel random bestSoFar 1.77 2.18 0.80400 10 8 4 1.00 1.00 0.99 0.11 0.81 parallel random bestSoFar 2.42 0.46 0.92400 40 13 6 2.19 1.16 0.97 0.99 0.03 parallel random bestOfItera 2.83 1.33 0.82400 70 1 11 1.61 20.00 0.98 0.01 0.07 parallel random bestOfItera 4.92 2.59 0.73500 10 7 3 1.13 1.00 0.99 0.86 0.01 parallel same bestOfItera 4.88 0.38 0.88500 40 13 7 1.00 1.00 0.99 0.99 0.48 parallel random bestSoFar 4.25 1.35 0.80500 70 1 10 1.04 19.78 0.99 0.05 0.01 parallel same bestOfItera 9.24 2.54 0.70
1 of 1
Figure 9.8: Full RelativeError-Time model results of desirability optimisation. The table lists therecommended parameter values for combinations of problem size and problem standard deviation. Theexpected time and relative error are listed with the desirability value. ACS_Screened02_R10_relErr-Time_02.dx7_desirability.xls
Size
StDev
beta
antsFraction
nnFraction
q0 rhoLocal
solutionConstruction
Time05
RelError05
Desirability
300 10 1 1.00 1.00 0.99 0.99 parallel 0.93 0.51 0.98300 40 1 1.00 1.00 0.99 0.99 sequential 1.13 2.32 0.82300 70 12 1.00 20.00 0.99 0.01 parallel 2.70 4.45 0.71400 10 1 1.00 1.00 0.99 0.99 parallel 1.91 0.59 0.93400 40 1 1.00 1.00 0.99 0.99 sequential 2.21 2.71 0.77400 70 13 1.00 20.00 0.99 0.01 parallel 5.19 3.68 0.69500 10 1 1.00 1.00 0.99 0.99 parallel 3.92 0.69 0.87500 40 5 1.03 1.13 0.99 0.04 sequential 3.71 3.16 0.73500 70 13 1.00 20.00 0.99 0.01 parallel 9.98 3.01 0.68
1 of 1
Figure 9.9: Screened RelativeError-Time model results of desirability optimisation. The table lists therecommended parameter values for combinations of problem size and problem standard deviation. Theexpected time and relative error are listed with the desirability value.
The rankings of the ANOVA terms has already highlighted the factors that have
little effect on the responses. These screened factors can take on any values in the
desirability optimisations. This is confirmed by examining the desirability recom-
mendations from the full and screened models. The most important factors, com-
prising the screening model, have recommended settings that strongly agree with
162
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
the recommended settings from the full model. For example, beta is always low,
except when the problem standard deviation is high. The exploration/exploitation
threshold q0 is always at a maximum of 0.99, implying that exploitation is always
preferred to exploration. AntsFraction is always low. The remaining unimportant
factors take on a variety of values in the full model desirability optimisation.
The predicted values of time from both models agree with one another to within
a second. The predictions of relative error are higher from the screened model. The
quality of the desirability optimisation recommendations can now be evaluated.
9.3.3 Evaluation of tuned settings
The tuned parameter recommendations from the desirability optimisation are eval-
uated as per the methodology of Section 6.6 on page 124. Some illustrative plots
are given in the following two figures. On each plot, the horizontal axis lists the
randomly generated treatments and the vertical axis lists the response value. Each
plot contains the data for the response recorded using the settings from the de-
sirability optimisation of the full and screened experiments. The responses pro-
duced by using parameter settings recommended in the literature (Section 2.4.9
on page 45) and some randomly chosen parameter settings are also listed.
Relative Error vs Time model after 250 iteration stagnation
0
5
10
15
20
25
30
0 1 2 3 4 5 6 7 8 9 10
Treatment
Rel
ativ
e E
rror 2
50
FullScreenedBookRandom
Figure 9.10: Evaluation of Relative Error response in the RelativeError-Time model of ACS on problemsof size 400 and standard deviation 70. The horizontal axis is the randomly generated treatment. Thereare plots of the results from four parameter settings, the settings from a desirability optimisation of thefull relativeError-Time model, the settings from a desirability optimisation of the screened relativeError-Time model, the settings recommended in the literature and randomly generated settings.
For both Relative Error and Time, the parameter settings from the full and
screened models perform about the same as the parameter settings from the litera-
ture. Interestingly, on a small number of occasions, randomly chosen settings per-
form better than all other settings. Similar results were found with all eight other
combinations of problem characteristics for the full and screened RelativeError-
Time models. This result confirms the recommendation of generally good ACS set-
tings in the literature summarised in Section 2.4.9 on page 45. In particular, both
the literature and the desirability optimisation agree on the recommended settings
163
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
Relative Error vs Time model after 250 iteration stagnation
1
10
100
1000
10000
0 1 2 3 4 5 6 7 8 9 10
Treatment
Tim
e 25
0
FullScreenedBookRandom
Figure 9.11: Evaluation of Time response in the RelativeError-Time model of ACS on problems of size400 and standard deviation 70. The horizontal axis is the randomly generated treatment. There areplots of the results from four parameter settings, the settings from a desirability optimisation of the fullrelativeError-Time model, the settings from a desirability optimisation of the screened relativeError-Time model, the settings recommended in the literature and randomly generated settings.
for the most important factors according to the screening study. Both recommend
low values of Beta and AntsFraction and high values of the exploration/exploitation
threshold q0.
Results from the ADA-Time desirability optimisation were different as the rec-
ommended parameter settings from the literature were chosen with a relative error
response in mind rather than an ADA response. The following two figures illustrate
representative results for ADA and Time. Again, on a few occasions, the randomly
chosen settings perform better than all alternatives. There is little difference in
solution times between the desirability settings and the literature settings. How-
ever, there is a very large difference when one considers ADA. This shows that one
should not use the literature recommended parameter settings if one is measuring
an ADA solution quality response.
9.4 Conclusions and discussion
The following conclusions are drawn from the ACS tuning study. The first of these
relate to screening and ranking and serve to confirm the conclusions from the
screening study of the previous chapter (Section 8.4 on page 150). These screening
and tuning conclusions apply for a significance level of 5% and a power of 80% to
detect the effect sizes listed in Figure 9.1 on page 155. These effect sizes are a
change in solution time of 224s, a change in Relative Error of 3.76% and a change
in ADA of 1.43. Issues of power and effect size are discuss in Section A.6 on
page 225.
• Tuning Ant placement not important. The type of ant placement has no
significant effect on ACS performance in terms of solution quality or solution
time.
164
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
ADA vs Time model after 250 iteration stagnation
0
5
10
15
20
25
30
35
40
0 1 2 3 4 5 6 7 8 9 10
Treatment
AD
A 2
50FullScreenedBookRandom
Figure 9.12: Evaluation of ADA response in the ADA-Time model of ACS on problems are of size 500and standard deviation 10. The horizontal axis is the randomly generated treatment. There are plots ofthe results from four parameter settings, the settings from a desirability optimisation of the full ADA-Time model, the settings from a desirability optimisation of the screened ADA-Time model, the settingsrecommended in the literature and randomly generated settings.
ADA vs Time model after 250 iteration stagnation
1
10
100
1000
10000
0 1 2 3 4 5 6 7 8 9 10
Treatment
Tim
e 25
0
FullScreenedBookRandom
Figure 9.13: Evaluation of Time response in the ADA-Time model of ACS on problems are of size 500and standard deviation 10. The horizontal axis is the randomly generated treatment. There are plots ofthe results from four parameter settings, the settings from a desirability optimisation of the full ADA-Time model, the settings from a desirability optimisation of the screened ADA-Time model, the settingsrecommended in the literature and randomly generated settings.
165
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
• Tuning Alpha not important. Alpha has no significant effect on ACS per-
formance in terms of solution quality or solution time. This confirms the
common recommendation in the literature of setting alpha equal to 1. Alpha
has also been analysed with an OFAT approach in Appendix D on page 235.
• Tuning Rho not important. Rho has no significant effect on ACS perfor-
mance in terms of solution quality or solution time. This is a new result for
ACS.
• Tuning Pheromone Update Ant not important. The ant used for pheromone
updates is ranked highly for solution time. However, omitting this factor from
the screened model did not affect the performance of the screened model.
• Most important tuning parameters. The most important ACS tuning pa-
rameters are the heuristic exponent B-beta, the amount of ants C-antsFraction,
the length of candidate lists D-nnFraction, the exploration/exploitation thresh-
old E-q0 and rhoLocal.
The tuning study also provides further results that the screening study of the
previous case study could not.
• Minimum order model. A model that is of at least quadratic order is required
to model ACS solution quality and ACS solution time. This is a new result for
ACS and shows that an OFAT approach is not an appropriate way to tune the
performance of ACS.
• Relationship between tuning, problems and performance. Both the mod-
els of RelativeError-Time and ADA-Time were good predictors of ACS perfor-
mance across the entire design space. The prediction intervals for full and
screened models were very similar, confirming that the decisions from the
screening study in the previous chapter were correct.
• Tuned parameter settings. There was much similarity between the rec-
ommended tuned parameter settings from the full and screened models and
both settings resulted in similar ACS performance. Recommended settings
from desirability optimisation resulted in similar solution quality and solu-
tion time as settings in the literature. There are immense performance gains
to be achieved as evidenced by the relatively poor performance of many ran-
domly chosen parameter settings. The reader may have intuitively expected
randomly chosen values to perform poorly but we emphasise that their evalu-
ation is nonetheless an important control for the test of the DOE methodology.
• Comparison of solution quality responses. The is no difference in screening
and ranking conclusions from using the ADA or Relative Error solution quality
responses for ACS.
166
CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM
9.5 Chapter summary
This chapter presented a case study applying the methodology of Chapter 6 to the
tuning of the Ant Colony System (ACS) heuristic. Many new results were presented
and existing recommendations in the literature were confirmed in a rigorous fash-
ion. The conclusions of the screening study in the previous chapter were also
confirmed.
167
10Case study: Screening Max-Min
Ant System
This chapter presents a case study on the screening of the Max-Min Ant System
(MMAS) heuristic. Several new results for MMAS are presented. These results
have not yet been published in the literature. The chapter follows the sequential
experimentation procedure of Chapter 6, beginning with a screening study. The
tuning study will follow in the next Chapter.
Established tuning parameters previously thought to affect performance are
actually shown to have no effect at all. New tuning parameters that were thought
to affect performance are investigated. A new TSP problem characteristic is shown
to have a very strong effect on performance, confirming the results of Chapter 7.
All analyses are conducted for two performance measures, quality of solution and
solution time. This provides an accurate measure of the heuristic compromise
that is rarely seen in the literature. Finally, it is shown that models of MMAS
performance must be of a higher order than linear.
10.1 Method
10.1.1 Response variables
Three responses were measured as per Section 6.7.2 on page 127. These responses
were percentage relative error from a known optimum (henceforth referred to as
Relative Error), adjusted differential approximation (henceforth referred to as ADA)
and solution time (henceforth referred to as Time).
169
CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM
10.1.2 Factors, Levels and Ranges
Design Factors
There were 12 design factors, 10 representing the MMAS tuning parameters and 2
representing the TSP problem characteristics being investigated. The design fac-
tors and their high and low levels are summarised in Table 10.1. A description of
the MMAS tuning parameters was given in Section 2.4.5 on page 39. The parame-
ter M:antPlacement could be considered as a parameterised design feature as men-
tioned in Section 6.3.1 on page 111. As with the ACS case studies, we acknowledge
that an experimenter’s prior experience with MMAS may suggest narrower ranges
for these factors. When this experience was not available in this thesis, we chose
ranges around the values hard-coded in the original source code. For example,
restart freq was hard-coded to 25 and so a range of 2 to 40 was experimented with
here. It is a simple matter to rerun this case study with different ranges if desired.
Factor Name Type Low High
A alpha Numeric 1 13
B beta Numeric 1 13
C antsFraction Numeric 1.00 110.00
D nnFraction Numeric 1.00 20.00
E q0 Numeric 0.01 0.99
F rho Numeric 0.01 0.99
G reinitBranchFac Numeric 0.50 2.00
H reinitIters Numeric 2 80
J problemStDev Numeric 10 70
K problemSize Numeric 300 500
L restartFreq Numeric 2 40
M antPlacement Categoric random same
Table 10.1: Design factors for the screening study with MMAS. There are two problem characteristicfactors (J-problemStDev and K-problemSize). The remaining 10 factors are MMAS tuning parameters.
Held-Constant Factors
The held constant factors are as per Section 6.7.6 on page 127.
10.1.3 Instances
All TSP instances were of the symmetric type and were created as per Section 6.7.1
on page 126. The TSP problem instances ranged in size from 300 cities to 500
cities with cost matrix standard deviation ranging from 10 to 70. All instances had
a mean of 100. The same instances were used for each replicate of a design point.
170
CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM
10.1.4 Experiment design, power and replicates
The experiment design was a Resolution IV (12-5) fractional factorial (Section A.3.2
on page 215) with 6 centre points. The number of replicates was 8, determined
using the work-up procedure of Section 6.7.4 on page 127 for a power of about
80%, a significance level of 5% and an effect size of 0.18 standard deviations.
This yielded a total of 1030 runs. The following figure summarises the descriptive
statistics of the three response variables across all treatments at the 5 stagnation
measuring points and the actual effect size that is equivalent to 0.18 standard
deviations.
Iterations Time Relative Error ADA
50 Mean 102.29 8.15 6.08StDev 378.70 13.34 10.36Max 4549.08 118.75 43.46Min 0.18 0.41 0.34Actual Effect Size of 0.18 sDev 68.17 2.40 1.86
100 Mean 181.28 7.66 5.59StDev 591.26 12.77 9.87Max 6799.15 116.22 43.46Min 0.29 0.41 0.28Actual Effect Size of 0.18 sDev 106.43 2.30 1.78
150 Mean 231.11 7.46 5.37StDev 654.76 12.57 9.45Max 6933.65 116.22 43.19Min 0.43 0.41 0.28Actual Effect Size of 0.18 sDev 117.86 2.26 1.70
200 Mean 307.51 7.29 5.09StDev 865.00 12.46 8.75Max 8144.02 116.22 43.19Min 0.54 0.41 0.25Actual Effect Size of 0.18 sDev 155.70 2.24 1.58
250 Mean 376.37 7.19 4.93StDev 1038.89 12.39 8.31Max 9312.89 116.22 42.97Min 0.65 0.41 0.21Actual Effect Size of 0.18 sDev 187.00 2.23 1.50
Figure 10.1: Descriptive statistics for the MMAS screening experiment. Statistics are given at fivestagnation points ranging from 50 to 250 iterations. The actual effect size equivalent to 0.18 standarddeviations is also listed for each response variable.
10.1.5 Performing the experiment
Responses were measured at five stagnation levels. An examination of the de-
scriptive statistics verifies that the stagnation level did not have a large effect on
the response values and therefore the conclusions after a 250 iteration stagnation
should be the same as after lower iteration stagnations. The two solution qual-
ity responses show a small but practically insignificant decrease in solution error
171
CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM
as the stagnation iterations is increased. The Time response increases with in-
creasing stagnation iterations because the experiments take longer to run. For all
cases, the level of stagnation iterations has little practically significant effect on
the three responses. MMAS did not make any large improvements when allowed
to run for longer. It is therefore sufficient to perform analyses at the 250 iterations
stagnation level.
10.2 Analysis
10.2.1 ANOVA
Effects for each response model were selected using stepwise regression (Sec-
tion A.4.1 on page 220) applied to a full 2 factor interaction model with an alpha
out threshold of 0.10. Some terms removed by backward selection were added
back into the final model to preserve hierarchy.
To make the data amenable to statistical analysis, a transformation of the
responses was required for each analysis. The transformation was a log10 (Sec-
tion A.4.3 on page 222) for all three responses.
Outliers were deleted and the model building repeated until the models passed
the usual ANOVA diagnostics for the ANOVA assumptions of model fit, normality,
constant variance, time-dependent effects, and leverage (Section A.4.2 on page 221).
24 data points (2% of total data) were removed when analysing Relative Error. 24
data points (2% of total data) were removed when analysing ADA. 10 data points
(1% of total data) were removed when analysing Time.
10.2.2 Confirmation
Confirmation experiments were run according to the methodology detailed in Sec-
tion 6.3.7 on page 114 in order to confirm the accuracy of the ANOVA models.
The randomly chosen treatments produced actual algorithm responses with the
descriptives listed in the following figure.
Max Min Mean StDevRelative Error 250 107.25 0.27 7.21 15.35ADA 250 37.62 0.79 5.25 8.00Time 250 4954 1 239 599
Figure 10.2: Descriptive statistics for the confirmation of the MMAS screening ANOVA. The responsedata is from runs of the actual algorithm on the randomly generated confirmation treatments.
The large ranges of each response reinforce the motivation for correct parameter
tuning as there is clearly a high cost in incorrectly tuned parameters.
The following three figures illustrate the 95% prediction intervals and actual
data for the three response models, Relative Error, ADA and Time respectively.
The two solution quality responses are well predicted by their models. The mod-
els match the trends of the actual data, successfully picking up the extremely low
172
CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM
0.00
5.00
10.00
15.00
20.00
25.00
30.00
0 10 20 30 40 50Treatment
Rel
ativ
e E
rror 2
50
Relative Error 25095% PI low95% PI high
Figure 10.3: 95% Prediction intervals for the MMAS screening of Relative Error. The horizontal axisshows the randomly generated treatment number.
05
10152025303540
0 10 20 30 40 50Treatment
AD
A 2
50
ADA 25095% PI low95% PI high
Figure 10.4: 95% Prediction intervals for the MMAS screening of ADA. The horizontal axis shows therandomly generated treatment number.
173
CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM
and extremely high response values which vary over a range of 107% for relative
error and 37 for ADA.
1.0
10.0
100.0
1000.0
10000.0
0 10 20 30 40 50Treatment
Tim
e 25
0
Time 25095% PI low95% PI high
Figure 10.5: 95% Prediction intervals for the MMAS screening of Time. The horizontal axis shows therandomly generated treatment number.
The time response is also well predicted by its model. This was achieved over
a range of 5000s. The three ANOVA models are therefore satisfactory predictors of
the three MMAS performance responses for factor values within 10% of the factor
range limits listed in Section 10.1.2 on page 170.
10.3 Results
The next figure gives a summary of the Sum of Squares ranking and the ANOVA
F and p values for the three responses. Only the main effects are listed. Those
main effects that rank in the top 12 are highlighted in bold. Rankings are based
on the full two factor interaction model of 78 terms, before backward selection was
applied. ScreeningResults_MMAS-TODO.xls
Rank F value p value Rank F value p value Rank F value p valueA-alpha 23 59.80 < 0.0001 24 59.80 < 0.0001 6 342.47 < 0.0001B-beta 3 1270.75 < 0.0001 3 1270.75 < 0.0001 40 5.79 0.0163C-antsFraction 8 760.90 < 0.0001 8 760.90 < 0.0001 1 28012.34 < 0.0001D-nnFraction 5 1006.33 < 0.0001 5 1006.33 < 0.0001 3 2941.57 < 0.0001E-q0 2 2378.04 < 0.0001 2 2378.04 < 0.0001 5 352.99 < 0.0001F-rho 13 547.03 < 0.0001 13 547.03 < 0.0001 8 172.10 < 0.0001G-reinitBranchFac 22 65.12 < 0.0001 23 65.12 < 0.0001 10 114.19 < 0.0001H-reinitIters 16 111.65 < 0.0001 17 111.65 < 0.0001 63 0.97 0.3255J-problemStDev 1 21133.33 < 0.0001 1 6690.14 < 0.0001 72 0.17 0.6833K-problemSize 44 7.56 0.0061 16 120.56 < 0.0001 2 3489.94 < 0.0001L-restartFreq 63 0.44 0.5078 63 0.44 0.5078 35 7.88 0.0051M-antPlacement 53 2.97 0.0853 54 2.97 0.0853 50 2.61 0.1066
Relative Error ADA Time
1 of 1
Figure 10.6: Summary of ANOVAs for Relative Error, ADA and Time for MMAS. Only the main effectsare shown. Effects with a top 12 ranking are in bold. Rankings are based on the full two factorinteraction model of 78 terms, before stepwise regression was applied.
The screening study of MMAS and the associated ANOVAs yield several impor-
174
CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM
tant results and answers to open questions from the ACO literature.
10.3.1 Screened factors
Factor L-RestartFreq is statistically insignificant at the 0.05 level for the two quality
responses but significant for the time response. The factor has low rankings across
all three responses.
M-AntPlacement, is statistically insignificant at the 0.05 level for all three re-
sponses. It has a low ranking for all three responses, leading us to expect this
factor to have little effect on performance.
In summary, 2 factors are screened out, M-AntPlacement and L-RestartFreq.
10.3.2 Relative Importance of Factors
Of the two problem characteristics, the factor with the larger effect on solution
quality is the problem standard deviation J-problemStDev. The problem size K-
problemSize has a stronger effect on solution time than J-problemStDev. This is
because MMAS takes longer to reach stagnation on larger problem instances than
smaller instances.
Of the remaining unscreened tuning parameters, the heuristic exponent B-beta,
the amount of ants C-antsFraction, the length of candidate lists D-nnFraction, the
exploration/exploitation threshold E-q0 and F-Rho have the strongest effects on
solution quality. The same is true for solution time except for B-beta which has a
low ranking for solution time.
G-ReinitBranchFac is statistically significant for all three responses but is only
ranked in the top third for the quality responses. It has a high ranking for Time.
This highlights that G-ReinitBranchFac should be considered as a tuning param-
eter rather than being hard-coded as is typically the case.
Although statistically significant for all responses, A-Alpha only has a high
ranking for solution Time. Alpha could possibly be considered for screening.
These results are important because they highlight the most important tuning
parameters and problem characteristics in terms of both performance dimensions
of the heuristic compromise. These factors are the minimal set of design factors
one should experiment with when modelling and tuning MMAS performance.
10.3.3 Adequacy of a Linear Model.
A test for curvature as per Section 6.3.9 on page 116 shows there is a statistically
significant amount of curvature in all three responses. This means that the linear
model from the screening study is not adequate to explore the whole design space.
A higher order model of all three responses is required. This is an important result
because it confirms that a One-Factor-At-a-Time (OFAT) approach is insufficient
for investigating the performance of MMAS.
175
CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM
10.3.4 Comparison of Solution Quality Measures
It was already mentioned that two solution quality responses were measured to
investigate if the choice of one response over the other lead to different conclu-
sions. An examination of the ANOVA summaries reveals that both Relative Error
and ADA have the same rankings (±1 places) and same statistical significance or
insignificance for all factors except K-problemSize. K-problemSize has a statisti-
cally significant effect on Relative Error only. This is due to the nature of how the
two quality responses are calculated (Section 3.10.2 on page 69).
This result shows that for screening of MMAS, the choice of solution quality re-
sponse will not affect the conclusions. However, ADA may be a preferable response
as it exhibits a lower range and variability than Relative Error. The advantage of a
lower variability was discussed in the context of statistical power (Section A.6 on
page 225).
10.4 Conclusions and discussion
The following conclusions are drawn from the MMAS screening study. The first
results concern tuning and design parameters that have no impact on MMAS per-
formance. These screening and tuning conclusions apply for a significance level of
5% and a power of 80% to detect the effect sizes listed in Figure 10.1 on page 171.
These effect sizes are a change in solution time of 187s, a change in Relative Error
of 2.2% and a change in ADA of 1.5. Issues of power and effect size are discuss in
Section A.6 on page 225.
1. Tuning Restart frequency tuning parameter not important. The number
of iterations used in the restart frequency has no significant effect on MMAS
performance in terms of solution quality or solution time. This is a highly
unexpected result as the restart frequency is a fundamental feature of MMAS
(Section 2.4.5 on page 39)
2. Tuning Ant placement design parameter not important. The type of ant
placement has no significant effect on MMAS performance in terms of either
solution quality or solution time. This was an open question in the litera-
ture. The result is remarkable because intuitively one would expect a random
scatter of ants across the problem graph to explore a wider variety of possi-
ble solutions. This result shows that this is not the case. MMAS design can
be fixed with either a random scatter method or single node method of ant
placement.
Other results were as follows.
3. Alpha only important for solution time. The choice of Alpha only signifi-
cantly effects solution time. Although statistically significant for the quality
responses, it has a low ranking.
176
CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM
4. Problem standard deviation is important. This confirms the main result of
Chapter 7 in identifying a new TSP problem characteristic that has a signifi-
cant effect on the difficulty of a problem for MMAS. ACO research should be
reporting this characteristic in the literature.
5. Most important parameters. The rankings show that the most important
tuning parameters affecting solution quality or solution time or both are beta,
antsFraction, length of candidate list, exploration/exploitation threshold and
the pheromone update term rho.
6. Beta not important for solution time. The choice of Beta only affects so-
lution quality and not solution time. It has always been known that Beta
strongly affects solution quality.
7. New tuning parameter. Reinitialisation Branching Factor tuning parameter
has a strong effect on time and a moderate but statistically significant effect
on quality..
8. Higher order model of MMAS behaviour needed. A higher order model,
greater than linear, is required to model MMAS solution quality and MMAS
solution time. This is an important result because it demonstrates for the
first time that simple OFAT approaches seen in the literature are insufficient
for accurately modelling and tuning MMAS performance.
9. Comparison of solution quality responses. The is no difference in conclu-
sions from the ADA and Relative Error solution quality responses for MMAS.
The ADA response is therefore preferable for screening because it exhibits a
lower variability than Relative Error and therefore results in more powerful
experiments.
The result regarding alpha confirms the literature’s general recommendation
that Alpha be set to 1 [47, p. 71]. It also contradicts Pellegrini et al’s [96] analysis
of the most important MMAS parameters, where alpha was one of these parame-
ters and the length of candidate list and exploration/exploitation threshold were
omitted. This result is all the more remarkable given that Pellegrini et al’s research
used the ACOTSP code and this study used the backwards compatible JACOTSP
code. It highlights the danger of using only intuitive reasoning to rank the impor-
tance of tuning parameters rather than a rigorous DOE approach with the fully
instantiated algorithm.
The ranking of the tuning parameters contradicts the claim of Doerr and Neu-
mann [41, p. 38] about rho being the most important parameter affecting ACO
solution run time. This thesis’ screening study shows that, excluding problem
characteristics, rho is in fact the 7th most important factor to affect solution time
to stagnation for MMAS.
The Reinitialisation Branching Factor is usually held constant (Section 2.4.5
on page 39) in reported research but this study has shown that it is an important
tuning parameter to be considered.
177
CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM
10.5 Chapter summary
This chapter presented a case study applying the methodology of Chapter 6 to the
screening of the Max-Min Ant System (MMAS) heuristic. Many new results were
presented. Existing recommendations in the literature were confirmed and other
claims were refuted in a rigorous fashion. The next chapter will use the results of
this screening to tune the performance of MMAS.
178
11Case study: Tuning Max-Min Ant
System
This Chapter reports a case study on tuning the factors affecting the performance
of a heuristic. The methodology for this case study was described in Chapter
6. The particular heuristic studied is Max-Min System (MMAS) for the Travelling
Salesperson Problem (Section 2.4.5 on page 39).
This chapter reports many new results for MMAS. All analyses are conducted for
two performance measures, quality of solution and solution time. This provides an
accurate measure of the heuristic compromise that is rarely seen in the literature.
It is shown that models of MMAS performance must be at least quadratic. MMAS
is tuned using a full parameter set and a screened parameter set resulting from
the case study of the previous chapter. This verifies that screening decisions from
the previous chapter are correct.
The results reported in this Chapter have been published in the literature [109].
11.1 Method
11.1.1 Response Variables
Three responses were measured as per Section 6.7.2 on page 127. These responses
were percentage relative error from a known optimum (henceforth referred to as
Relative Error), adjusted differential approximation (henceforth referred to as ADA)
and solution time (henceforth referred to as Time).
179
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
11.1.2 Factors, levels and ranges
Design Factors
In the full RSM, there were 12 design factors as per the screening study of the
previous chapter. The factors and their high and low levels are repeated in the
following table for convenience.
Factor Name Type Low High
A alpha Numeric 1 13
B beta Numeric 1 13
C antsFraction Numeric 1.00 110.00
D nnFraction Numeric 1.00 20.00
E q0 Numeric 0.01 0.99
F rho Numeric 0.01 0.99
G reinitBranchFac Numeric 0.50 2.00
H reinitIters Numeric 2 80
J problemStDev Numeric 10 70
K problemSize Numeric 300 500
L restartFreq Numeric 2 40
M antPlacement Categoric random same
Table 11.1: Design factors for the tuning study with MMAS. The factor ranges are also given.
Tuning parameters for the Screened model, based on the results of Section 10.3
on page 174, did not include L-RestartFreq and M-antPlacement. These two fac-
tors took on randomly chosen values within their range for each experiment run.
Held-Constant Factors
The held constant factors are as per Section 6.7.6 on page 127. There were ad-
ditional held-constant factors for MMAS. The p term in the trail minimum update
(Section 2.4.5 on page 39) was fixed at 0.05, as hard-coded in the original ACOTSP.
The lambda value used in the Trail Reinitialisation of the daemon actions calcula-
tion (Section 2.4.5 on page 39) is fixed at 0.05.
11.1.3 Instances
All TSP instances were of the symmetric type and were created as per Section 6.7.1
on page 126. The TSP problem instances ranged in size from 300 cities to 500
cities with cost matrix standard deviation ranging from 10 to 70. All instances had
a mean of 100. The same instances were used for each replicate of a design point.
180
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
11.1.4 Experiment design, power and replicates
The experiment design for both models was a Minimum Run Resolution V Face-
Centred Composite (Section A.3.4 on page 218) with six centre points.
The work-up procedure was slightly different to that specified in Section 6.7.4
on page 127 due to a limitation on experimental resources. In this study, a target
power of 80% and significance level of 5% were fixed by convention. A number
of replicates of 8 was fixed according to the number of feasible experiment runs
that could be conducted with the given resources. The collected data was then
examined to determine the smallest effect size that could be detected given these
constraints. Fortunately, the variability of the data was low enough to permit a
reasonable effect size to be detected with this power, significance level and num-
ber of replicates. The descriptive statistics for the full experiment and screened
experiment data are given in the following two figures.
Iterations Time Relative Error ADA
50 Mean 110.68 9.16 4.73StDev 455.98 16.75 6.62Max 6493.67 105.23 43.27Min 0.19 0.41 0.25Actual Effect Size of 0.25 stdevs 113.99 4.19 1.66
100 Mean 205.88 8.63 4.50StDev 885.52 15.64 6.41Max 11166.51 104.18 40.42Min 0.32 0.39 0.16Actual Effect Size of 0.25 stdevs 221.38 3.91 1.60
150 Mean 274.53 8.17 4.38StDev 1161.26 14.33 6.32Max 16378.81 101.68 40.42Min 0.44 0.37 0.16Actual Effect Size of 0.25 stdevs 290.32 3.58 1.58
200 Mean 355.73 8.00 4.32StDev 1562.46 13.88 6.28Max 20094.57 101.68 40.42Min 0.58 0.37 0.16Actual Effect Size of 0.25 stdevs 390.62 3.47 1.57
250 Mean 426.54 7.84 4.27StDev 1898.49 13.44 6.24Max 25757.98 99.00 40.42Min 0.71 0.37 0.12Actual Effect Size of 0.25 stdevs 474.62 3.36 1.56
Figure 11.1: Descriptive statistics for the full MMAS experiment design. The actual detectable effectsize of 0.25 standard deviations is shown for each response and for each stagnation point. There islittle practical difference in effect size of the solution quality responses for an increase in the stagnationpoint.
The full design could achieve sufficient power while detecting an effect of size
0.25 standard deviations.
The screened design could only achieve sufficient power while detecting an ef-
fect of size 0.41 standard deviations.
181
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
Iterations Time Relative Error ADA
50 Mean 88.71 7.97 5.20StDev 334.78 10.45 7.74Max 3983.14 96.23 42.54Min 0.23 0.41 0.46Actual Effect Size of 0.41 stdevs 137.26 4.29 3.17
100 Mean 179.99 7.22 4.71StDev 658.07 8.30 6.92Max 6673.04 93.47 42.54Min 0.41 0.41 0.46Actual Effect Size of 0.41 stdevs 269.81 3.40 2.84
150 Mean 284.49 6.81 4.48StDev 1024.61 6.78 6.43Max 10113.55 93.47 42.54Min 0.56 0.41 0.46Actual Effect Size of 0.41 stdevs 420.09 2.78 2.64
200 Mean 348.11 6.45 4.35StDev 1170.99 4.58 6.22Max 11535.46 32.22 42.54Min 0.69 0.41 0.46Actual Effect Size of 0.41 stdevs 480.11 1.88 2.55
250 Mean 398.42 6.40 4.33StDev 1280.86 4.55 6.22Max 11656.30 32.22 42.54Min 0.83 0.41 0.46Actual Effect Size of 0.41 stdevs 525.15 1.87 2.55
Figure 11.2: Descriptive statistics for the screened MMAS experiment design. The actual detectableeffect size of 0.41 standard deviations is shown for each response and at each stagnation point. Thereis little practical difference in the detectable effect size of the solution quality responses for differentstagnation iterations levels.
182
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
11.1.5 Performing the experiment
Responses were measured at five stagnation levels. An examination of the de-
scriptive statistics verifies that the stagnation level did not have a large effect on
the response values and therefore the conclusions after a 250 iteration stagnation
should be the same as after lower iteration stagnations. The two solution quality
responses show a small but practically insignificant decrease as the stagnation
iterations is increased. The Time response increases with increasing stagnation
iterations because the experiments take longer to run. For all cases, the level of
stagnation iterations has little practically significant effect on the three responses.
MMAS did not make any large improvements when allowed to run for longer. It is
therefore sufficient to perform analyses of the full and screened models at the 250
iterations stagnation level.
11.2 Analysis
11.2.1 Fitting
A fit analysis was conducted for each response in the full experiment and the
screened experiment. In both cases, at least a quadratic model was required to
model the responses. For the Minimum Run Resolution V Face-Centred Compos-
ite, cubic models are aliased and so were not considered.
11.2.2 ANOVA
Effects for each response model were selected using backward selection applied to
a full quadratic model with an alpha threshold of 0.10. Some terms removed by
backward selection were added back into the final model to preserve hierarchy. To
make the data amenable to statistical analysis, a transformation of the responses
was required for each analysis (Section A.4.3 on page 222). The transformations
were a log10 in all but one case., In the case of ADA in the ADA-Time model, a
square root transformation with d=0.5 was used.
Outliers were deleted and the model building repeated until the models passed
the usual ANOVA diagnostics for the ANOVA assumptions of model fit, normality,
constant variance, time-dependent effects, and leverage (Section A.4.2 on page 221).
The next table summarises the outliers deleted from each analysis.
Experiment Model Number % of runs
Full ADA-Time 56 4
Screened ADA-Time 50 8
Full RelativeError-Time 41 3
Screened RelativeError-Time 53 9
Table 11.2: Amount of outliers removed from MMAS tuning analyses. The percentage of the total runsdeleted is greater in the screened experiments than the full experiments.
183
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
For both models, the screened experiments required more outliers to be deleted
than the full experiments. The amount of outliers (8% and 9%) for the two screened
experiments may be cause for concern. This concern is addressed with some con-
firmation experiments.
11.2.3 Confirmation
Confirmation experiments were run according to the methodology detailed in Sec-
tion 6.3.7 on page 114. The randomly chosen treatments produced actual algo-
rithm responses with the descriptives listed in the following figure.
Iterations TimeRelative
Error ADA100 Mean 65.30 5.89 3.62
StDev 81.26 4.35 3.92Max 380.82 25.68 25.31Min 3.27 0.69 0.43
150 Mean 88.83 5.87 3.61StDev 103.21 4.35 3.91Max 474.20 25.68 25.31Min 4.64 0.69 0.43
200 Mean 112.13 5.84 3.59StDev 122.51 4.35 3.90Max 568.09 25.68 25.31Min 6.00 0.69 0.43
250 Mean 137.01 5.83 3.58StDev 149.95 4.35 3.90Max 662.35 25.68 25.31Min 7.37 0.69 0.43
Figure 11.3: Descriptive statistics for the MMAS confirmation experiments. This is the actual dataproduced from the randomly generated confirmation treatments. All three responses vary over wideranges, highlighting the cost of incorrectly chosen tuning parameter settings.
The same confirmation runs were used for the screened and full experiments.
This allows a direct comparison between the predictive capabilities of the models
from the screened and full experiments.
The following two figures compare the predictive capabilities of the full and
screened RelativeError-Time models for the Relative Error response over fifty ran-
domly generated treatments. Both models predict the relative error response well
for almost all of the 50 treatments extending over a range of 25%. There are about
three poorly performing predictions, two where the models overpredicted and one
where the models underpredicted the actual MMAS performance. Both the full and
screened models have the same shape, confirming the accuracy of the screening
study in the previous chapter.
184
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
Full Response Surface Model
0.00
5.00
10.00
15.00
20.00
25.00
30.00
0 5 10 15 20 25 30 35 40 45 50
Treatment
Rel
ativ
e E
rror 2
50Relative Error 25095% PI low95% PI high
Figure 11.4: 95% prediction intervals of Relative Error by the full RelativeError-Time model of MMAS.The horizontal axis is the randomly generated treatment number.
Screened Response Surface Model
0.00
5.00
10.00
15.00
20.00
25.00
30.00
0 5 10 15 20 25 30 35 40 45 50
Treatment
Rel
ativ
e E
rror 2
50
Relative Error 250
95% PI low
95% PI high
Figure 11.5: 95% prediction intervals of Relative Error by the screened RelativeError-Time model ofMMAS. The horizontal axis is the randomly generated treatment number.
185
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
The following two figures compare the predictive capabilities of the full and
screened RelativeError-Time models for the Time response. Both the full and
screened models of RelativeError-Time are excellent predictors of the Time re-
sponse. All extreme points are well predicted. Both models have the same shape,
confirming the accuracy of the screening study from the previous chapter.
Full Response Surface Model
1
10
100
1000
0 5 10 15 20 25 30 35 40 45 50Treatment
Tim
e 25
0
Time 25095% PI low95% PI high
Figure 11.6: Predictions of Time by the full RelativeError-Time model of MMAS. The horizontal axis isthe randomly generated treatment number.
Screened Response Surface Model
1
10
100
1000
0 5 10 15 20 25 30 35 40 45 50Treatment
Tim
e 25
0
Time 25095% PI low95% PI high
Figure 11.7: Predictions of Time by the screened RelativeError-Time model of MMAS. The horizontalaxis is the randomly generated treatment number.
The next two figures compare the predictive capabilities of the full and screened
ADA-Time models for the ADA response. Both models are good predictors of ADA
across the majority of confirmation treatments. The full model fails to predict
the exact values of two peaks but does nonetheless identify them as peaks. The
screened model has a slightly different shape to the full model. This may be due
either to decisions in the modelling process or and incorrect screening decision.
The next two figures compare the predictive capabilities of the full and screened
ADA-Time models for the Time response. Both models are excellent predictors of
the Time response over a range of thousands of seconds and there is little differ-
ence in the models’ predictions.
186
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
Full Response Surface Model
0.00
5.00
10.00
15.00
20.00
25.00
0 5 10 15 20 25 30 35 40 45 50Treatment
AD
A 2
50
ADA 25095% PI low95% PI high
Figure 11.8: 95% prediction intervals of ADA by the full ADA-Time model of MMAS. The horizontalaxis is the randomly generated treatment number.
Screened Response Surface Model
0.00
5.00
10.00
15.00
20.00
25.00
0 5 10 15 20 25 30 35 40 45 50
Treatment
AD
A 2
50
ADA 25095% PI low95% PI high
Figure 11.9: 95% prediction intervals of ADA by the screened ADA-Time model of MMAS. The horizon-tal axis is the randomly generated treatment number.
Full Response Surface Model
1
10
100
1000
10000
0 5 10 15 20 25 30 35 40 45 50
Treatment
Tim
e 25
0
Time 25095% PI low95% PI high
Figure 11.10: 95% prediction intervals of Time by the full ADA-Time model of MMAS. The horizontalaxis is the randomly generated treatment number.
187
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
Screened Response Surface Model
1
10
100
1000
10000
0 5 10 15 20 25 30 35 40 45 50Treatment
Tim
e 25
0Time 25095% PI low95% PI high
Figure 11.11: 95% prediction intervals of Time by the screened ADA-Time model of MMAS. The hori-zontal axis is the randomly generated treatment number.
In general, we conclude that both models are good predictors of the responses
and that the screening decisions from the screening study were correct. The con-
cerns raised in the previous section regarding effect size and outlier deletion have
been mitigated. The models can therefore be used to make recommendations on
good tuning parameter settings for a given instance.
11.3 Results
11.3.1 Screening and relative importance of factors
The following two figures give the ranked ANOVAs of the Relative Error and Time
models from the RelativeError-Time analysis. The terms have been rearranged in
order of decreasing sum of squares so that the largest contributor to the models
comes first.
Looking first at the Relative Error rankings, we see that the least important
main effects are A-alpha, M-antPlacement, and H-reinitIters. The first two of these
are the terms that were deemed unimportant in the screening study of the previous
chapter. However, in the screening study it was L-restartFreq that was deemed
unimportant rather than H- reinitIters. This may adversely affect the desirability
optimisation of the next section.
By far the most important terms are the main effects of the exploration/exploit-
ation tuning parameter, the candidate list length tuning parameter and exponent
B-Beta as well as their interactions. This is a very important result because it
shows that candidate list length, a parameter that we have often seen set at a fixed
value or not used is actually one of the most important parameters to set correctly.
G-reinitBranchFac, which is not normally considered a parameter at all, is also
very important for solution quality.
Looking at the Time rankings, we see that antPlacement was completely re-
moved from the model. The least important main effects were then L-restartFreq
and H-reinitIters. These results also confirm the screening decisions.
188
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
Rank TermSum of squares F value p value Rank Term
Sum of squares F value p value
1 J-problemStDev 96.79 27278.14 < 0.0001 35 BK 0.59 165.63 < 0.00012 B-beta 12.95 3648.58 < 0.0001 36 HL 0.59 164.91 < 0.00013 E-q0 12.69 3576.14 < 0.0001 37 BL 0.55 154.09 < 0.00014 BE 12.21 3440.86 < 0.0001 38 CF 0.54 151.67 < 0.00015 D-nnFraction 5.84 1646.78 < 0.0001 39 FL 0.50 140.41 < 0.00016 DE 4.47 1260.23 < 0.0001 40 JK 0.44 124.25 < 0.00017 BD 3.89 1096.49 < 0.0001 41 BJ 0.44 122.94 < 0.00018 G-reinitBranchFac 3.89 1095.75 < 0.0001 42 A^2 0.43 120.87 < 0.00019 GJ 2.79 785.30 < 0.0001 43 EH 0.43 120.76 < 0.0001
10 FH 2.68 756.69 < 0.0001 44 CK 0.42 118.41 < 0.000111 BC 2.61 735.55 < 0.0001 45 FG 0.42 117.05 < 0.000112 CD 2.57 725.08 < 0.0001 46 CH 0.41 115.01 < 0.000113 AB 2.50 704.56 < 0.0001 47 J^2 0.38 106.55 < 0.000114 AK 2.41 680.27 < 0.0001 48 DG 0.37 104.57 < 0.000115 AJ 2.39 673.34 < 0.0001 49 CE 0.36 101.04 < 0.000116 HJ 2.34 658.60 < 0.0001 50 H-reinitIters 0.34 96.05 < 0.000117 HK 2.28 642.08 < 0.0001 51 AL 0.29 81.91 < 0.000118 EL 2.00 565.03 < 0.0001 52 AE 0.27 76.12 < 0.000119 GK 1.90 535.92 < 0.0001 53 FK 0.25 70.12 < 0.000120 EJ 1.84 519.39 < 0.0001 54 AG 0.22 62.70 < 0.000121 C-antsFraction 1.82 513.19 < 0.0001 55 BF 0.20 57.43 < 0.000122 GL 1.75 492.62 < 0.0001 56 EK 0.17 49.28 < 0.000123 DF 1.68 474.43 < 0.0001 57 F-rho 0.17 47.94 < 0.000124 DK 1.51 426.61 < 0.0001 58 BH 0.14 40.61 < 0.000125 EF 1.40 395.95 < 0.0001 59 AD 0.14 38.56 < 0.000126 CL 1.20 338.79 < 0.0001 60 F^2 0.13 36.25 < 0.000127 K-problemSize 1.14 322.31 < 0.0001 61 DJ 0.12 34.29 < 0.000128 KL 1.01 284.32 < 0.0001 62 E^2 0.11 30.90 < 0.000129 L-restartFreq 0.85 238.83 < 0.0001 63 M-antPlacement 0.08 21.47 < 0.000130 D^2 0.69 194.48 < 0.0001 64 AC 0.08 21.37 < 0.000131 DH 0.66 185.14 < 0.0001 65 BG 0.06 17.64 < 0.000132 FJ 0.65 182.44 < 0.0001 66 AF 0.06 15.97 < 0.000133 AH 0.62 174.99 < 0.0001 67 EG 0.05 15.35 < 0.000134 CJ 0.62 174.03 < 0.0001 68 B^2 0.05 14.48 0.0001
69 C^2 0.05 13.28 0.000370 DL 0.04 11.10 0.000971 A-alpha 0.03 9.67 0.001972 EM 0.03 7.05 0.008073 JL 0.02 6.44 0.0113
Figure 11.12: RelativeError-Time ranked ANOVA of Relative Error response from full model. The tablelists the remaining terms in the model after stepwise regression in order of decreasing Sum of Squares.
189
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
Rank TermSum of squares F value p value Rank Term
Sum of squares F value p value
1 C-antsFraction 447.72 28026.84 < 0.0001 31 AK 0.66 41.11 < 0.00012 K-problemSize 47.69 2985.59 < 0.0001 32 DK 0.63 39.22 < 0.00013 D-nnFraction 42.94 2688.11 < 0.0001 33 CF 0.55 34.54 < 0.00014 E-q0 8.06 504.59 < 0.0001 34 B-beta 0.45 28.07 < 0.00015 CG 5.58 349.35 < 0.0001 35 BJ 0.39 24.28 < 0.00016 DG 5.08 318.31 < 0.0001 36 DE 0.39 24.14 < 0.00017 A-alpha 4.95 309.83 < 0.0001 37 DJ 0.35 22.18 < 0.00018 EG 4.61 288.38 < 0.0001 38 BK 0.35 21.84 < 0.00019 AG 3.57 223.53 < 0.0001 39 EL 0.33 20.92 < 0.0001
10 C^2 3.46 216.87 < 0.0001 40 BH 0.32 19.86 < 0.000111 BC 3.41 213.56 < 0.0001 41 JL 0.29 17.87 < 0.000112 AE 3.02 189.01 < 0.0001 42 BE 0.26 16.46 < 0.000113 AC 2.64 165.50 < 0.0001 43 HK 0.24 14.71 0.000114 CJ 2.39 149.68 < 0.0001 44 HJ 0.23 14.39 0.000215 G-reinitBranchFac 2.31 144.33 < 0.0001 45 EJ 0.20 12.65 0.000416 F-rho 1.66 104.20 < 0.0001 46 D^2 0.20 12.48 0.000417 CE 1.52 94.85 < 0.0001 47 CK 0.18 11.16 0.000918 FJ 1.50 94.14 < 0.0001 48 EK 0.17 10.89 0.001019 DF 1.50 93.69 < 0.0001 49 GJ 0.14 8.65 0.003320 AD 1.48 92.87 < 0.0001 50 GL 0.13 8.45 0.003721 BG 1.30 81.17 < 0.0001 51 G^2 0.10 6.34 0.011922 J-problemStDev 1.25 78.56 < 0.0001 52 FL 0.10 6.17 0.013123 HL 1.20 75.24 < 0.0001 53 FH 0.09 5.39 0.020424 BF 1.19 74.42 < 0.0001 54 DL 0.06 3.93 0.047625 AB 0.98 61.19 < 0.0001 55 BL 0.05 3.07 0.079926 CD 0.95 59.67 < 0.0001 56 CH 0.04 2.75 0.097727 KL 0.84 52.89 < 0.0001 57 EF 0.04 2.68 0.101728 AH 0.73 45.43 < 0.0001 58 FG 0.04 2.52 0.113029 JK 0.72 44.89 < 0.0001 59 H-reinitIters 0.01 0.49 0.483930 GK 0.66 41.51 < 0.0001 60 L-restartFreq 0.01 0.36 0.5483
Figure 11.13: RelativeError-Time ranked ANOVA of time response from full model. The table lists theremaining terms in the model after stepwise regression in order of decreasing Sum of Squares.
By far the most important tuning parameters are the amount of ants and the
lengths of their candidate lists. This is quite intuitive as the amount of processing
is directly related to these parameters. The result regarding the cost of the amount
of ants is particularly important because the amount of ants does not have a rel-
atively strong effect on solution quality. The extra time cost of using more ants
will not result in gains in solution quality. This is an important result because it
contradicts the often recommended parameter setting of letting the number of ants
equal to the problem size.
An examination of the ranked ANOVAs from the ADA-Time model gives the
same ranking of tuning parameter contributions to time and ADA. As with the
previous screening study, the choice of solution quality response does not change
the conclusions of the relative importance of the tuning parameters.
11.3.2 Tuning
A desirability optimisation is performed as per Section 6.5.2 on page 122. The
results from the full and screened RelativeError-Time models are presented in the
following two figures.
The rankings of the ANOVA terms has already highlighted the factors that have
little effect on the responses. These screened factors can take on any values in
the desirability optimisations. This is confirmed by examining the desirability rec-
ommendations from the full and screened models. The most important factors,
comprising the screening model, have recommended settings that usually agree
190
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
Desirability results for the Full MMAS algorithm
Size StDev alph
a
beta
ants
Frac
tion
nnFr
actio
n
q0 rho
rein
itBra
nchF
ac
rein
itIte
rs
rest
artF
req
antP
lace
men
t
Tim
e
Rel
ativ
e Er
ror
Des
irabi
lity
300 10 1 1 1.00 1.00 0.99 0.01 0.50 80 2 random 0.72 0.15 1.00300 40 1 1 1.00 1.00 0.99 0.01 0.50 80 2 random 0.94 0.56 0.95300 70 13 13 1.13 1.00 0.04 0.68 0.57 80 39 same 1.01 0.54 0.95400 10 1 1 1.00 1.50 0.96 0.22 0.51 80 39 same 1.53 0.19 0.96400 40 13 6 1.00 1.00 0.95 0.70 0.50 80 20 random 2.18 0.84 0.87400 70 13 12 1.00 1.00 0.01 0.64 0.62 35 30 random 2.55 1.16 0.84500 10 13 2 1.00 1.01 0.99 0.49 0.51 31 23 same 4.25 0.37 0.91500 40 13 13 1.01 1.32 0.01 0.59 0.50 5 30 random 4.45 0.99 0.82500 70 13 13 1.00 1.40 0.01 0.48 0.51 38 34 same 4.27 1.14 0.81
MMAS_RSM_All_R8_RelErr-Time_05.dx7_desirability.xls
Figure 11.14: Full RelativeError-Time model results of desirability optimisation. The table lists therecommended parameter values for combinations of problem size and problem standard deviation. Theexpected time and relative error are listed with the desirability value.
Desirability results for the Full MMAS algorithm
Size
St D
ev
alph
a
beta
ants
Frac
tion
nnFr
actio
n
q0 rho
rein
itBra
nchF
ac
rein
itIte
rs
Tim
e
Rel
ativ
e Er
ror
Des
irabi
lity
300 10 3 11 1.00 1.10 0.98 0.25 0.50 80 1 0.40 0.97300 40 1 7 8.83 1.00 0.98 0.56 0.50 77 3 0.62 0.89300 70 13 6 1.00 2.34 0.99 0.52 0.50 77 1 2.29 0.73400 10 13 5 1.00 1.00 0.89 0.53 0.50 3 3 0.41 0.94400 40 1 3 2.79 1.00 0.94 0.69 0.60 77 3 1.11 0.80400 70 13 9 1.00 1.00 0.99 0.53 1.12 76 5 2.82 0.64500 10 13 8 1.00 1.02 0.85 0.24 0.50 50 5 0.41 0.90500 40 13 11 1.20 1.04 0.99 0.57 0.85 80 8 1.18 0.74500 70 13 10 1.11 1.00 0.14 0.46 0.61 75 4 3.04 0.63
MMAS_RSM_Screened_R8_RelErr-Time_06.dx7_desirability.xls
Figure 11.15: Screened RelativeError-Time model results of desirability optimisation. The table liststhe recommended parameter values for combinations of problem size and problem standard deviation.The expected time and relative error are listed with the desirability value.
191
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
with the recommended settings from the full model. For example, AntsFraction
and nnFraction are always low in both models. The remaining unimportant factors
take on a variety of values in the full model desirability optimisation, for example
AntPlacement.
The predicted values of both relative error and time from both models agree
closely with one another. The quality of the desirability optimisation recommenda-
tions can now be evaluated.
11.3.3 Evaluation of tuned settings
The tuned parameter recommendations from the desirability optimisation are eval-
uated as per the methodology of Section 6.6 on page 124. Some illustrative plots
are given in the following figures. On each plot, the horizontal axis lists the ran-
domly generated treatments and the vertical axis lists the response value. Each
plot contains the data for the response recorded using the settings from the desir-
ability optimisation of the full and screened experiments. The responses produced
by using parameter settings recommended in the literature and some randomly
chosen parameter settings are also listed.
Relative Error vs Time model after 250 iteration stagnation
0
5
10
0 1 2 3 4 5 6 7 8 9 10Treatment
Rel
ativ
e E
rror 2
50
Full DesirabilityScreened DesirabilityBookRandom
Figure 11.16: Evaluation of Relative Error response in the RelativeError-Time model of MMAS. Prob-lems are of size 500 and standard deviation 10.
For Relative Error, the parameter settings from the full and screened models
perform slightly better than the parameter settings from the literature. Interest-
ingly, on a small number of occasions, randomly chosen settings perform better
than all other settings. It is not until we examine the other side of the heuristic
compromise that we see the advantage of the DOE approach to parameter tun-
ing. Here, the results from the DOE desirability optimisation are two orders of
magnitude better than the results from the settings recommended in the literature
(Section 2.4.9 on page 45).
For both models, there is little difference between the performance in terms of
time or quality, supporting the conclusions made in the screening experiment of
the previous chapter.
As with ACS, results were quite different when evaluating the ADA-Time model.
192
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
Relative Error vs Time model after 250 iteration stagnation
1
10
100
1000
10000
0 1 2 3 4 5 6 7 8 9 10
Treatment
Tim
e 25
0
Full DesirabilityScreened DesirabilityBookRandom
Figure 11.17: Evaluation of the Time response in the relativeError-Time model of MMAS. Problemsare of size 500 and standard deviation 10.
In the next two figures we see that the parameter settings from the full and
screened models outperform the settings from the literature by an order of magni-
tude in solution quality and three orders of magnitude in solution time. Again, the
comparison of the performance of the DOE settings with the literature settings is
not appropriate for ADA as the literature settings were recommended in the context
of relative error. However this highlights the importance of not apply the literature
recommended settings without an understanding of the solution quality response.
There is no practically significant difference between parameter setting recommen-
dations from the full and screened models. This indicates that the decisions from
the screening study were correct.
ADA vs Time model after 250 iteration stagnation
0
5
10
15
20
25
30
35
40
0 1 2 3 4 5 6 7 8 9 10Treatment
AD
A 2
50
Full DesirabilityScreened DesirabilityBookRandom
Figure 11.18: Evaluation of the Time response in the RelativeError-Time model of MMAS. Problemsare of size 300 and standard deviation 10.
193
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
ADA vs Time model after 250 iteration stagnation
1
10
100
1000
0 1 2 3 4 5 6 7 8 9 10Treatment
Tim
e 25
0Full DesirabilityScreened DesirabilityBookRandom
Figure 11.19: Evaluation of the Time response in the ADA-Time model of MMAS. Problems are of size300 and standard deviation 10.
Not all parameter recommendations performed so well. The next figure shows
the evaluation of the Relative Error performance from the RelativeError-Time model
for problems of size 400 and standard deviation 70. The parameter settings from
the literature perform as well as or better than those obtained with the DOE ap-
proach. Randomly chosen parameter settings perform better than all others on
many occasions.
Relative Error vs Time model after 250 iteration stagnation
0
5
10
15
0 1 2 3 4 5 6 7 8 9 10
Treatment
Rel
ativ
e E
rror 2
50
Full DesirabilityScreened DesirabilityBookRandom
Figure 11.20: Evaluation of Relative Error response in the RelativeError-Time model of MMAS. Prob-lems are of size 400 and standard deviation 70.
Similar results of poor relative error performance despite excellent time perfor-
mance were also observed for other combinations of problem size and standard
deviation. This was not an issue with ACS and may be due to the more compli-
cated nature of MMAS or an interaction between the stagnation stopping criterion
and the less aggressive behaviour of MMAS relative to ACS.
194
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
11.4 Conclusions and discussion
The following conclusions are drawn from the MMAS tuning study. These screen-
ing and tuning conclusions apply for a significance level of 5% and a power of
80% to detect the effect sizes listed in Figure 11.1 on page 181. These effect sizes
are a change in solution time of 474s, a change in Relative Error of 3.36% and a
change in ADA of 1.56. Issues of power and effect size are discuss in Section A.6
on page 225.
• Tuning Ant Placement not important. This factor had a very low rank-
ing in both full models. The screening study correctly identified the lack of
importance of Ant Placement.
• Tuning Restart frequency not important. The number of iterations used
in the restart frequency has no significant effect on MMAS performance in
terms of solution quality or solution time. This is a highly unexpected result
as the restart frequency is a fundamental feature of MMAS (Section 2.4.5 on
page 39)
• Alpha only important for solution time. The choice of Alpha only effects
solution time. Although statistically significant for the quality responses, it
has a low ranking. This confirms the literature’s general recommendation
that Alpha be set to 1.
• Beta not important for solution time. The choice of Beta only effects solu-
tion quality and not solution time. This is a new result in the ACO literature.
• Reinitialisation Branching Factor is important. This has a strong effect
on Time and a moderate but statistically significant effect on quality. This
highlights the importance of tuning this parameter which is usually held con-
stant.
• Sufficient order model.. A model of at least quadratic order is required to
model MMAS solution quality and MMAS solution time. This confirms the re-
sult of the screening study that simple OFAT approaches seen in the literature
are insufficient for accurately modelling and tuning MMAS performance.
• Relationship between tuning, problems and performance. Both the mod-
els of RelativeError-Time and ADA-Time were good predictors of MMAS per-
formance across the entire design space. The prediction intervals for full and
screened models were very similar, confirming that the decisions from the
screening study were correct. However a ranking of the full model terms from
the tuning study suggests it may not have been appropriate to screen out
restart frequency.
• Tuned parameter settings. There was little similarity between the recom-
mended tuned parameter settings from the full and screened models but both
settings resulted in similar MMAS performance. This indicates that there may
be many combinations of parameter settings that give similar performance for
195
CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM
MMAS. Finding one of these settings is important however as recommended
settings resulted in similar or better solution quality than random settings
and two orders of magnitude better time than random settings. There are
immense performance gains to be achieved.
For some combinations of problem size and problem standard deviation, the
RelativeError-Time recommendations resulted in worse solution quality than
some randomly chosen parameter settings. Solution time was nonetheless
improved. These poor recommendations may be due to the complex nature
of MMAS that may be difficult to model with an interpolative DOE approach.
This complexity is particularly evident in the daemon actions phase of MMAS
(Seciton 2.4.5 on page 39) where reinitialisations occur. This ‘restart’ nature
of MMAS may be difficult to model with the DOE approach.
11.5 Chapter summary
This chapter presented a case study applying the methodology of Chapter 6 to
the tuning of the Max-Min Ant System (MMAS) heuristic. Many new results were
presented and existing recommendations in the literature were confirmed in a rig-
orous fashion. The conclusions of the screening study in the previous chapter were
also confirmed.
196
12Conclusions
This thesis presents a rigorous Design Of Experiments (DOE) approach for the
tuning of heuristic algorithms for Combinatorial Optimisation (CO). The thesis
therefore draws on well-established fields such as Operations Research, Design
Of Experiments, Empirical Analysis and Empirical Methodology and contributes a
much needed rigour [64, 103, 48, 65] to fields such as Heuristics, Metaheuristics,
and Ant Colony Optimisation.
This chapter summarises the thesis. It begins with a brief overview of the main
problem that the thesis addresses. It then examines the advantages of the ap-
proach taken in the thesis, leading to the main hypothesis of the work. The contri-
butions from this thesis are listed along with the thesis strengths and limitations.
The chapter closes with a discussion of possibilities for future work.
12.1 Overview
CO is a ubiquitous and extremely important type of optimisation problem that oc-
curs in scheduling, planning, timetabling and routing. However, these are typically
very difficult problems to solve exactly and so are generally tackled with approxi-
mate (heuristic) approaches and more general frameworks of heuristic approaches
called metaheuristics. Heuristics sacrifice finding an exact solution to a problem
and instead find an approximate solution in reasonable time. They therefore in-
herently involve a tradeoff between solution quality and solution time called the
heuristic compromise. Popular and successful metaheuristics include Evolutionary
Computation, Tabu Search, Iterated Local Search and Ant Colony Optimisation.
The flexibility of metaheuristics to deal with a range of problems comes at a cost—
their application to a particular problem typically requires setting the values of a
large number of tuning parameters. These tuning parameters are inputs to the
metaheuristic that govern its behaviour. Exploring this large parameter space and
relating it to problem characteristics and performance in terms of the heuristic
197
CHAPTER 12. CONCLUSIONS
compromise is called the parameter tuning problem. Methodically solving the pa-
rameter tuning problem for metaheuristics is probably the single most important
obstacle to producing metaheuristics that are well understood scientifically and
easily applied in practical scenarios.
12.2 Advantages of DOE
This thesis takes an empirical approach to parameter tuning, adapting method-
ologies, experiment designs and analyses from the field of Design Of Experiments
[84, 85]. It therefore fits between analytical and automated tuning approaches. It
efficiently provides the raw data and trends to which the analytical camp should fit
their models. It recommends good quality parameter tuning settings against which
the automated tuning camp can compare their tuners’ recommendations.
DOE was chosen for several reasons. It is well-established theoretically and
well supported in terms of software tools. This means it should be relatively easy
to convince metaheuristics researchers of the thesis’ methodology and to encour-
age them to adopt it. While a familiarity with statistical analyses and their in-
terpretation is of course required, many of the more involved aspects can be left
to the statistical analysis software. This is certainly the case in other fields such
as psychology and medicine where researchers have a tradition of following good
practice in experiment design, data collection and data analysis without neces-
sarily being expert statisticians. The traditional areas to which DOE is applied in
engineering map almost directly to the common research questions that one asks
in metaheuristics research so DOE’s power and maturity can be transferred di-
rectly to metaheuristics research. DOE offers efficiency in terms of the amount
of data that needs to be gathered. This is critical when attempting to understand
immense design spaces. All DOE conclusions are based on statistical analyses and
so are supported with mathematical precision. This allays any concerns regarding
subjective interpretation of results.
The main advantage of DOE over automated approaches is that it produces
a model of the data. If the model is verified to be a good quality model then it
can be used to explore, understand and ask questions of the actual algorithm
performance. Automated approaches [12, 8] provide relatively fast solutions but
do not provide this power to explore the parameter-performance relationship. They
generally provide only a tuned solution. Understanding where this solution came
from requires understanding the complex tuning algorithm code and its dynamics.
Understanding the results from a DOE analysis requires only understanding a
clear and well-established methodology. While some application scenarios will not
require this understanding, scientific reproducibility certainly demands it.
These advantages of DOE over alternatives for tuning metaheuristics lead to the
thesis hypothesis.
198
CHAPTER 12. CONCLUSIONS
12.3 Hypothesis
The main hypothesis of this thesis is:
The problem of tuning a metaheuristic can be successfully addressed with a
Design Of Experiments approach.
This hypothesis was tested in a multitude of ways. DOE nested designs were
used to investigate the importance of problem characteristics on heuristic per-
formance. DOE screening designs were used to rank the importance of tuning
parameters and problem characteristics. DOE response surface models were used
to model the relationship between tuning parameters, problem characteristics and
performance. Desirability functions were used to tune performance while simulta-
neously addressing the two aspects of the heuristic compromise
The key finding was that the Design of Experiments approach is indeed an
excellent method for modelling and tuning metaheuristics like Ant Colony Op-
timisation. This was demonstrated with independent confirmation experiments
and with comparisons to tuning parameter settings taken from the literature.
This was recognised by the community through peer-reviewed publications at
the field’s main conferences with best paper nominations and best paper awards
[105, 106, 110, 108, 104, 109]. This result is now covered in more detail.
12.4 Summary of main thesis contributions
The following is a summary of the main contributions from this thesis.
1. Synthesis. The poor quality of methodology in the metaheuristics field is
probably due to a lack of awareness of the issues involved. Chapter 3 sur-
veyed the literature to gather together these issues as they have been dis-
cussed over the past 30 years in fields such as heuristics and operations
research. Issues covered included:
(a) the types of research question and study,
(b) the stages in sound experiment design and some common mistakes,
(c) the issues of heuristic instantiation and problem abstraction,
(d) the importance of pilot studies,
(e) reproducibility of results,
(f) benchmarking of machines,
(g) the advantages and disadvantages of various performance responses,
(h) random number generators,
(i) problem instances,
(j) stopping criteria, and
199
CHAPTER 12. CONCLUSIONS
(k) interpretative bias
This thesis strived to apply best practice regarding these issues and should
serve as a much needed illustration of good experiment design and analysis.
2. Heuristic code JACOTSP. All experiments were run with our Java version
(JACOTSP) of the original C source code (ACOTSP) accompanying the litera-
ture [47]. JACOTSP was informally verified to produce the same behaviour as
ACOTSP by comparing their outputs on a variety of problem instances and a
variety of tuning parameter settings. JACOTSP also offers the usual advan-
tages of an Object-Oriented (OO) design, namely extensibility and reuse. It is
intended to make JACOTSP available to the community, creating a focal point
for more reproducible research with ACO.
3. Problem generator code Jportmgen. All problem instances were generated
with our Java port (Jportmgen) of a generator used in a large open competition
in the optimisation community (portmgen) [58]. Jportmgen was informally
verified to produce the same instances as the original portmgen. Again, the
OO design permitted instances to be generated with edge lengths that follow
a plugged-in distribution. This was important for experiments with problem
characteristics.
4. Methodology for investigating if a problem characteristic affects per-formance. A detailed methodology was presented for determining whether
a given problem characteristic affects heuristic performance. This involved
the introduction of the nested design and its analysis to the ACO field. The
methodology was illustrated with a published case study of ACS and MMAS
[105, 110]. This method and experiment design are now attracting attention
in the stochastic local search field [6].
5. Methodology for screening tuning parameters and problem characteris-tics. A detailed methodology and efficient experiment designs were presented
for ranking the most important tuning parameters and problem character-
istics that affect performance and for screening out those that do not effect
performance. This methodology was published along with illustrative case
studies of its application [108, 104]. A further methodology was presented
for independently confirming the accuracy of the screening model and its rec-
ommendations. The thesis is the first use of fractional factorial designs (Sec-
tion A.3.2 on page 215) for screening ACO tuning parameters and problem
characteristics.
6. Methodology for modelling the relationship between tuning parameters,problem characteristics and performance. A detailed methodology and ef-
ficient experiment designs were presented for modelling the relationship be-
tween tuning parameters, problem characteristics and performance. This
methodology was published along with illustrative case studies of its appli-
cation [106, 109]. The thesis is the first use of response surface models and
fractional factorial designs for modelling ACO.
200
CHAPTER 12. CONCLUSIONS
7. Desirability functions and emphasis on the heuristic compromise. Thr-
oughout the thesis, the heuristic compromise has been emphasised. The mul-
tiobjective problem of reducing running time while improving solution quality
was tackled by the introduction of desirability functions. Using desirabil-
ity functions and tuning heuristic desirability does not exclude the analysis
of heuristic solution time and heuristic solution quality separately. It is a
convenient approach to deal with the heuristic compromise. Confirmation
experiments demonstrated that tuning parameter settings found with the de-
sirability approach offered orders of magnitude savings in solution time over
parameters taken from the literature.
8. A new important problem characteristic for ACO. The analysis of problem
difficulty from the case study in Chapter 7 showed that the standard deviation
of edge lengths in a TSP instance has a significant effect on problem difficulty
for ACS and MMAS. This means that research should report the standard de-
viation of instances. This result was confirmed in subsequent screening and
modelling case studies in which it was shown that problem instance standard
deviation had a very large effect on solution quality. This may extend to other
metaheuristics.
9. New results from screening ACS and MMAS. The screening experiments
answered many open questions regarding the importance of various tuning
parameters and problem characteristics. From screening ACS, it was shown
that:
• Tuning Ant placement not important. The type of ant placement has
no significant effect on ACS performance in terms of solution quality or
solution time. This was an open question in the literature. It is remark-
able because intuitively one would expect a random scatter of ants across
the problem graph to explore a wider variety of possible solutions. This
result shows that this is not the case.
• Tuning Alpha not important. Alpha has no significant effect on ACS
performance in terms of solution quality or solution time. This confirms
the common recommendation in the literature of setting alpha equal to
1. An OFAT analysis of alpha for ACS is reported in Appendix D.
• Tuning Rho not important. Rho has no significant effect on ACS per-
formance in terms of solution quality or solution time. This is a new
result for ACS. It is a surprising result since Rho is a term in the ACS
update pheromone equations and analytical approaches in very simpli-
fied scenarios have concluded that rho is important [41].
• Tuning Pheromone Update Ant not important. The ant used for phe-
romone updates is practically insignificant for all three responses. An
examination of the plot of time for the K-pheromoneUpdate factor shows
that the effect on time is not practically significant. K-pheromoneUpdate
can therefore be screened out.
201
CHAPTER 12. CONCLUSIONS
• Most important tuning parameters. The most important ACS tuning
parameters are the heuristic exponent B-beta, the amount of ants C-
antsFraction, the length of candidate lists D-nnFraction and the explo-
ration/exploitation threshold E-q0.
• Problem standard deviation is important. This confirms the main re-
sult of Chapter 7 in identifying a new TSP problem characteristic that has
a significant effect on the difficulty of a problem for ACS. ACO research
should be reporting this characteristic in the literature.
• Higher order model needed. A higher order model, greater than linear,
is required to model ACS solution quality and ACS solution time. This is
an important result because it demonstrates for the first time that simple
OFAT approaches seen in the literature are insufficient for accurately
tuning ACS performance.
• Comparison of solution quality responses. The is no difference in con-
clusions from the ADA and Relative Error solution quality responses.
ADA has a slightly smaller variability and so results in more powerful
experiments than Relative Error.
From screening MMAS, it was shown that:
(a) Tuning Restart Frequency not important. The tuning parameter Restart
Frequency is statistically insignificant for solution quality and solution
time in the factor ranges experimented with. It may, however, become
important when very high solution quality is required.
(b) Tuning AntPlacement not important. As with ACS, the design param-
eter AntPlacement does not have a significant effect on solution quality
or solution time. Either random scatter or single random node placement
can be used when placing ants on the TSP graph.
(c) Tuning Alpha only important for solution time. The choice of the
Alpha tuning parameter value only effects solution time. Although sta-
tistically significant for the quality responses, it has a low ranking. This
confirms the literature’s general recommendation that Alpha be set to 1
[47, p. 71].
(d) Problem difficulty results confirmed. The result of the study of prob-
lem characteristics affecting performance was confirmed with problem
edge length standard deviation having a very strong effect on solution
quality for MMAS.
(e) Important tuning parameters. Of the remaining unscreened tuning pa-
rameters, the heuristic exponent Beta, the amount of ants antsFraction,
the length of candidate lists nnFraction, the exploration/exploitation thres-
hold q0 and and pheromone decay term Rho have the strongest effects
on solution quality. The same is true for solution time except for beta
which has a low ranking for solution time.
202
CHAPTER 12. CONCLUSIONS
(f) New parameter. Reinitialisation Branching Factor (ReinitBranchFac) is
statistically significant for all three responses but is only ranked in the
top third for the quality responses. It has a high ranking for Time. This
highlights that ReinitBranchFac should be considered as a tuning pa-
rameter rather than being hard-coded as is typically the case.
(g) Higher order model of MMAS behaviour needed. A higher order model,
greater than linear, is required to model MMAS solution quality and
MMAS solution time. This is an important result because it demonstrates
for the first time that simple OFAT approaches seen in the literature are
insufficient for accurately tuning MMAS performance.
(h) Comparison of solution quality responses. The is no difference in con-
clusions from the ADA and Relative Error solution quality responses for
MMAS. The ADA response is therefore preferable for screening because
it exhibits a lower variability than Relative Error and therefore results in
more powerful experiments.
The modelling case studies in Chapters 9 and 11 both confirmed results from
the other screening case studies and yielded new results in their own right.
• Confirmation of screening study results. The models of ACS and
MMAS performance were built using the full set of tuning parameters
and the reduced set resulting from the previous screening experiments.
Both the full and reduced models were good predictors of performance in
terms of both solution quality and solution time. This confirmed the ac-
curacy of the previous screening studies’ recommendations. In general,
the ranking of the importance of the tuning parameters was in broad
agreement with the ranking from the screening study. Some small differ-
ences are to be expected because the screening analyses are conducted
on each response separately while the modelling analyses are conducted
on each response simultaneously.
These results confirm that screening studies for ACO can be trusted for
screening out parameters that do not affect performance and so reducing
the parameter space to explore in more expensive modelling studies.
• Possibility of multiple regions of interest. The recommended param-
eter settings from the desirability optimisation of full and screened mod-
els were not the same for MMAS. However, when the recommendations
were independently evaluated, both gave similarly competitive solution
qualities and similarly huge savings in solution time on new problem
instances. This highlights the possibility of multiple regions of interest
in the parameter space of MMAS. This is important because it confirms
the futility of attempting to recommend ‘optimal’ parameter settings. It
also illustrates the more complicated parameter-performance relation-
ship that emerges when one considers the two aspects of the heuristic
compromise simultaneously. There are probably more than one regions
203
CHAPTER 12. CONCLUSIONS
in the parameter space where a similar compromise in solution quality
and solution time can be found.
• Quadratic models needed. Fit analyses for the ACS and MMAS response
surface models showed that a surface of order at least quadratic is re-
quired to model these metaheuristics. This rules out the use of OFAT
approaches (Section 2.5 on page 47) to tuning these heuristics. The
quadratic models were independently confirmed to be good predictors of
performance across the parameter space. It is an open question whether
higher order models and the associated increase in experiment expense
would yield even better predictions of performance.
Note that all the aforementioned results were obtained at a significance level of
5% and the largest effect size that could be detected with a power of 80%. Please
refer to the individual case studies for the details of these effect sizes. Note that
these effect sizes were limited by the number of replicates that could be run with
the available experimental resources. In an ideal situation, the experimenter would
determine the effect size from the experimental objectives and then increase the
replicates until sufficient power was achieved.
12.5 Thesis strengths
12.5.1 Rigour and efficiency
The strengths of this thesis come from the strengths of DOE (Section 2.5 on
page 47). The thesis’ methodologies are adapted from well-established and tested
methodologies used in other fields such as manufacture. They are therefore proven
on decades of scientific and industrial experience. The experiment designs allow
for a very efficient use of experiment resources while still obtaining all of the most
important information from the data. Until this thesis, there has been little aware-
ness of the potential of these designs in the ACO literature. Their efficiency is crit-
ically important when experiments are expensive due to large parameter spaces
and difficult problem instances. In particular, the designs provide a vast saving in
experiment runs (Section A.3.3 on page 218). Because DOE and Response Sur-
face Models build a model of performance across the whole design space, many
research questions can be explored. Numerical optimisation of this surface can
recommend tuning parameter settings for different weightings of the responses of
interest. One may obtain settings appropriate for long run times and high quality
or short run times and lower levels of solution quality. All of these questions are
answered on the same model without need to rerun experiments.
12.5.2 Generalizability
The methodology and results from this thesis are of interest to both those who
design heuristics and engineers who wish to deploy a heuristic. Designers can use
the thesis methodology to rank the contribution of new additions to a heuristic
204
CHAPTER 12. CONCLUSIONS
(design parameters) as well as to understand and model the contribution of tuning
parameters to changes in performance. DOE provides a rigorous approach for
testing hypotheses about a new heuristic, categorically determining whether new
techniques/components make a significant impact on performance. For engineers,
DOE provides a verifiably accurate model of behaviour, allowing the heuristic to
be quickly retuned to new problem instances without running a large set of new
experiments.
Although all case studies illustrate the application of the thesis’ methodolo-
gies to ACO, there is no reason why the methodologies cannot be applied to other
heuristics.
12.5.3 Reproducibility and empirical best practice
An effort has been made throughout the thesis to address the methodological is-
sues raised and discussed in Chapter 3. The thesis uses algorithm and problem
generator code that is backwards compatible with codes commonly used in the
field. This strengthens the reproducibility of its results and makes its conclusions
applicable to all previous work that has used these codes. Experiment machines
were properly benchmarked according to an established procedure [58]. This im-
proves the thesis’ reproducibility and applicability for all subsequent research work
that may refer to this thesis.
12.6 Thesis limitations
DOE is not a panacea for the myriad difficulties that arise in the empirical analysis
of heuristics. It goes a long way towards overcoming many of those difficulties.
However, the conclusions and contributions of this thesis are necessarily limited
in a few ways.
• Computational expense. Despite the efficiency of the DOE designs that this
thesis introduced, running sufficient experiments to gather sufficient data
is still computationally expensive. Of course, the experiments would have
been orders of magnitude more expensive had a less sophisticated approach
been used. This expense is increased when there are many categorical tun-
ing parameters due to the nature of how the designs are built. However,
any expense is mitigated by the amount of useful and structured informa-
tion obtained. DOE yields a full model of the data that can be explored in
many ways and used to make new predictions about the heuristic perfor-
mance across the entire design space. The DOE designs and methods in this
thesis are the state-of-the-art approach for building such models. If a user
is concerned about quickly tuning parameters in a one-off scenario, an au-
tomated approach may be a preferable alternative. Recall however that use
of an automated method implies that the user is content to trust its black-
box approach and requires no understanding of the parameter, problem and
performance relationship.
205
CHAPTER 12. CONCLUSIONS
• Categorical factors. The previous point mentioned how categorical tuning
parameters increase the size and consequently the expense of the experi-
ments. It must be pointed out that the 2-level fractional factorial designs
used in screening can only take on two values for each factor. This is not
a severe limitation when one considers the main motivation of a screening
design—to determine whether a factor should be included in the more expen-
sive Response Surface Model design.
• Nested parameters. A parameter type that we term a nested parameter arose
in the analysis of MMAS parameters in Section 2.4.9 on page 45. These are
parameters that only make sense within their parent parameter. Factorial ex-
periment designs cannot analyse these types of parameters directly. Values of
the nested parameter and its parent could be lumped into a single categorical
parameter. Our summary of tuning parameters in Section 2.4.9 on page 45
suggests that these types of parameter might not be so common anyway.
12.7 Future work
The research presented in this thesis should be developed and extended along the
following lines.
• Further ACO algorithms. The original ACOTSP and our JACOTSP contain
further ACO algorithms, Rank-based Ant System [24], Elitist Ant System and
Best-Worst Ant System [31]. These could easily be investigated with the thesis
methodologies to see whether similar results are obtained regarding the im-
portance of various tuning parameters. It would also be of interest to extend
these algorithms with local search.
• Comparison to OFAT. It would strengthen the argument for the use of DOE
in favour of OFAT if a comprehensive comparison of the two methods were
conducted as has been done in other fields [37].
• Comparison to other tuning methods. If would be interesting to compare
DOE to the results from other tuning approaches such as automated tuning.
• Further heuristics. The methodology could also be applied to other heuris-
tics. Screening studies, for example, have already been independently applied
to Particle Swarm Optimisation [74].
• Useful tools. A strength of DOE is its support in software. There is no
excuse for the metaheuristics practitioner to claim that statistics are too time-
consuming or complicated to use (Section 2.5 on page 47). Modern statistical
analysis software shields the user from much of this complexity. A greater
awareness of this software and tutorials on how to use it with metaheuristics
is urgently needed.
206
CHAPTER 12. CONCLUSIONS
12.8 Closing
It is hoped that this thesis has convinced the reader of the merits of the DOE
approach when applied to the problem of tuning metaheuristics. The parameter
tuning problem is ubiquitous in the field and must be tackled in every new piece
of metaheuristics research. The methodologies of this thesis, or some appropri-
ate adaptation of them, should be used when setting up ACO heuristics. There
is no longer any excuse for inheriting values from other publications or for fuzzy
reasoning with words and intuition about the parameters that need to be tuned.
Adopting this thesis’ methodologies will add to the expense of metaheuristics ex-
periments. However, this is the unavoidable reality of dealing with metaheuristics
with a large number of tuning parameters. The researcher who embraces this the-
sis’ methodologies will have at their disposal an established, efficient, rigourous,
reproducible approach for making strong conclusions about the relationship be-
tween metaheuristic tuning parameters, problem characteristics and performance.
207
ADesign Of Experiments (DOE)
This appendix provides a basic background on the main Design Of Experiments
(DOE) and statistics concepts used in this thesis and introduced in Section 2.5 on
page 47. The material in this appendix is adapted and compiled from the literature
[84, 1, 89, 85] for the reader’s convenience and is not intended to replace a detailed
study of those texts.
The appendix begins with an introduction to the terminology for the DOE field.
It then provides a short explanation of the various DOE topics referred to in the
thesis.
A.1 Terminology
The following terms are encountered frequently in the design and analysis of ex-
periments.
A.1.1 Response variable
The response variable is the measured variable of interest. In the analysis of meta-
heuristics, one typically measures the solution quality and solution time required
by a heuristic as these are reflections of the heuristic compromise. The DOE ap-
proach can be used in heuristic design as well as heuristic performance analysis
and so the choice of response variable is limited only by the experimenter’s imag-
ination. In some cases, it may be appropriate to measure the frequency of some
internal heuristic operation for example.
A.1.2 Factors and Levels
A factor is an independent variable manipulated in an experiment because it is
thought to affect one or more of the response variables. The various values at
which the factor is set are known as its levels. In heuristic performance analysis,
211
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
the factors include both the heuristic tuning parameters and the most important
problem characteristics. These are sometimes distinguished by referring to them
as tuning factors and problem factors respectively. Factors can also be new heuris-
tic components that are hypothesised to improve performance. Sometimes factors
are distinguished as being either design factors (or primary factors) or held-constantfactors (or secondary factors). Design factors are those factors that are being stud-
ied because we are interested in their effects on the responses. Held-constant
factors are those factors that are known to affect the responses but are not of in-
terest in the present study. They should be held at a constant value throughout
all experiments.
A.1.3 Treatments
A treatment is a specific combination of factor levels. The particular treatments will
depend on the particular experiment design and on the ranges over which factors
are varied.
A.1.4 Replication
Replicates are repeated runs of a treatment. Replicates are needed when a studied
process produces different response measurements for identical runs of a treat-
ment. This is always the case with stochastic heuristics. The number of replicates
required in an experiment is linked to the statistical concept of power discussed
later.
A.1.5 Effects
An effect is a change in the response variable due to a change in one or more
factors. We can define main effects as follows:
The main effect of a factor is a measure of the change in the response
variable to changes in the level of the factor averaged over all levels of all
the other factors. [89]
Higher order effects (or interactions) are the effect that occurs when the com-
bined change in two factors produces an effect greater (or less) than that of the
sum of effects expected from either factor alone. An interaction occurs when the
effect one factor has depends on the level of another factor. A second order effect
is due to two factors, a third order to three and so on.
A.1.6 Confounding
Two are more effects are said to be confounded if it is impossible to separate the ef-
fects when the subsequent statistical analysis is performed. This is best described
with an example.
A computer scientist has developed a new algorithm and wishes to compare
it with an established algorithm. He has two machines available to him. The
212
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
established algorithm will be run on one machine and the experimental algorithm
to the other. The characteristic to be measured as an index of performance will
be the run time of the algorithm to solve a particular problem. However, when the
two run times are compared, it is impossible to say how much of the difference is
due to the algorithms and how much is due to inherent differences (age, operating
system version, memory) between the two machines.
The effects of algorithm and machine are thus confounded. Confounding is due
to poor experimental planning and execution, particularly to poor control of factors.
It is important to stress the difference between confounding and aliasing. Aliasing
is an inability to distinguish several effects due to the nature of the experiment
design rather than poor execution. It is a deliberate and known price that we pay
for using more efficient designs such as fractional factorials, as discussed later.
A.2 Regions of operability and interest
There are two regions within an experimental design space [85]. The region ofoperability is the region in which the equipment, process etc. works and it is the-
oretically possible to conduct an experiment and measure responses. In ACO, the
region of operability is sometimes bounded, as with the exploration/exploitation
parameter ρ which must be within a range 0 < ρ < 1. With other tuning parameters
such as α, the region of interest is, in theory, unbounded . Within this region of
operability, there may be one or more regions of interest. A region of interest is a
region to which an experimental design is confined. The region of interest is typ-
ically chosen because we believe it contains the optimal process settings. These
regions are illustrated schematically in the following figure.
O
R
R’
Figure A.1: Region of operability and region of interest (adapted from [85]).
The region of operability is often not known until the process has been well
studied and may change depending on circumstances. Myers and Montgomery
[85] offer the following comments on the difficulty of choosing an experimental
region of interest.
. . . in many situations the region of interest (or perhaps even the re-
gion of operability) is not clear cut. Mistakes are often made and adjust-
213
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
ments adopted in future experiments. Confusion regarding type of designshould never be an excuse for not using designed experiments. Using [a
design for the wrong type of region] for example, will still provide impor-
tant information that will, among other things, lead to more educated
selection of regions for future experiments. [85, p. 317]
A.3 Experiment Designs
There many experiment designs to choose from. The design an experimenter uses
will depend on many factors including the particular research question, whether
experiments are in the early stages of research and the experimental resources
available. This section focuses on the advanced designs that appear in this the-
sis. It begins with a simpler more common design as this provides the necessary
background for understanding the subsequent designs.
A.3.1 Full and 2k Factorial Designs
A full factorial design consists of a crossing of all levels of all factors. The number
of levels of each factor can be two or more and need not be the same for each
factor. These levels may be quantitive (scalar), such as values of pheromone de-
cay constant; or they may be qualitative, such as types of algorithm. This is an
extremely powerful but expensive design. A more useful type of factorial for DOE
uses k factors, each at only 2 levels. The so-called 2k factorial design provides
the smallest number of runs with which k factors can be studied in a full factorial
design. Factorials have some particular advantages and disadvantages [89]. These
are worth noting given the importance that factorials play in experimental design.
The advantages are that:
• greater efficiency is achieved in the use of available experimental resources
in comparison to what could be learned from the same number of experiment
runs in a less structured context such as an OFAT analysis [37],
• information is obtained about the interactions, if any, of factors because the
factor levels are all crossed with one another, and
• results are more comprehensive over a wider range of conditions due to the
combining of factor levels in one experiment.
Of course, these advantages come at a price. As the number of factors grows,
the number of treatments in a 2k design rapidly overwhelms the experiment re-
sources. Consider the case of 10 continuous factors. A naıve full factorial design
for these ten factors will require a prohibitive 210 = 1024 treatments. The full facto-
rial experiment is the ideal design for many of the research questions in this thesis
but the size of metaheuristic design spaces limits its applicability. A more efficient
design is required.
214
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
A.3.2 Fractional Factorial Design
The previous section mentioned the exponential increase in expense of factorial
designs with an increase in the design factors. There are benefits to this expense.
A 210 full factorial will provide data to evaluate all the effects listed in the next table.
For screening however, the experimenter is interested only in the main effects (the
design factors) and perhaps the two-factor effects. This makes the full factorial
inefficient for screening purposes.
Effect Number es-timated
Main 10
Two-factor 45
Three factor 120
Four-factor 210
Five-factor 252
Six-factor 210
Seven-factor 120
Eight-factor 45
Nine-factor 10
Ten-factor 1
Table A.1: Numbers of each effect estimated by a full factorial design of 10 factors.
If it is assumed that higher-order interactions are insignificant, information on
the main effects and lower-order interactions can be obtained by running a fraction
of the complete factorial design. This assumption is based on the sparsity of effectsprinciple. This states that a system or process is likely to be most influenced by
some main effects and low-order interactions and less influenced by higher order
interactions.
A judiciously chosen fraction of the treatments in a full factorial will yield in-
sights into only the lower order effects. This is termed a fractional factorial. The
price we pay for the fractional factorial’s reduction in number of experimental
treatments is that some effects are indistinguishable from one another. They are
aliased. Additional treatments, if necessary, can disentangle these aliased effects
should an alias group be statistically significant. The advantage of the fractional
factorial is that it facilitates sequential experimentation. The additional treatments
and associated experiment runs need only be performed if aliased effects are sta-
tistically significant. Depending on the number of factors, and consequently the
design size, a range of fractional factorials can be produced from a full factorial.
The amount of higher order effects that are aliased is described by the design’s
resolution. For Resolution III designs, all effects are aliased. Resolution IV designs
have unaliased main effects but second-order effects are aliased. Resolution V
designs estimate main and second-order effects without aliases.
The details of how to choose a fractional factorial’s treatments are beyond the
215
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
scope of this thesis. It is an established algorithmic procedure that is well covered
in the literature [84] and is provided in all modern statistical analysis software.
The fractional factorials used in this research are summarised in the next figure
which shows the relationship between number of factors, design resolution and
associated number of experiment treatments.
−4 1
IV2
2232
4252
6272
8292 12 3
VI2 −
3 1
III2 −
5 1
V2 −
6 1
VI2 − 7 2
IV2 − 8 3
IV2 − 9 4
IV2 − 10 5
IV2 − 11 6
IV2 − 12 7
IV2 −
7 1
VII2 − 8 2
V2 − 9 3
IV2 − 10 4
IV2 − 11 5
IV2 − 12 6
IV2 −
12 5
IV2 −
8 4
IV2 −7 3
IV2 −6 2
IV2 −
5 2
III2 − 6 3
III2 −
9 5
III2 − 10 6
III2 − 11 7
III2 − 12 8
III2 −
8 1
VIII2 − 9 2
VI2 − 10 3
V2 − 11 4
V2 −
9 1
IX2 − 10 2
VI2 − 11 3
VI2 − 12 4
VI2 −
10 1
X2 − 11 2
VII2 −
8
16
32
64
256
128
512
4 65 7 8 9 10 11 1232
4
7 4
III2 −
Number of factors
Num
ber o
f tre
atm
ents
Figure A.2: Fractional Factorial designs for two to twelve factors. The required number of treatmentsis listed on the left. Resolution III designs (do not estimate any terms) are coloured darkest followed byResolution IV designs (estimate main effects only) followed by Resolution V and higher (estimate maineffects and second order interactions).
The minimum appropriate fractional factorial design resolution for screening is
therefore resolution IV since screening aims to remove factors (main effects) that
do not effect the responses. A resolution V design is preferable when resources
allow because it also tells us what second order effects are present without the
need for additional treatments and experiment runs.
It is informative to consider the two available resolution IV designs for 9 factors
in the next figure as examples of the importance of examining alias structure.
The 2(9-4) design requires 32 treatments while the 2(9-3) is more expensive with
64 treatments. The cheaper 2(9-4) design has 8 of its 9 main effects aliased with 3
third order interactions. The 2(9-3) design has only 4 of its 9 main effects aliased
with a single third-order interaction. The second order interactions are almost all
aliased in the more expensive 2(9-3) design but the aliasing is more favourable
than the cheaper 2(9-4) design. Resources permitting, the more expensive 2(9-3)
design is therefore more desirable for screening main effects.
Screening designs based on 2k factorials and fractional factorials can only pro-
duce linear models of a response because each factor appears at only two levels.
For a more complicated relationship between factors and response, a more sophis-
216
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
2(9-3) Resolution IV 2 (9-4) Resolution IVTerm Term Alias
1 [A] = A + DHJ [A] = A2 [B] = B [B] = B + CHJ + DGJ + EFJ3 [C] = C [C] = C + BHJ + DGH + EFH4 [D] = D + AHJ [D] = D + BGJ + CGH + EFG5 [E] = E [E] = E + BFJ + CFH + DFG6 [F] = F [F] = F + BEJ + CEH + DEG7 [G] = G [G] = G + BDJ + CDH + DEF8 [H] = H + ADJ [H] = H + BCJ + CDG + CEF9 [J] = J + ADH [J] = J + BCH + BDG + BEF10 [AB] = AB + CDG [AB] = AB + CDF + CEG + DEH + FGH11 [AC] = AC + BDG + EFH [AC] = AC + BDF + BEG + DEJ + FGJ12 [AD] = AD + HJ + BCG [AD] = AD + BCF + BEH + CEJ + FHJ13 [AE] = AE + CFH [AE] = AE + BCG + BDH + CDJ + GHJ14 [AF] = AF + CEH [AF] = AF + BCD + BGH + CGJ + DHJ15 [AG] = AG + BCD [AG] = AG + BCE + BFH + CFJ + EHJ16 [AH] = AH + DJ + CEF [AH] = AH + BDE + BFG + DFJ + EGJ17 [AJ] = AJ + DH [AJ] = AJ + CDE + CFG + DFH + EGH18 [BC] = BC + ADG + GHJ [BC] = BC + HJ + ADF + AEG19 [BD] = BD + ACG [BD] = BD + GJ + ACF + AEH20 [BE] = BE [BE] = BE + FJ + ACG + ADH21 [BF] = BF [BF] = BF + EJ + ACD + AGH22 [BG] = BG + ACD + CHJ [BG] = BG + DJ + ACE + AFH23 [BH] = BH + CGJ [BH] = BH + CJ + ADE + AFG24 [BJ] = BJ + CGH [BJ] = BJ + CH + DG + EF25 [CD] = CD + ABG + EFJ [CD] = CD + GH + ABF + AEJ26 [CE] = CE + AFH + DFJ [CE] = CE + FH + ABG + ADJ27 [CF] = CF + AEH + DEJ [CF] = CF + EH + ABD + AGJ28 [CG] = CG + ABD + BHJ [CG] = CG + DH + ABE + AFJ29 [CH] = CH + AEF + BGJ [DE] = DE + FG + ABH + ACJ30 [CJ] = CJ + BGH + DEF [DF] = DF + EG + ABC + AHJ31 [DE] = DE + CFJ [ABJ] = ABJ + ACH + ADG + AEF32 [DF] = DF + CEJ33 [DG] = DG + ABC34 [EF] = EF + ACH + CDJ35 [EG] = EG36 [EH] = EH + ACF37 [EJ] = EJ + CDF38 [FG] = FG39 [FH] = FH + ACE40 [FJ] = FJ + CDE41 [GH] = GH + BCJ42 [GJ] = GJ + BCH43 [ABE] = ABE + FGJ44 [ABF] = ABF + EGJ45 [ABH] = ABH + BDJ46 [ABJ] = ABJ + BDH + EFG47 [ACJ] = ACJ + CDH48 [ADE] = ADE + EHJ49 [ADF] = ADF + FHJ50 [AEG] = AEG + BFJ51 [AEJ] = AEJ + BFG + DEH52 [AFG] = AFG + BEJ53 [AFJ] = AFJ + BEG + DFH54 [AGH] = AGH + DGJ55 [AGJ] = AGJ + BEF + DGH56 [BCE] = BCE57 [BCF] = BCF58 [BDE] = BDE + FGH59 [BDF] = BDF + EGH60 [BEH] = BEH + DFG61 [BFH] = BFH + DEG62 [CEG] = CEG63 [CFG] = CFG
Alias
Figure A.3: Effects and alias chains for a 2 (9-3) resolution IV design and a 2(9-4) resolution IV design.
217
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
ticated design is required. These are called Response Surface designs.
A.3.3 Efficiency of Fractional Factorial Designs
The following figure makes explicit the huge savings in experiment runs when
using a fractional factorial design instead of a full factorial design.
Design Treatments% saving of treatments Design* Treatments
% saving of treatments
Full 512 Full 5312 (9-5) III 16 97 Half 275 502(9-4) IV 32 94 Quarter 147 752(9-3) IV 64 88 Min Run 65 912(9-2) VI 128 75 * FCC with 1 centre point
Figure A.4: Savings in experiment runs when using a fractional factorial design instead of a fullfactorial design. The savings for screening designs are on the left and the savings for response surfacedesigns are on the right. In both cases, fractional factorial designs offer enormous savings in numberof treatments over the full factorial alternative.
A.3.4 Response Surface Designs
There are several types of experiment design for building response surface models.
This research uses Central Composite Designs (CCD). A CCD contains an imbedded
factorial (or fractional factorial design). This is augmented with both centre points
and a group of so-called ‘star points’ that allow estimation of curvature. Let the
distance from the centre of the design space to a factorial point be ±1 unit for each
factor. Then, the distance from the centre of the design space to a star point is ±αwhere |α| > 1. The value of α depends on certain properties desired for the design
and on the number of factors involved. The number of centre point runs the design
is to contain also depends on certain properties required for the design. There are
three types of central composite design, illustrated in Figure A.5.
Factorial point
Centerpoint
Axial point
CCC FCC ICC
Figure A.5: Central composite designs for building response surface models. From left to right thesedesigns are the Circumscribed Central Composite (CCC), the Face-Centred Composite (FCC) and theInscribed Central Composite (ICC). The design space is represented by the shaded area. The factorialpoints are black circles and the star points are grey squares.
218
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
The designs differ in the location of their axial points. The choice of design
depends on the nature of the factors being experimented with.
• Circumscribed Central Composite (CCC). In this design, the star points
establish new extremes for the low and high settings for all factors. These
designs require 5 levels for each factor. Augmenting an existing factorial or
resolution V fractional factorial design with star points can produce this de-
sign.
• Inscribed Central Composite (ICC). For those situations in which the limits
specified for factor settings are truly limits, the ICC design uses the factor
settings as the star points and creates a factorial or fractional factorial design
within those limits. This design also requires 5 levels of each factor.
• Face-Centred Composite (FCC). In this design, the star points are at the
centre of each face of the factorial space, so alpha = ± 1. This design requires
just 3 levels of each factor.
An existing factorial or resolution V design from the screening stage can be aug-
mented with appropriate star points to produce the CCC and FCC designs. This is
not the case with the ICC and so it is less useful in the sequential experimentation
scenario (see Section 6.1).
Many of the tuning parameters encountered in ACO algorithms have a restric-
tive range of values they can take on. For example, the exploration/exploitation
threshold q0 (Section 2.4.6 on page 42) must be greater than or equal to 0 and less
than or equal to 1. The problem instance characteristics also have a restrictive
range imposed by the user. We cannot hope to model all possible instances and so
must restrict our instance characteristic ranges to those that will be encountered
in the application of the algorithm. The FCC is designed for scenarios where such
restrictions on factor ranges are enforced. Clearly, it is the most appropriate design
in the current ACO parameter tuning scenario. The FCC is used in all response
surface modelling in this thesis.
A.3.5 Prediction Intervals
A regression model from the response surface design is used to predict new values
of the response given values of the tuning parameter and problem characteristic
input variables. The model’s p% prediction interval is the range in which you can
expect any individual value from the actual heuristic to fall into p% of the time. The
prediction interval will be larger (a wider spread) than a confidence interval since
there is more scatter in individual values than in averages. Montgomery [84, p.
394-396] describes the mathematical formulation of prediction intervals and their
applications. In particular, prediction intervals should be used in confirmation
experiments to verify that models of the heuristic behaviour are correct. This thesis
is the first use in the heuristics literature of prediction intervals and independent
confirmation runs to verify conclusions.
219
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
A.3.6 Desirability functions
The concept of a desirability function can be briefly described as follows [84, 1].
The desirability function approach is a widely used industrial method for optimis-
ing multiple responses. The basic idea is that a process with many quality char-
acteristics is completely unacceptable if any of those characteristics are outside
some desired limits. For each response, Yi, a desirability function di(Yi) assigns a
number between 0 and 1 to the possible values of the response Yi. di(Yi) = 0 is a
completely undesirable value and di(Yi) = 1 is an ideal response value. These indi-
vidual k desirabilities are combined into an overall desirability D using a geometric
mean:
D = (d1(Y1)× d2(Y2)× . . .× dk(Yk))1/k (A.1)
A particular class of desirability function was proposed by Derringer and Suich
[40]. Let i and Ui be the lower limit and upper limits respectively of response i. Let
Ti be the target value. If the target value is a maximum then
di =
0 yi < Li(yi−Li
Ti−Li
)rLi ≤ yi ≤ Ti
1 yi > Ti
(A.2)
If the target is a minimum value then
di =
1 yi < Ti(Ui−yi
Ui−Ti
)rLi ≤ yi ≤ Ti
0 yi > Ui
(A.3)
The value r adjusts the shape of the desirability function. A value of r = 1 is
linear. A value of r > 1 increases the emphasis of being close to the target value.
A value of 0 < r < 1 decreases this emphasis. These cases are illustrated in the
following figure.
A.4 Experiment analysis
Experiment analysis is the steps one takes after designing an experiment and gath-
ering data. The analysis steps in this thesis are listed in Chapter 6 on methodology.
Some of these steps are covered in more detail here.
A.4.1 Stepwise regression.
Various techniques can be used to identify the most important terms that should
be included in a regression model. This thesis uses an automated approach called
stepwise regression. Usually, this takes the form of a sequence of F-tests, but
other techniques are possible. The 2 main stepwise regression approaches are:
1. Forward selection. This involves starting with no variables in the model,
220
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
Figure A.6: Individual desirability functions. On the left is a maximise function and on the right is aminimise function. Figure adapted from [84, p. 426].
trying out the variables one by one and including them if they are ’statistically
significant’.
2. Backward selection. This involves starting with all candidate variables and
testing them one by one for statistical significance, deleting any that are not
significant according to an alpha out value.
This thesis uses backward selection for the choice of terms in all its analyses
with an alpha out value of 0.1. There are several criticisms of stepwise regression
methods worth noting.
1. A sequence of F-tests is often used to control the inclusion or exclusion of
variables, but these are carried out on the same data and so there will be
problems of multiple comparisons for which many correction criteria have
been developed.
2. It is difficult to interpret the p-values associated with these tests, since each
is conditional on the previous tests for inclusion and exclusion.
Nonetheless, the accuracy of all models in this thesis is independently analysed
with confirmation runs. This should allay any engineer’s concerns over the use of
stepwise regression for metaheuristic screening and modelling.
A.4.2 ANOVA diagnostics
Once an Analysis of Variance (ANOVA) has been calculated, some diagnostics must
be examined to ensure that the assumptions on which ANOVA depends have not
been violated.
• Normality. A Normal Plot of Studentised Residuals should be approximately
a straight line. Deviations from this may indicate that a transformation of the
response is appropriate.
221
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
• Constant Variance. A plot of Studentised Residuals against predicted re-
sponse values should be a random scatter. Patterns such as a ‘megaphone’
may indicate the need for a transformation of the response.
• Time-dependent effects. A plot of Studentised Residuals against run order
should be a random scatter. Any trend indicates the influence of some time-
dependent nuisance factor that was not countered with randomisation.
• Model Fit. A plot of predicted values against actual response values will
identify particular treatment combinations that are not well predicted by the
model. Points should align along the 450 axis.
• Leverage and Influence. Leverage measures the influence of an individual
design point on the overall model. A plot of leverage for each treatment indi-
cates any problem data points.
• A plot of Cook’s distance against treatment measures how much the regres-
sion changes if a given case is removed from the model.
It is an open question how much a violation of these diagnostics invalidates
the conclusions from an ANOVA. Coffin and Saltzman [28] believe that the ANOVA
F-test is extremely robust to unequal variances provided that there is approxi-
mately the same number of observations in each treatment group. Diagnostics
were always examined and passed in the analyses of this thesis. Furthermore,
any concerns regarding these diagnostics are allayed with the use of independent
confirmation runs.
A.4.3 Response Transformation
If a model is correct and the assumptions are satisfied, the residuals should be
unrelated to any other variable, including the predicted response. This can be
verified by plotting the residuals against the fitted values. The plot should be un-
structured. However, sometimes nonconstant variance is observed. This is where
the variance of the observations increases as the magnitude of the observations in-
creases. The usual approach to dealing with this problem is to transform the data
and run the ANOVA on the transformed data. The next table gives some popular
transformations.
Name Equation
Logarithmic Y ′ = log10(Y + d)
Square root Y ′ =√Y + d
InverseSquare Root
Y ′ = 1/√
Y + d
Table A.2: Some common response transformations.
The appropriate transformation is chosen based on the shape of the data or, in
the case of this thesis, using an automated technique called a Box-Cox plot [21].
222
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
A.4.4 Outliers
An outlier is a data value that is much larger or smaller than the majority of the
experiment data. Outliers are important because they affect the ANOVA assump-
tions and can render conclusions from a statistical analysis invalid. Outliers are
easily identified with the ANOVA diagnostics. In this thesis, we take the approach
of deleting outliers. Some would disagree with this approach because outliers in
responses such as solution quality are not due to any random noise but are instead
actual repeatable data values. This is true but we must still somehow deal with the
outliers and make the data amenable to statistical analysis. If the proportion of
outliers deleted is reported then the reader can be assured that the outliers repre-
sented a very small proportion of the total data. If confirmation runs are reported
then the reader is reassured that the model was accurate despite the deletion of
outliers.
A.4.5 Dealing with Aliasing
Once a model has been successfully built and analysed, allowing for necessary
transformations and outliers, there may still be obstacles to interpreting the ANOVA
results. Some designs such as the fractional factorial reduce the number of ex-
periment runs required at the cost of some of the model’s effects being aliased.
Aliased effects are those effects that cannot be distinguished from one another. We
say that these effects form an alias chain. For example, if the main effect A and a
second order effect AB are aliased then we cannot tell whether it is A or AB that
contributes to the model. When a significant effect is aliased, several approaches
are available to the experimenter to determine the correct model term to which the
effect should be attributed [84, p. 289].
1. Engineering judgement. A first attempt is to use engineering judgement to
justify ignoring some terms in the alias. It may be known from experience
that one of the aliased effects is not important and can be discarded.
2. Ockham’s razor. Consider a 24−1IV experiment. It has four main effects that
we shall call A, B, C and D. Suppose that the significant main effects are A,
C and D and the significant aliased interactions are AC and AD, aliased as
follows [AC] → AC + AD and [AD] → AD + BC. The fact that AC and AD are
the interactions composed only of significant main effects, it is more likely
that these interactions are the significant interactions in the alias chains.
Montgomery [84] cites this as an application of Ockham’s razor—the simplest
explanation of the effect is most likely the correct one.
Failing the application of either of these two approaches, one must augment the
design to de-alias the significant effects.
3. Augment design. A foldover procedure is a methodical and efficient way to
introduce more treatments into a fractional design so that a particular effect
can be de-aliased. The foldover procedure produces double the number of
223
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
new treatments for which data must be gathered (once for each replicate).
The augmented design should foldover on those most significant model terms
that the experimenter wishes to de-alias
Once dealiasing has been completed, the experimenter can interpret significant
main effects and interactions.
A.4.6 Interpreting interactions
There are several possibilities when plotting two-factor interactions from a two-way
analysis of variance [29]. These possibilities are illustrated in the following figure.
There are two factors denoted by A and B and these factors were tested at two
levels A1, A2 and three levels B1, B2 and B3 respectively.
Interachons among Variables Analys~s of Vanance
120 , AreaBurned
slow medium fast
Wind S p e d
Figure 7.2 A plot of mean acreage lost at three wind speeds.
A1 A2 A1 A2
Figure 7.3 Examples of main effects and interaction effects in two-way analyses of variance. Figure A.7: Examples of possible main and interaction effects [29]. The possibilities are numbered 1to 6.
The interpretation of these possibilities is as follows.
• Example 1: There is a main effect for A represented by the increasing slope.
There is a main effect for B, represented by the vertical distance between
lines. There is no interaction AB since the lines are always parallel.
• Example 2: There is a main effect for A but now there is no main effect for
B since the lines are no longer separated by a vertical distance. There is no
interaction AB either.
• Example 3: There are no effects whatsoever.
224
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
• Example 4: There is a main effect for A represented by the increasing slopes.
There is no main effect B because the vertical distances between levels of B
are reversed at the two levels of A. There is an interaction between A and B.
• Example 5: There is no main effect for A as the slopes at different levels of
B cancel one another out. There is a main effect of B. There is an interaction
AB.
• Example 6: There is a main effect of A. There is a main effect of B. There is
also an interaction AB.
The presence of interactions clearly complicates an analysis because it means
that a main effect cannot be interpreted in isolation. The inability to detect interac-
tions is one of the most important shortcomings of the OFAT approach (Section 2.5
on page 47) and one of the main strengths of the DOE approach.
A.5 Hypothesis testing
Hypothesis testing (sometimes called significance testing) is an objective method
of making comparisons with a knowledge of the risks associated with reaching
the wrong conclusions. A statistical hypothesis is a conjecture about the problem
situation. One may conjecture, for example, that the mean heuristic performances
at two levels 1 and 2 of a factor are equal. This is written as:
H0 : µ1 = µ2
H1 : µ1 6= µ2
The first statement is the null hypothesis and the second statement is the alter-native hypothesis. A random sample of data is taken. A test statistic is computed.
The null hypothesis is rejected if the test statistic falls within a certain rejectionregion for the test. It is extremely important to emphasise that hypothesis testing
does not permit us to conclude that we accept the null hypothesis. The correct
conclusion is always either a rejection of the null hypothesis or a failure to reject
the null hypothesis.
The p-value is the probability of obtaining a test statistic that is at least as far as
the observed value from the value specified in the null hypothesis, where the null
hypothesis value is calculated under the assumption that the null hypothesis is
true. Smaller p-values indicate that the data are inconsistent with the assumption
that the null hypothesis is true
A.6 Error, Significance, Power and Replicates
Two types of error can be committed when testing hypotheses [84, p. 35]. If
the null hypothesis is rejected when it is actually true, then a Type I Error has
occurred. If the null hypothesis is not rejected when it is false then a Type II Errorhas occurred. These error probabilities are given special symbols
225
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
• α = P (Type I error) = P (reject H0|H0 true)
• β = P (Type II error) = P (fail to reject H0|H0 false)
In the context of Type II errors, it is more convenient to use the power of a test
where
Power = 1− β = P (reject H0| H0 false)
It is therefore desirable to have a test with a low α and a high power. The
probability of a Type I Error is often called the significance level of a test. The
particular significance level depends on the requirements of the experimenter and,
in a research context, on the conventional acceptable level. Unfortunately, with so
little adaptation of statistical methods to the analysis of heuristics, there are few
guidelines on what value to choose. Norvig cites a value as low as 0.0000001% in
research work at Google [87]. All experiments in this thesis use a level of either 1%
or 5%. The power of a test is usually set to 80% by convention. The reason for this
choice is due to diminishing returns. It requires an exponentially increasing num-
ber of replicates to increase power beyond about 80% and there is little advantage
to the additional power this confers.
Miles [82] describes the relationship between significance level, effect size, sam-
ple size and power using an analogy with searching.
• Significance Level: This is the probability of thinking we have found some-
thing when it is not really there. It is a measure of how willing we are to risk
a Type I error.
• Effect Size: The size of the effect in the population. The bigger it is, the
easier it will be to find. This is a measure of the practical significance of a
result, preventing us claiming a statistically significant result that has little
consequence [101].
• Sample size: A larger sample size leads to a greater ability to find what we
were looking for. The harder we look, the more likely we are to find it.
The critical point regarding this relationship is that what we are looking for
is always going to be there—it might just be there in such small quantities that
we are not bothered about finding it. Conversely, if we look hard enough, we are
guaranteed to find what we are looking for. Power analysis allows us to make sure
that we have looked reasonably hard enough to find it. A typical experiment design
approach is to agree the significance level and choose an effect size based on prac-
tical experience and experiment goals. Given these constraints, the sample size
is increased until sufficient power is reached. If a response has a high variability
then a larger sample size will be required.
Different statistical tests and different experiment designs involve different power
calculations. These calculations can become quite involved and the details of their
calculation are beyond the scope of this thesis. Power calculations are supplied
226
APPENDIX A. DESIGN OF EXPERIMENTS (DOE)
with most good quality statistical analysis software. Some are even provided on-
line [76]. Power considerations have had limited exposure in the heuristics field
[28] but play a strong role in this thesis.
A.6.1 Power work up procedure
Sufficient power is achieved with a so-called work-up procedure [36]. This is an
iterative procedure whereby data is calculated for a design with a number of repli-
cates, power is calculated and replicates are added if sufficient power was not
achieved. This process repeats until sufficient power is reached. The work up
procedure is an efficient way to ensure the experiment has enough power without
wasting resources on unnecessary replicates.
227
BTSPLIB Statistics
This appendix reports statistics and plots of some TSPLIB [102] instances. The in-
stances are symmetric Euclidean instances as described in Section 2.2 on page 30.
These statistics and plots are referenced in the conclusions of Chapter 7.
The table gives some descriptive statistics of the symmetric Euclidean instances.
All instances have approximately the same ratio of standard deviation to mean.
Figure B.2 on the next page to Figure B.4 on page 231 are histograms illustrating
the normalised frequency of normalised edge lengths in several of the symmetric
Euclidean TSP instances. All histograms have a shape that can be represented by
a Log-Normal Distribution.
229
APPENDIX B. TSPLIB STATISTICS
Instance Standard Deviation Mean CoefficientOliver30 21.08 43.93 0.48kroA100 916.04 1710.70 0.54kroB100 912.90 1687.54 0.54kroC100 910.74 1700.55 0.54kroD100 867.21 1631.10 0.53kroE100 933.55 1732.15 0.54eil101 16.35 33.92 0.48lin105 670.85 1177.35 0.57pr107 3105.26 5404.24 0.57pr124 2848.45 5623.35 0.51bier127 3082.05 4952.47 0.62ch130 169.98 356.22 0.48pr136 2945.43 6073.99 0.48pr144 2813.44 5639.51 0.50ch150 169.35 359.31 0.47kroA150 919.03 1717.35 0.54kroB150 922.40 1711.61 0.54pr152 3668.43 6914.83 0.53kroA200 917.36 1701.17 0.54kroB200 892.36 1664.18 0.54ts225 3321.76 7080.03 0.47tsp225 95.21 183.58 0.52pr226 3708.92 7503.01 0.49gil262 48.25 101.92 0.47pr264 2557.95 4248.45 0.60pr1002 3160.87 6435.61 0.49vm1084 4149.34 7907.79 0.52rl1304 3670.93 7190.12 0.51rl1323 3724.98 7403.58 0.50nrw1379 543.67 1032.34 0.53vm1748 4235.17 8548.22 0.50rl1889 4012.50 7834.80 0.51pr2392 3125.49 6374.92 0.49
Figure B.1: Some descriptive statistics for the symmetric Euclidean instances in TSPLIB. Instancesare presented in order of increasing size. The columns are the standard deviations of edge lengths, themean edge lengths and the ratio of standard deviation to mean.
bier127.tsp
00.10.20.30.40.50.60.70.80.9
1
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90
Normalised Edge Length
Nor
mal
ised
Fre
quen
cy
Figure B.2: Histogram of normalised frequency of normalised edge lengths of the bier127 TSPLIBinstance.
230
APPENDIX B. TSPLIB STATISTICS
Oliver30.tsp
00.10.20.30.40.50.60.70.80.9
1
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90
Normalised Edge Length
Nor
mal
ised
Fre
quen
cy
Figure B.3: Histogram of normalised frequency of normalised edge lengths of the Oliver30 TSPLIBinstance.
pr1002.tsp
00.10.20.30.40.50.60.70.80.9
1
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90Normalised Edge Length
Nor
mal
ised
Fre
quen
cy
Figure B.4: Histogram of normalised frequency of normalised edge lengths of the pr1002 TSPLIBinstance.
231
CCalculation of Average Lambda
Branching Factor
Average lambda branching factor was first introduced as a descriptive statistic for
ACS performance [46] but is now an integral part of trail reinitialisation in the
MMAS daemon actions (Section 2.4.5 on page 39). It is discussed in detail in the
literature [47, p. 87]. The following figure represents its calculation in pseudocode
adapted from JACOTSP.
Broadly, the calculation can be described as follows. For every city in the TSP,
go through the city’s candidate list calculating a branching factor cut-off. For each
edge in the city’s candidate list, count the edges that are above this cut-off value.
Average the counts for every city in the TSP.
If we let the TSP size be n and we let the candidate list length for all cities be cl
then the complexity of the calculation is given by:
O(n · 2cl + n) ≈ O(2n) if cl� n (C.1)
Clearly the calculation is expensive and this is exacerbated if the candidate list
length approaches the problem size. However, this expense does not show when
CPU time is not recorded.
The expense of the calculation is mitigated in three ways:
• The branching factor is often not computed at each iteration.
• The complexity is the same as that of construction of one solution by an ant.
• The candidate list length cl is typically very small.
233
APPENDIX C. CALCULATION OF AVERAGE LAMBDA BRANCHING FACTOR
1
/**
* Method to calculate the average lambda branching factor.
*/
public double computeAverageBranchingFactor()
{
/*An array to store number of branches from each TSP city*/
double[] num_branches = new double[tspSize];
/*O(tspSize)*/
for (each aCity in TSP)
{
/*O(cl)*/
final double cutoff = calculateCutOffForLambdaBranchingFactorForACity(aCity);
/*O(cl)*/
num_branches[aCity] = countEdgesAboveCutOffFromCity(aCity, cutoff);
}
/*O(tspSize)*/
final double averageNumberOfBranches = calculateAverageOf(num_branches[tspSize]);
double result = averageNumberOfBranches / (tspSize * 2);
return result;
}
Figure C.1: Pseudocode for the calculation of the average lambda branching factor.
234
DExample OFAT Analysis
This appendix reports a One-Factor-at-A-Time (OFAT) approach to tuning a single
parameter from the ACS heuristic (Section 2.4 on page 34). The particular param-
eter is alpha, which plays a role in an artificial ant’s solution building decisions.
D.1 Motivation
The ACS screening study of Chapter 8 and the ACS tuning study of Chapter 9 both
reported that alpha did not have a statistically significant effect on either solution
quality or solution time. This is an interesting and important result because it is
intuitively unexpected and contradicts the accepted view of the importance of alpha
[96]. The methods and experiment designs introduced in this thesis are new to the
ACO field and the metaheuristics field in general. It is of interest, therefore, to
attempt the same analysis, in so far as is possible, using a more familiar empirical
technique called the One-Factor-at-A-Time (OFAT) approach. An OFAT approach
involves taking one of the algorithm tuning parameters and allowing it to vary
while all other tuning parameters are held fixed at some other values. The free
parameter is tuned until performance is maximised. The procedure then moves
on to another of the tuning parameters, allowing it to vary while all others are
held fixed. This study applies an OFAT analysis to the alpha tuning parameter.
The aim of the study is to determine whether an OFAT approach will lead to a
different conclusion from the DOE approach. It is not an endorsement of the OFAT
approach, the demerits of which were highlighted in Section 2.5 on page 47. In
keeping with the thesis’ strong emphasis on experimental rigor, the OFAT analysis
is conducted with a designed experiment and supporting statistical analyses.
235
APPENDIX D. EXAMPLE OFAT ANALYSIS
D.2 Method
D.2.1 Response Variables
Two responses were measured, relative error from a known optimum and elapsed
solution time, as per Section 6.7 on page 125.
D.2.2 Factors, Levels and ranges
Design Factors
Being an OFAT analysis, there was 1 design factor. This factor was the alpha
tuning parameter for the ACS algorithm, described in Section 2.4 on page 34.
Alpha was set at the following five levels: 1, 3, 5, 7, 12.
Held-Constant Factors
The held constant factors are as per Section 6.7.6 on page 127. There were ad-
ditional held-constant factors required of the OFAT approach. All other tuning
parameters were fixed at 6 different settings. These settings came from the desir-
ability optimisation results from the full ACS response surface model, given in the
following figure. ACS_RSM_All_R5_relErr-Time_05.dx7_desirability.xls
Size
StD
ev
beta
ants
Frac
tion
nnFr
actio
n
q0 rho
rhoL
ocal
solu
tionC
onst
ruct
ion
antP
lace
men
t
pher
omon
eUpd
ate
Tim
e
Rel
ativ
e Er
ror
Des
irabi
lity
400 10 4 1.00 1.00 0.99 0.11 0.81 parallel random bestSoFar 2.42 0.46 0.92400 40 6 2.19 1.16 0.97 0.99 0.03 parallel random bestOfItera 2.83 1.33 0.82400 70 11 1.61 20.00 0.98 0.01 0.07 parallel random bestOfItera 4.92 2.59 0.73500 10 3 1.13 1.00 0.99 0.86 0.01 parallel same bestOfItera 4.88 0.38 0.88500 40 7 1.00 1.00 0.99 0.99 0.48 parallel random bestSoFar 4.25 1.35 0.80500 70 10 1.04 19.78 0.99 0.05 0.01 parallel same bestOfItera 9.24 2.54 0.70
1 of 1
Figure D.1: Fixed parameter settings for the OFAT analysis. These are reproduced from the results ofthe desirability optimisation of the ACS full response surface model. The response predictions from thetuning have also been included.
Note that in practice, one may not have access to these tuned parameter set-
tings. A researcher conducting an OFAT analysis without any prior knowledge
would have no guidelines on the values at which the other parameters should be
fixed.
D.2.3 Instances
All TSP instances were of the symmetric type and were created as per Section 6.7.1
on page 126. The TSP problem instances ranged in size from 400 cities to 500
236
APPENDIX D. EXAMPLE OFAT ANALYSIS
cities with cost matrix standard deviation ranging from 10 to 70. All instances had
a mean of 100. The same instances were used for each replicate of a design point.
For each OFAT analysis, a single problem instance was used. These were the same
instances used in the ACS tuning case study.
D.2.4 Experiment design, power and replicates
The experiment design for each of the OFAT analyses is a single factor 5-level
factorial. All 5 treatments were replicated 10 times. A work up procedure was not
needed in this case.
The next figure gives the descriptive statistics for the collected data and the
actual detectable effect size for the quality and time responses with a significance
level of 5% and a power of 80%.
Problem size
Problem StDev Range Min Max Mean
Std. Dev.
400 10 0.30 0.74 1.04 0.89 0.07400 40 1.72 3.29 5.01 4.14 0.43400 70 6.30 4.17 10.47 6.73 1.73500 10 0.35 0.62 0.97 0.80 0.09500 40 1.55 3.25 4.79 3.91 0.36500 70 4.52 6.03 10.55 8.11 1.25
Problem size
Problem StDev Range Min Max Mean
Std. Dev.
400 10 3.67 1.52 5.19 2.95 0.96400 40 6.05 2.50 8.55 4.13 1.30400 70 13.15 4.38 17.53 8.20 3.32500 10 4.03 1.97 6.00 3.15 0.89500 40 7.97 2.60 10.56 4.97 1.58500 70 31.83 4.63 36.45 14.08 7.67
Figure D.2: Descriptive statistics for the six OFAT analyses. Relative Error is reported above and Timebelow. There are six combinations of problem size and problem standard deviation.
D.2.5 Performing the experiment
Responses were measured at a 250 iteration stagnation stopping criterion.
Available computational resources necessitated running experiments across a
variety of similar machines. Runs were executed in a randomised order across
these machines to counteract any uncontrollable nuisance factors. The experi-
mental machines are benchmarked as per Section 5.3 on page 100.
D.3 Analysis
D.3.1 ANOVA
To make the data amenable to statistical analysis, a transformation of the re-
sponses was required for some of the analyses. These transformations were a
log10, inverse or inverse square root.
237
APPENDIX D. EXAMPLE OFAT ANALYSIS
No outliers were detected. The models passed the usual ANOVA diagnostics for
the ANOVA assumptions of model fit, normality, constant variance, time-dependent
effects, and leverage (Section A.4.2 on page 221).
D.4 Results
The next figure summarises each of the OFAT analyses for relative error and time.
It reports the detectable effect size for a significance threshold of 5% and a power
of 80% and the statistical significance result.
Problem size
Problem StDev
St Devs at 80% power
Effect size
ANOVAsig?
Effect size
ANOVAsig?
400 10 2 0.14 No 1.91 No400 40 0.35 0.15 Yes 0.45 Yes400 70 0.35 0.61 Yes 1.16 Yes500 10 0.4 0.04 Yes 0.35 Yes500 40 2 0.71 No 3.16 Yes500 70 2 2.50 Yes 15.34 No
Relative Error Time
Figure D.3: Summary of results from the six OFAT analyses. The figure gives the detectable effect sizein terms of both the number of standard deviations and the actual response units for the relative errorresponse and the time response.
Some of the analysis showed a statistically significant effect for alpha on the
responses of solution quality and solution time. Note that the detectable effect
sizes are small relative to those of the screening and tuning case studies. This
is due to the lower variability in the responses when varying only a single tuning
parameter.
The following figures show the plots of the relative error response for the OFAT
analyses. Each plot shows the five levels of alpha on the horizontal axis and in-
cludes a 95% Fisher’s Least Significant Difference interval [84, p. 96].
Alpha had a statistically significant effect on solution quality in four out of
the six experiments. An examination of the range over which average relative error
varied in these significant case shows that the largest difference was approximately
3.9% for problems of size 400 and standard deviation 70 and approximately 0.1%
for problems of size 500 and standard deviation 10.
All analyses except that of size 400 and standard deviation of 70 recommended
an alpha value of 1 to minimise relative error.
D.5 Conclusions and discussion
We draw the following conclusion from these results. For ACS, with all tuning
parameters except alpha set to the values in Figure D.1 on page 236:
• alpha has a statistically significant effect on solution quality for instances
with a size and standard deviation combination of 400-40, 400-70, 500-10
and 500-70.
238
APPENDIX D. EXAMPLE OFAT ANALYSIS
Figure D.4: Plot of the effect of alpha on relative error for a problem with size 400 and standarddeviation 10. Alpha was not statistically significant in this case.
Figure D.5: Plot of the effect of alpha on relative error for a problem with size 400 and standarddeviation 40. Alpha was statistically significant in this case.
239
APPENDIX D. EXAMPLE OFAT ANALYSIS
Figure D.6: Plot of the effect of alpha on relative error for a problem with size 400 and standarddeviation 70. Alpha was statistically significant in this case.
Figure D.7: Plot of the effect of alpha on relative error for a problem with size 500 and standarddeviation 10. Alpha was statistically significant in this case.
240
APPENDIX D. EXAMPLE OFAT ANALYSIS
Figure D.8: Plot of the effect of alpha on relative error for a problem with size 500 and standarddeviation 40. Alpha was not statistically significant in this case.
Figure D.9: Plot of the effect of alpha on relative error for a problem with size 500 and standarddeviation 70. Alpha was statistically significant in this case.
241
APPENDIX D. EXAMPLE OFAT ANALYSIS
• Apart from one anomalous result, an alpha value of 1 is recommended to
minimise relative error.
The first of these conclusions appears to contradict the results of Chapters 8
and 9. These concluded that alpha had a relatively unimportant effect on solution
quality. However, there are several important differences between the previous
experiments and the current OFAT analysis.
Firstly, the fractional factorial and response surface designs experimented with
many more factors. This resulted in a larger variability in the response, as listed in
the descriptive statistics of Figure 9.1 on page 155, for example. The OFAT anal-
ysis, varying only alpha and conducted on a single instance, had a much smaller
variance in its response measurements, as listed in Figure D.2 on page 237. The
consequence is that the OFAT analysis could detect much smaller effects for a
given significance level and power than the fractional factorial screening and the
response surface. This does not mean that OFAT is a better approach than DOE.
The OFAT conclusions are more accurate in their context. This context is the par-
ticular fixed values of the other parameter settings and a single problem instance.
As discussed in Section 2.5, the OFAT analysis tells us nothing about interac-
tions and is inefficient in comparison to DOE approaches in terms of the infor-
mation gained from a given number of experiments. Most importantly, for some
response surface shapes, an incorrect OFAT starting point can lead to incorrect
tuning recommendations. Unfortunately, since the response surface shape cannot
be deduced with OFAT, the experimenter does not know if these incorrect tuning
recommendations are being made. The only safe option in this case is to use a
DOE approach.
242
References
[1] NIST/SEMATECH Engineering Statistics Handbook, 2006.
[2] ADENSO-DIAZ, B., AND LAGUNA, M. Fine-Tuning of Algorithms Using Frac-tional Experimental Designs and Local Search. Operations Research 54, 1(2006), 99–114.
[3] AMINI, M. M., AND RACER, M. A Rigorous Computational Comparison ofAlternative Solution Methods for the Generalized Assignment Problem. Man-agement Science 40, 7 (1994), 868–890.
[4] ANDERSON, V. L., AND MCLEAN, R. A. Design of experiments: a realisticapproach. M. Dekker Inc., New York, 1974.
[5] APPLEGATE, D., BIXBY, R., CHVATAL, V., AND COOK, W. Implementingthe Dantzig-Fulkerson-Johnson algorithm for large traveling salesman prob-lems. Mathematical Programming Series B 97, 1-2 (2003), 91–153.
[6] BANG-JENSEN, J., CHIARANDINI, M., GOEGEBEUR, Y., AND JØRGENSEN, B.Mixed Models for the Analysis of Local Search Components. In EngineeringStochastic Local Search Algorithms. Designing, Implementing and AnalyzingEffective Heuristics, T. Stutzle, M. Birattari, and H. Hoos, Eds., vol. 4638.Springer, Berlin / Heidelberg, 2007, pp. 91–105.
[7] BARR, R. S., GOLDEN, B. L., KELLY, J. P., RESENDE, M. G. C., AND STEW-ART, W. R. Designing and Reporting on Computational Experiments withHeuristic Methods. Journal of Heuristics 1 (1995), 9–32.
[8] BARTZ-BEIELSTEIN, T. Experimental Research in Evolutionary Computation.The New Experimentalism. Natural Computing Series. Springer, 2006.
[9] BARTZ-BEIELSTEIN, T., AND PREUSS, M. Experimental Research in Evolu-tionary Computation. Tutorial at the genetic and evolutionary computationconference, June 2005.
[10] BAUTISTA, J., AND PEREIRA, J. Ant Algorithms for Assembly Line Balanc-ing. In Proceedings of the Third International Workshop on Ant Algorithms,M. Dorigo, G. D. Caro, and M. Sampels, Eds., vol. 2463 of Lecture Notes inComputer Science. Springer, 2002, p. 65.
[11] BENTLEY, J. L. Fast algorithms for the geometric traveling salesman prob-lem. ORSA Journal on Computing 4 (1992), 387–411.
[12] BIRATTARI, M. The Problem of Tuning Metaheuristics. Phd, Universite Librede Bruxelles, 2006.
[13] BIRATTARI, M., AND DORIGO, M. How to assess and report the performanceof a stochastic algorithm on a benchmark problem: Mean or best result on anumber of runs? Optimization Letters (2006).
[14] BIRATTARI, M., STUTZLE, T., PAQUETE, L., AND VARRENTRAPP, K. A RacingAlgorithm for Configuring Metaheuristics. In GECCO 2002: Proceedings ofthe Genetic and Evolutionary Computation Conference, New York, USA, W. B.Langdon, E. Cant-Paz, K. E. Mathias, R. Roy, D. Davis, R. Poli, K. Balakr-ishnan, V. Honavar, G. Rudolph, J. Wegener, L. Bull, M. A. Potter, A. C.
243
REFERENCES
Schultz, J. F. Miller, E. K. Burke, and N. Jonoska, Eds. Morgan Kaufmann,2002, pp. 11–18.
[15] BLUM, C. Ant colony optimization: Introduction and recent trends. Physicsof Life Reviews 2, 4 (2005), 353–373.
[16] BLUM, C., AND ROLI, A. Metaheuristics in combinatorial optimization:Overview and conceptual comparison. ACM Computing Surveys 35, 3 (2003),268–308.
[17] BLUM, C., AND SAMPELS, M. An ant colony optimization algorithm for shopscheduling problems. Journal of Mathematical Modelling and Algorithms 3, 3(2004), 285–308.
[18] BOOCH, G. Object-oriented Analysis and Design with Applications, second ed.The Benjamin/Cummings Publishing Company, Inc., 1994.
[19] BOTEE, H. M., AND BONABEAU, E. Evolving Ant Colony Optimization. Ad-vances in Complex Systems 1 (1998), 149–159.
[20] BOX, G. E. P. Sequential experimentation and sequential assembly of de-signs. Quality Engineering 5, 2 (1992), 321–330.
[21] BOX, G. E. P., AND COX, D. R. An Analysis of Transformations. Journal ofthe Royal Statistical Society Series B (Methodological) 26, 2 (1964), 211–252.
[22] BREEDAM, A. V. Improvement Heuristics for the Vehicle Routing ProblemBased on Simulated Annealing. European Journal of Operations Research86, 3 (1995), 480–490.
[23] BULL, J. M., SMITH, L. A., BALL, C., POTTAGE, L., AND FREEMAN, R. Bench-marking Java against C and Fortran for scientific applications. Concurrencyand Computation: Practice and Experience 15, 3-5 (2003), 417–430.
[24] BULLNHEIMER, B., HARTL, R. F., AND STRAUSS, C. A New Rank Based Ver-sion of the Ant System: A Computational Study. Central European Journalfor Operations Research and Economics 7, 1 (1999), 25–38.
[25] BULLNHEIMER, B., HARTL, R. F., AND STRAUSS, C. An Improved Ant SystemAlgorithm for the Vehicle Routing Problem. Annals of Operations Research89 (1999), 319–328.
[26] CHEESEMAN, P., KANEFSKY, B., AND TAYLOR, W. M. Where the Really HardProblems Are. In Proceedings of the Twelfth International Conference on Ar-tificial Intelligence, vol. 1. Morgan Kaufmann Publishers, Inc., USA, 1991,pp. 331–337.
[27] CHIARANDINI, M., PAQUETE, L., PREUSS, M., AND RIDGE, E. Experimentson Metaheuristics: Methodological Overview and Open Issues. Tech. Rep.IMADA-PP-2007-04, Institut for Matematik og Datalogi, University of South-ern Denmark, 20 March.
[28] COFFIN, M., AND SALTZMAN, M. J. Statistical Analysis of ComputationalTests of Algorithms and Heuristics. INFORMS Journal on Computing 12, 1(2000), 24–44.
[29] COHEN, P. R. Empirical Methods for Artificial Intelligence. The MIT Press,Cambridge, Massachusetts, 1995.
[30] COLORNI, A., DORIGO, M., MAFFIOLI, F., MANIEZZO, V., RIGHINI, G., ANDTRUBIAN, M. Heuristics from Nature for Hard Combinatorial Problems. In-ternational Transactions in Operational Research 3, 1 (1996), 1–21.
[31] CORDN, O., FERNANDEZ, I., HERRERA, F., AND MORENO, L. A New ACOModel Integrating Evolutionary Computation Concepts: The Best-Worst AntSystem. In Proceedings of ANTS’2000. From Ant Colonies to Artificial Ants:Second Interantional Workshop on Ant Algorithms, Brussels, Belgium, Septem-ber 7-9, 2000. 2000, pp. 22–29.
244
REFERENCES
[32] COSTA, D., AND HERTZ, A. Ants Can Colour Graphs. The Journal of theOperational Research Society 48, 3 (1997), 295–305.
[33] COY, S., GOLDEN, B., RUNGER, G., AND WASIL, E. Using Experimental De-sign to Find Effective Parameter Settings for Heuristics. Journal of Heuristics7, 1 (2001), 77–97.
[34] CROWDER, H. P., DEMBO, R. S., AND MULVEY, J. M. Reporting Computa-tional Experiments in Mathematical Programming. Mathematical Program-ming 15 (1978), 316–329.
[35] CROWDER, H. P., DEMBO, R. S., AND MULVEY, J. M. On Reporting Com-putational Experiments with Mathematical Software. ACM Transactions onMathematical Software 5, 2 (1979), 193–203.
[36] CZARN, A., MACNISH, C., VIJAYAN, K., TURLACH, B., AND GUPTA, R. Sta-tistical Exploratory Analysis of Genetic Algorithms. IEEE Transactions onEvolutionary Computation 8, 4 (2004), 405–421.
[37] CZITROM, V. One-Factor-at-a-Time versus Designed Experiments. The Amer-ican Statistician 53, 2 (1999), 126–131.
[38] DEN BESTEN, M. L. Simple Metaheuristics for Scheduling: An empirical inves-tigation into the application of iterated local search to deterministic schedulingproblems with tardiness penalties. Phd, Germany.
[39] DEN BESTEN, M. L., STUTZLE, T., AND DORIGO, M. Ant colony optimizationfor the total weighted tardiness problem. In Proceedings of PPSN-VI, sixthinternational conference on parallel problem solving from nature, M. Schoe-nauer, K. Deb, G. Rudolph, X. Yao, E. Lutton, J. J. Merelo, and H.-P. Schwe-fel, Eds., vol. 1917 of Lecture Notes in Comput Science. Springer, Berlin, 2000,pp. 611–620.
[40] DERRINGER, G., AND SUICH, R. Simultaneous Optimization of Several Re-sponse Variables. Journal of Quality Technology 12, 4 (1980), 214–219.
[41] DOERR, B., NEUMANN, F., SUDHOLT, D., AND WITT, C. On the RuntimeAnalysis of the 1-ANT ACO Algorithm. In Proceedings of the Genetic andEvolutionary Computation Conference, vol. 1. ACM, 2007, pp. 33–40.
[42] DORIGO, M., AND BLUM, C. Ant colony optimization theory: A survey. Theo-retical Computer Science 344, 2-3 (2005), 243–278.
[43] DORIGO, M., AND CARO, G. D. The Ant Colony Optimization Meta-Heuristic.In New Ideas in Optimization, D. Corne, M. Dorigo, F. Glover, D. Dasgupta,P. Moscato, R. Poli, and K. V. Price, Eds., Mcgraw-Hill’S Advanced Topics InComputer Science. McGraw-Hill, 1999, pp. 11–32.
[44] DORIGO, M., AND COLORNI, A. The Ant System: Optimization by a colonyof cooperating agents. IEEE Transactions on Systems, Man, and CyberneticsPart B 26, 1 (1996), 1–13.
[45] DORIGO, M., AND GAMBARDELLA, L. M. Ant Colonies for the Travelling Sales-man Problem. BioSystems 43, 2 (1997), 73–81.
[46] DORIGO, M., AND GAMBARDELLA, L. M. Ant Colony System: A CooperativeLearning Approach to the Traveling Salesman Problem. IEEE Transactionson Evolutionary Computation 1, 1 (1997), 53–66.
[47] DORIGO, M., AND STUTZLE, T. Ant Colony Optimization. The MIT Press,Massachusetts, USA, 2004.
[48] EIBEN, A., AND JELASITY, M. A critical note on experimental researchmethodology in EC. In Proceedings of the 2002 IEEE Congress on Evolu-tionary Computation. IEEE, 2002, pp. 582–587.
245
REFERENCES
[49] FISCHER, T., STUTZLE, T., HOOS, H., AND MERZ, P. An Analysis Of TheHardness Of TSP Instances For Two High Performance Algorithms. In Pro-ceedings of the Sixth Metaheuristics International Conference. 2005, pp. 361–367.
[50] GAERTNER, D., AND CLARK, K. L. On Optimal Parameters for Ant ColonyOptimization Algorithms. In Proceedings of the 2005 International Conferenceon Artificial Intelligence, vol. 1. CSREA Press, 2005, pp. 83–89.
[51] GAGNE, C., PRICE, W. L., AND GRAVEL, M. Comparing an ACO algo-rithm with other heuristics for the single machine scheduling problem withsequence-dependent setup times. Journal of the Operational Research Soci-ety 53 (2002), 895–906.
[52] GAMBARDELLA, L. M., AND DORIGO, M. HAS-SOP: hybrid Ant System for theSequential Ordering Problem. Tech. Rep. IDSIA-11-97, IDSIA, 19 April.
[53] GAMBARDELLA, L. M., AND DORIGO, M. HAS-SOP: An Ant Colony SystemHybridized with a New Local Search for the Sequential Ordering Problem.INFORMS Journal on Computing 12, 3 (2000), 237–255.
[54] GANDIBLEUX, X., DELORME, X., AND T’KINDT, V. An Ant Colony Optimi-sation Algorithm for the Set Packing Problem. In Proceedings of the FourthInternational Workshop on Ant Colony Optimization and Swarm Intelligence,M. Dorigo, M. Birattari, C. Blum, L. M. Gambardella, F. Mondada, andT. Stutzle, Eds., vol. 3172 of Lecture Notes in Computer Science. 2004, pp. 49–60.
[55] GAREY, M. R., AND JOHNSON, D. S. Computers and Intractability : A Guideto the Theory of NP-Completeness. Books in the Mathematical Sciences. W.H. Freeman, 1979.
[56] GLOVER, F. Tabu Search - Part I. ORSA Journal on Computing 1, 3 (1989),190–206.
[57] GOLDBERG, D. E. Genetic algorithms in search, optimization, and machinelearning. Addison-Wesley Publishing Company, Inc., 1989.
[58] GOLDWASSER, M., JOHNSON, D. S., AND MCGEOCH, C. C., Eds. Proceedingsof the Fifth and Sixth DIMACS Implementation Challenges. American Mathe-matical Society, 2002.
[59] GOTTLIEB, J., PUCHTA, M., AND SOLNON, C. A Study of Greedy, LocalSearch, and Ant Colony Optimization Approaches for Car Sequencing Prob-lems. In Proceedings of EvoWorkshops 2003: Applications of EvolutionaryComputing, S. Cagnoni, J. R. Cardalda, D. Corne, J. Gottlieb, A. Guillot,E. Hart, C. Johnson, E. Marchiori, J.-A. Meyer, M. Middendorf, and G. Raidl,Eds., vol. 2611 of Lecture Notes in Computer Science. Springer, Berlin, 2003,pp. 246–257.
[60] GREENBERG, H. Computational Testing: Why, how and how much? ORSAJournal on Computing 2, 1 (1990), 94–97.
[61] GUNTSCH, M., AND MIDDENDORF, M. Pheromone Modification Strategies forAnt Algorithms Applied to Dynamic TSP. In Proceedings of EvoWorkshops2001: Applications of Evolutionary Computing, E. J. W. Boers, J. Gottlieb,P. L. Lanzi, R. E. Smith, S. Cagnoni, E. Hart, G. R. Raidl, and H. Tijink,Eds., vol. 2037 of Lecture Notes in Computer Science. Springer, Berlin, 2001,p. 213.
[62] GUNTSCH, M., AND MIDDENDORF, M. Applying Population Based ACO to Dy-namic Optimization Problems. In Ant Algorithms : Third International Work-shop,, . Dorigo, G. D. Caro, and M. Sampels, Eds., vol. 2463 of Lecture Notesin Computer Science. Springer, 2002, p. 111.
246
REFERENCES
[63] HELSGAUN, K. An effective implementation of the Lin-Kernighan travelingsalesman heuristic. European Journal of Operational Research 126, 1 (2000),106–130.
[64] HOOKER, J. N. Needed: An Empirical Science of Algorithms. OperationsResearch 42, 2 (1994), 201–212.
[65] HOOKER, J. N. Testing heuristics: We have it all wrong. Journal of Heuristics1 (1996), 33–42.
[66] HOOS, H., AND STUTZLE, T. Stochastic Local Search, Foundations and Appli-cations. Morgan Kaufmann, 2004.
[67] HYBARGER, J. The Ten Most Common Designed Experiment Mistakes. StatTeaser (December 2006).
[68] JAIN, R. The art of computer systems performance analysis: techniques forexperimental design, measurement, simulation and modeling. John Wiley andSons Inc., 1991.
[69] JOHNSON, D. S. A Theoretician’s Guide to the Experimental Analysis of Al-gorithms. In Proceedings of the Fifth and Sixth DIMACS Implementation Chal-lenges, M. Goldwasser, D. S. Johnson, and C. C. McGeoch, Eds. AmericanMathematical Society, 2002, pp. 215–250.
[70] JOHNSON, D. S., AND PAPADIMITRIOU, C. H. Computational Complexity. InThe Traveling Salesman Problem, E. L. Lawler, J. K. Lenstra, A. H. G. R.Kan, and D. B. Shmoys, Eds., Wiley Series in Discrete Mathematics andOptimization. John Wiley and Sons, 1995, pp. 37–85.
[71] KAPTCHUK, T. J. Effect of interpretive bias on research evidence. BritishMedical Journal 326 (2003), 1453–1455.
[72] KERIEVSKY, J. Refactoring to Patterns. The Addison-Wesley Signature Series.Addison-Wesley, 2005.
[73] KIRKPATRICK, S., GELATT, C. D., AND VECCHI, M. P. Optimization by Simu-lated Annealing. Science 220, 4598 (1983), 671–680.
[74] KRAMER, O., GLOGER, B., AND GOEBELS, A. An experimental analysis ofevolution strategies and particle swarm optimisers using design of experi-ments. In Proceedings of the Genetic and Evolutionary Computation Confer-ence. ACM, 2007, pp. 674–681.
[75] LAWLER, E. L., LENSTRA, J. K., KAN, A. H. G. R., AND SHMOYS, D. B., Eds.The Traveling Salesman Problem - A Guided Tour of Combinatorial Optimiza-tion. Wiley Series in Discrete Mathematics and Optimization. John Wiley andSons, New York, USA.
[76] LENTH, R. V. Java Applets for Power and Sample Size. 2006.
[77] MANIEZZO, V., AND COLORNI, A. The Ant System Applied to the QuadraticAssignment Problem. IEEE Transactions on Knowledge and Data Engineering11, 5 (1999), 769–778.
[78] MARON, O., AND MOORE, A. Hoeffding races: Accelerating model selectionsearch for classification and function approximation. Advances in NeuralInformation Processing Systems 6 (1994), 59–66.
[79] MCGEOCH, C. C. Toward an experimental method for algorithm simulation.INFORMS Journal on Computing 8, 1 (1996), 1–15.
[80] MERKLE, D., MIDDENDORF, M., AND SCHMECK, H. Ant colony optimizationfor resource-constrained project scheduling. IEEE Transactions on Evolution-ary Computation 6, 4 (2002), 333–346.
247
REFERENCES
[81] MICHELS, R., AND MIDDENDORF, M. An Ant System for the Shortest CommonSupersequence Problem. In New Ideas in Optimization, D. Corne, M. Dorigo,and F. Glover, Eds. McGraw-Hill, 1999, pp. 51–61.
[82] MILES, J. Getting the Sample Size Right: A Brief Introduction to PowerAnalysis, 2007.
[83] MITCHELL, M., AND TAYLOR, C. E. Evolutionary Computation: An Overview.Annual Review of Ecology and Systematics 20 (1999), 593–616.
[84] MONTGOMERY, D. C. Design and Analysis of Experiments, 6 ed. John Wileyand Sons Inc, 2005.
[85] MYERS, R. H., AND MONTGOMERY, D. C. Response Surface Methodology.Process and Product Optimization Using Designed Experiments. Wiley Seriesin Probability and Statistics. John Wiley and Sons Inc., 1995.
[86] NEUMANN, F., AND WITT, C. Runtime Analysis of a Simple Ant Colony Op-timization Algorithm. In Theory of Evolutionary Algorithms, D. V. Arnold,T. Jansen, M. D. Vose, and J. E. Rowe, Eds., Dagstuhl Seminar Proceed-ings. Internationales Begegnungs- und Forschungszentrum fuer Informatik(IBFI), Schloss Dagstuhl, Germany, Dagstuhl, Germany, 2006.
[87] NORVIG, P. Mistakes in Experimental Design and Interpretation, 2007.
[88] NOWE, A., VERBEECK, K., AND VRANCX, P. Multi-type Ant Colony: The EdgeDisjoint Paths Problem. In Proceedings of the Fourth International Workshopon Ant Colony, Optimization and Swarm Intelligence, M. Dorigo, M. Birattari,C. Blum, L. M. Gambardella, F. Mondada, and T. Stutzle, Eds., vol. 3172 ofLecture Notes in Computer Science. Springer, 2004, pp. 202–213.
[89] OSTLE, B. Statistics in Research, 2nd ed. Iowa State University Press, 1963.
[90] PAQUETE, L., CHIARANDINI, M., AND BASSO, D. Proceedings of the Workshopon Empirical Methods for the Analysis of Algorithms. In International Con-ference on Parallel Problem Solving From Nature (Reykjavik, Iceland, 2006).
[91] PARK, M.-W., AND KIM, Y.-D. A systematic procedure for setting parametersin simulated annealing algorithms. Computers and Operations Research 25,3 (1998), 207–217.
[92] PARK, S. K., AND MILLER, K. W. Random number generators: good ones arehard to find. Communications of the ACM 31, 10 (1988), 1192–1201.
[93] PARPINELLI, R., LOPES, H., AND FREITAS, A. Data mining with an ant colonyoptimization algorithm. IEEE Transactions on Evolutionary Computation 6(2002), 321–332.
[94] PARSONS, R., AND JOHNSON, M. A Case Study in Experimental Design Ap-plied to Genetic Algorithms with Applications to DNA Sequence Assembly.American Journal of Mathematical and Management Sciences 17, 3 (1997),369–396.
[95] PEACE, G. S. Taguchi Methods: A Hands-On Approach. Addison-Wesley,1993.
[96] PELLEGRINI, P., FAVARETTO, D., AND MORETTI, E. On Max-Min Ant System’sparameters. In Fifth International Workshop on Ant Colony Optimization andSwarm Intelligence, vol. 4150 of Lecture Notes in Computer Science. SpringerBerlin, 2006, pp. 203–214.
[97] PLANCK, M. Scientific autobiography and other papers. Williams and Norgate,London, 1950.
[98] PRESS, W. H., FLANNERY, B. P., TEUKOLSKY, S. A., AND VETTERLING, W. T.Numerical Recipes in Pascal: the art of scientific computing. Cambridge Uni-versity Press, 1989.
248
REFERENCES
[99] PRESS, W. H., TEUKOLSKY, S. A., VETTERLING, W. T., AND FLANNERY, B. P.Numerical Recipes in C: the art of scientific computing. Cambridge UniversityPress, Cambridge, 1992.
[100] RANDALL, M. Near Parameter Free Ant Colony Optimisation. In Proceedingsof the Fourth International Workshop on Ant Colony, Optimization and SwarmIntelligence, M. Dorigo, M. Birattari, C. Blum, L. M. Gambardella, F. Mon-dada, and T. Stutzle, Eds., vol. 3172 of Lecture Notes in Computer Science.Springer, Berlin, 2004, pp. 374–381.
[101] RARDIN, R. L., AND UZSOY, R. Experimental Evaluation of Heuristic Opti-mization Algorithms: A Tutorial. Journal of Heuristics 7 (2001), 261–304.
[102] REINELT, G. TSPLIB - A traveling salesman problem library. ORSA Journalof Computing 3 (1991), 376–384.
[103] RIDGE, E., AND CURRY, E. A Roadmap of Nature-Inspired Systems Researchand Development. Multi-Agent and Grid Systems 3, 1 (2007).
[104] RIDGE, E., AND KUDENKO, D. Sequential Experiment Designs for Screeningand Tuning Parameters of Stochastic Heuristics. In Workshop on EmpiricalMethods for the Analysis of Algorithms at the Ninth International Conferenceon Parallel Problem Solving from Nature, L. Paquete, M. Chiarandini, andD. Basso, Eds. 2006, pp. 27–34.
[105] RIDGE, E., AND KUDENKO, D. An Analysis of Problem Difficulty for a Classof Optimisation Heuristics. In Proceedings of the Seventh European Confer-ence on Evolutionary Computation in Combinatorial Optimisation, C. Cottaand J. V. Hemert, Eds., vol. 4446 of Lecture Notes in Computer Science.Springer-Verlag, 2007, pp. 198–209.
[106] RIDGE, E., AND KUDENKO, D. Analyzing Heuristic Performance with Re-sponse Surface Models: Prediction, Optimization and Robustness. In Pro-ceedings of the Genetic and Evolutionary Computation Conference. ACM,2007, pp. 150–157.
[107] RIDGE, E., AND KUDENKO, D. Screening the Parameters Af-fecting Heuristic Performance. Technical Report YCS 415(www.cs.york.ac.uk/ftpdir/reports/index.php), The Department of Com-puter Science, The University of York, April 2007.
[108] RIDGE, E., AND KUDENKO, D. Screening the Parameters Affecting HeuristicPerformance. In Proceedings of the Genetic and Evolutionary ComputationConference, D. Thierens, H.-G. Beyer, M. Birattari, J. Bongard, J. Branke,J. A. Clark, D. Cliff, C. B. Congdon, K. Deb, B. Doerr, T. Kovacs, S. Kumar,J. F. Miller, J. Moore, F. Neumann, M. Pelikan, R. Poli, K. Sastry, K. O.Stanley, T. Stutzle, R. A. Watson, and I. Wegener, Eds., vol. 1. ACM, 2007.
[109] RIDGE, E., AND KUDENKO, D. Tuning the Performance of the MMAS Heuris-tic. In Engineering Stochastic Local Search Algorithms. Designing, Implement-ing and Analyzing Effective Heuristics, T. Stutzle and M. Birattari, Eds.,vol. 4638 of Lecture Notes in Computer Science. Springer, Berlin / Heidel-berg, 2007, pp. 46–60.
[110] RIDGE, E., AND KUDENKO, D. Determining whether a problem characteristicaffects heuristic performance. A rigorous Design of Experiments approach.In Recent Advances in Evolutionary Computation for Combinatorial Optimiza-tion, Studies in Computational Intelligence. Springer, 2008.
[111] RIDGE, E., KUDENKO, D., AND KAZAKOV, D. A Study of Concurrency inthe Ant Colony System Algorithm. In Proceedings of the IEEE Congress onEvolutionary Computation. 2006, pp. 1662–1669.
[112] SCOTT, L. DOE Strategies: An Overview of the Methodology and Concepts,2006.
249
REFERENCES
[113] SHMYGELSKA, A., AND HOOS, H. An ant colony optimisation algorithm for the2D and 3D hydrophobic polar protein folding problem. BMC Bioinformatics6, 1 (2005), 30.
[114] SILVA, R. M. A., AND RAMALHO, G. L. Going the Extra Mile in Ant ColonyOptimization. In Proceedings of the Fourth Metaheuristics International Con-ference. 2001, pp. 361–366.
[115] SOCHA, K. The Influence Of Run-time Limits On Choosing Ant System Pa-rameters. In Proceedings of the Genetic and Evolutionary Computation Con-ference, E. Cantu-Paz, J. A. Foster, K. Deb, L. Davis, R. Roy, U.-M. O’Reilly,H.-G. Beyer, R. K. Standish, G. Kendall, S. W. Wilson, M. Harman, J. We-gener, D. Dasgupta, M. A. Potter, A. C. Schultz, K. A. Dowsland, N. Jonoska,and J. F. Miller, Eds., vol. 2723. Springer, 2003, pp. 49–60.
[116] SOLNON, C. Ants can solve constraint satisfaction problems. IEEE Transac-tions on Evolutionary Computation 6, 4 (2002), 347–357.
[117] STUTZLE, T. Local Search Algorithms for Combinatorial Problems - Analysis,Algorithms and New Applications. Phd, TU Darmstadt, 1998.
[118] STUTZLE, T., AND HOOS, H. H. Max-Min Ant System. Future GenerationComputer Systems 16, 8 (2000), 889–914.
[119] VAN HEMERT, J. I. Property Analysis of Symmetric Travelling SalesmanProblem Instances Acquired Through Evolution. In Proceedings of theFifth Conference on Evolutionary Computation in Combinatorial Optimization,G. R. Raidl and J. Gottlieb, Eds., vol. 3448. Springer-Verlag, Berlin, 2005,pp. 122–131.
[120] VOSS, S., MARTELLO, S., OSMAN, I. H., AND ROUCAIROL, C., Eds. Meta-Heuristics - Advances and Trends in Local Search Paradigms for Optimization.Kluwer Academic Publishers, Dordrecht, The Netherlands, 1999.
[121] WIKIPEDIA, T. F. E. I. Travelling salesman problem, 2007.
[122] WINEBERG, M., AND CHRISTENSEN, S. An Introduction to Statistics for ECExperimental Analysis. Tutorial at the ieee congress on evolutionary compu-tation, 2004.
[123] XU, J., CHIU, S., AND GLOVER, F. Fine-tuning a tabu search algorithmwith statistical tests. International Transactions on Operations Research 5, 3(1998), 233–244.
[124] ZEMEL, E. Measuring the quality of approximate solutions to zero-one pro-gramming problems. Mathematics of Operations Research 6 (1981), 319–332.
[125] ZLOCHIN, M., AND DORIGO, M. Model based search for combinatorial op-timization: a comparative study. In Proceedings of the Seventh Interna-tional Conference on Parallel Problem Solving from Nature, J. J. M. Guervs,P. Adamidis, and H.-G. Beyer, Eds., vol. 2439. Springer-Verlag, Berlin, Ger-many, 2002, pp. 651–661.
250