Predicting Good Compiler Transformations
Using Machine Learning
Edwin V. Bonilla
Master of Science
Artificial Intelligence
School of Informatics
University of Edinburgh
2004
I
Abstract
This dissertation presents a machine learning solution to the compiler optimisation problem
focused on a particular program transformation: loop unrolling. Loop unrolling is a very
straightforward but powerful code transformation mainly used to improve Instruction Level
Parallelism and to reduce the overhead due to loop control. However, loop unrolling can also
be detrimental, for example, when the instruction cache is degraded due to the size of the
loop body. Additionally, the effect of the interactions between loop unrolling and other
program transformations is unknown. Consequently, determining when and how unrolling
should be applied remains a challenge for compiler writers and researchers. This project
works under the assumption that the effect of loop unrolling on the execution times of
programs can be learnt based on past examples. Therefore, a regression approach able to
learn the improvement in performance of loops under unrolling is presented. This novel
approach differs from previous work ([Monsifrot et al., 2002] and [Stephenson and
Amarasinghe, 2004]) because it does not formulate the problem as a classification task but as
a regression solution. Great effort has been invested in the generation of clean and reliable
data in order to make it suitable for learning. Two different regression algorithms have been
used: Multiple Linear Regression and Classification and Regression Trees (CART).
Although the accuracy of the methods is questionable, the realisation of final speed-ups on
seven out of twelve benchmarks indicates that something has been gained with this learning
process. A maximum 18% of re-substitution improvement has been achieved. 2.5% of
overall improvement in performance for Linear Regression and 2.3% for CART algorithm
have been obtained. The present work is the beginning of an ambitious project that attempts
to build a compiler that can learn to optimise programs and can undoubtedly be improved in
the near future.
II
Acknowledgements Special thanks to my supervisor Dr. Chris Williams for his invaluable advice and
comprehensive revision of my progress throughout this project.
Thanks to Dr. Michael O'Boyle and Dr. Grigori Fursin for the discussions held during our
meetings that made possible the creation of the dataset that has been used in this project.
Thanks to Catalina Voroneanu and Bonny Quick for their patience when revising some
drafts of this dissertation.
Supported by the Programme Alβan, European Union Programme of High Level
Scholarships for Latin America, identification number (E03M14650CO).
III
Declaration I declare that this thesis was composed by myself, that the work contained herein is my own
except where explicitly stated otherwise in the text, and that this work has not been
submitted for any other degree or professional qualification except as specified.
(Edwin V. Bonilla)
IV
Table of Contents
Introduction.............................................................................................................................. 1 Overview and Motivation .................................................................................................... 1 Project Objectives ................................................................................................................ 2 Organisation......................................................................................................................... 3
Chapter One: Literature Review .............................................................................................. 6 1.1 Introduction ............................................................................................................ 6 1.2 Tuning heuristics and recommending program transformations ............................ 6 1.3 Learning in a particular program transformation: loop unrolling........................... 8 1.4 Summary............................................................................................................... 10
Chapter 2: Background on Compiler Optimisation ............................................................... 11 2.1 Introduction .......................................................................................................... 11 2.2 Definition of compilation ..................................................................................... 11 2.3 Compiler organisation .......................................................................................... 12 2.4 The purpose of a compiler .................................................................................... 13 2.5 An Optimising Compiler ...................................................................................... 14
2.5.1 Goals of Compiler Optimisation ...................................................................... 14 2.5.2 Considerations for program transformations.................................................... 15 2.5.3 The process of transforming a program for optimisation................................. 16 2.5.4 The problem of interaction............................................................................... 17 2.5.5 Types of program transformations ................................................................... 18 2.5.6 The scope of optimisation ................................................................................ 18 2.5.7 Some common transformations........................................................................ 20
2.6 Loop Unrolling ..................................................................................................... 20 2.6.1 Definition ......................................................................................................... 20 2.6.2 Implementation ................................................................................................ 21 2.6.3 Advantages of loop unrolling........................................................................... 22 2.6.4 Disadvantages of loop unrolling ...................................................................... 23 2.6.5 Interactions, again ............................................................................................ 24 2.6.6 Candidates for unrolling................................................................................... 24
2.7 Summary............................................................................................................... 24 Chapter 3: Data Collection..................................................................................................... 26
3.1 Introduction .......................................................................................................... 26
V
3.2 The Benchmarks................................................................................................... 28 3.3 Implementation of loop unrolling......................................................................... 31
3.3.1 Which loops should be unrolled? ..................................................................... 31 3.3.2 Initial experiments............................................................................................ 31 3.3.3 Loop level profiling.......................................................................................... 32
3.4 Generating the targets........................................................................................... 34 3.4.1 Preparing the benchmarks ................................................................................ 34 3.4.2 Selecting loops ................................................................................................. 34 3.4.3 Profiling............................................................................................................ 34 3.4.4 Filtering............................................................................................................ 34 3.4.5 Running the search strategy ............................................................................. 35
3.5 Technical Details .................................................................................................. 35 3.5.1 The platform..................................................................................................... 35 3.5.2 The compiler .................................................................................................... 35 3.5.3 The timer�s precision........................................................................................ 35
3.6 The results in summary......................................................................................... 36 3.7 Feature extraction ................................................................................................. 36 3.8 The representation of a loop ................................................................................. 39 3.9 Summary............................................................................................................... 40
Chapter 4: Data Preparation and Exploratory Data Analysis................................................. 41 4.1 Introduction .......................................................................................................... 41 4.2 The general framework for data integration ......................................................... 42 4.3 Formal representation of the data ......................................................................... 43 4.4 Is this data valid? .................................................................................................. 44
4.4.1 Statistical analysis ............................................................................................ 44 4.5 Pre-processing the targets..................................................................................... 46
4.5.1 Filtering............................................................................................................ 48 4.5.2 Dealing with outliers ........................................................................................ 49 4.5.3 Target transformation....................................................................................... 50
4.6 Pre-processing the features................................................................................... 52 4.6.1 Rescaling.......................................................................................................... 52 4.6.2 Feature selection and feature transformation ................................................... 53
4.7 Summary............................................................................................................... 54 Chapter 5: Modelling and Results.......................................................................................... 57
5.1 Introduction .......................................................................................................... 57
VI
5.2 The regression approach....................................................................................... 58 5.3 Learning methods used......................................................................................... 60
5.3.1 Multiple Linear Regression.............................................................................. 62 5.3.2 Classification and Regression Trees ................................................................ 64
5.4 Parameters setting................................................................................................. 64 5.5 Measure of performance used............................................................................... 65 5.6 Experimental Design ............................................................................................ 65
5.6.1 Complete dataset .............................................................................................. 65 5.6.2 K-fold cross-validation..................................................................................... 66 5.6.3 Leave One Benchmark Out cross-validation.................................................... 66 5.6.4 Realising speed-ups.......................................................................................... 67
5.7 Results and Evaluation ......................................................................................... 67 5.7.1 Complete dataset .............................................................................................. 67 5.7.2 K-fold cross-validation..................................................................................... 70 5.7.3 Leave One Benchmark Out Cross-validation................................................... 71 5.7.4 Realising speed-ups.......................................................................................... 75 5.7.5 Feature construction ......................................................................................... 78 5.7.6 Comparison to related work ............................................................................. 78
5.8 Summary and Discussion ..................................................................................... 79 Conclusions............................................................................................................................ 87 Bibliography .......................................................................................................................... 91
1
Introduction
Overview and Motivation The continuously increasing demand of high performance computational resources has
required great effort on behalf of hardware designers in order to develop advanced
components making possible the construction of current applications. Thus,
microprocessors� architectures have become more complex and have come to provide
processing rates that until a few years ago were affordable only by a selected group of users.
However, the resources offered by current processors are significantly underexploited, which
limits the possibility of running applications at maximum speed.
This �bottleneck� is mainly a consequence of the limitations of a very complex application
being used to transform the source code of programs written in a high-level programming
language into machine dependent code, the compiler. Indeed, the main concern of compiler
writers centres not upon very well studied tasks at the front of the compilation process (such
as parsing and lexical analysis) but upon the discovery of ways in which to optimise the
execution time of programs. Hence, optimisation is not considered an additional feature of
the compilation process but has become a crucial component of existing compilers.
However, achieving high performance on modern processors is an extremely difficult task
and compiler writers have to deal with NP-complete problems.
Numerous program transformations have been created in order to guarantee optimal use of
the resources of existing architectures. Nonetheless, all these transformations fail in the sense
that the compiler optimisation problem is far from being optimally solved. In fact, although
these transformations should, in theory, lead to a significant improvement in the performance
of the execution time of programs, they have instead become new problems to be solved.
Certainly, due to the great variety of transformations that are currently available, the number
of parameters each transformation involves and the unknown and almost enigmatic effect of
the interactions among these transformations, the compiler optimisation problem has been
transferred to the task of discovering when and how program transformations should be
applied.
Introduction
2
Considering the huge search space in compiler optimisation, compiler writers commonly rely
on heuristics that suggest when the application of a particular transformation can be
profitable. The usual approach adopted is running a set of benchmarks and setting the
parameters of the heuristics on these benchmarks. Although valid, this approach has
demonstrated that it insufficiently exploits the power of the transformations, leading to
suboptimal solutions, or sometimes to a diminishment in the performance of the execution
time of programs. This is understandable given that the heuristics adopted may be specific to
particular types of programs and target architectures.
This project tackles the compiler optimisation problem from a different point of view. It does
not attempt to tune heuristics previously constructed in order to optimise programs, but
investigates the use of machine learning techniques with the aim of discovering when a
particular transformation can be beneficial. Machine Learning techniques have been
successfully applied to different areas, from business applications to star/galaxy
classification. The type of learning adopted in this project is the one based on past examples
(training data) from which a model can be constructed. This model must be provided with
predicted power, i.e. although it is built on a set of training examples, it must be able to
successfully perform on novel data.
Project Objectives Thus, the principal goal of this project is to use supervised machine learning techniques to
predict good compiler transformations for programs. Therefore, if a number of examples
of good transformations for programs are given, the aim is to use machine learning methods
to predict these transformations based on features that characterise the programs. Given the
great number of program transformations that are available, this project will focus on a
relatively simplified version of the problem by applying machine learning techniques to
predict the effect of a particular transformation: loop unrolling.
Loop unrolling is a very straightforward but powerful program transformation mainly used
to improve Instruction Level Parallelism ILP (the execution of several instructions at the
same time) and to reduce the overhead due to loop control. Studies on loop unrolling date
from about 1979 when Dongarra and Hinds (1979) investigated the effect of this
transformation on programs written in Fortran. However, understanding the interaction of the
parameters that may affect it as well as its effects on the execution time of the ultimate code
Introduction
3
generated by the compiler is still a challenge. The main problem to be solved in loop
unrolling is deciding whether this transformation is beneficial or not and, yet further, how
much a loop should be unrolled in order to optimise the execution time of programs.
Previous work has focused on building classifiers in order to predict the profitability of loop
unrolling [Monsifrot et al., 2002] and how much this transformation should be applied
[Stephenson and Amarasinghe, 2004]. Although well-founded, these approaches may
encounter some difficulties when noisy measurements are present and when the data used for
learning is limited. This project goes beyond the classification problem and proposes a
regression approach to predict the improvement in performance that unrolling can
produce on a particular loop. The regression approach is a more general formulation of the
classification problem and previous solutions can be obtained with it. Furthermore, the
machine learning solution to loop unrolling based on a regression method is smoother than
the one obtained by classification methods given the degree of noisiness the measurements
may have and the limitations in terms of the number of training examples that are available.
Bearing in mind that the process of applying machine learning techniques requires great
effort in the preparation and analysis of the data to make it suitable for learning and the
correct use of appropriate techniques for the regression task, the following specific
objectives have been stated for the project:
• To develop a systematic approach that enables the application of machine learning
techniques to compiler optimisation focused on loop unrolling.
• To investigate the appropriateness of learning based on past examples in compiler
optimisation.
• To propose a regression solution able to determine the most influential parameters
involved in loop unrolling and able to predict the effect of this transformation on the
execution time of programs.
• To assess the quality and the impact of the results obtained by two different
regression methods used to predict the effect of loop unrolling.
Organisation This dissertation assumes that the reader does not have a background on compilers and,
therefore, an entire chapter will be dedicated to explaining the most relevant issues involved
Introduction
4
in compilation and compiler optimisation. Furthermore, the most important terminology
related to compiler optimisation needed to understand a specific concept or procedure will be
progressively explained. Similarly, although no separate section has been devoted to
providing a background on machine learning techniques, those concepts that are believed to
be crucial to understanding what is stated will be described as each chapter is developed. The
organisation of this dissertation is presented below.
Chapter One provides the Literature Review, describing previous work that involves
machine learning and/or artificial intelligence techniques applied to compiler optimisation.
Each approach is briefly explained, giving some details about the implementation and the
learning technique used. A discussion concerning their advantages and drawbacks is
presented and an explanation about how this previous work has influenced and motivated the
present project is given.
Chapter Two presents a Background on Compiler Optimisation. It aims to familiarise the
reader with the terminology involved in compiler optimisation, which will be used
throughout this dissertation, and give the background necessary to understand subsequent
chapters. Initially, the definition of compilation is provided, followed by an explanation of
the most important issues in compiler optimisation. The concept of program transformation
is briefly described and some code transformations are mentioned. A special section is
devoted to loop unrolling, which is the program transformation of interest in the present
dissertation.
Chapter Three describes the Data Collection process that has been carried out in this project.
It presents the most relevant factors that were taken into account to execute the experiments
and to generate the data available for learning. In this chapter, the benchmarks that were used
to construct the dataset are described; the implementation used for loop unrolling is
explained; some technical issues regarding the platform, the compiler and the optimisation
level used for the experiments are provided; the results of the data collection are summarised
and the features extracted from the loops are given.
Chapter Four presents the Data Preparation and Exploratory Data Analysis that was
carried out after the data collection process. It describes the process of cleaning, analysing
and preparing the data in order to make it suitable for learning. This chapter provides an
Introduction
5
insight into the data that will be used for modelling, explaining how the data is processed
throughout different stages and is made available for analysis and learning.
Chapter Five presents the section of Modelling and Results. This chapter explains the
machine learning solution to loop unrolling based on the regression approach and describes
the results obtained with the techniques used. It details how the regression problem is
formulated and why it is considered more general as compared to previous approaches that
applied machine learning techniques to loop unrolling. It provides an overview of the
regression methods that have been used in this project and defines the measure of
performance that has been used to evaluate the methods. It presents and evaluates the results
obtained and compares the solution found with previous work.
Finally, the conclusions obtained after having culminated this project are presented with a
summary of the work that has been done, the achievements obtained and the future work that
is proposed for continuing research in the area of compiler optimisation with machine
learning.
An additional clarification needs to be provided before reading the following chapters. At
some stages the term data mining is used to describe the work that has been developed in this
project; sometimes this term seems to be equated with machine learning. However, the
reader should bear in mind that although there is not a defined line that separates these fields,
it is assumed that data mining is the complete process of discovering knowledge from data
and that machine learning is only one important step in this process. Thus, even though the
main goal of this project is to apply machine learning techniques to compiler optimisation,
the work that has been developed is essentially a data mining application.
6
Chapter One
Literature Review
1.1 Introduction This section describes the most relevant work involving machine learning and/or artificial
intelligence techniques applied to compiler optimisation. Although there are other
approaches that have used machine learning for specific tasks in compiler optimisation, they
are not mentioned in this section as they are distant from the central idea and the aims of this
project. As described throughout this chapter, the attractive idea of applying machine
learning to compiler optimisation is relatively new. Some authors have proposed solutions
that are easy to implement and automate and others have preferred methods that are difficult
to deal with and can take a long time to obtain results.
The previous work that is presented in this chapter has been divided into two groups: the use
of machine learning as a general approach to compiler optimisation and the application of
machine learning methods to a specific program transformation. Thus, while section 1.2
presents previous work that has attempted to tune several code transformations by using
Evolutionary Computing and Case-Based Reasoning, Section 1.3 explains how Decision
Trees and Nearest Neighbours have been applied to learning in a particular program
transformation: loop unrolling. Each approach will be briefly described, giving some details
of the implementation and the learning technique used; a discussion concerning their
advantages and drawbacks will be presented; finally, a summary and an explanation about
how this previous work has influenced and motivated the present project will be given in
section 1.4.
1.2 Tuning heuristics and recommending program transformations
Stephenson et al. (2003) used evolutionary computing to build up priority functions. Priority
functions are, in some way, ubiquitous in constructing heuristics for compiler optimisation.
Chapter 1. Literature Review
7
In other words, compiler writers commonly rely on the assumption that a specific
optimisation technique is strongly tied to a certain function, called priority function. This
function involves some of the parameters that possibly affect a particular heuristic. Under
this assumption, they used Genetic Programming [Koza, 1992] to search the priority function
solution space. Their work was focused on three different heuristics: hyperblock formation,
register allocation and data prefetching, using Trimaran [Trimaran] and the Open Research
Compiler [ORC] for their experiments.
Although it may be appealing for many computer scientists to construct programs based on
evolution and natural/sexual selection, this approach has more drawbacks than advantages.
To avoid extending more than necessary this section, I will only mention four aspects that
can stop someone using Genetic Programming and evolutionary computing for compiler
optimisation. The first reason is related to a very popular term in machine learning:
overfitting. Plainly explained, overfitting is the effect of fitting a model to the training data in
such a way that is unable to generalise and to perform successfully in novel data. In fact, the
generalisation of Genetic Programming for this type of problems has not been clearly
demonstrated and the results of the work of Stephenson et al. (2003) reflect this effect.
Secondly, given the fact that the so-called �fitness function� used to evaluate candidates to
the solution is strongly connected to the execution time of the programs, carrying out the
task of selecting and �evolving� the solution may take a very long time. Furthermore, and this
is the third justification to avoid using Genetic Programming for this problem, different runs
of the technique for the same input can lead to different results, which is referred as
instability of the solution. Finally, but not less important, the adjustment of the parameters of
the heuristic may be transferred to the tuning of the parameters of the technique itself.
In an attempt to build an interactive tool to provide the user a guide to performance tuning,
Monsifrot and Bodin (2001) developed a framework based on Cased-Based Reasoning
(CBR). The purpose was to advise the user on possible code transformations in order to
reduce the execution time of programs. They adapted the general idea of Case-Based
Reasoning consisting in learning from past cases (See [Shen and Smaill, 2003] pages 72-78
for an introduction or [Kolodner, 1993] for a thorough description) to the compiler
optimisation problem by detecting fragments of code (i.e. loops) candidates to optimise,
checking their similarities with other past cases and reusing the solution of these cases. The
system was implemented using TSF [Bodin et al., 1998] and codes written in Fortran 77. The
features used to characterise the loops were selected according to four different categories:
Chapter 1. Literature Review
8
loop structure, arithmetic expressions, array references, and data dependences. Loop
transformations such as unrolling innermost and outer loops, unroll-and-jam and loop
blocking were considered.
As Nearest Neighbours, Case-Based Reasoning belongs to Instance-Based Learning
methods, which classify new instances according to their similarities with a set of training
examples ([Mitchell, 1997] pages 230-245). Although Case-Based Reasoning can be
considered a more sophisticated version of Nearest Neighbour methods, the work done in
[Monsifrot and Bodin, 2001] does not detail important stages in CBR such as the
modification of the prior solution or the repair of the proposed solution when it is not
successful.
An aspect to highlight in this approach is that it presents a wide variety of sensible and
important features to characterise programs and, more specifically, to describe loops, that
can be crucial when working with machine learning techniques. These features, also called
indices in the CBR terminology, make possible the identification of past cases that can be
reused and modified to provide the solution for a given problem. However, there are several
caveats to mention about this approach. In the first place, an insufficient number of loops
were used in the experiments, and they cannot be considered representative for programs. In
fact, the initial experiments were on only one benchmark containing 64 loops. It is clear that
it limits the capacity of the method to generalise to new problems. Furthermore, only one
specific program was used to test the performance of the system, which reinforces the idea of
biased results. Nevertheless, this is understandable given the difficulty of collecting data for
this kind of application. Finally, and possibly the greatest drawback of this solution, is that it
poorly contributes to the main goal of reducing the effort of compiler writers on
optimisation. Certainly, unlike traditional compilers, the system suggests modifications
without checking them for legality. Therefore, there is still a lot of work left to the user who
is responsible for this task.
1.3 Learning in a particular program transformation: loop unrolling
In a more pragmatic approach, which was indeed the motivation for the present dissertation,
Monsifrot et al. (2002) concentrated their effort in working with a particular transformation
for optimising the execution time of programs: loop unrolling. Based on a characterisation of
Chapter 1. Literature Review
9
loops from different programs written in Fortran 77 they wanted to investigate if it was
possible to learn a rule that could predict when unrolling was beneficial or detrimental. In
this case, unrolling was implemented at the source code level using TSF [Bodin et al., 1998]
and the experiments were performed with the GNU Fortran Compiler [g77]. For Learning,
they applied Decision Trees, which is an appropriate technique when the readability of the
results is needed. Briefly, their aim was to build a binary classifier able to decide whether to
perform loop unrolling or not.
Even though their results were not so exciting, which is explainable because only one code
transformation was taken into account, the methodology itself must be highlighted as well as
the endeavour to obtain the loop abstraction. Nevertheless, it is necessary to mention several
limitations about their work. Firstly, there is no reference to how much a loop should be
unrolled and how this decision should be taken. In fact, this is an important issue in loop
unrolling, because the technique can be advantageous for a specific unroll factor but
detrimental for another one. Secondly, given the methods they used to carry out feature
extraction, two loops with the same representation (loop abstraction) can belong to different
classes: positive or negative, where positive refers to the case when unrolling causes an
improvement in the execution time and negative to the opposite situation. This �noisy�
training data seems to have severely affected their results. Finally, although it is explicitly
recognised in the original paper, the fact of having a large number of negative examples and
only a small number of positive examples also affected the performance of their technique.
However, there was no effort to implement any solution to deal with this unbalanced data.
Following the work of Monsifrot et al. (2002), Stephenson and Amarasinghe (2004) have
recently published a technical report that describes how to go beyond the binary
classification about the suitability of loop unrolling. In fact, besides predicting whether
unrolling is beneficial or not, they also tried to predict the best unroll factor for loops. Unlike
Monsifrot et al. (2002) who applied unrolling at the source code level, they implemented
unrolling at the back-end of the compiler, and used the Open Research Compiler [ORC] for
their experiments with codes written in C, Fortan and Fortran 90. As in [Monsifrot and
Bodin, 2001], their technique was also based on Instance-Based Learning methods. In fact,
they used Nearest Neighbours, a very simple machine learning method commonly used for
classification, although also possible for regression. In summary, their work was focused on
solving the multi-class problem of predicting the unroll factor for a loop that guaranteed its
minimum execution time.
Chapter 1. Literature Review
10
Although not comparable, given the differences in the experimental methodology and the
machine learning algorithm used, their results were relatively better than those obtained by
Monsifrot et al. (2002). Aware of the variability in the execution times of the loops, they
invested a lot of effort in the instrumentation of the loops and ran each experiment 30 times,
taking the median as the representative for each training data point. Although this approach
is very sensible and valid, a lot can be gained if information about the behaviour of the
execution time of the loops under unrolling is maintained. Thus, it is possible to formulate a
more general solution to the problem that directly models the improvement in performance
of the loops when they are unrolled. This is the solution proposed in the present dissertation
and it will be denoted as the regression approach.
1.4 Summary This chapter has presented the relevant previous work that has applied machine learning to
compiler optimisation. In general, machine learning has been used in order to provide a
solution that can be applied to different optimisation problems but also has been focused on
the application of the techniques to a specific code transformation: loop unrolling, which is
the interest of the present project. Certainly, it is necessary to remark that this dissertation
has been strongly motivated by the last two pieces of work described above and, many of
their ideas were analysed to formulate a more general approach that can easily deal with
variability and noisiness and can generalise more appropriately when encountering novel
data.
11
Chapter 2
Background on Compiler Optimisation
2.1 Introduction This chapter presents an overview of compilation and describes the most important concepts
related to compiler optimisation. It is not the aim of this chapter to provide an in-depth study
about compilers. Contrarily, this section attempts to familiarise the reader with the
terminology involved in compiler optimisation, which will be used throughout this
dissertation, and give the background necessary to understand subsequent chapters. A
general-to-specific methodology is used to associate the different topics. Thus, the definition
of compilation is provided in section 2.2 and the organisation of a compiler is given in
section 2.3. The aim of a compiler is described in section 2.4 and the most important issues
in compiler optimisation are explained in section 2.5, where the concept of program
transformation is briefly described and some code transformations are mentioned. Section
2.6 is devoted to loop unrolling, which is the program transformation of interest in the
present dissertation. Finally a summary of the chapter is presented in section 2.7.
2.2 Definition of compilation The term compilation can be defined as the process of transforming the source code of a
program (in a high-level language) into an object code (in machine language) executable on
a target machine. Although this definition can be seen as more restrictive and less general
than others (see [Cooper and Torczon, 2004] for an alternative definition of compilation) it is
sufficient to understand the following concepts related to compiler optimisation. Hence, the
most basic idea of a compiler can be understood as a black box whose input is the source
code of a program written in high-level language and output is an executable code on a
specific machine as shown in Figure 2.1.
Chapter 2. Background on Compiler Optimisation
12
This mysterious black box is actually a very complex program responsible for making the
job easier for many programmers. Indeed, it hides the complexity of translating an easy-to-
use high-level language into a less understandable machine-dependent code. Usually,
commercial compilers provide additional features to the definition given above, such as
debugging or even an Integrated Development Environment IDE, and these features are
commonly included in the functionality of the compiler itself. However, the interest of this
section is focused on understanding this principal function of translation, how it is
performed, which issues may affect it and how the code generated can be improved. To
achieve this goal, it is necessary to start by describing the internal structure of a compiler.
2.3 Compiler organisation In a very simple form, a modern compiler can be thought as a three-layer structure where the
output of one layer is the input of the following one. These three layers, namely the front-
end, the optimiser and the back-end, perform different tasks that influence the ultimate code
generated. A possible structure for a modern compiler is shown in Figure 2.2. The functions
developed by each layer are:
The Front-end is responsible for the lexical analysis and parsing. It checks if the source
code satisfies the static constraints of the language in which it is implemented. Finally, it
converts the code into a more convenient form called Intermediate Representation (IR).
The Optimiser takes the intermediate representation of the program and applies a series of
transformations that can possibly improve the object code.
The Back-end receives the representation of the transformed code and converts it into a
language that the specific machine can understand. This language explicitly deals with the
management of physical resources available in the target architecture.
Figure 2.1: Basic form of a compiler
Compiler
Source code
(Program)
Object Code
(Executable)
Chapter 2. Background on Compiler Optimisation
13
Although Figure 2.2 presents compilation as a sequential process, sometimes this labour
division between the layers is not clearly delimited. For example, after applying some
transformations it could be necessary to perform additional analyses to the code that has
been modified. However, it is instructive to bear in mind that there are three essential tasks
developed by a compiler: analysis and transformation of the code into an intermediate
representation, improvement of the intermediate representation with the help of several
program transformations and translation of the code into a machine-understandable
language.
2.4 The purpose of a compiler Having explained the compiler organisation as a three-layer structure with well-defined
functions, two questions can be raised about the final output of the compilation process.
Firstly, what has been gained with the application of this process? Secondly, does the
compilation of the program make sense if the meaning of the initial code is changed?
Front-end
Convert the code into
IR
Optimiser
Apply transformations
Object code
Source code
Back-end
Convert into assembly
code
Figure 2.2: A possible structure of a modern compiler
Chapter 2. Background on Compiler Optimisation
14
The first question goes to the core of the compilation process. Even in the case when the
optimisation phase is not applied, a lot has been gained because is the compilation process
what makes the program executable on the target machine. Certainly, if compilers were not
available programmers would have to write their applications directly in assembly code,
which clearly would take much more time than using a high level programming language1.
Hence, it can be concluded that a compiler improves the initial code.
Now, it is possible to answer the second question with a simple statement: If the compilation
process changes the meaning of the initial program all the effort invested by the compiler is
useless. Unquestionably, a compiler �must preserve the meaning of the program being
compiled� [Cooper and Torczon, 2004]. This preservation is usually referred in the literature
as the correctness of the compiler.
2.5 An Optimising Compiler It has been introduced so far the idea of a modern compiler as a structure composed by three
layers: the front-end, the optimiser and the back-end. The functions of the front-end and the
back-end have been clearly explained but the role of the optimiser has been purposely
described in a general way by using the term improvement. The following subsections
discuss what optimisation means and why it is important in the compilation process.
2.5.1 Goals of Compiler Optimisation Compiler optimisation is mainly related to the ability of the compiler to optimise or improve
the generated code rather than enhancing the compilation process itself. Therefore, although
it might be very important to reduce the compilation time, and in fact, it is an issue to take
into account in compiler optimisation, the first goal of an optimising compiler is to discover
opportunities in the program being compiled in order to apply some transformations to
improve the code. Improvement, in this case, can refer to several facts depending on what
the user actually requires. For example, the user might be interested in reducing the
execution time of a program. However, the goal could also be to generate a code that
1 A person with a basic knowledge in computer science might argue that programmers could use interpreters. However, interpreters are also programs that transform the source code into machine understandable code. The difference is that they translate and execute line by line rather than the whole program at once, as it happens with compilers. Additionally, it is widely accepted that compiled code is much more efficient than interpreted code.
Chapter 2. Background on Compiler Optimisation
15
occupies the least possible space, or even a trade-off between speeding up the program and
reducing the size of the code that has been created. Additionally, it also might be interesting
to guarantee an efficient use of the resources of the target machine, for example memory,
registers and cache. In general, there is not a unique objective when talking about compiler
optimisation. Nevertheless, compiler optimisation is commonly associated only to the
purpose of speeding up programs, and indeed, this is the goal the present dissertation tries to
achieve. Therefore, the term optimisation will indicate henceforth the effect of reducing the
execution time of programs.
Compilers can improve the execution time of programs by carrying out subtasks such as
minimisation of the number of operations executed by the program, efficient management of
computational resources (cache, registers, and functional units) and minimisation of the
number of accesses to the main memory. These tasks jointly executed by the compiler can
greatly benefit not only the end-user of the program but also programmers and even
hardware designers. The end-user logically profits from optimisation because the application
executes faster. Programmers can neglect the details about making the appropriate code for a
particular machine and concentrate on high-level structures and a good application design.
Finally, hardware designers can be confident that compilers can appropriately exploit the
capabilities of their products.
2.5.2 Considerations for program transformations As expressed above and depicted in Figure 2.2, a compiler can optimise programs by
applying code transformations. These transformations will hopefully improve the code
generated by reducing its execution time. However, there are several issues to consider when
applying a particular transformation: correctness, profitability and compilation time.
2.5.2.1 Correctness: As for the general compilation process, the transformation applied
must be correct. The principle is basically the same: the code produced by a transformation
must preserve the meaning of the input code. In other words, if the meaning of the program
changes, the transformation should not be applied. Correctness is also referred in the
literature as the legality of the transformation. Bacon et al. (1994) provide a more formal
definition of legality:
Chapter 2. Background on Compiler Optimisation
16
�A transformation is legal if the original and the transformed programs produce
exactly the same output for identical executions�
However, in this case, the legality of the transformation is not sufficient. Since we are
considering the process of transforming the code in order to optimise the program, a
transformation must also be profitable.
2.5.2.2 Profitability: A transformation applied to a particular fragment of a program must
improve the ultimate code generated. In other words, the process of applying a particular
transformation is expected to produce an improvement in the execution time of the program.
This is in fact the purpose of applying a transformation for optimisation, but in many cases,
the effect of transforming a program is not noticeable or, even worse, it leads to a detriment
in performance of the final code generated2.
Finally, although sometimes neglected, the compilation time is also an important factor in
determining whether a transformation is actually beneficial or not. If the transformation
considerably increases the compilation time of the program, there should be serious doubts
about its application. Ideally, it should be a trade-off between the improvement of the
execution time and the detriment of the compilation time.
2.5.3 The process of transforming a program for optimisation The transformation of a program for optimisation can be divided into three steps as follows:
2.5.3.1 Identification: The compiler has to identify which part of the code can be
optimised and which transformations may be applied to it.
2.5.3.2 Verification: The legality of each transformation must be ensured, i.e. that the
transformation does not change the meaning of the program.
2.5.3.3 Conversion: Refers to the process of applying a particular transformation.
2 Certainly, if all the transformations guaranteed an improvement in performance of the final code there would not be reasons for research in this area and for the present dissertation. Therefore, the questions of how and when to apply a particular transformation constitute the major problem in compiler optimisation.
Chapter 2. Background on Compiler Optimisation
17
The transformation process is depicted in Figure 2.3.
From these three steps, the identification phase, i.e. the process of recognising fragments of
the program susceptible to optimisation and the decision of finding transformations that can
potentially improve the code, is the most complicated stage. Compiler writers commonly
rely on heuristics to determine these transformations and the parameters that define them.
However, given that, many parameters may be involved and one transformation may affect
the applicability of the subsequent transformations, these heuristics commonly lead to
suboptimal solutions.
2.5.4 The problem of interaction As explained above, compiler optimisation is a very complex problem not only because
many parameters may be involved in each transformation but also because one
transformation can impede or enable the applicability of another transformation. The latter is
known as the effect of interaction among the different transformations. Certainly, the actual
performance of the compiled code depends on the resulting outcome of the interactions
among the transformations that were applied during the whole compilation process. In
general, while some transformations are demonstrated to be beneficial for a particular
program when applied independently, they can degrade the execution time of the program if
they are sequentially executed.
Thus, considering that optimisation problems such as register allocation and instruction
scheduling are NP-complete in themselves and that different transformations can interact
with each other, the compiler optimisation problem is far from being optimally solved.
Code (IR)
Identification
Which part of
the code?
Verification
Is it legal?
Conversion
Apply
Transformed code
Figure 2.3: Program transformations for optimisation
Chapter 2. Background on Compiler Optimisation
18
2.5.5 Types of program transformations Program transformations can be split into two classes depending on which part of the
compiler they are applied to. Thus, they can be classified as:
2.5.5.1 Machine-independent transformations: convert an intermediate
representation (IR) of the program into another intermediate representation. Consequently,
the code generated does not depend on a specific machine or architecture. However, since
the code produced is not the final code and may be susceptible to further changes, the
profitability of these transformations cannot be ensured. High-level transformations that
perform tasks such as eliminate redundancies, eliminate unreachable or useless code and
enable other transformations, can be considered to belong to this group. Loop unrolling is, in
general, an example of machine-independent transformation.
2.5.5.2 Machine-dependent transformations: Machine-dependent transformations are
also called machine-level transformations. They convert the intermediate representation of
the program directly into assembly code. Thus, the code generated is tied to a specific
architecture. Those transformations that consider particularities of the target architecture
belong to this group, for example, instruction scheduling, instruction selection and register
allocation.
2.5.6 The scope of optimisation Program transformations can be applied at different levels of the code. For example, they can
be applied to statements, basic blocks3, innermost loops, general loops, procedures (intra-
procedures) and the whole program (inter-procedures). Depending on this level of
granularity, the complexity of the analysis becomes bigger and applying a particular
transformation turns out to be more costly because it increases the compilation time of the
program. Loop level transformations are very important as loops are considered to be the
places where programs spend most of their time.
3 A basic block can be defined as straight-line code; In other words, a block of code with no branches.
Chapter 2. Background on Compiler Optimisation
19
Type Transformation
Loop Interchange
Loop Skewing
Loop Reversal
Loop Blocking (Tiling)
Loop Pushing
Loop Fusion
Loop Peeling
Loop Code motion
Loop Normalisation
Loop Transformations
Loop Unrolling
Memory Alignment
Array Expansion
Array Contraction
Scalar Replacement
Code Co-location
Memory Access Transformations
Array Padding
Unreachable Code Elimination
Useless Code Elimination
Dead Variable Elimination
Common Subexpression Elimination
Redundancy Elimination
Short-Circuiting
Frame Collapsing
Procedure Inlining Procedure Call Transformations
Parameter Promotion
Constant Propagation
Constant Folding Partial Evaluation
Algebraic Simplification
Table 2.1: Some common transformations for compiler optimisation (taken from [Bacon et al., 1994])
Chapter 2. Background on Compiler Optimisation
20
2.5.7 Some common transformations To give an idea about the type and number of program transformations available for
compiler optimisation some of them are shown in Table 2.1 (taken from [Bacon et al.,
1994]). The transformations shown as shaded are very well known and studied in the
literature. For a good review about the meaning and implementation of these
transformations, see [Bacon et al., 1994].
As it is the focus of the present dissertation, Loop unrolling is described in more detail in
section 2.6.
2.6 Loop Unrolling Loop unrolling is a very straightforward but powerful program transformation mainly used
to improve Instruction Level Parallelism ILP (the execution of several instructions at the
same time) and to reduce the overhead due to loop control. Although extensively studied in
the literature ([Dongarra and Hinds, 1979] and [Davidson and Jinturkar, 2001]) it continues
being of interest for the compiler community. Understanding the interaction of the
parameters that may affect it as well as its effects on the execution time of the ultimate code
is still a challenge.
2.6.1 Definition Loop unrolling is the replication of the loop body certain number of times u, called unroll
factor. As the loop body is replicated, the trip count, i.e. the loop termination code, must be
adjusted in order to guarantee that the loop is executed exactly the same number of times
than the rolled (original) version. To additionally control the leftover iterations, a prologue
or epilogue is added before or after the unrolled loop. To clearly illustrate how loop
unrolling works, the example shown in Figure 2.4, taken from [Bacon et al., 1994], depicts a
loop that has been unrolled twice. The notation used does not belong to a specific language
although it is very similar to Fortran 77, except for the way of accessing elements of arrays.
The left side of Figure 2.4 shows the original loop composed by only one statement (the loop
body) when the iteration step is 1 (considered by default). The right side shows the loop
Chapter 2. Background on Compiler Optimisation
21
unrolled using a factor u=2. Thus, the loop body is replicated twice4, the array accesses are
modified accordingly and the iteration step is changed to 2. Since the value of the trip count
(n-2) is unknown, an epilogue has been added to control that the unrolled loop is executed
the same number of times as the original loop.
2.6.2 Implementation Loop unrolling can be implemented by hand (manually) or by the compiler (automatically).
It is manually implemented, as in Figure 2.4, either by the programmer or by a software
transformation tool that works on top of the compiler. It can be automatically implemented
by the compiler on the source code, on an intermediate representation of the program or on
assembly code (back-end), i.e. on an optimised version of the program. Implementing loop
unrolling at the source code level or at an intermediate representation of the program can be
more profitable than at the back-end of the compiler. For example, it might make the code
more susceptible to the application of other program transformations. However, it also can
be unfavourable because it may impede the application of other transformations. Arguably,
one of the reasons to implement loop unrolling at the back-end of the compiler is because of
4 There is no general agreement in the literature about considering the rolled version of the loop as u=1 or u=0. The former notation, which is believed to be more understandable, will be used throughout this dissertation.
Original Loop
do i=2, n-1
a[i] = a[i] + a[i-1] * a[i+1]
end do
Loop unrolled twice
do i=2, n-2, 2
a[i] = a[i] + a[i-1] * a[i+1]
a[i+1] = a[i+1] + a[i] * a[i+2]
end do
if (mod(n-2,2) = 1) then
a[n-1] = a[n-1] + a[n-2] * a[n]
end if
Epilogue
Figure 2. 4: Original loop (left) and loop unrolled by a factor u=2 (right)
(taken from [Bacon et al., 1994])
Chapter 2. Background on Compiler Optimisation
22
the fact that its profitability can be almost ensured when applied to one of the latest
representations of the program. An important issue to remark is that, if judiciously applied,
unrolling is an always-legal transformation.
In general, loop unrolling can offer several advantages that may improve the execution time
of programs. However, these benefits can be diminished by some side effects of the
transformation.
2.6.3 Advantages of loop unrolling As mentioned above, loop unrolling can be considered a beneficial transformation because it
may:
o Improve Instruction Level parallelism (ILP). Instruction Level Parallelism refers
to the ability of the compiler to execute multiple instructions simultaneously. Hence,
If the size of the loop is augmented, the number of instructions that can be scheduled
in an out-of-order mode5 also increases, causing that more instructions can be
executed in parallel [Monsifrot et al., 2002].
o Reduce the overhead due to loop control. Loop overhead is caused by the
increments of the loop variable, the tests applied to this variable and the branch
operations. All these operations are decreased by compensating the reduction in the
number of iterations with the replications of the loop body. For example, in Figure
2.4 loop overhead is reduced by a half. Therefore, if the loop executes a considerable
amount of iterations, the improvement in the execution time due to the reduction of
the loop overhead is appreciable.
Additionally, loop unrolling can also:
o Enable other transformations. Loop unrolling applied at an early stage can give to
the code the appropriate shape for other transformations, for example, for common
subexpression elimination.
o Eliminate loop copy operations. Sometimes copy operations are necessary when
the loop calculates a value that is needed in a subsequent iteration, which is known
5 This term refers to the situation when the instructions are not executed in the specific order given by the program.
Chapter 2. Background on Compiler Optimisation
23
in the jargon of compilers as loop-carried data dependency. These copy operations
can be eliminated by unrolling.
o Improve memory locality (register, data cache or TLB6). Memory locality is
improved when the access to local memory resources is performed in an efficient
way. For example, on the right side of Figure 2.4 the loads corresponding to a[i] and
a[i+1] are done twice. Thus, the number of loads per iteration has been reduced from
3 to 2.
2.6.4 Disadvantages of loop unrolling By far the major drawback of loop unrolling is that it can degrade the performance of the
instruction cache. Parameters such as the size of the loop body after unrolling (dependent
on the unroll factor used), the size, the organisation and the replacement policy of the
instruction cache may cause that some instructions cannot be kept in the instruction cache
and must be accessed from the main memory, which can be thousands of times slower than
the cache. When this happens, it is said that some cache misses have occurred. The number
of cache misses can certainly affect the final execution time of the program and diminish any
benefit from loop unrolling. Furthermore, it may be possible to completely deteriorate the
execution time and slow down the program.
Besides the degradation of the instruction cache, loop unrolling can have other negative
effects. For example, as the size of the loop body becomes bigger, the number of instructions
increases, which may augment the number of calculations of addresses and make the
instruction scheduling problem more complex. Moreover, additional loads and stores may
be needed causing a greater demand for registers. It is said that the register pressure has
increased. The register pressure is a measure that indicates the ratio between the registers
demanded by a program and the actual number of registers available in a particular machine.
It is bad news if the register pressure becomes much greater than 1 because some registers�
values have to be saved (into main memory) and liberated for other purposes. In this case, it
is said that the registers have been spilled [Bacon et al., 1994].
6 TLB stands for Translation Lookaside Buffer. It is a table in the processor that maps virtual addresses into real addresses of memory pages that have been recently referenced.
Chapter 2. Background on Compiler Optimisation
24
Finally, another disadvantage of loop unrolling is that it may prevent other optimisation
techniques. In fact, after loop unrolling some transformations may not be profitable or
simply may not be applicable.
2.6.5 Interactions, again It might seem rather weird that one of the advantages of loop unrolling is that it enables
some transformations and one of its drawbacks is that it limits the applicability of other
transformations. The justification for this apparent contradiction is related to the already
known hurdle of interactions. The interactions between most of the compiler optimisation
techniques are still poorly understood and the results can vary depending on the input
program, the target architecture and the very high dimensional space of transformations and
their parameters.
2.6.6 Candidates for unrolling Having explained the most important features of loop unrolling, describing its positive and
negative effects, one can have a general idea about which type of loops can be candidates to
unroll in order to improve the execution time of programs. However, it is easier to suggest
which loops are not good candidates for unrolling. In general, loops with a very low trip
count, a large body, containing procedure calls or containing branches in them are not very
suitable for unrolling [Nielsen, 2004]. Nevertheless, these features are rather ambiguous and
it is difficult to specify the meaning of low trip count, large loop body or even the number
and the size of procedure calls or the number of branches. Thus, as expressed above, loop
unrolling still remains an area of great interest for the compiler community.
2.7 Summary This chapter has presented the most important issues involved in compilation and compiler
optimisation necessary to understand latter chapters and the purpose of this dissertation. The
organisation of a compiler and the process of applying program transformations have been
described emphasising the principles of correctness and profitability. As it is the focus of the
present dissertation, loop unrolling has been studied in detail explaining why it is an
important program transformation, what are its advantages and disadvantages and why the
problem of determining when and how to apply this transformation is a challenge.
Chapter 2. Background on Compiler Optimisation
25
Additionally, important terminology about compilers has been introduced throughout this
section. Expressions such as legality, intermediate representation, front-end, back-end, basic
blocks, cache misses, spill registers were explicitly defined. Finally, the problem of
interactions among the different optimisation techniques has been also mentioned and
described as a hurdle for the compiler optimisation problem.
26
Chapter 3
Data Collection
3.1 Introduction One of the most difficult obstacles to applying machine learning to compiler optimisation is
the process of generating clean, reliable and sufficient data. As explained in chapter two,
compiler optimisation is related to the improvement of the ultimate code generated by the
compiler in a specific way. In this case, improvement refers to the reduction in the execution
time of programs. Therefore, for any approach that attempts to apply machine learning
techniques in order to reduce the execution time of programs, the process of creating or
evaluating the data is strongly dependent on this execution time.
In fact, regardless of whether the solution proposed is based on supervised or unsupervised
learning, the data that is utilised to build a specific model must arise in some way from the
execution time of the programs or from the execution time of parts of them. For example, it
was explained in Chapter One that Stephenson et al. (2003) used Evolutionary Computing to
construct priority functions. For this approach, the process of evaluating the candidate
solutions may require a very long time. Indeed, it is necessary to compute the whole
population of functions on different programs in order to ascertain which are beneficial and
actually lead to an improvement in performance. Similarly, it was also mentioned in Chapter
One that Stephenson and Amarasinghe (2004) used Nearest Neighbours with the aim of
predicting the best unroll factor for loops. The label for each loop was constructed by finding
the unroll factor that guaranteed its minimal execution time. Each experiment was executed
thirty times due to the variability of the measurements. Although not specified in the original
paper, let us consider the ideal case where the interactions between loops are negligible and
the programs are executed by using the same unroll factor for all the loops on each run.
Using a maximum unroll factor of eight, each program should be executed at least 30 x 8 =
240 times. As in the traditional approach used by compiler writers when building heuristics
for optimisation, machine learning techniques for compiler optimisation should also work on
a set of benchmarks that can be considered representative for specific tasks and challenging
for current computers. Normally, these benchmarks have at least hundreds of lines and can
Chapter 3. Data Collection
27
take in general about 1 or 2 minutes for a normal execution. Hence, if a program must be run
240 times and its execution time is one minute, obtaining the data for the loops within the
program will take about 4 hours. Now, say for simplicity that on average a program contains
about 20 loops that can be considered for unrolling. If one wanted to include 1000 loops in
the training data, it would take about (1000 loops) x (1 program / 20 loops) x (4h) = 200
hours. Obviously, this number can significantly increase with the preparation of the
programs for a particular transformation, the compilation time, the effect of the
instrumentation and the process of analysing and cleaning the data. Consequently, generating
sufficient data for this type of application can be a time-consuming activity.
However, there are still questions to be asked about the meaning of sufficient. Are 1000
loops sufficient for learning; should there be more; could there be less? These are very
difficult questions and, unfortunately, the present dissertation does not attempt to answer
them as they are beyond the scope of this project. Nevertheless, other issues involved in the
data collection process are also important and need to be mentioned. Particularly, the details
about how the experiments were carried out in order to generate the data must be explained.
This not only contributes to a better understanding of the present dissertation but also allows
other researchers to replicate the results obtained here.
As clearly recognised by the data mining community, at least 60% of the time in a data
mining application is devoted to understanding, analysing, pre-processing and cleaning the
data. This project is not an exception and great effort has been invested in generating clean
and appropriate data as well as in performing its analysis and pre-processing. Whilst the
following chapter deals with the data analysis phase, this chapter presents the most relevant
factors that were taken into account in carrying out the experiments and generating the data
available for learning. Initially, the benchmarks that were used to construct the dataset are
described in section 3.2. The implementation used for loop unrolling is explained in section
3.3, providing useful information about the granularity of the instrumentation and the
assumptions involved in this data generation process. The general process that has been
followed is explained in section 3.4. Subsequently, some technical issues regarding the
platform, the compiler and the optimisation level used for the experiments are provided in
section 3.5. The results of the data collection are summarised in section 3.6 and the features
extracted from the loops are described in section 3.7. The final representation of a loop
composed of its features and execution times is presented in section 3.8. Finally, a summary
of the data collection process and the experiments performed is given in section 3.9.
Chapter 3. Data Collection
28
3.2 The Benchmarks Three important factors may influence the selection of the programs that are used to generate
the data for compiler optimisation with machine learning: programming language, type of
application and execution time.
Programming language: Several programming languages can be considered for applying
machine learning to compiler optimisation. In fact, other researchers have included programs
written in Java [Long, 2004] and Fortran 77 [Monsifrot et al., 2002], or have used
benchmarks from a mixture of sources such as Fortran 77, Fortran 90 and C [Stephenson and
Amarasinghe, 2004]. Although it is possible to have different programming languages in a
set of benchmarks and to include the language in which the programs are written as an
additional feature of each loop, it is worth focusing only on Fortran 77 for at least two
reasons. Firstly, there is a great deal to be said for optimising programs that are written in
Fortran, as it is considered the scientific programming language and most high-performance
computing applications have been developed in this language. Secondly, Fortran 77 lacks
pointers that can potentially reduce the application of some program transformations. Other
issues such as portability of the code to different platforms can also affect the choice the
programming language.
Type of application: Benchmarks are designed to investigate the capabilities of different
platforms and the architectures for particular tasks. For example, some benchmarks are
demanding on floating point operations, others are challenging on integer operations or
target specific applications such as graphics or digital signal processing. In principle, it
would be valuable if one could consider a wide variety of applications and include them in
the set of programs to analyse. However, as explained above, time limitations generally
place constraints upon the types of benchmarks that may be used. For this project, since we
are not considering a specific target such as network applications or multimedia programs, it
is important to focus on numerical applications that are demanding enough for current
computers. Therefore, the benchmarks should be as realistic as possible, given the final goal
of building a solution that can generalise over real programs.
Execution time: Execution time performs an important role in choosing the benchmarks
given the time constraints one can face when generating the data for learning. It is certainly a
trade-off between how challenging a program is for a particular machine and how long it
takes to execute on that machine. Programs for which the execution time is very low will
Chapter 3. Data Collection
29
probably not be rich in features necessary to build a model that can perform properly in
novel data. On the other hand, programs rich in features and demanding for a particular
machine will probably take a long time to execute and one could not afford to include them
in the dataset. However, caution must be taken with programs that have a long execution
time. The long execution time of some benchmarks may be attributable to the complexity of
their input but not to the richness of any demanding features they may have. In other words,
sometimes it is the input of a benchmark that compels the execution time of a program to be
long, rather than the complexity of the program. Therefore, these programs might not be
important to analyse either.
An additional consideration for choosing the programs when building the dataset for
compiler optimisation is the bias introduced by selecting specific types of applications. For
example, one could include some very simple computational kernels such as matrix
multiplication for which a transformation like loop unrolling is known to be beneficial to
some extent. Therefore, it would be possible to create more examples by varying some of the
parameters of the kernels such as the trip count and the array sizes. Since the final goal of the
learning technique is finding a solution that provides an improvement in performance for the
programs, the results will be biased towards these simple kernels. However, is it also true
that numerical applications may include these computational kernels and a lot can be gained
when having them in the training data. Therefore, it would be profitable for the learning
technique to have these simple kernels along with other more complex programs.
Having considered the issues that may affect the choice of the benchmarks, the programs
used for the experiments belong to the suites SPEC CFP95 [SPEC95] and VECTORD
[Levine et al., 1991]. The benchmarks taken from [SPEC95] are scientific applications
written in Fortran. These numerical programs, intensive in floating point operations,
represent a variety of real applications and are still challenging for current computers. One of
these benchmarks, namely 110.applu was significantly affected by the instrumentation, and
another benchmark called 145.fpppp did not have appropriate loops to be unrolled.
Therefore, they were discarded; only eight of ten possible programs from this suite have
been used. The suite VECTORD [Levine et al., 1991] contains a variety of subroutines
written in Fortran intended to test the analysis capabilities of a vectorising compiler. These
subroutines include different types of loops whose features may be encountered in other
applications.
Chapter 3. Data Collection
30
The description of each benchmark specifying the name, the number of lines, the number of
subroutines, the area of application and the specific task performed is shown in Table 3.1
(for the case of SPEC CFP95 benchmarks this information was obtained from the official
web site [SPEC95]).
Benchmark # Lines /
# subroutines
Application
Area Specific Task
101.tomcatv 190/1
Fluid Dynamics /
Geometric
Translation
Generation of a two-
dimensional boundary-fitted
coordinate system around
general geometric domains.
102.swim 429/6 Weather Prediction
Solves shallow water
equations using finite
difference approximations.
103.su2cor 2332/35 Quantum Physics
Masses of elementary
particles are computed in the
Quark-Gluon theory.
104.hydro2d 4292/42 Astrophysics
Hydrodynamical Navier
Stokes equations are used to
compute galactic jets.
107.mgrid 484/12 Electromagnetism Calculation of a 3D potential
field.
125.turb3d 2101/23 Simulation Simulates turbulence in a
cubic area.
141.apsi 7361/96 Weather Predication
Calculates statistics on
temperature and pollutants in
a grid.
146.wave 7764/105 Electromagnetics Solves Maxwell's equations
on a Cartesian mesh.
vector 5302/135 Variety of vectorial
routines
Tests the analysis capabilities
of a vectorising compiler
Table 3.1: Description of the Benchmarks
Chapter 3. Data Collection
31
3.3 Implementation of loop unrolling As explained in section 2.6.2 loop unrolling can be implemented at the source code level, at
an intermediate representation of the program or at the back-end of the compiler. For the
experiments that generated the data in this project a software framework that works on top of
the compiler has been used. In other words, loop unrolling has been implemented at the
source code level. This framework developed in [Fursin, 2004] is mainly written in java and
provides a platform independent tool to assist the user in compiler optimisation. The
software, based on a �feed-back directed program restructuring� ([Fursin, 2004] page 63)
searches for the best possible code transformations and their parameters in order to minimise
the execution time of programs.
The unrolling algorithm used is a generalised version of the algorithm described in section
2.6.1 and it is shown in Figure 3.1 (taken from [Fursin, 2004]). In this generalised version,
the loop body is replicated u times in the first loop and an additional loop is introduced to
control the leftover operations.
3.3.1 Which loops should be unrolled? As in other approaches ([Monsifrot et al., 2002] and [Stephenson and Amarasinghe, 2004]),
only innermost loops were chosen to be unrolled. Although this may be the most common
choice throughout the literature, one should bear in mind that there are some cases where
unrolling outer loops can be beneficial as explained in [Nielsen, 2004]. However, in order to
keep the complexity of the transformer low and to guarantee the legality of the
transformations, outer loops were not considered. This fact is not very restrictive; many
innermost loops can be significantly improved by unrolling.
3.3.2 Initial experiments Due to the characteristics of the transformation framework used to perform unrolling, the
initial experiments executed to generate the data were based on program-level timing.
Firstly, innermost loops are chosen from a particular program and the maximum unroll factor
is set to eight (U = 8). The framework runs a program U times corresponding to different
unroll factors for a loop, recording the execution time of the whole program for each run.
Subsequently, the best unroll factor found for that loop is fixed and the software executes the
program for each unrolled version of the following loop. This process is repeated until all the
Chapter 3. Data Collection
32
loops are unrolled. In this case, it is said that the framework follows a systematic search
strategy. This strategy works under the assumption that there is no interaction between loops.
In other words, it means that fixing an unroll factor for a specific loop will not severely
affect the performance of the execution of another loop. Furthermore, there is one advantage
of having program-level profiling for measuring the execution times: the intrusion caused by
the instrumentation is negligible. Indeed, since only the execution time for the whole
program is measured, the performance of each loop is minimally affected by this
instrumentation. However, there is a major drawback to carrying out the experiments in this
way: the time needed to obtain the data for a specific program grows with the maximum
unroll factor used and with the number of loops considered for unrolling. Thus, the number
of times a program must be executed in order to be analysed is U x L, where U is the
maximum unroll factor and L is the number of loops to be analysed. If, for example, a
program contains forty loops and the maximum unroll factor used is eight, the program will
need to be executed 40 x 8 = 320 times. If one whished to include a considerable number of
programs the process of generating the data would be extremely time-consuming. Despite
this fact, the initial experiments followed this approach. Unfortunately, an additional
inconvenience was further discovered after obtaining the data for all the benchmarks: for
most of the programs, the improvement found by the search strategy was only comparable to
the variability of their execution times. Using signal-processing terminology, the signal of
improvement was swamped by the noise. This fact utterly impeded the use of any criterion
for selecting the loops and including them in the dataset. Therefore, it was necessary to
follow a loop level granularity.
3.3.3 Loop level profiling Unlike program-level profiling that measures the execution time of the whole program, loop-
level profiling measures the execution time of each loop within the program. In the case of
determining the execution time after unrolling, timers are inserted around each loop and the
program is run U times (one for each unroll factor), setting up the same unroll factor for all
the loops during each run. As in the case of program-level profiling, there is an assumption
of independence in this process. There are no runs that involve different unroll factors for
different loops. Indeed, analysing all the possible combinations of unroll factors for all the
loops within a program ceases to be feasible. Additionally, there is a severe reduction in the
number of times a program must be executed. In this case the number of runs is equal to the
maximum unroll factor used. This fact dramatically reduces the time invested in obtaining
Chapter 3. Data Collection
33
the data, given that it is not dependent on the number of loops a program may have. There is,
however, a great shortcoming when following this process: the effect of the
instrumentation. Certainly, given that timing functions with their own variables and
instructions were inserted around the loops in order to determine their execution time, the
performance of a loop may be affected by this instrumentation. For example, some loop
instructions cannot be kept in the cache because other instructions corresponding to the
instrumentation code are already held in this place. The effect of the instrumentation is
noticeable especially for those loops that are called upon many times in the program. This is
compensated if the loops take a considerable amount of time to be executed. Therefore,
caution must be taken when selecting the loops to be included in the dataset, avoiding those
loops for which the execution time and/or trip count is very low.
Original loop
do i = 1, n
S1[i]
S2[i]
�
end do
Unrolled loop (unroll factor = u)
do i = 1, n, u
S1[i]
S2[i]
�
S1[i+1]
S2[i+1]
�
S1[i+u-1]
S2[i+u-1]
�
end do
do j = i, n
S1[j]
S2[j]
�
end do
Loop body
replicated u times
Processing remaining
elements
Figure 3.1: Generalised version of loop unrolling (Taken from [Fursin, 2004])
Chapter 3. Data Collection
34
3.4 Generating the targets As explained above, the process of generating the targets, i.e. the execution times, followed a
loop level granularity. The process starts by preparing the benchmarks for the transformation
tool and ends after running each program, considering the different unroll factors. The steps
involved in this process are briefly explained as follows.
3.4.1 Preparing the benchmarks
Some of the benchmarks may not be appropriately handled by the transformation tool.
Specifically, there may be some loop constructions that are problematic for the transformer.
They must be converted into a form that the framework can properly manage.
3.4.2 Selecting loops Having prepared a specific program for the framework, the next step is selecting the loops
that are believed to be appropriate for unrolling. In general, to avoid introducing some bias in
the data, the only restriction is that they must be innermost loops. Additionally, loops with
calls to subroutines were not considered given the difficulty of determining the actual effect
of unrolling as other loops may be involved within the subroutine called.
3.4.3 Profiling The loops must be profiled in order to calculate their execution time and in order to have an
idea of which loops are insignificant. Those loops for which the execution time is very low
are considered insignificant because no real improvement can be determined.
3.4.4 Filtering After the profiling step, it is possible to define the loops that should not be included in the
dataset. As mentioned above, the instrumentation causes an intrusive effect that may
severely affect those loops with low trip count or low execution time. In general, if the
execution time of a loop is less than a threshold T, the loop must be discarded.
Chapter 3. Data Collection
35
3.4.5 Running the search strategy This step implies running the benchmarks for each unroll factor. There must be a maximum
unroll U common for all the loops and benchmarks.
The process of generating the targets followed the steps explained above with a threshold for
filtering T = 0.4 seconds and a maximum unroll factor U = 8. Additionally, each benchmark
(for all unroll factors) was executed ten times, which henceforth will be referenced as the
number of runs R = 10, in order to consider the variability of the execution times.
In order to completely describe the data collection process it is necessary to mention some
technical issues regarding the hardware and software resources that were utilised to execute
the experiments.
3.5 Technical Details
3.5.1 The platform A dual Intel(R) XEON(TM) running at 2.00GHz with 512 KB in cache (level 2) and 4GB in
RAM has been used for the experiments. The operating system installed in this machine is
Red Hat Linux 2.40.20-24.9.
3.5.2 The compiler The Fortran GNU compiler [G77] gcc-3.2.2-5 has been used. As remarked in section 2.6.5
the interactions between loop unrolling and other transformations constitute a potential
problem. Hence, the optimisation level chosen for the experiments was �O2 with no
additional flags. Clearly, unrolling at the compiler level was switched off by avoiding the use
of the options –funroll-loops or �funroll-all-loops; otherwise the compiler could apply
unrolling to the already unrolled code.
3.5.3 The timer’s precision The transformation framework utilises the C function clock() in order to compute the
execution time at a loop level. This function was found to have a precision of 0.01 seconds in
the machine used for the experiments. This precision refers to the fact that the minimum
elapsed time detected by this function is 0.01 seconds. In other words, it would not be
Chapter 3. Data Collection
36
possible to detect a difference between two runs less than this precision. Considering that the
threshold used for filtering the loops is 40 times the precision on this particular machine, it
does not represent a problem for this project.
3.6 The results in summary The results of the data collection process are shown in Table 3.2. In order to facilitate its
execution, the suite VECTORD [Levine et al., 1991] has been divided into four different
programs given the independence among subroutines. The execution times for the original
code and for the instrumented code (with timers) are given. Additionally, the contribution of
each benchmark to the dataset in terms of the number of loops is provided. An important fact
to notice from Table 3.2 is that for most of the benchmarks the execution time is not severely
affected by the instrumentation of the code. In fact, only three benchmarks (shown as
shaded), namely 107.mgrid, 125.turb3d and 141.apsi experienced a notable increase in their
execution time. The benchmark that was most influenced by the instrumentation was
125.turb3d, due to the great amount of times some of its loops were called upon within the
program. However, unlike 110.applu that was discarded, 125.turb3d was maintained, as it
could affordably be executed several times and most of its loops have an acceptable
execution time.
3.7 Feature extraction It has been explained so far how the execution times of selected loops have been generated
for different unroll factors and multiple runs. Explicitly, the results of the process described
above correspond to the execution time of each loop after unrolling using u = 1�U. Each
execution was repeated R times in order to consider its variability for a specific unroll factor.
Hence, the data collection process generated LxUxR execution times. Where L is the number
of loops, U is the maximum unroll factor considered and R is the number of repetitions of
each run. Bearing in mind that the approach followed in this project is creating a regression
model able to learn the improvement in performance of the execution time of loops, these
magnitudes have been called the targets. Therefore, a regression model must be able to
learn a function7 for which the output is a value based on the targets for a specific loop and
7 Actually, as it will be explained in chapter 5, for each unroll factor a different model will be built, i.e. there will be U functions each one corresponding to a different unroll factor.
Chapter 3. Data Collection
37
for which the input is a characterisation of this loop. This section describes the features
extracted from the programs that were used to characterise the loops.
Execution Time (sec.)
Benchmark Original Instrumented
# Loops
101.tomcatv 40.8 45.2 5
102.swim 39.9 43.0 3
103.su2cor 51.3 53.5 15
104.hydro2d 61.4 61.6 35
107.mgrid 44.5 102.5 15
125.turb3d 73.3 594.7 12
141.apsi 54.6 161.2 23
146.wave5 42.5 59.4 25
Vectord_1 146.3 146.4 51
Vectord_2 148.3 148.7 49
Vectord_3 163.9 167.9 13
Vectord_4 160.0 160.2 2
Total number of loops 248
Table 3.2: Results of the data collection process
One of the most important issues in a data mining application is selecting the right features
for learning. In fact, the characterisation of a problem should be carried out by selecting
those features that are believed to influence the targets to learn. As explained in Chapter
Two, there are many factors that may influence loop unrolling, such as hardware components
of the target architecture, other code transformations applied after unrolling and,
characteristics of the program itself. It is not unrealistic to characterise a loop for unrolling
based on its static features, i.e. those that can be determined at compilation time. However, it
is necessary to emphasise that dynamic features, i.e. those that are determined at execution
time, are also important and may not be involved in the static representation; for example,
the number of cache misses. The characterisation of loops in this project is mainly based on
Chapter 3. Data Collection
38
static features and only two of them, namely the trip count and the total number of times the
loop is called, were determined during the execution of the programs. This loop abstraction
is mostly based on the description presented by [Monsifrot and Bodin, 2001] and [Monsifrot
et al., 2002]. The loop characterisation presented by [Stephenson and Amarasinghe, 2004] is
not applicable to the present approach given the differences in the implementation of loop
unrolling.
The features extracted to characterise the loops are shown in Table 3.3. Feature 1 (called) is
the total number of times the loop is called within the program. Therefore, it represents the
number of times the outer loops are executed and the subroutine containing the loop is
called. The size (feature 2) of the loop refers to the number of statements within the loop. A
statement may contain one or more lines. The trip count (feature 3) of the loop determines
the number of times the body of the loop is executed. Given that for most of the loops this
feature is unknown at compilation time, it was determined during the execution of the
programs. For some loops it was found that the trip count was variable depending on some
parameters of a subroutine. In these cases a weighted average was calculated. Feature 4
considers the number of calls to proper functions of the language. Feature 5 (Branches)
refers to the number of if statements within the loop. Feature 6 is the nested level of the loop.
Features 7 and 8 represent the number of array accesses within the loop depending upon
whether an array element is loaded or stored. A less straightforward feature is the number of
array element reuses (feature 9). It attempts to measure dependency among different
iterations. Although dependency analysis is a very complicated topic in compilers, a simple
approach has been taken. The number of reuses of a particular array is computed as the total
number of elements involved in its update when it is controlled by the iteration variable. An
example is given in Figure 3.2 where the number of reuses is three. Finally, feature 10
(Floating) considers the number of floating-point operations and feature 11 (indAcc)
represents the number of indirect array accesses. An indirect array access occurs when an
array is used as an index of another array.
do i = 1, N a[i] = a[i] + a[i-1] * a[i+1] end do
Figure 3.2: An example of a loop containing three array element reuses
Chapter 3. Data Collection
39
Index Name Description
1 Called The number of times the loop is called
2 Size The number of statements within the loop
3 Trip The trip count of the loop
4 Sys The number of calls to proper functions of the language
5 Branches The number of if statements
6 Nested The nested level of the loop
7 Loads The number of loads
8 Stores The number of stores
9 Reuses The number of array element reuses
10 Floating The number of floating point operations
11 IndAcc The number of indirect array accesses
Table 3.3: Features extracted to characterise loops
3.8 The representation of a loop The features described above and the execution times throughout R repetitions for each
unroll factor compose the dataset constructed by the data collection process. Figure 3.3
shows the representation of a datapoint where the specific unroll factor used has been
omitted for simplicity. In other words, for each unroll factor a loop has the representation of
the one shown in Figure 3.3.
x1 x2 � xN t1 t2 � tR
Features Execution time throughout R repetitions
Figure 3.3: The representation of a datapoint
Chapter 3. Data Collection
40
3.9 Summary This chapter has described the data collection process and has remarked upon the difficulties
that appear during the generation of data when applying machine learning to compiler
optimisation. In general, constructing a dataset for this purpose can be a time-demanding
activity. Great effort has been invested in this project in order to generate clean and reliable
data. Hence, competitive benchmarks have been chosen in order to keep the bias towards
simple applications as minimal as possible. Loop unrolling has been applied to innermost
loops within these benchmarks by using a framework developed in [Fursin, 2004]. Only
loops with sufficient execution time have been taken into account in order to reduce the
effect of the instrumentation. The general process utilised for generating the targets
(execution times for loops) consists of preparing the benchmark, selecting loops, profiling
and filtering the loops and running the programs using different unroll factors. Unroll factors
from 1 to 8 were used and each run was repeated 10 times in order to consider variability.
After having determined which loops should be included, a feature extraction process was
performed over each loop using a set of 11 features that were selected mainly based on static
characteristics of the loops. Finally, the representation of a datapoint (loop) composed of
features and execution times resulting after the data collection process has been summarised.
41
Chapter 4
Data Preparation and Exploratory Data Analysis
4.1 Introduction The last chapter described the data collection process and provided technical and
methodological details about how the experiments were carried out in this project. These
experiments have generated the data that in principle will be available for learning. As has
been highlighted, the construction of a dataset that can be used by learning techniques with
the aim of optimising the execution time of programs is a time-demanding activity.
Consequently, great effort has been invested in producing a considerable amount of data that
can be reliably used by machine learning techniques. This effort has mainly been focused on
determining the criteria used for choosing the benchmarks; the preparation of these
benchmarks; the selection of appropriate loops to be unrolled; the generation of the
execution times (also called the targets) and the selection of correct features that describe the
loops and constitute the characterisation of the problem. Unfortunately, the raw data
produced by the data collection process is not suitable to be directly used by any machine
learning technique and it needs to be pre-processed and refined. There are several reasons for
this. Firstly, it must be recalled that the ultimate goal of loop unrolling is optimising a
program by determining whether a particular unroll factor is beneficial or not, i.e. if
unrolling a loop a specific number of times represents a significant improvement with
respect to the case of maintaining the original (rolled) version of the loop. Hence, it is
necessary to apply a validation process to the data in order to establish if there is an actual
improvement in performance to be learnt. In other words, if unrolling does not yield a
potential reduction in the execution times of the loops that are included in the dataset, the
process of modelling the data and trying to learn from it is useless. The second reason to
avoid using the unprocessed data that has been collected is that it can be very difficult to
learn from the pure execution times. Indeed, since the interest of this project is mainly
focused on predicting the effectiveness of loop unrolling, the execution times do not directly
reflect the improvement that needs to be learnt. Therefore, these execution times must be
Chapter 4. Data Preparation and Exploratory Data Analysis
42
transformed into another more convenient representation that explicitly indicates how good
or bad the transformation is for a particular loop. Furthermore, some loops that were not
removed after being profiled may still need to be discarded from the dataset, as their mean
execution time is low. Similarly, some measurements deviate strangely from the distribution
of the execution times for a particular loop under a specific unroll factor. These outliers
should also be removed from the dataset as they may deteriorate the learning process. The
final reason for which pre-processing the data is essential relates to the transformation of the
features that constitute the representation of the loops. In fact, this representation must be
rescaled in order to prevent some features being considered more important than others.
This chapter tackles the problems mentioned above and provides an insight into the data that
will be used for modelling. The rest of this chapter is organised as follows. Section 4.2
explains how the data is processed throughout different stages (by using several software
resources) and is made available for analysis and learning. Section 4.3 provides a formal
representation of the data and introduces the notation that will be used in subsequent
sections. Section 4.4 analyses the data and validates its suitability in order to be used by
machine learning techniques. Section 4.5 presents the importance of pre-processing the
execution times, giving details about the elimination of some loops from the dataset, the
transformation of the targets and the detection and treatment of outliers. Section 4.6 explains
how the features are also pre-processed in order to facilitate the application of learning
techniques. Finally, section 4.7 summarises the most important aspects of this chapter.
4.2 The general framework for data integration It was explained in chapter three how the benchmarks selected for this project were used to
generate the execution times at a loop-level profile with the aid of a software tool that works
on top of the compiler. From a more general perspective, the generation of the targets and
the selection of the features for loops are only two steps during the whole process of making
the data available for analysis and learning. In fact, the framework used for applying loop
unrolling and obtaining the executions times, which will be referenced in this section as EOS
[Fursin, 2004], produces a set of files that must be parsed and from which the targets must be
extracted. In order to automate this task, a program written in java, referenced in this section
as Java Parser, was developed. This program reads the files produced by EOS and transforms
the results into easy-to-load text files. The files produced by the Java Parser are loadable by
Matlab® subroutines that integrate them with the features extracted from the programs.
Chapter 4. Data Preparation and Exploratory Data Analysis
43
Furthermore, these subroutines make possible the analysis, exploration and pre-processing of
the data and communicate with modelling subroutines also written in Matlab®. This process
is shown in Figure 4.1.
4.3 Formal representation of the data The data that has been generated by repeatedly executing the programs using the unrolled
version of the loops can be denoted as tuj,k, representing the execution time of the jth run (j =
1, 2, �, R) of loop k (k = 1, 2, �, L), which has been unrolled u times (u = 1, 2, �, U).
Here, the notation used in a great part of the literature about loop unrolling has been adopted,
where an unroll factor of one (u = 1) corresponds to the original version of the loop, i.e. with
no unrolling at all.
Similarly, the features explained in section 3.7 describe a particular loop. Hence, the
characterisation of loop k can be denoted as a vector xk for which the elements xi,k (i=1, 2,
�, N) represent the features that compose the loop.
Benchmarks
EOS
Execution times
Java Parser
Features
Raw data Pre-
Processing
Modeling
Matlab
Results
Figure 4.1: The general framework for data integration
Chapter 4. Data Preparation and Exploratory Data Analysis
44
4.4 Is this data valid? The first question to be answered before starting to pre-process the data that has been
collected is whether this data can actually be useful and suitable for applying learning
techniques. Therefore, the aim of this section is to ascertain the validity and suitability of the
data by finding out if the improvement in performance of the loops for unroll factors
different from one (u=1) represents an actual reduction of the execution times and is not only
a consequence of the variability of the data. In section 3.3.2 it was explained that the data
resulting from the initial experiments using program-level profiling was discarded. In fact,
the improvement obtained by loop unrolling in most of the benchmarks was only comparable
to the variability of the execution times under no unrolling. Therefore, there were no criteria
to select the loops that should be included in the dataset. The problem to be explained in this
section is essentially the same but the aim now is to determine if the final data obtained
represents a potential improvement that could be predicted by machine learning techniques.
In other words, considering that the final goal in compiler optimisation is related to the
realisation of speed-ups on a set of programs, the focus here is on the maximum possible
improvement that may be reached by any learning technique. Although the detrimental effect
of loop unrolling is also important to this research, the data will prove inappropriate should
no loops be found to be improved by unrolling.
4.4.1 Statistical analysis To enable the assessment of the suitability of the data, the measurements have been repeated
ten times in this project. A sensible approach to establishing the appropriateness of the data
aims to validate the significance of the improvement with the aid of statistics. The objective
is to determine how many loops in the dataset are improved by unrolling and when this
improvement is statistically significant. Using the notation introduced in section 4.3, a
particular loop k has associated R execution times (due to repetitions) for each unroll factor
u. Therefore, it is reasonable to compare the R measurements for each unroll factor with the
execution times for the rolled (original) version of the loop. Although it may be possible to
apply a t-test to validate the hypothesis of different means for u=1 and u>1, that is, Ho: iu
kuk tt == ≠1 with i>1, it would be necessary to apply U-1 tests, which can considerably
increase the probability of drawing at least one incorrect conclusion. Therefore, a one-way
analysis of variance (one-way ANOVA) followed by a multiple comparison (multi-
Chapter 4. Data Preparation and Exploratory Data Analysis
45
comparison) procedure has been applied. Additionally, only improvements in performance
but not detriments have been considered.
In general, the purpose of ANOVA is to test if the means of several groups are significantly
different. In our case, the groups are the unroll factors considered. Thus, ANOVA tests the
null hypothesis that the means do not differ by using the p-values of the F-test. Essentially,
the F-test measures the between-groups variance compared to the within-groups variance.
Therefore, if the variance between groups is considerably greater than the variance within
groups, there are serious doubts about the null hypothesis. Since we are interested in
establishing if there is a significant difference between the mean execution times of the
original loops and the unrolled version of the loops�, a multi-comparison procedure is
needed. Multi-comparison procedures also determine when this difference is positive (i.e. an
improvement in performance) or negative (i.e. a detriment in performance). All the tests
were executed at 5% level of significance and the Tukey's honestly significant difference
criterion was used as the critical value for the multi-comparison procedure (for a complete
description of ANOVA and multiple comparison procedures see [Neter et al., 1996] pages
663-701 and pages 725-738).
In order to provide a better understanding of whether unrolling is considered significantly
beneficial or not regarding the procedure explained above, Figure 4.2 shows the box plots of
the execution times for two different loops. The lower and upper lines of the boxes represent
the 25th and the 75th percentile of the execution times for each unroll factor. The horizontal
lines in the middle of the boxes are the medians. The dashed lines (also called whiskers)
represent the variability of the execution times throughout R=10 repetitions for each unroll
factor, where the top is the maximum execution time and the bottom is the minimum. Thus,
the upper part of Figure 4.2 shows the case of a loop for which unrolling does not correspond
to a significant improvement in performance. In fact, although the execution times for u=3,
u=5 and u=6 are less than the minimal for u=1, they are not considered significant due to the
variability of the measurements. The opposite case is shown in the lower part of Figure 4.2.
For this loop, the variability of the execution times is low and the improvement in
performance due to loop unrolling, e.g. for u=4 is significant. It is clear for this case that loop
unrolling is beneficial as there is no overlapping between the execution times for u=1 and for
u=4.
Chapter 4. Data Preparation and Exploratory Data Analysis
46
The procedure explained above was applied to all the loops in the dataset and the results are
summarised in Table 4.1. The contribution to the dataset of each benchmark in terms of the
total number of loops and in terms of the number of loops for which unrolling causes a
statistically significant improvement of performance is shown. It can be seen in Table 4.1
that the number of loops that can be improved by unrolling in the SPEC CFP95 [SPEC95]
suite is considerably less than in the VECTORD [Levine et al., 1991] benchmarks.
Furthermore, 42% of the loops may be significantly improved by loop unrolling in the whole
set of benchmarks. It is necessary to emphasise that the loops included in Table 4.1 can also
be negatively affected by unrolling because some unroll factors may be detrimental to the
performance of a specific loop. However, since we are analysing the case of how many loops
in the dataset can be improved by loop unrolling, it is possible to conclude that the reduction
in the execution time of the loops included in the dataset is not only caused by the variability
of the measurements but also by the effect of the transformation. Thus, the data that has been
collected can be used by machine learning techniques in order to model its improvement in
performance.
Until now, it has been considered whether there is a positive impact of loop unrolling on the
set of benchmarks. However, it may be of interest to formulate two additional questions
regarding the behaviour of the execution times that have been collected for this project.
Primarily, is there any negative impact of the transformation on these benchmarks?
Secondly, how much can unrolling affect the execution time of these loops? To answer these
questions, it is necessary to explain the pre-processing stage of the execution times, i.e. how
the targets have been transformed in order to facilitate their analysis and to appropriately
apply the modelling techniques.
4.5 Pre-processing the targets As indicated by Han and Kamber (2001), pre-processing the data can significantly improve
and ease the application of learning techniques. In this case, we are interested in eliminating
some execution times that should not be considered because of their low magnitude;
transforming these execution times into other values that may be more appropriate for
learning; and detecting some strange data-points (outliers) that are somehow anomalous and
do not follow the general behaviour of similar data-points.
Chapter 4. Data Preparation and Exploratory Data Analysis
47
Figure 4.2: Insignificant (top) and significant (bottom) effect of loop unrolling
Chapter 4. Data Preparation and Exploratory Data Analysis
48
Benchmark # Loops
# Loops with
improvement
for u > 1
101.tomcatv 5 2
102.swim 3 1
103.su2cor 15 5
104.hydro2d 35 3
107.mgrid 15 1
125.turb3d 12 4
141.apsi 23 6
146.wave5 25 3
Vectord1 51 38
Vectord2 49 39
Vectord3 13 1
Vectord4 2 2
Total 248 105
Table 4.1: Number of loops that may be benefited
from unrolling
4.5.1 Filtering In section 3.4.4 program-profiling permitted the recognition of loops with a low execution
time and their removal from the set of loops that were considered for unrolling. However,
after the data collection process, loops with very low execution time remained, which went
undetected during that phase. As now, at this stage, all the repetitions are available, it is
possible to determine the behaviour of the execution time for the original (rolled) version of
a specific loop. Therefore, a similar criterion may be established to remove loops with low
execution time from the dataset.
Let us consider the R repetitions of the execution time for the original version (with no
unrolling) of loop k, t1j,k. A straightforward criterion to eliminate or maintain this loop can be
ascertained by deciding whether its mean is greater than a threshold T or not. Thus, the
mean can be computed by:
Chapter 4. Data Preparation and Exploratory Data Analysis
49
∑=
=R
jkjk t
Rt
1
1,
1 1 (4.1)
Therefore, the criterion to remove or maintain a loop is:
1kt
<
otherwiseitmantain
Ttifloopthediscard k
,
, 1
Setting T to 0.4 seconds, all the loops were considered for this filtering. Table 3.2 and Table
4.1 already reflect these results.
Finally, some loops were found to have zero execution time for u≠1, probably due to an
inappropriate unrolling applied by the transformation tool. Therefore, they were also
removed from the dataset.
4.5.2 Dealing with outliers An outlier is a data-point that is somehow anomalous and does not follow the general
behaviour of similar data-points. Detecting and dealing with outliers is a challenge and may
be dependent on the characteristics of the application. For example, the most common
approach is eliminating the outliers from the dataset, although they may be of interest in
other applications such as fraud detection. Strictly speaking, an outlier refers to a point in the
dataset, which in the case of this project represents a loop. However, the treatment of
outliers in this project does not aim to identify and remove loops from the dataset, but to
identify and discard those repetitions of the execution times for a particular loop that
considerably deviate from the general tendency.
Although visualisation techniques may help with the detection of outliers, a statistical and
automated approach has been followed. Let us consider loop xk and its R execution times for
a specific unroll factor u: ukjt , with j = 1, 2, � R. An execution time for this particular loop
is considered an outlier if its value is greater that 1.5 times the inter-quartile range plus the
75th percentile. Similarly, an execution time less than the 25th percentile minus 1.5 times the
inter-quartile range is also considered an outlier. In other words, an execution time that is
notably greater or smaller than the median is identified as an outlier. These outliers may be a
Chapter 4. Data Preparation and Exploratory Data Analysis
50
consequence of the effect of the instrumentation or a result of other factors such as the
activation of low-priority processes in the operating system. Figure 4.3 shows a loop that
contains two outliers. For the unroll factor u=6 an outlier is discovered (indicated by the sign
�+�), as it is considered to be significantly greater than the median execution time for that
unroll factor. Similarly, for u=7 a low execution time that is significantly less than the
median is also considered an outlier.
Approximately 26% of the loops for each unroll factor were determined to have outliers.
Finally, 4.3% of the execution times were detected as outliers and they were removed from
the dataset.
4.5.3 Target transformation As will be explained in Chapter Five, the ultimate goal of applying a learning technique to
the dataset constructed is predicting whether unrolling is a beneficial transformation for a
particular loop or not and how much it can affect the execution time of the loop. Therefore, it
stands to reason that if the raw execution times are directly used for modelling, it will be
Outlier
Outlier
Figure 4.3: An example of outlier
Chapter 4. Data Preparation and Exploratory Data Analysis
51
difficult to learn the parameters involved in loop unrolling and successfully predict its effect
on a particular loop. In fact, the absolute execution times for a loop considered throughout all
unroll factors do not explicitly reflect how good or bad is the transformation for a specific
loop. Therefore, these execution times need to be transformed into a set of values that
directly represent their relative performance with respect to the execution times under no
unrolling.
Thus, let us consider the term 1kt as a statistic that correctly represents the execution times
for loop k under no unrolling (u = 1) over all the runs, i.e. the mean given by equation 4.1 or,
if preferable, the median. Therefore, the percentage of improvement with respect to this
statistic can be obtained as follows:
( )1
1,
,
100
k
ku
kjukj
t
tty
−−= (4.2)
Where j = 1, 2, � R and u = 1, 2, � U. It must be clear that each of these magnitudes can be
negative or positive as long as the execution time for a particular loop under a specific unroll
factor u and run j is greater or less than its corresponding statistic for no unrolling. The
negative sign in equation 4.2 has been used with the purpose of establishing a convention
that will be adopted throughout this document. A positive value for these magnitudes will
indicate an improvement in performance and a negative value will reflect a detrimental
effect. Hence, this equation not only provides a way to ascertain when unrolling a specific
loop is beneficial but also how good this benefit is, on the contrary, when the effect of
unrolling is detrimental and how disadvantageous this transformation can be. Both cases are
of special interest for the compiler community.
With the transformation denoted by equation 4.2 and applied to all the execution times
available from the benchmarks, it is possible to provide a mechanism that enables us to
answer the questions raised in section 4.4.1. Specifically, in order to analyse the negative
impact of loop unrolling on the data obtained and to measure the magnitude of this impact,
the data has been divided into U groups (one for each unroll factor), the mean improvement
in performance has been calculated and a histogram has been computed for each group. The
results are shown in figure 4.3. The y-axis represents the number of loops and the x-axis is
the mean percentage of improvement in performance. As highlighted above, a positive value
Chapter 4. Data Preparation and Exploratory Data Analysis
52
of x refers to an actual improvement and a negative value is a detrimental effect. Since the
unroll factor u=1 (when a loop is not unrolled) is the point of reference, no improvement is
found in this case. For the other unroll factors, it is clear that about 40% to 50% of the loops
are not noticeably affected by unrolling and their improvement is nearly zero. The number
of these loops decreases as the unroll factor becomes bigger and is compensated by the
number of loops for which the improvement is greater than 5%. In fact, while about 30% of
the loops have an improvement greater than 5% for u=2, about 40% of the loops show this
improvement for u=8. Additionally, only very few loops are dramatically improved by
unrolling. The number of these loops also increases with the unroll factor. Explicitly, about
1% of the loops have an improvement greater than 40% for u=2 and 6% of the loops show
this improvement for u=8. However, it is also necessary to mention the detrimental effect of
loop unrolling. Unrolling has a negative impact in about 20% of the loops. Furthermore, for
approximately 2% of the loops unrolling worsens the performance by more than 50% (when
u>2). In the worst case, though, the performance of some loops is deteriorated by more than
100%.
In conclusion, although a great number of loops are not appreciably affected by unrolling, it
is worth predicting when they are improved or deteriorated by the transformation and how
much the execution time of these loops may be influenced. The predictive modelling used to
tackle this problem will be explained in the following chapter. However, additional pre-
processing needs to be applied in order to make the data suitable for the prediction
techniques.
4.6 Pre-processing the features
4.6.1 Rescaling The values of the features selected to characterise the loops may significantly influence the
learning techniques. For example, some features may outweigh others. A common approach
to diminish this effect and make the variables have similar magnitudes is rescaling each
feature to have zero mean and unit variance.
Thus, the update of feature i for loop k is obtained as follows:
Chapter 4. Data Preparation and Exploratory Data Analysis
53
i
ikiki s
xxx
−← ,
, (4.3)
Where i = 1, 2, � N; ix is the mean and is is the standard deviation of the variable in the
training data. Namely:
∑=
=K
kkii x
Kx
1,
1 (4.4)
∑=
−−
=K
kikii xx
Ks
1
2, )(
11
(4.5)
Where K is the number of loops considered in the training data.
4.6.2 Feature selection and feature transformation Feature selection is the choice of a subset of features from all the variables available for
learning. Working with a great number of features can demand more training data that may
be difficult to obtain. Additionally, some features can be found to be irrelevant and the
variables that mainly influence the learning can be a reduced subset of them. Domain
knowledge or very well known techniques such as forward selection and backward
elimination help to identify those features that can be discarded without worsening the
performance of the learning algorithm. For example, from the set of 11 features presented in
Table 3.3, it may be suggested that only five variables are determinant for loop unrolling,
namely the number of floating point operations, the number of array element reuses, the
number of loads by iteration, the number of stores and the number of if statements.
Although the amount of features selected for this project is acceptable compared to the size
of the dataset, it is important to find which features are found to be the most important for
the learning techniques used. This will provide a better understanding of loop unrolling. The
best features for each technique and the method used to obtain them will be explained in the
following chapter.
Feature transformation is the process of creating new features in order to considerably
improve the performance of the learning techniques. Again, domain knowledge can help to
construct the right ones. For example, an intelligent variable to be constructed for applying
code transformations is the ratio memory references to floating point operations. This feature
tell us how the memory is affected in relation to the number of operations performed by
Chapter 4. Data Preparation and Exploratory Data Analysis
54
iteration, and may not be easily found by a specific learning technique. How this feature is
explicitly constructed and its effect on the learning process will also be explained in the
following chapter.
4.7 Summary This chapter has provided an insight into the execution times resulting from the data
collection process and has described the need of the pre-processing stage in order to make
the data suitable for learning. The general framework in which the data has been prepared
and made available to the learning algorithms has been described. The execution times have
been analysed in terms of the improvement obtained by applying loop unrolling and it has
been found that about 42% of the loops that constitute the dataset can be, statistically,
significantly improved. This fact is important because it validates the appropriateness of the
benchmarks in order to build models that can potentially improve the performance of
programs. The pre-processing stage has been divided into two parts: the pre-processing of
the targets and the preparation of the features. The targets (the execution times) have been
filtered in order to guarantee that only loops with a sufficient execution time are maintained.
After discarding some loops due to their low execution times, the detection of outliers has
taken part and has been focused on identifying those loops for which some execution times
are considerably greater or smaller that the median for a specific unroll factor. These outliers
have also been eliminated from the dataset. As the final phase in the target pre-processing
stage, the execution times have been transformed into more suitable magnitudes that directly
indicate the improvement or detriment in performance of unrolling with respect to the mean
execution time of the original loop. These magnitudes are the actual targets for the learning
techniques explained in the following chapter. The new values obtained made possible a
deeper study of the effect of loop unrolling on the benchmarks used. Specifically, it was
found that approximately from 40% to 50% of the loops included in the dataset were not
considerably affected by unrolling and consequently, their improvement was nearly zero. It
was also discovered that a great number of loops were positively affected by the
transformation. Roughly, from 30% to 40% of the loops experienced a reduction in their
execution time when they were unrolled. Nevertheless, some loops (20%) were found to be
negatively affected by the transformation. These results provide a good understanding of the
appearance of the data and demonstrate how difficult is to build a decision rule for unrolling
loops.
Chapter 4. Data Preparation and Exploratory Data Analysis
55
Figure 4.3: Number of Loops vs. Mean Improvement in Performance (%)
Chapter 4. Data Preparation and Exploratory Data Analysis
56
Finally, in order to avoid some features being considered more important than others for the
learning algorithms, each variable that characterises the loops was rescaled to have zero
mean and unit variance.
To conclude, with the results obtained by the analysis phase and the pre-processing stage
explained in this chapter, the data is ready to be used by learning algorithms in order to build
a model capable of predicting how beneficial or detrimental unrolling can be for a particular
loop.
57
Chapter 5
Modelling and Results
5.1 Introduction Chapter Four has emphasised the importance of cleaning, analysing and preparing the data
that has been collected in order to make it suitable for learning. This suitability relates to the
maximum improvement in performance that may be achieved when applying machine
learning techniques. However, the term learning has not been explicitly defined within the
context of this project. Here, we are referring to learning from data. Thus, the aim is to build
a model able to learn from past examples by discovering the underlying structure of the data
in order to accurately predict on novel data. Therefore, in this project, learning is focused on
predictive modelling, i.e. given a dataset called the training set, the goal is to build a model
that can generalise and successfully perform on data that has not been seen before.
In general, predictive modelling can be thought as the process of constructing a function that
maps a vector of input variables (predictor variables) onto one or more output variables
(response variables) [Hand et al., 2001]. This function is constructed based on a training
dataset but is expected to successfully perform on new data. Predictive modelling is
commonly divided into two different approaches: classification and regression. In a
classification problem, the output variable is categorical. In other words, the targets, i.e. the
values of the output variable are discrete and are commonly referred as classes or labels. The
work done by [Monsifrot et al., 2002] and [Stephenson and Amarasinghe, 2004], described
in Chapter One, are examples of classification problems. The former attempted to solve the
binary classification problem of predicting whether loop unrolling is beneficial or not
(targets as classes 1 or 0) for a particular loop given its static representation (values of the
input variables). The latter was also focused on loop unrolling but stated the classification
task as a multi-class problem where the possible labels for the output variable are the unroll
factors that were considered from one to eight. Clearly, both cases aim to predict a
categorical variable based on a set of features of the loops. Thus, they used machine learning
techniques to achieve that goal. This project goes beyond the classification problem and
proposes a regression approach to predict the improvement in performance that unrolling can
Chapter 5. Modelling and Results
58
produce on particular loop. Undoubtedly, the regression approach is a more general
formulation of the problem and previous solutions represent particular cases. Furthermore,
the machine-learning solution to loop unrolling based on a regression method is smoother
than the one obtained by classification methods given the degree of nosiness the
measurements may have. Indeed, the likely event of two loops with the same characterisation
(the same values of the set of features that represent them) having different best unroll
factors constitutes a great difficulty for the classification task. However, the case of two
loops with a slightly different improvement in performance but the same representation is
not a problem for the regression approach. Although it is not the main goal of predicting
modelling, it is advantageous if the results of the techniques used for regression can be
readable and understandable. This will provide a better insight into the problem. This chapter
presents the machine learning solution to loop unrolling based on the regression approach
and describes the results obtained with the techniques used. The organisation of this chapter
is presented below.
Section 5.2 explains in detail how the regression problem is formulated and why it is
considered a more general view compared to previous approaches that apply machine
learning techniques to loop unrolling. Section 5.3 presents an overview of the regression
methods that have been used in this project and section 5.4 provides information about the
parameters that are involved in each method. Section 5.5 describes the measure of
performance that has been used to evaluate the methods. Section 5.6 provides the
experimental methodology that was adopted to obtain the results. Section 5.7 presents and
evaluates these results and compares the solution found with previous work. Finally, section
5.8 summarises the most important theoretical concepts and experimental results presented
in this chapter.
5.2 The regression approach The regression approach to applying machine learning to loop unrolling proposed in this
project attempts to build a function for a particular unroll factor u. This function aims to map
a representation of a loop onto an expected improvement in performance. In other words, the
input of this function is a vector of features for a specific loop and the output is the expected
improvement in performance that the loop can achieve after being unrolled u times. Two
important further clarifications need to be mentioned. Firstly, a different function is
constructed for each unroll factor, i.e. there will be U functions that represent the behaviour
Chapter 5. Modelling and Results
59
of the performance of a loop when it is unrolled. Secondly, the output has been called
improvement but it can also be a detriment in performance.
In a more formal way, let us consider the magnitudes obtained by the transformation applied
to the execution times of loop k (described by equation 4.2) ukjy , (j=1,2, � R). These values
represent the improvement in performance of loop k under a specific unroll factor u
throughout R repetitions. This particular loop is characterised by the N-dimensional vector of
features xk that have been scaled to have zero mean and unit variance (see Equation 4.3).
Although it could be possible for learning algorithms to predict a vector of targets for a
specific loop, let us convert the repetitions through index j into a sole magnitude by
calculating its mean.
∑=
=R
j
ukj
uk y
Ry
1,
1 (5.1)
Where, for the sake of clarity, the bar over uky has been omitted. However, it should be
always borne in mind that this magnitude represents the mean improvement in performance
throughout R repetitions. Thus, the data available for learning can be represented
as ),( ukk yx , k=1, � L, where L is the number of loops that are considered in the training
data. Hence, a regression approach for this data attempts to construct a function of the
form ),( uβxΦ=uf , where uβ is the parameter vector of the model8. Thus, given a loop k
described by its feature vector xk, it is possible to obtain ukf that predicts its mean
improvement in performance when it is unrolled using the factor u, i.e. uky .
Two advantages of this approach over previous work need to be highlighted: robustness and
generality, and they are described below.
Robustness: The classification solution may be severely affected by the noisiness of the
measurements and the limited number of examples. Indeed, as explained in Chapter One, the
binary classification approach described in [Monsifrot et al., 2002] is likely to have two
8 This notation assumes that a set of parameters are involved in the model. However, for non-parametric models, these parameters may not exist. For simplicity, we consider that the notation used can refer to both parametric and non-parametric models.
Chapter 5. Modelling and Results
60
loops with the same feature vector but assigned to different classes. This represents a
potential problem for any classifier if it is not appropriately treated. Additionally, even in the
case of the multi-class approach presented in [Stephenson and Amarasinghe, 2004], the
learning process may become very difficult. Certainly, there may not be sufficient examples
that indicate the correct unroll factor for a type of loop and, therefore, the best factor to be
used by unrolling will be erroneously predicted. These complications are not present in the
regression approach where the aim is to build a function that smoothly predicts the
improvement in performance. Consequently, the solution provided is more robust to the
intrinsic variability of the data.
Generality: The multi-class problem of establishing the best unroll factor for loops is a
particular case of the regression approach. Clearly, the most appropriate unroll factor to be
used for a specific loop can be easily determined by calculating the greatest improvement in
performance obtained by the different models constructed (u=1, � U). The binary decision
of whether unrolling is beneficial or not is also straightforward.
5.3 Learning methods used A wide variety of modelling techniques is available to solve regression problems. In general,
these techniques can be divided into two groups: parametric and non-parametric models.
Whilst the former makes assumptions about the underlying distribution that generates the
data, the latter is more data-driven and does not place constraints in the form of the function
that constitutes a good representation of the data. It is necessary to remark that in our context
a �good representation� does not necessarily implies the best fit for the training data but
relates to the predictive power the algorithm can have when novel examples are presented.
Parametric models may assume, for example, that the underlying function that generates the
data is linear on the parameters of the model. This function can be a linear combination of
the input variables or a combination of other functions such as polynomials, exponentials or
gaussians. Among non-parametric models, other methods are important such us kernel
machines (Support Vector Regression), Decision Trees or Artificial Neural Networks. Other
techniques such as Instance-Based Learning (e.g. Nearest Neighbours methods) can also be
used for regression problems, although they are more popular for classification.
Although there is not a perfect method that can be used for regression, some techniques may
be more appropriate for the problem this project deals with. Considering that the number of
Chapter 5. Modelling and Results
61
examples in the training data is limited, very complicated methods such as Neural Networks
should be avoided. In fact, there is latent danger of overfitting the data and, although this can
be diminished in Neural Networks by regularisation, this technique may demand more
training examples than those currently available. Another important issue for the present
research is to give an interpretation to the results obtained by a specific method. Thus,
methods for which interpretability of the results is very difficult or even impossible such us
Neural Networks, are not convenient for this problem. Hence, it seems that Occam’s razor is
applicable to this problem: �try the simplest hypothesis first�. Certainly, the complexity of
the problem is unknown and a lot of effort can be saved if a simple model is found to be
sufficient to solve it. In other words, if it is possible to find a simple method with a good
performance on data that is not included in the training set, it should be preferable over other
more complex models with comparable performance.
Therefore, the first method used to carry out the predictive modelling in this project is Linear
Regression. Since the input for the regression problem in this project is multivariate, i.e. a
vector instead of a simple scalar magnitude, this method will be referenced as Multiple
Linear Regression. Multiple Linear Regression is a simple but extensively used technique to
predict a variable based on a linear combination of the input variables. Thus, this method
works under the assumption that there is no interaction among the predictor variables, i.e.
one variable does not have an effect on the others. Furthermore, it is a linear additive model,
i.e. the effects of the variables are independent and they are linearly added to produce the
final prediction. Although these assumptions seem very restrictive, Linear Regression has
demonstrated to perform well even in some cases where the relation between the predictor
variables and response variable is known to be non-linear. Another reason for which Linear
Regression represents a good approach to be attempted is that the parameter vector obtained
by the model may indicate the relevance of each variable. Thus, with this linear model, it is
possible to ascertain the variables that most influence loop unrolling.
However, in many situations the relation between the input variables and the output variable
is very complex and cannot be properly modelled by linear techniques. In these cases, it is
necessary to apply a non-linear approach that may have better predictive power. To consider
the general case in which no assumptions about the underlying process that generates the
data are made, a non-parametric model has been adopted, Namely Decision Trees Learning
has been used as the second choice of the regression approach in this project. Being more
specific, the Classification and Regression Trees (CART) algorithm described in [Breiman et
Chapter 5. Modelling and Results
62
al., 1993] has been applied. Besides modelling non-linear relations between the predictor
variable and the response variable, Decision Trees can provide good insight into the features
that are more important for loop unrolling given that its results are produced in a tree-
structured fashion. However, it is also recognised that if the number of training examples is
not sufficient to provide a representative sample of the true population Decision Trees can
also overfit the training data (see [Mitchell, 1997] pages 66-69). In general, Decision Trees
may suffer from fragmentation, repetition and replication [Han and Kamber, 2001], which
can deteriorate the accuracy and comprehensibility of the tree. Fragmentation occurs when
the rules created by the tree (the branches of the tree) are only based on a small subset of the
training examples tending to be statistically insignificant. Repetition takes place when one
attribute is evaluated several times along the same branch of the tree. Replication occurs
when some portions of the tree (subtrees) are duplicated. These drawbacks appear to be
serious when applying Decision Trees Learning to solve the present regression approach.
Nevertheless, several improvements have been developed in the recent years in order to
overcome these problems. Essentially, if a good pruning algorithm is applied to the primary
tree created by the method and the performance is roughly the same, these problems can be
diminished. Therefore, a lot can be gained with a non-linear and non-parametric technique
to solve the regression problem, being especially cautious to avoid overfitting the data.
The following sections will present a brief overview of each method in order to provide a
better understanding of how they work and to describe the most important issues to take into
consideration for each of them.
5.3.1 Multiple Linear Regression Multiple Linear Regression models the response variable based on a linear combination of
the predictor variables. In an N-dimensional input space, a linear regression model tries to fit
a hyperplane to the training data. Hence, a linear model for the regression approach can be
stated as:
uo
uf β+= xβ .u (5.2)
Where the upper index u has been used to emphasise that one model must be created for each
unroll factor. The parameters of the model, also called regression coefficients, are uβ
Chapter 5. Modelling and Results
63
and uoβ . uβ is an N-dimensional vector corresponding to the coefficients of each variable
and uoβ is the free parameter or bias.
Therefore, given a loop described by its N-dimensional vector of features xk, it is possible to
find the predicted mean improvement in performance under the unroll factor u by
calculating:
uu
kf 0ku . β+= xβ (5.3)
In order to fit this model to the data given by ),( ukk yx , where k=1, � K, and K is the
number of loops considered in the training set, it is possible to find the parameters of the
model that minimise the mean square error (MSE) between the predictions ukf and the
actual values uky . These parameters can be found by:
uTu1uTuu )( yXXXβ −= (5.4)
Where now, uβ is a (N+1) dimensional vector (its last component is u0β ); uX is a matrix of
dimensions Kx(N+1) for which each row is a different data-point and the last column is
composed of ones; and uy is a vector containing the actual predictions for each data-point.
In general, in order to find the solution given by equation (5.4), the calculation of the inverse
is not necessary and numerical techniques of linear algebra are used instead.
5.3.1.1 Understanding the regression coefficients: It is clear that the components of
the parameter vector uβ are the coefficients of the input variables. Recalling that the features
have been scaled to have zero mean and unit variance, the value of uiβ indicates the effect of
the variable ix when all the others variables are constant. Thus, these coefficients measure
the importance of the variables in order to predict the mean improvement in performance. If,
for example, one coefficient of a variable is nearly zero, that variable can be considered an
irrelevant predictor.
Chapter 5. Modelling and Results
64
5.3.2 Classification and Regression Trees Decision Trees constitute a Learning technique commonly used for classification problems
although, as in the case of CART algorithm, they are also applicable to regression problems.
CART algorithm, comprehensively described in [Breiman et al., 1993], does not make any
assumptions about the distributions of the independent or dependent variables. The aim of
the algorithm is to produce a set of rules able to accurately predict the dependent variable
based on the values of the independent variables. The resulting rules can be seen in a tree-
structured fashion. It works by recursively partitioning the data into smaller subsets, testing
the value of one variable at each node. The criterion used to split the data each time is based
on an impurity measure and CART exhaustively tests all the variables at each node. At each
leaf of the tree (terminal node) the predicted value is calculated as the mean of the dependent
variable in the subset that meets the conditions of the respective branch. Given that the size
of the tree can considerably increase at the risk of overfitting, CART computes the final tree
by determining the subtree with minimal cost by using cross-validation. It is worth noting
that although computing the mean at the leaf nodes as the possible predicted values may
sound unsophisticated, the subsets corresponding to these terminal nodes are believed to be
as homogeneous as possible. Furthermore, after pruning, the final tree may provide
understandable rules that potentially give a better insight into the interactions among the
variables.
5.4 Parameters setting Unlike other machine learning methods, Multiple Linear Regression and CART require
minimal tweaking in order to determine the best parameters of the algorithms for which the
greatest performance may be obtained. Certainly, there are no parameters to tune in Multiple
Linear Regression. Since the implementation of CART provided in the MATLAB�s
Statistics Toolbox was used, the splitting criterion or measure of impurity was the Least
Square function. Additionally, pruning was performed by means of cross-validation.
Chapter 5. Modelling and Results
65
5.5 Measure of performance used In order to evaluate the performance of the algorithms used and to carry out comparisons
between them, the Standardised Mean Square Error (SMSE) has been used. It can be
computed by:
∑
∑
=
=
−
−= K
k
uk
u
K
k
uk
uk
u
yy
yfSMSE
1
2
1
2
)(
)( (5.5)
Where uky are the actual improvements in performance given by (5.1), u
kf are the values
predicted by the model and uy is the mean improvement in performance throughout all the
K loops considered within the validation or test set. The SMSE describes how good the
predictions of the model are compared to the case of always predicting the mean, i.e. without
knowledge of the problem. Therefore, small values of SMSE indicate a good performance of
the algorithm and they are expected to be less than 1.
5.6 Experimental Design This section describes the design of the experiments that were carried out in order to evaluate
the performance of the learning algorithms used. There are three objectives with the
realisation of these experiments:
1. To provide an interpretation of the parameters and the results obtained by the
methods used regarding the features that were considered for loop unrolling.
2. To compare the accuracy of the methods and determine the technique that obtained
the best performance.
3. To ascertain the quality of the predictions obtained and the possible impact of these
predictions on the set of benchmarks used.
5.6.1 Complete dataset The first experiments were performed in the whole set of loops that were collected, i.e. the
total number of loops were used for training and validation. Although the results in terms of
Chapter 5. Modelling and Results
66
the performance of each method do not represent an expected performance on novel data,
these experiments were carried out in order to analyse the results obtained by each method
and to interpret the importance of the variables involved in loop unrolling.
5.6.2 K-fold cross-validation Given that testing a model on the same dataset that was used for training does not represent a
real measure of performance of the machine learning methods used, the second set of
experiments were carried out using K-fold cross-validation. This procedure has demonstrated
to be a good methodology to evaluate the performance of a learning algorithm and does not
require a dataset with a great number of examples. The procedure is explained as follows:
1. Divide the training data into K randomly chosen independent (not overlapped)
subset (folds) S1, S2, �, SK of approximately the same size.
2. Repeat for i varying from 1 to K:
a. Take the Si subset as the test set and the rest of sets (S1, �, Si-1, Si+1, � SK)
as the training set.
b. Train the methods (Linear Regression and CART) with the training set
previously built.
c. Evaluate the models constructed in the test set, i.e. determine the predicted
values for each model.
d. Assess the performance of each algorithm by using the SMSE.
The procedure explained above was executed for each unroll factor u (u=1, 2, � 8) and the
number of folds used was K=10.
5.6.3 Leave One Benchmark Out cross-validation Although the K-fold cross-validation procedure is widely accepted by the machine learning
community to evaluate the performance of learning algorithms, the real goal in compiler
optimisation with machine learning is to optimise complete programs that have not been seen
before. Therefore, an alternative procedure called Leave One Benchmark Out (LOBO) cross-
validation has been followed. This time, given a set of B benchmarks, the models must be
trained with B-1 benchmarks and the performance of the algorithms must be evaluated in the
benchmark that was not used for training. Therefore, it is the same cross-validation
Chapter 5. Modelling and Results
67
procedure explained above when the subsets are not randomly chosen but they are selected
to be each benchmark S1, �SB. Hence, there are B folds instead of K, where B is the number
of benchmarks used (B=12).
5.6.4 Realising speed-ups The accuracy of the predictions calculated for each learning technique in terms of the SMSE
provides a measure that indicates the performance of the learning algorithms used for this
specific problem. However, it does not present the actual improvement obtained with the
values predicted on each loop and consequently, on each benchmark. The valid procedure to
be followed is to determine the best unroll factor for each loop based on the predictions
obtained and setting up the programs with the unrolled version of the loops in order to realise
the final speed-ups. However, time-limitations have constrained the present project to carry
out this task. Alternatively, with the initial data already collected and the results of LOBO
cross-validation, it is possible to obtain the expected improvement in performance that each
loop could experience under the assumptions of no interactions between loops and a
negligible effect of the instrumentation. This procedure has been adopted and its results are
described in section 5.7.4.
5.7 Results and Evaluation
5.7.1 Complete dataset Multiple Linear Regression and CART algorithm have been applied separately to the set of
benchmarks. The Standardised Mean Square Error (SMSE) for each model that has been
constructed is shown in Table 5.1. In the case of u=1 the SMSE by definition is 1 given that
this factor is considered the point of reference and there is no improvement over itself.
Recalling that the SMSE measures how good the predictions obtained by the model are
compared to the situation of always predicting the mean, Multiple Linear Regression shows
very little improvement over always predicting the mean. CART algorithm considerably
outperforms the linear regression model. Since the same dataset for training is used to
validate the model, this good performance is rather unrealistic and it may indicate that the
algorithm is overfitting the data. However, the application of the models to the same data
that is used for their construction attempts to be explanatory rather than to provide a measure
Chapter 5. Modelling and Results
68
of accuracy. In both cases, we are trying to determine those variables that are considered
more important for each model.
5.7.1.1 Analysing the regression coefficients: As explained above, in the linear
regression model the parameter vector may provide an indication of the effect of each
predictor variable on the output variable. The parameter vector for the Linear Model
constructed using the whole dataset is shown in Table 5.2. For each model, the coefficients
of the five most important features are shown shaded, i.e. those features with the greatest
conditional effect (negative or positive).
The absolute conditional effect of each feature throughout all the models has been calculated
by:
∑=
=U
u
uii
2ββ (5.6)
This has been used to determine the best five features and they are presented below.
SPEC CFP95
Method u = 1 u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8
MLR 1.000 0.898 0.929 0.911 0.890 0.905 0.898 0.909
CART 1.000 0.287 0.318 0.385 0.384 0.346 0.427 0.431
VECTORD
Method u = 1 u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8
MLR 1.000 0.898 0.971 0.932 0.970 0.965 0.968 0.943
CART 1.000 0.306 0.649 0.463 0.595 0.565 0.586 0.517
SPEC CFP95 + VECTORD
Method u = 1 u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8
MLR 1.000 0.959 0.979 0.971 0.986 0.987 0.986 0.980
CART 1.000 0.310 0.649 0.419 0.606 0.585 0.629 0.460
Table 5.1: SMSE when applying Multiple Linear Regression (MLR) and CART to
the complete set of benchmarks
Chapter 5. Modelling and Results
69
1. Stores: The number of array stores
2. Size: The number of statements within the loop
3. Reuses: The number of array element reuses
4. Nested: The nested level of the loop
5. Floating: The Number of floating point operations
By far, the number of array stores and the size of the loop body are the features that the
linear model has encountered to most influence the improvement in performance of the loops
under unrolling. This is understandable as the former provides an indication of the memory
references and the demand on registers the code may have and the latter is crucial to
determine whether the instructions in the loop body may be kept in the cache. However,
other features such as the number of if statements, the number of array loads or even the trip
count, are found not to be so important for the linear models. Indeed, the trip count does not
seem to provide useful information and is ranked among the best five features only when
u=2. Other variables such as the number of array element reuses that attempts to represent
data dependency among iterations and the number of floating point operations are also found
to be relevant for these models. Surprisingly, the nested level of the loop is ranked forth
among all the features. However, a cautionary comment should be mentioned given that the
linear models that have been constructed do not seem to accurately represent the data, as the
values of SMSE are approximately one. Certainly, the predictions are only slightly better
than always predicting the mean and the findings explained above may only be a
consequence of the poor performance of the linear model.
5.7.1.2 Analysing the trees: With the aim of obtaining an explanatory model of the data,
CART was applied to the whole dataset with no pruning, i.e. allowing overfitting. The
results shown in Table 5.1 reflect this fact. The most informative feature, i.e. the one placed
at the root of the tree, was the trip count for u=2, 4, 8 and the number of floating operations
for u=3, 5, 6, 7. This is in contrast with the results obtained with linear regression where the
trip count was found not to be relevant for the predictions.
Some rules that show how the algorithm fits the data may be of interest. For example, for
u=2 it is found that IF the loop is called between 349800 and 650000 times within the
program AND the loop body ranges between 4 and 8 statements AND the trip count is
greater than 755 AND the loop has maximum 1 branch AND the number of loads is greater
than 2 AND the loop has only 1 store AND maximum 1 floating-point operation THEN the
Chapter 5. Modelling and Results
70
expected improvement in performance is approximately 38%. On the opposite case, IF the
number of times the loops is called, the trip count and the number of branches are the same
than before, but the number of loads per iteration is 2 or 3, AND the loop has 2 or more
stores AND 2 or more floating-point operations THEN the expected detriment in
performance is roughly 23%.
Variable u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8
Called -0.958 -1.205 -1.023 -1.042 -0.849 -1.206 -1.049
Size 3.318 5.387 3.967 4.779 2.850 4.742 3.827
Trip -1.000 -0.434 -0.492 -0.505 -0.427 -0.534 -0.828
Sys -0.838 -0.993 -0.782 -0.695 -0.433 -1.516 -1.038
Branches 0.911 0.539 0.288 0.178 0.532 1.147 0.380
Nested -0.716 -0.559 -1.504 -1.918 -1.783 -1.521 -2.008
Loads 0.925 1.249 0.236 1.205 1.588 1.448 1.630
Stores -3.911 -4.616 -5.179 -5.044 -4.315 -4.976 -4.987
Reuses 1.724 1.514 2.566 2.419 2.647 1.602 2.418
Floating -1.113 -2.137 -1.044 -1.237 -1.233 -1.208 -1.615
IndAcc -0.173 0.268 -0.009 -0.214 -0.201 -0.177 -0.443
Bo (bias) 1.804 1.668 4.587 2.370 2.872 3.772 5.143
Table 5.2: Parameter vectors of the linear models using the whole dataset
Although these prediction rules may seem unsophisticated, they represent how the loops in
the set of benchmarks selected are affected by unrolling. If it is recalled that the subsets in
which the data is partitioned by CART are as homogeneous as possible, they may provide a
better understanding of how unrolling works on this set of benchmarks.
5.7.2 K-fold cross-validation Given that training and testing the models in the same dataset does not represent the actual
predictive power of the algorithms, K-fold cross-validation has been used to test the
accuracy of the predictions. The results for the set of benchmarks used are shown in Table
5.3, Table 5.4 and Table 5.5. Each table shows the average (in bold) of the SMSE throughout
all the folds for the algorithms applied to each set of benchmarks. Although there are several
Chapter 5. Modelling and Results
71
cases when the SMSE is found to be less than one (highlighted cells), in general, the
accuracy of the predictions for linear regression and CART algorithm is not good. The
predictions obtained do not outperform the simple mean predictor used in the calculation of
the SMSE. However, it is worth noting that some SMSE calculations indicate a good
performance of the algorithm for a specific fold (e.g. in Table 5.3 for CART algorithm when
using k=9 and u=6) but they are compensated by other folds in which the algorithm performs
poorly. Comparatively, linear regression and CART algorithm have roughly the same
performance, although as shown in Table 5.3 and Table 5.5, Linear Regression consistently
outperforms CART.
5.7.3 Leave One Benchmark Out Cross-validation As explained in section 5.6.3 a more realistic approach to test the algorithms is to use Leave
One Benchmark Out (LOBO) cross-validation. The SMSE values for Linear Regression and
CART algorithm are shown in Table 5.6 and Table 5.7. Unfortunately, the overall results are
again, an indication of bad performance.
For the case of Multiple Linear Regression (Table 5.6) the loops within the benchmarks
102.swim, 141.apsi and vectord3 are predicted better than baseline, but these low values of
SMSE are compensated by the poor performance of the algorithm in other benchmarks such
as 104.hydro2d and 125.turb3d. For benchmarks with a considerable amount of loops that
can be improved by unrolling such as vectord1 and vectord2 (see table 4.1 in chapter 4) the
SMSE values are only slightly different from 1 and it could be expected these benchmarks to
be correctly predicted when determining the best unroll factor to be used.
As with Multiple Linear Regression, CART algorithm (Table 5.7) shows quite good
performance when predicting on the benchmark 102.swim (u = 3, 4, 7 and 8). Additionally,
unlike Linear Regression, CART shows good SMSE values on the benchmark 125.turb3d
that has 5 loops for which unrolling is significantly beneficial (see table 4.1 in chapter 4).
However, CART also has a bad performance in 101.tomcatv, vectord3 and vectord4. For this
last benchmark composed of two loops with very little improvement, the algorithm makes
predictions greater than 50%, causing the SMSE to reach values greater than 300.
Chapter 5. Modelling and Results
72
Linear Regression
Fold u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8
K = 1 0.960 0.912 0.887 0.899 0.839 0.876 0.864
K = 2 1.105 1.058 1.021 0.917 1.094 1.096 0.992
K = 3 0.942 0.886 1.050 1.433 1.349 0.781 0.682
K = 4 0.975 0.967 1.150 1.090 1.025 1.116 1.079
K = 5 0.961 0.986 0.925 1.024 1.072 0.918 0.941
K = 6 1.916 2.473 1.473 1.257 1.494 1.986 1.365
K = 7 1.315 1.488 1.334 1.350 1.750 1.001 1.197
K = 8 1.315 1.452 1.284 1.023 0.904 1.296 1.283
K = 9 1.072 1.361 1.524 1.330 1.021 1.329 1.273
K = 10 0.948 0.962 0.888 0.813 0.835 1.069 0.868
AVG. 1.151 1.254 1.154 1.114 1.138 1.147 1.054
CART
FOLD u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8
K=1 1.400 1.164 1.319 1.078 0.905 1.228 0.825
K=2 3.336 1.707 4.385 1.616 6.883 3.440 3.507
K=3 1.097 2.658 1.452 2.220 2.380 2.079 2.387
K=4 1.542 1.018 1.165 1.571 1.208 2.015 1.085
K=5 1.550 1.268 0.909 0.807 1.106 0.565 0.756
K=6 2.596 1.206 1.191 1.189 1.218 1.579 0.969
K=7 1.060 2.247 1.240 1.331 0.757 0.718 0.763
K=8 2.744 4.179 3.193 9.976 1.151 2.861 2.867
K=9 1.976 1.769 0.940 1.359 0.244 1.299 1.155
K=10 0.650 0.650 0.747 1.006 1.184 1.027 0.994
AVG. 1.795 1.787 1.654 2.215 1.704 1.681 1.531
Table 5.3: SMSE K-fold cross-validation for SPEC CFP95
Chapter 5. Modelling and Results
73
Linear Regression
Fold u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8
K = 1 0.963 0.740 1.297 1.086 1.043 0.897 1.084
K = 2 1.016 1.429 1.153 1.033 1.075 1.129 1.158
K = 3 0.934 0.996 0.826 0.994 0.994 0.996 0.940
K = 4 17.42 13.75 3.618 4.624 5.308 4.107 3.242
K = 5 1.053 1.312 0.900 1.163 1.010 1.122 0.959
K = 6 1.731 1.656 1.445 1.625 1.554 1.239 1.258
K = 7 1.093 1.192 1.057 1.201 1.201 1.268 1.109
K = 8 0.797 0.850 1.039 1.088 1.421 1.182 1.048
K = 9 1.078 1.569 1.016 1.529 1.534 1.640 1.039
K = 10 1.090 1.800 2.011 2.723 2.809 2.477 1.676
AVG. 2.717 2.529 1.436 1.707 1.795 1.606 1.351
CART
FOLD u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8
K=1 0.899 2.007 1.810 1.863 2.353 2.529 1.483
K=2 1.844 6.094 1.570 6.769 7.186 7.719 1.081
K=3 0.869 0.827 0.728 0.784 0.743 0.831 2.922
K=4 1.345 1.070 0.771 0.742 0.713 0.889 0.963
K=5 1.595 4.826 1.677 3.197 3.073 4.198 2.964
K=6 1.599 1.610 1.050 0.793 0.746 0.682 0.575
K=7 1.794 1.581 1.874 3.552 3.475 3.432 1.876
K=8 0.938 1.348 1.173 1.278 0.964 1.343 0.995
K=9 0.990 0.625 1.046 1.512 1.340 1.374 1.713
K=10 1.269 1.379 1.307 1.162 1.412 1.329 1.366
AVG. 1.314 2.137 1.301 2.165 2.200 2.433 1.594
Table 5.4: SMSE K-fold cross-validation for VECTORD
Chapter 5. Modelling and Results
74
Linear Regression
Fold u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8
K = 1 1.076 1.034 0.991 1.037 1.020 1.052 0.978
K = 2 1.022 1.000 1.044 1.006 1.022 1.004 1.055
K = 3 0.972 0.960 0.942 0.960 0.986 0.903 0.946
K = 4 0.919 1.016 1.052 1.077 1.051 1.090 1.051
K = 5 1.000 1.047 0.960 0.945 0.971 1.014 1.037
K = 6 1.192 1.149 1.052 1.062 1.034 1.038 1.030
K = 7 1.032 0.981 1.057 1.028 1.030 1.034 1.032
K = 8 0.931 0.852 0.914 0.927 0.993 0.871 0.954
K = 9 0.935 0.943 0.975 0.999 0.988 0.963 0.970
K = 10 1.303 1.353 2.247 1.332 1.621 1.876 1.691
AVG. 1.038 1.034 1.123 1.037 1.072 1.084 1.074
CART
FOLD u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8
K=1 1.054 0.678 0.887 1.818 1.094 0.566 1.185
K=2 1.180 0.835 0.529 0.766 0.752 0.769 0.717
K=3 0.638 1.898 1.544 2.420 2.459 1.724 2.371
K=4 1.119 1.678 0.978 1.561 1.267 1.717 0.996
K=5 1.661 1.531 1.825 1.833 1.551 1.139 1.117
K=6 1.116 1.171 1.721 1.677 1.541 1.694 1.830
K=7 0.907 1.332 0.781 1.105 1.070 1.291 0.645
K=8 1.270 1.172 0.884 1.593 1.219 1.359 0.720
K=9 1.511 8.758 1.068 3.444 1.009 2.384 1.089
K=10 1.513 1.533 1.255 2.091 2.764 2.027 1.138
AVG. 1.197 2.058 1.147 1.831 1.473 1.467 1.181
Table 5.5: SMSE K-fold cross-validation for SPEC CFP95 + VECTORD
Chapter 5. Modelling and Results
75
Benchmark u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8
101.tomcat 5.917 0.869 0.985 0.573 2.691 0.612 1.838
102.swim 0.279 0.625 0.636 0.739 0.717 0.756 0.705
103.su2cor 1.832 1.079 1.793 1.352 1.573 1.479 1.602
104.hydro2d 8.086 38.60 19.28 42.64 29.03 23.60 26.76
107.mgrid 0.889 1.681 2.096 2.997 4.354 2.505 2.775
125.turb3d 2.085 9.685 20.75 17.63 6.313 10.96 9.547
141.apsi 1.031 1.025 0.913 0.911 0.900 0.898 0.987
146.wave5 1.343 1.616 1.856 2.715 2.365 2.464 2.381
Vectord1 0.976 0.996 0.987 1.010 1.018 1.003 1.003
Vectord2 1.064 1.102 1.042 1.057 1.048 1.068 1.037
Vectord3 0.958 0.976 0.943 0.727 0.802 0.838 0.895
Vectord4 1.487 1.485 4.748 0.527 1.869 2.274 0.030
AVG. 2.162 4.978 4.670 6.073 4.390 4.038 4.130
Table 5.6: SMSE LOBO cross-validation for Multiple Linear Regression
5.7.4 Realising speed-ups Considering that the ultimate task in loop unrolling is to determine the unroll factor to be
used for each loop, an additional procedure has been followed in order to measure the impact
of the predictions described above. Given that the actual improvement in performance of the
loops is available for each unroll factor, it is possible to establish the best improvement in
performance that a loop can achieve. Similarly, the predictions provided by the algorithms
(obtained with LOBO cross-validation) can be used to ascertain the best predicted unroll
factor for each loop. Therefore, the expected improvement in performance a loop may
experience under this unroll factor can be found. Let us call this magnitude the predicted
Chapter 5. Modelling and Results
76
improvement, although it should be clear that it is the actual improvement of each loop based
on the unroll factor suggested by the predictions of the algorithms.
Benchmark u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8
101.tomcat 4.861 6.862 6.244 4.185 13.68 8.461 3.907
102.swim 22.76 0.739 0.575 1.499 1.284 0.646 0.795
103.su2cor 6.600 4.158 1.753 1.754 1.560 1.317 0.949
104.hydro2d 1.953 2.274 1.218 3.075 3.181 2.042 1.896
107.mgrid 1.656 2.401 0.674 2.903 2.187 2.069 2.731
125.turb3d 1.468 2.425 0.658 1.392 0.787 1.252 0.514
141.apsi 1.253 2.020 1.455 1.635 3.770 2.328 2.539
146.wave5 0.672 1.579 0.611 5.272 3.133 1.262 0.789
Vectord1 1.169 1.122 1.239 1.002 1.242 1.145 1.296
Vectord2 0.882 4.690 1.009 2.469 2.678 3.126 1.309
Vectord3 5.459 1.891 6.883 4.882 4.799 5.201 3.859
Vectord4 386.5 328.3 389.4 210.8 95.33 344.6 481.3
AVG. 36.27 29.87 34.31 20.07 11.13 31.12 41.82
Table 5.7: SMSE LOBO cross-validation for CART algorithm
Figure 5.1 shows the plots of the predicted improvement vs. the best improvement for
Multiple Linear Regression and CART algorithm. It is clear that the ideal place for each
point in the plots is on the diagonal of the first quadrant, i.e. when the predicted improvement
is equal to the Best improvement. However, it may also be acceptable to be under the
diagonal (it should be noticed that being over the diagonal is impossible) but over the x-axis.
This is an indication that, although the exact improvement is not predicted, a lot can be
gained if the unroll factor suggested by the algorithms is used. Both Linear Regression and
CART suggest unroll factors for which a notable improvement in performance is achieved.
Chapter 5. Modelling and Results
77
However, they also suggest predictions that cause a detriment in performance when the loops
could potentially be improved, as it can be perceived from the points under the x-axis.
Negative values lying at the y-axis indicate that the best improvement is zero, but that the
predicted unrolling hurts performance.
As some overprinting that makes difficult the analysis of the actual effect of the predictions
may be present in Figure 5.1, the best improvement and the predicted improvement have
been divided into ten classes and a histogram that computes the number of loops that belong
to each pair has been calculated. The results are shown in Figure 5.2. Given that there are
only few differences for Linear Regression and CART, the subsequent analysis will refer to
the results of both algorithms (unless explicitly stated). As before, it is desired to be on the
main diagonal of the plot because it indicates that the effect of the predictions is the best
possible improvement. Hence, approximately 57% of the loops are correctly affected with
the predictions. Furthermore, it can be observed that a large number of loops remain
unaffected with the predictions as their maximum improvement is nearly zero. More
specifically, the bar that corresponds to the fifth class of both, the predicted improvement
and the best improvement, represents roughly 28% of the loops. It is also interesting to
calculate the percentage of loops for which unrolling cannot produce an improvement in
performance but are negatively affected with the predictions. These loops are represented by
the bars for which the predicted improvement class is less than five and the best
improvement class is five. Roughly, 17% of the loops belong to this group. The bars for
which the best improvement class is greater than five and the predicted improvement class is
five correspond to those loops that could potentially be improved by unrolling but the
predictions do not produce a significant effect. 13% (for Linear Regression) and 17% (for
CART) of the loops correspond to this group.
Besides analysing the effect of the predictions on the complete set of loops, the unroll factors
suggested by the algorithms have been used to ascertain the reduction (or increase) in the
execution time of each loop by using the initial data that has been collected. These
reductions in execution time have been used to determine the impact of the predictions on
the total execution time of the benchmarks. Let us denote this improvement of performance
as hypothetical or re-substitution improvement because it has not been obtained by
additional executions of the programs. It relies on the assumptions that there is no interaction
between loops and that the effect of the instrumentation is negligible. The results for all the
benchmarks used are shown in Figure 5.3. The results obtained by the Linear Regression
Chapter 5. Modelling and Results
78
model leaded to an improvement in performance of seven benchmarks, whilst one
benchmark, namely 125.turb3d remained nearly unaffected, and four programs experienced
an increase in their execution times. The improvements obtained by using the results of
CART algorithm are similar, except that in this case, 125.turb3d was negatively affected and
its performance was worsened. It can be seen that the SPEC CFP95 benchmarks are more
difficult to optimise than the VECTORD benchmarks. The maximum improvement reached
for the former was roughly 6% while the latter was improved at a maximum of
approximately 18%. The mean improvement in performance throughout all the benchmarks
achieved by the Linear Regression model was about 2.5%, while CART algorithm gave
2.3% improvement on average. Finally, the benchmark most negatively affected by
erroneous predictions of the algorithms was vectord3 that experienced a detriment of
approximately 7%.
If Figure 5.3 is compared with the maximum possible improvement achieved by unrolling
shown in Figure 5.4 for all the benchmarks, the negative effect of the predictions on some
benchmarks can be understood given the insignificant maximum possible improvement that
can be reached on these benchmarks. Furthermore, the good effect obtained for vectord1 and
vectord2 can also be explained based on this maximum improvement.
5.7.5 Feature construction As explained in section 4.6.2 some features may be irrelevant for the learning techniques;
others can be transformed into a more suitable form, for example binary features; and new
features can be constructed in order to improve the accuracy of the predictions, for example,
the ratio of memory references to floating point operations. Binary features were created for
sparse variables such us the number of proper functions of the language or the number of
branches within the loop but no improvement was found. Similarly, the feature ratio of
memory references to floating point operations was created, but unfortunately this new
variable did not affect the performance of the algorithms.
5.7.6 Comparison to related work In principle, there is no point of comparison with the work done in [Monsifrot et al., 2002] or
the one in [Stephenson and Amarasinghe, 2004] due to the differences in the benchmarks
Chapter 5. Modelling and Results
79
used and the formulation of the problem. Certainly, their goal was to build a classifier while
the approach that has been adopted in this project is the construction of a regression model.
Although in [Monsifrot et al., 2002] unrolling was also implemented at the source code level,
a classification task was developed and, consequently, the results were presented in a
different way. Additionally, the decision of the specific unroll factor to be used was
transferred to the compiler and it was not determined by the classifier. This considerably
simplifies the problem given that a loop may be improved by one unroll factor but negatively
affected by another one.
For illustrative purposes only, if a classifier was constructed with the results obtained with
Linear Regression or CART algorithm, it would have approximately 75% of accuracy, which
is comparable with the total accuracy obtained in [Monsifrot et al., 2002] without boosting
(79.4%). Their speed-ups were on average 6.2% and 4% in two different machines.
However, there is no reference of the maximum speed-up reachable in the data.
In the case of the work developed by [Stephenson and Amarasinghe, 2004] it is even more
difficult to establish comparisons because they implemented unrolling at the back-end of the
compiler. They also adopted a classification approach and obtained speed-ups of 6% on the
SPEC benchmarks. However, they included programs from the SPEC INT95 that were not
considered in this project. Furthermore, programs for which an improvement of 30% to 50%
was possible were included in their dataset, which is greater than the maximum obtainable
speed-ups for the programs used in this project.
5.8 Summary and Discussion This chapter has addressed the regression approach for predicting the expected improvement
in performance of loops under unrolling. This approach has been explained to be more robust
and general than the classification approach. It is more robust because it deals more easily
with noisy measurements. It is considered more general because the binary classification of
deciding whether unrolling should be applied or not, and the multi-classification approach of
predicting the best unroll factors can be easily determined after obtaining the results with the
regression approach.
Chapter 5. Modelling and Results
80
Two different modelling techniques have been used to tackle this regression problem:
Multiple Linear Regression and Classification and Regression Trees (CART). Multiple
Linear Regression is a straightforward technique that assumes that the output variable is a
linear combination of the input variables. Despite its simplicity, it has demonstrated to be
successful for many types of problems. CART algorithm is a method based on decision trees
learning used for classification and regression. It attempts to model non-linear dependencies
between the output variable and the predictor variables without making assumptions on the
distribution of the data. To measure the predictive power of the techniques, the Standardised
Mean Square Error (SMSE) has been used. It comparatively indicates the accuracy of the
predictions with the very simple case of always predicting the mean. Three different types of
experiments have been designed in order to test the models constructed for the regression
approach: using the complete dataset, using K-fold cross-validation and using Leave One
Benchmark Out (LOBO) Cross-validation. When working with the complete dataset, the
algorithms were trained and tested on the same data in order to obtain an explanatory model
that enabled us to have a better insight into the problem and to determine those variables that
are found to be relevant for loop unrolling. The regression coefficients indicated that the
most influential variables were the number of array stores, the number of statements within
the loop, the number of array element reuses, the nested level of the loop and the number of
floating point operations. Apart from the nested level of the loop that was unexpectedly
ranked fourth among the most important variables all the other features encountered have a
demonstrated impact on loop unrolling. The analysis of the trees that were built by CART
algorithm determined that the trip count and the number of floating-point operations were
found to be the most relevant features. Additionally, interesting rules that involve the size of
the loop body, the number of memory references, the number of floating-point operations
and the trip count emerged from the trees to provide an explanation of how the regression
task was developed in the data.
Considering that the predictive power of the techniques is what describes the effectiveness of
the regression approach, the results obtained for the cross-validation methodology and the
Leave One Benchmark Out strategy are, in general, disappointing. Certainly, although for
some cases the SMSE values were an indication of good performance, they were strongly
overshadowed by other cases where the performance of the algorithms was poor.
Without being discouraged by the low accuracy of the predictions, the models that were
constructed by the Linear Regression approach and CART algorithm were used to suggest an
Chapter 5. Modelling and Results
81
unroll factor for each loop. Thus, these unroll factors were considered in order to ascertain a
possible improvement in performance that the loops could obtain. It was found that in most
of the cases, the unroll factors suggested by the algorithms produced an improvement in
performance that, to some extent, agreed with the best obtainable improvements. In other
words, many of the unroll factors suggested by the predictions of the algorithms leaded to an
improvement in performance of the loops. However, some loops were negatively affected by
incorrect predictions.
Additionally, given the time-constraints placed on the present project, the benchmarks were
not executed with the predicted unrolled version of the loops. An alternative procedure based
on the data that was collected was performed in order to realise speed-ups on each
benchmark. This procedure reused the execution times of the data and worked under the
assumptions of no interaction between loops and a negligible effect of the instrumentation.
Hence, the new execution times for the benchmarks were calculated based on the predictions
suggested by the algorithms with the LOBO cross-validation results. It was encountered that
seven of twelve benchmarks might experience an improvement in performance, where the
maximum speed-up found was approximately 18%. However, other benchmarks might
experience an increase in their execution times if the predictions were adopted. The worst
detriment in performance was found to be roughly 7%. The mean improvement of
performance throughout all the benchmarks was about 2.5% for the linear model and 2.3%
for the results obtained by CART.
It may seem rather contradictory that the accuracy of the regression techniques was found to
be poor, but the unroll factor suggested by the results leaded to an improvement in
performance that is notable, considering the maximum improvement obtainable by unrolling
on the set of benchmarks. There are two reasons that may explain this fact. Firstly, some
loops may always be improved by unrolling regardless of the unroll factor used. That is, the
execution time of some loops may be effectively reduced by any unroll factor greater than
one. However, there is still an indication of a good effect of the predictions given that, as
shown in Chapter Four, most of the loops do not experience an improvement in performance
greater than 5%. Indeed, and this is the second reason for the speed-ups obtained despite the
poor performance of the algorithms, the models constructed may somehow have learnt when
the loops experience a positive effect under unrolling and, although the accuracy of their
predictions is low, the best unroll factor suggested by their results still represents a good
improvement for the loops.
Chapter 5. Modelling and Results
82
The explanation given above does not attempt to soften the poor predictive power in terms of
the SMSE achieved by the techniques used, but rather raise a question regarding the
appropriateness of the regression approach for the problem tackled in this project. I still
believe it is a more general and suitable approach than the construction of a classifier to
directly predict the best unroll factor. The classification approach is more susceptible to
erroneous decisions given that it does not smoothly model the behaviour of the execution
times of the loops under unrolling. However, it may also be true that the regression approach
may need more training data, different features to capture the variations in the execution
times of the loops under unrolling and even different types of techniques to solve the
problem. These considerations can be adopted by future work in order to improve the results
obtained in this project.
Chapter 5. Modelling and Results
83
-160
-120
-80
-40
0
40
80
0 10 20 30 40 50 60 70
Best Improvement (%)
Pred
icte
d im
prov
emen
t (%
)
-160
-120
-80
-40
0
40
80
0 10 20 30 40 50 60 70
Best Improvement (%)
Pred
icte
d im
prov
emen
t (%
)
Figure 5.1: Predicted Improvement (%) vs. Best Improvement in performance
for Linear Regression model (top) and CART (bottom)
Chapter 5. Modelling and Results
84
Class Improvement (%)
1 (-150,-100]
2 (-100,-50]
3 (-50, -20]
4 (-20,-5]
5 ( -5,5]
6 (5,10]
7 (10,20]
8 (20,40]
9 (40,60]
10 (60,80]
Figure 5.2: Histogram Best Improvement � Predicted Improvement for Multiple
Linear Regression (top) and CART (bottom)
Chapter 5. Modelling and Results
85
Figure 5.3: Re-substitution Improvement found by Multiple Linear Regression
(top) and CART (bottom)
Chapter 5. Modelling and Results
86
Figure 5.4: Maximum possible improvement obtainable by unrolling
87
Conclusions
This dissertation has addressed the problem of compiler optimisation with machine learning.
It has been shown that even in the case of only one transformation such as loop unrolling
being considered, the problem of determining when and how this transformation should be
applied remains a challenge for compiler writers and current researchers. This is evident
because loop unrolling may be beneficial for some loops but detrimental for other loops.
Furthermore, this transformation may cause a negative and a positive effect on the same loop
when using a different unroll factor. One of the most important reasons to apply loop
unrolling is that it exposes Instruction Level Parallelism. Unrolling can also eliminate loop
copy operations and improve memory locality. However, the transformation may degrade the
instruction cache depending on factors such as the size of the loop body, the size of the cache
and the unroll factor used. Additionally, it may augment the number of calculations of
addresses and make the instruction scheduling problem more complex. Finally, due to the
effect of interactions, loop unrolling may enable or prevent other transformations. This is
probably the most complicated issue when studying loop unrolling, given that analysing the
interaction with another transformation in isolation may be unrealistic but evaluating its
effect when using a great number of program transformations will become unfeasible.
These factors are essential for determining the use of machine learning techniques in order to
predict when unrolling can be detrimental or beneficial for a particular loop. This approach
assumes that program transformations can be learnt from past examples; more specifically,
that the parameters involved in loop unrolling can be learnt based on the behaviour of the
execution times of loops that are described by a set of features.
Despite previous work having focused on stating the optimisation problem with loop
unrolling as a classification task, throughout this project the belief has been held that a
regression approach could more appropriately model the behaviour of a loop under unrolling.
This may be understood from the fact that, as concerns the problem this project has dealt
with, classifiers could be severely affected by noisy measurements and a limited number of
examples in the training set. Furthermore, the results of a classification approach regarding
the decisions of whether applying unrolling to a specific loop and determining the best unroll
factor for loops are easily obtained when applying the regression approach.
Conclusions
88
The belief has also been held that, as in most data mining applications, the stage of pre-
processing the data to make it suitable for learning is the phase to which most time must be
dedicated, given that without reliable data the learning process proves non-sensical. Thus,
much effort was focused on the process of collecting, cleaning, analysing and preparing the
data to be used by machine learning techniques. Judiciously following this process, the
features and the execution times of the loops were extracted and transformed in order to be
used by the regression approach.
Two different algorithms were used for regression in order to predict the improvement in
performance a loop can experience when it is unrolled a specific number of times: Multiple
Linear Regression and Classification and Regression Trees (CART). The former was not
only selected on the basis of its simplicity, but also on the basis of its successful performance
across different types of applications. The latter was chosen as it considers non-linear
relationships and does not make assumptions about the underlying distribution of the data.
The results in terms of the accuracy measured by the Standardised Mean Square Error
(SMSE) were disappointing as most of the cases the predictions obtained with the algorithms
could not outperform the very simple mean predictor used by the SMSE. Nonetheless, these
predictions were used in order to ascertain the best unroll factor for loops and determine the
expected improvement in performance the benchmarks that were considered could
experience. Due to time-constraints placed on this project, the programs could not be
executed with the predictions suggested by the algorithms. The initial data that was collected
was used instead, in order to compute the final results. Thus, the realisations of speed-ups
were carried out under the assumption that there is no interaction between loops and that the
effect of the instrumentation is negligible.
It was found that most of the loops could experience an improvement in performance by
using the unroll factor suggested by the predictions of the algorithms. Certainly, it was found
that seven out of twelve Benchmarks could be improved with the predictions, where the
maximum speed-up found was approximately 18%. Four programs in the case of the linear
model, and five programs in the case of CART algorithm, were negatively affected by
unrolling when using the factors predicted. On average, an overall improvement in
performance of 2.5% (for Linear Regression) and 2.3% (for CART) could be achieved.
Conclusions
89
Two possible explanations were considered after analysing the results obtained. Firstly, the
fact that some loops are positively influenced by unrolling, regardless of the unroll factor
used. Secondly, although the accuracy of the predictions was poor, the models enabled the
discovery of circumstances in which unrolling could effectively improve the execution time
of loops. In conclusion, both factors have influenced the improvements obtained. However,
as a consequence of improvements in performance having been achieved, it would be naïve
to sate that all the results in the project are exciting. Other approaches, albeit using different
data, have obtained superior speed-ups. It would also be extreme to assume that the
improvement in performance of loops under unrolling is unpredictable. It is difficult to
determine a unique reason responsible for the poor performance of the algorithms used. It is
necessary, however, to remark that the number of loops utilised in this project is
considerably less than in similar approaches. Certainly, whilst only 248 loops were included
in the dataset in this project, previous work accumulated data for more than 1000 loops.
Additionally, the set of mostly static features utilised for loops may not be sufficient to
accurately predict the improvement in performance of loops under unrolling. Finally, other
regression methods different from Linear Regression and CART algorithm could perform
better using the dataset that has been created.
Thus, immediate future work carried out to follow on from the present project must be
focused on the addition of further examples to the erstwhile dataset, remaining cautious that
any new data added is suitable for use by machine learning techniques. The dataset may be
complemented not only with more examples but also with a greater number of features.
Dynamic features can be determined by frameworks specialised in program analysis and
program transformations. Furthermore, it is worth trying other regression methods different
from Multiple Linear Regression and CART algorithm.
Going beyond learning in loop unrolling there is a great number of applications in compiler
optimisation that could be addressed. For example, discerning better heuristics for other
program transformations or working with more complex problems such as instruction
scheduling or register allocation. However, there is still much work to be done in connection
with the problem of interactions among different transformations. Given that evaluating all
the possible optimisation options with which a program should be executed is unfeasible,
one could have a set of training examples for which these optimisations has been recorded,
and it would be worthwhile predicting the levels of optimisation for which a program may
achieve maximum speed. Nonetheless, it should be recalled that the creation of a good
Conclusions
90
dataset usable for machine learning techniques is a time-consuming activity. Therefore, the
use or the creation of software tools that could automate this process can provide a great
benefit for the research in this area.
91
Bibliography
[Bacon et al., 1994] Bacon, D., Graham, S. and Sharp, O. (1994). Compiler Transformations
for High-Performance Computing. In ACM Computing Survey, Vol. 26, Issue 4, Pages 345�
420.
[Bodin et al., 1998] Bodin, F., Mével Y. and Quiniou, R. (1998). A User Level Program
Transformation tool. In Proceedings of the international Conference on Supercomputing.
[Breiman et al., 1993] Breiman L., Friedman J., Olshen R. and Stone C. (1998).
Classification and Regression Trees. Chapman and Hall, Boca Raton.
[Calder et al., 1997] Calder, B., Grunwald, D., Jones, M., Lindsay, D., Martin, J., Mozer, M.
and Zorn, B. (1997). Evidence-Based Static Branch Prediction Using machine Learning. In
ACM Transactions on Programming Languages and Systems, Vol. 19, No. 1, Pages 188-
222.
[Cooper and Torczon, 2004] Cooper, K. and Torczon L. Engineering a Compiler. (2004).
Morgann Kaufmann.
[Davidson and Jinturkar, 2001] Davidson, J. and Jinturkar, S. (2001). An Aggressive
Approach to Loop Unrolling. Technical Report: CS-95-26. University of Virginia, USA.
[Dongarra and Hinds, 1979] Dongarra, J. and Hinds, A. (1979). Unrolling Loops in
FORTRAN. Software-Practice and Experience, Vol. 9, Pages 219-226.
[Fursin, 2004] Fursin, G. (2004). Iterative Compilation and Performance Prediction for
Numerical Applications. PhD thesis, University of Edinburgh, Edinburgh, UK.
[G77] GNU Fortran Compiler. http://gcc.gnu.org/.
[Han and Kamber, 2001] Han, J. and Kamber, M. (2001). Data Mining: Concepts and
Techniques. Morgan Kaufmann.
Bibliography
92
[Hand et al., 2001] Hand, D., Mannila, H. and Smyth, P. (2001). Principles of Data Mining.
MIT Press.
[Hays and Winkler, 1971] Hays, W. and Winkler, R. (1971). Statistics: Probability, Inference
and Decision. Volume I. Holt, Rinehart and Winston, Inc.
[Kolodner, 1993] Kolodner, J. (1993). Case-Based Reasoning. Morgann Kaufmann.
[Koza, 1992] Koza J. (1992). Genetic Programming: On the Programming of Computers by
Means of Natural Selection. MIT Press.
[Levine et al., 1991] Levine, D., Callahan, D. and Dongarra, J. (1991). A test suite for
vectorizing Fortran compilers. Double precision.
[Long, 2004] Long, S. (2004). Adaptive Java Optimisation Using Machine Learning
Techniques. PhD thesis, University of Edinburgh, Edinburgh, UK.
[Mitchell, 1997] Mitchell, Tom. (1997). Machine Learning. McGraw Hill.
[Monsifrot and Bodin, 2001] Monsifrot, A. and Bodin, F. (2001). Computer Aided Tuning
(CAHT): �Applying Case-Based Reasoning to Performance Tuning�. In proceedings of the
15th ACM International Conference on Supercomputing (ICS-01), pages 196-203. ACM
Press, Sorrento, Italy.
[Monsifrot et al., 2002] Monsifrot, A., Bodin, F. and Quiniou R. (2002). A Machine
Learning Approach to Automatic Production of Compiler Heuristics. In Artificial
Intelligence: Methodology, Systems, Applications, pages 41-50.
[Neter et al., 1996] Neter, J., Kutner, M., Nachtsheim, C. and Wasserman, W. (1996).
Applied Linear Statistical Models, 4th Edition. Irwin.
[Nielsen, 2004] Nielsen, P. (2004). Lecture notes on Advanced Computer Systems. The
University of Auckland, New Zealand. Downloadable from:
http://www.esc.auckland.ac.nz/teaching/Engsci453SC/.
Bibliography
93
[ORC] Open Research Compiler for ItaniumTM Processor Family.
http://ipf-orc.sourceforge.net/.
[Shen and Smaill, 2003] Shen, Q. and Smaill, A. (2003). Lecture Notes on Knowledge
Representation. The University of Edinburgh, UK. Downloadable from:
http://www.inf.ed.ac.uk/teaching/modules/kr/notes/index.html.
[SPEC95] The Standard Performance Evaluation Corporation.
http://www.specbench.org/.
[Stephenson and Amarasinghe, 2004] Stephenson, M. and Amarasinghe, S. (2004).
Predicting Unroll Factors Using Nearest Neighbors. MIT-TM-938.
[Stephenson et al., 2003] Stephenson M., Martin M., O�Reilly U.-M. and Amarasinghe, S.
(2003). Meta Optimization: Improving Compiler Heuristics with Machine Learning. In
Proceedings of the SIGPLAN �03 Conference on Programming Language Design and
Implementation, San Diego, CA.
[Trimaran] Trimaran, an Infrastructure for Research in Instruction-Level Parallelism.
http://www.trimaran.org.