Predicting Good Compiler Transformations · Predicting Good Compiler Transformations Using Machine...

Predicting Good Compiler Transformations

Using Machine Learning

Edwin V. Bonilla

Master of Science

Artificial Intelligence

School of Informatics

University of Edinburgh

2004

I

Abstract

This dissertation presents a machine learning solution to the compiler optimisation problem

focused on a particular program transformation: loop unrolling. Loop unrolling is a very

straightforward but powerful code transformation mainly used to improve Instruction Level

Parallelism and to reduce the overhead due to loop control. However, loop unrolling can also

be detrimental, for example, when the instruction cache is degraded due to the size of the

loop body. Additionally, the effect of the interactions between loop unrolling and other

program transformations is unknown. Consequently, determining when and how unrolling

should be applied remains a challenge for compiler writers and researchers. This project

works under the assumption that the effect of loop unrolling on the execution times of

programs can be learnt based on past examples. Therefore, a regression approach able to

learn the improvement in performance of loops under unrolling is presented. This novel

approach differs from previous work ([Monsifrot et al., 2002] and [Stephenson and

Amarasinghe, 2004]) because it does not formulate the problem as a classification task but as

a regression solution. Great effort has been invested in the generation of clean and reliable

data in order to make it suitable for learning. Two different regression algorithms have been

used: Multiple Linear Regression and Classification and Regression Trees (CART).

Although the accuracy of the methods is questionable, the realisation of final speed-ups on

seven out of twelve benchmarks indicates that something has been gained with this learning

process. A maximum 18% of re-substitution improvement has been achieved. 2.5% of

overall improvement in performance for Linear Regression and 2.3% for CART algorithm

have been obtained. The present work is the beginning of an ambitious project that attempts

to build a compiler that can learn to optimise programs and can undoubtedly be improved in

the near future.

II

Acknowledgements Special thanks to my supervisor Dr. Chris Williams for his invaluable advice and

comprehensive revision of my progress throughout this project.

Thanks to Dr. Michael O'Boyle and Dr. Grigori Fursin for the discussions held during our

meetings that made possible the creation of the dataset that has been used in this project.

Thanks to Catalina Voroneanu and Bonny Quick for their patience when revising some

drafts of this dissertation.

Supported by the Programme Alβan, European Union Programme of High Level

Scholarships for Latin America, identification number (E03M14650CO).

III

Declaration I declare that this thesis was composed by myself, that the work contained herein is my own

except where explicitly stated otherwise in the text, and that this work has not been

submitted for any other degree or professional qualification except as specified.

(Edwin V. Bonilla)

IV

Table of Contents

Introduction.............................................................................................................................. 1 Overview and Motivation .................................................................................................... 1 Project Objectives ................................................................................................................ 2 Organisation......................................................................................................................... 3

Chapter One: Literature Review .............................................................................................. 6 1.1 Introduction ............................................................................................................ 6 1.2 Tuning heuristics and recommending program transformations ............................ 6 1.3 Learning in a particular program transformation: loop unrolling........................... 8 1.4 Summary............................................................................................................... 10

Chapter 2: Background on Compiler Optimisation ............................................................... 11 2.1 Introduction .......................................................................................................... 11 2.2 Definition of compilation ..................................................................................... 11 2.3 Compiler organisation .......................................................................................... 12 2.4 The purpose of a compiler .................................................................................... 13 2.5 An Optimising Compiler ...................................................................................... 14

2.5.1 Goals of Compiler Optimisation ...................................................................... 14 2.5.2 Considerations for program transformations.................................................... 15 2.5.3 The process of transforming a program for optimisation................................. 16 2.5.4 The problem of interaction............................................................................... 17 2.5.5 Types of program transformations ................................................................... 18 2.5.6 The scope of optimisation ................................................................................ 18 2.5.7 Some common transformations........................................................................ 20

2.6 Loop Unrolling ..................................................................................................... 20 2.6.1 Definition ......................................................................................................... 20 2.6.2 Implementation ................................................................................................ 21 2.6.3 Advantages of loop unrolling........................................................................... 22 2.6.4 Disadvantages of loop unrolling ...................................................................... 23 2.6.5 Interactions, again ............................................................................................ 24 2.6.6 Candidates for unrolling................................................................................... 24

2.7 Summary............................................................................................................... 24 Chapter 3: Data Collection..................................................................................................... 26

3.1 Introduction .......................................................................................................... 26

V

3.2 The Benchmarks................................................................................................... 28 3.3 Implementation of loop unrolling......................................................................... 31

3.3.1 Which loops should be unrolled? ..................................................................... 31 3.3.2 Initial experiments............................................................................................ 31 3.3.3 Loop level profiling.......................................................................................... 32

3.4 Generating the targets........................................................................................... 34 3.4.1 Preparing the benchmarks ................................................................................ 34 3.4.2 Selecting loops ................................................................................................. 34 3.4.3 Profiling............................................................................................................ 34 3.4.4 Filtering............................................................................................................ 34 3.4.5 Running the search strategy ............................................................................. 35

3.5 Technical Details .................................................................................................. 35 3.5.1 The platform..................................................................................................... 35 3.5.2 The compiler .................................................................................................... 35 3.5.3 The timer�s precision........................................................................................ 35

3.6 The results in summary......................................................................................... 36 3.7 Feature extraction ................................................................................................. 36 3.8 The representation of a loop ................................................................................. 39 3.9 Summary............................................................................................................... 40

Chapter 4: Data Preparation and Exploratory Data Analysis................................................. 41 4.1 Introduction .......................................................................................................... 41 4.2 The general framework for data integration ......................................................... 42 4.3 Formal representation of the data ......................................................................... 43 4.4 Is this data valid? .................................................................................................. 44

4.4.1 Statistical analysis ............................................................................................ 44 4.5 Pre-processing the targets..................................................................................... 46

4.5.1 Filtering............................................................................................................ 48 4.5.2 Dealing with outliers ........................................................................................ 49 4.5.3 Target transformation....................................................................................... 50

4.6 Pre-processing the features................................................................................... 52 4.6.1 Rescaling.......................................................................................................... 52 4.6.2 Feature selection and feature transformation ................................................... 53

4.7 Summary............................................................................................................... 54 Chapter 5: Modelling and Results.......................................................................................... 57

5.1 Introduction .......................................................................................................... 57

VI

5.2 The regression approach....................................................................................... 58 5.3 Learning methods used......................................................................................... 60

5.3.1 Multiple Linear Regression.............................................................................. 62 5.3.2 Classification and Regression Trees ................................................................ 64

5.4 Parameters setting................................................................................................. 64 5.5 Measure of performance used............................................................................... 65 5.6 Experimental Design ............................................................................................ 65

5.6.1 Complete dataset .............................................................................................. 65 5.6.2 K-fold cross-validation..................................................................................... 66 5.6.3 Leave One Benchmark Out cross-validation.................................................... 66 5.6.4 Realising speed-ups.......................................................................................... 67

5.7 Results and Evaluation ......................................................................................... 67 5.7.1 Complete dataset .............................................................................................. 67 5.7.2 K-fold cross-validation..................................................................................... 70 5.7.3 Leave One Benchmark Out Cross-validation................................................... 71 5.7.4 Realising speed-ups.......................................................................................... 75 5.7.5 Feature construction ......................................................................................... 78 5.7.6 Comparison to related work ............................................................................. 78

5.8 Summary and Discussion ..................................................................................... 79 Conclusions............................................................................................................................ 87 Bibliography .......................................................................................................................... 91

1

Introduction

Overview and Motivation The continuously increasing demand of high performance computational resources has

required great effort on behalf of hardware designers in order to develop advanced

components making possible the construction of current applications. Thus,

microprocessors� architectures have become more complex and have come to provide

processing rates that until a few years ago were affordable only by a selected group of users.

However, the resources offered by current processors are significantly underexploited, which

limits the possibility of running applications at maximum speed.

This �bottleneck� is mainly a consequence of the limitations of a very complex application

being used to transform the source code of programs written in a high-level programming

language into machine dependent code, the compiler. Indeed, the main concern of compiler

writers centres not upon very well studied tasks at the front of the compilation process (such

as parsing and lexical analysis) but upon the discovery of ways in which to optimise the

execution time of programs. Hence, optimisation is not considered an additional feature of

the compilation process but has become a crucial component of existing compilers.

However, achieving high performance on modern processors is an extremely difficult task

and compiler writers have to deal with NP-complete problems.

Numerous program transformations have been created in order to guarantee optimal use of

the resources of existing architectures. Nonetheless, all these transformations fail in the sense

that the compiler optimisation problem is far from being optimally solved. In fact, although

these transformations should, in theory, lead to a significant improvement in the performance

of the execution time of programs, they have instead become new problems to be solved.

Certainly, due to the great variety of transformations that are currently available, the number

of parameters each transformation involves and the unknown and almost enigmatic effect of

the interactions among these transformations, the compiler optimisation problem has been

transferred to the task of discovering when and how program transformations should be

applied.

Introduction

2

Considering the huge search space in compiler optimisation, compiler writers commonly rely

on heuristics that suggest when the application of a particular transformation can be

profitable. The usual approach adopted is running a set of benchmarks and setting the

parameters of the heuristics on these benchmarks. Although valid, this approach has

demonstrated that it insufficiently exploits the power of the transformations, leading to

suboptimal solutions, or sometimes to a diminishment in the performance of the execution

time of programs. This is understandable given that the heuristics adopted may be specific to

particular types of programs and target architectures.

This project tackles the compiler optimisation problem from a different point of view. It does

not attempt to tune heuristics previously constructed in order to optimise programs, but

investigates the use of machine learning techniques with the aim of discovering when a

particular transformation can be beneficial. Machine Learning techniques have been

successfully applied to different areas, from business applications to star/galaxy

classification. The type of learning adopted in this project is the one based on past examples

(training data) from which a model can be constructed. This model must be provided with

predicted power, i.e. although it is built on a set of training examples, it must be able to

successfully perform on novel data.

Project Objectives Thus, the principal goal of this project is to use supervised machine learning techniques to

predict good compiler transformations for programs. Therefore, if a number of examples

of good transformations for programs are given, the aim is to use machine learning methods

to predict these transformations based on features that characterise the programs. Given the

great number of program transformations that are available, this project will focus on a

relatively simplified version of the problem by applying machine learning techniques to

predict the effect of a particular transformation: loop unrolling.

Loop unrolling is a very straightforward but powerful program transformation mainly used

to improve Instruction Level Parallelism ILP (the execution of several instructions at the

same time) and to reduce the overhead due to loop control. Studies on loop unrolling date

from about 1979 when Dongarra and Hinds (1979) investigated the effect of this

transformation on programs written in Fortran. However, understanding the interaction of the

parameters that may affect it as well as its effects on the execution time of the ultimate code

Introduction

3

generated by the compiler is still a challenge. The main problem to be solved in loop

unrolling is deciding whether this transformation is beneficial or not and, yet further, how

much a loop should be unrolled in order to optimise the execution time of programs.

Previous work has focused on building classifiers in order to predict the profitability of loop

unrolling [Monsifrot et al., 2002] and how much this transformation should be applied

[Stephenson and Amarasinghe, 2004]. Although well-founded, these approaches may

encounter some difficulties when noisy measurements are present and when the data used for

learning is limited. This project goes beyond the classification problem and proposes a

regression approach to predict the improvement in performance that unrolling can

produce on a particular loop. The regression approach is a more general formulation of the

classification problem and previous solutions can be obtained with it. Furthermore, the

machine learning solution to loop unrolling based on a regression method is smoother than

the one obtained by classification methods given the degree of noisiness the measurements

may have and the limitations in terms of the number of training examples that are available.

Bearing in mind that the process of applying machine learning techniques requires great

effort in the preparation and analysis of the data to make it suitable for learning and the

correct use of appropriate techniques for the regression task, the following specific

objectives have been stated for the project:

• To develop a systematic approach that enables the application of machine learning

techniques to compiler optimisation focused on loop unrolling.

• To investigate the appropriateness of learning based on past examples in compiler

optimisation.

• To propose a regression solution able to determine the most influential parameters

involved in loop unrolling and able to predict the effect of this transformation on the

execution time of programs.

• To assess the quality and the impact of the results obtained by two different

regression methods used to predict the effect of loop unrolling.

Organisation This dissertation assumes that the reader does not have a background on compilers and,

therefore, an entire chapter will be dedicated to explaining the most relevant issues involved

Introduction

4

in compilation and compiler optimisation. Furthermore, the most important terminology

related to compiler optimisation needed to understand a specific concept or procedure will be

progressively explained. Similarly, although no separate section has been devoted to

providing a background on machine learning techniques, those concepts that are believed to

be crucial to understanding what is stated will be described as each chapter is developed. The

organisation of this dissertation is presented below.

Chapter One provides the Literature Review, describing previous work that involves

machine learning and/or artificial intelligence techniques applied to compiler optimisation.

Each approach is briefly explained, giving some details about the implementation and the

learning technique used. A discussion concerning their advantages and drawbacks is

presented and an explanation about how this previous work has influenced and motivated the

present project is given.

Chapter Two presents a Background on Compiler Optimisation. It aims to familiarise the

reader with the terminology involved in compiler optimisation, which will be used

throughout this dissertation, and give the background necessary to understand subsequent

chapters. Initially, the definition of compilation is provided, followed by an explanation of

the most important issues in compiler optimisation. The concept of program transformation

is briefly described and some code transformations are mentioned. A special section is

devoted to loop unrolling, which is the program transformation of interest in the present

dissertation.

Chapter Three describes the Data Collection process that has been carried out in this project.

It presents the most relevant factors that were taken into account to execute the experiments

and to generate the data available for learning. In this chapter, the benchmarks that were used

to construct the dataset are described; the implementation used for loop unrolling is

explained; some technical issues regarding the platform, the compiler and the optimisation

level used for the experiments are provided; the results of the data collection are summarised

and the features extracted from the loops are given.

Chapter Four presents the Data Preparation and Exploratory Data Analysis that was

carried out after the data collection process. It describes the process of cleaning, analysing

and preparing the data in order to make it suitable for learning. This chapter provides an

Introduction

5

insight into the data that will be used for modelling, explaining how the data is processed

throughout different stages and is made available for analysis and learning.

Chapter Five presents the section of Modelling and Results. This chapter explains the

machine learning solution to loop unrolling based on the regression approach and describes

the results obtained with the techniques used. It details how the regression problem is

formulated and why it is considered more general as compared to previous approaches that

applied machine learning techniques to loop unrolling. It provides an overview of the

regression methods that have been used in this project and defines the measure of

performance that has been used to evaluate the methods. It presents and evaluates the results

obtained and compares the solution found with previous work.

Finally, the conclusions obtained after having culminated this project are presented with a

summary of the work that has been done, the achievements obtained and the future work that

is proposed for continuing research in the area of compiler optimisation with machine

learning.

An additional clarification needs to be provided before reading the following chapters. At

some stages the term data mining is used to describe the work that has been developed in this

project; sometimes this term seems to be equated with machine learning. However, the

reader should bear in mind that although there is not a defined line that separates these fields,

it is assumed that data mining is the complete process of discovering knowledge from data

and that machine learning is only one important step in this process. Thus, even though the

main goal of this project is to apply machine learning techniques to compiler optimisation,

the work that has been developed is essentially a data mining application.

6

Chapter One

Literature Review

1.1 Introduction This section describes the most relevant work involving machine learning and/or artificial

intelligence techniques applied to compiler optimisation. Although there are other

approaches that have used machine learning for specific tasks in compiler optimisation, they

are not mentioned in this section as they are distant from the central idea and the aims of this

project. As described throughout this chapter, the attractive idea of applying machine

learning to compiler optimisation is relatively new. Some authors have proposed solutions

that are easy to implement and automate and others have preferred methods that are difficult

to deal with and can take a long time to obtain results.

The previous work that is presented in this chapter has been divided into two groups: the use

of machine learning as a general approach to compiler optimisation and the application of

machine learning methods to a specific program transformation. Thus, while section 1.2

presents previous work that has attempted to tune several code transformations by using

Evolutionary Computing and Case-Based Reasoning, Section 1.3 explains how Decision

Trees and Nearest Neighbours have been applied to learning in a particular program

transformation: loop unrolling. Each approach will be briefly described, giving some details

of the implementation and the learning technique used; a discussion concerning their

advantages and drawbacks will be presented; finally, a summary and an explanation about

how this previous work has influenced and motivated the present project will be given in

section 1.4.

1.2 Tuning heuristics and recommending program transformations

Stephenson et al. (2003) used evolutionary computing to build up priority functions. Priority

functions are, in some way, ubiquitous in constructing heuristics for compiler optimisation.

Chapter 1. Literature Review

7

In other words, compiler writers commonly rely on the assumption that a specific

optimisation technique is strongly tied to a certain function, called priority function. This

function involves some of the parameters that possibly affect a particular heuristic. Under

this assumption, they used Genetic Programming [Koza, 1992] to search the priority function

solution space. Their work was focused on three different heuristics: hyperblock formation,

register allocation and data prefetching, using Trimaran [Trimaran] and the Open Research

Compiler [ORC] for their experiments.

Although it may be appealing for many computer scientists to construct programs based on

evolution and natural/sexual selection, this approach has more drawbacks than advantages.

To avoid extending more than necessary this section, I will only mention four aspects that

can stop someone using Genetic Programming and evolutionary computing for compiler

optimisation. The first reason is related to a very popular term in machine learning:

overfitting. Plainly explained, overfitting is the effect of fitting a model to the training data in

such a way that is unable to generalise and to perform successfully in novel data. In fact, the

generalisation of Genetic Programming for this type of problems has not been clearly

demonstrated and the results of the work of Stephenson et al. (2003) reflect this effect.

Secondly, given the fact that the so-called �fitness function� used to evaluate candidates to

the solution is strongly connected to the execution time of the programs, carrying out the

task of selecting and �evolving� the solution may take a very long time. Furthermore, and this

is the third justification to avoid using Genetic Programming for this problem, different runs

of the technique for the same input can lead to different results, which is referred as

instability of the solution. Finally, but not less important, the adjustment of the parameters of

the heuristic may be transferred to the tuning of the parameters of the technique itself.

In an attempt to build an interactive tool to provide the user a guide to performance tuning,

Monsifrot and Bodin (2001) developed a framework based on Cased-Based Reasoning

(CBR). The purpose was to advise the user on possible code transformations in order to

reduce the execution time of programs. They adapted the general idea of Case-Based

Reasoning consisting in learning from past cases (See [Shen and Smaill, 2003] pages 72-78

for an introduction or [Kolodner, 1993] for a thorough description) to the compiler

optimisation problem by detecting fragments of code (i.e. loops) candidates to optimise,

checking their similarities with other past cases and reusing the solution of these cases. The

system was implemented using TSF [Bodin et al., 1998] and codes written in Fortran 77. The

features used to characterise the loops were selected according to four different categories:


8

loop structure, arithmetic expressions, array references, and data dependences. Loop

transformations such as unrolling innermost and outer loops, unroll-and-jam and loop

blocking were considered.

As Nearest Neighbours, Case-Based Reasoning belongs to Instance-Based Learning

methods, which classify new instances according to their similarities with a set of training

examples ([Mitchell, 1997] pages 230-245). Although Case-Based Reasoning can be

considered a more sophisticated version of Nearest Neighbour methods, the work done in

[Monsifrot and Bodin, 2001] does not detail important stages in CBR such as the

modification of the prior solution or the repair of the proposed solution when it is not

successful.

An aspect to highlight in this approach is that it presents a wide variety of sensible and

important features to characterise programs and, more specifically, to describe loops, that

can be crucial when working with machine learning techniques. These features, also called

indices in the CBR terminology, make possible the identification of past cases that can be

reused and modified to provide the solution for a given problem. However, there are several

caveats to mention about this approach. In the first place, an insufficient number of loops

were used in the experiments, and they cannot be considered representative for programs. In

fact, the initial experiments were on only one benchmark containing 64 loops. It is clear that

it limits the capacity of the method to generalise to new problems. Furthermore, only one

specific program was used to test the performance of the system, which reinforces the idea of

biased results. Nevertheless, this is understandable given the difficulty of collecting data for

this kind of application. Finally, and possibly the greatest drawback of this solution, is that it

poorly contributes to the main goal of reducing the effort of compiler writers on

optimisation. Certainly, unlike traditional compilers, the system suggests modifications

without checking them for legality. Therefore, there is still a lot of work left to the user who

is responsible for this task.

1.3 Learning in a particular program transformation: loop unrolling

In a more pragmatic approach, which was indeed the motivation for the present dissertation,

Monsifrot et al. (2002) concentrated their effort in working with a particular transformation

for optimising the execution time of programs: loop unrolling. Based on a characterisation of


9

loops from different programs written in Fortran 77 they wanted to investigate if it was

possible to learn a rule that could predict when unrolling was beneficial or detrimental. In

this case, unrolling was implemented at the source code level using TSF [Bodin et al., 1998]

and the experiments were performed with the GNU Fortran Compiler [g77]. For Learning,

they applied Decision Trees, which is an appropriate technique when the readability of the

results is needed. Briefly, their aim was to build a binary classifier able to decide whether to

perform loop unrolling or not.

Even though their results were not so exciting, which is explainable because only one code

transformation was taken into account, the methodology itself must be highlighted as well as

the endeavour to obtain the loop abstraction. Nevertheless, it is necessary to mention several

limitations about their work. Firstly, there is no reference to how much a loop should be

unrolled and how this decision should be taken. In fact, this is an important issue in loop

unrolling, because the technique can be advantageous for a specific unroll factor but

detrimental for another one. Secondly, given the methods they used to carry out feature

extraction, two loops with the same representation (loop abstraction) can belong to different

classes: positive or negative, where positive refers to the case when unrolling causes an

improvement in the execution time and negative to the opposite situation. This �noisy�

training data seems to have severely affected their results. Finally, although it is explicitly

recognised in the original paper, the fact of having a large number of negative examples and

only a small number of positive examples also affected the performance of their technique.

However, there was no effort to implement any solution to deal with this unbalanced data.

Following the work of Monsifrot et al. (2002), Stephenson and Amarasinghe (2004) have

recently published a technical report that describes how to go beyond the binary

classification about the suitability of loop unrolling. In fact, besides predicting whether

unrolling is beneficial or not, they also tried to predict the best unroll factor for loops. Unlike

Monsifrot et al. (2002) who applied unrolling at the source code level, they implemented

unrolling at the back-end of the compiler, and used the Open Research Compiler [ORC] for

their experiments with codes written in C, Fortan and Fortran 90. As in [Monsifrot and

Bodin, 2001], their technique was also based on Instance-Based Learning methods. In fact,

they used Nearest Neighbours, a very simple machine learning method commonly used for

classification, although also possible for regression. In summary, their work was focused on

solving the multi-class problem of predicting the unroll factor for a loop that guaranteed its

minimum execution time.


10

Although not comparable, given the differences in the experimental methodology and the

machine learning algorithm used, their results were relatively better than those obtained by

Monsifrot et al. (2002). Aware of the variability in the execution times of the loops, they

invested a lot of effort in the instrumentation of the loops and ran each experiment 30 times,

taking the median as the representative for each training data point. Although this approach

is very sensible and valid, a lot can be gained if information about the behaviour of the

execution time of the loops under unrolling is maintained. Thus, it is possible to formulate a

more general solution to the problem that directly models the improvement in performance

of the loops when they are unrolled. This is the solution proposed in the present dissertation

and it will be denoted as the regression approach.

1.4 Summary This chapter has presented the relevant previous work that has applied machine learning to

compiler optimisation. In general, machine learning has been used in order to provide a

solution that can be applied to different optimisation problems but also has been focused on

the application of the techniques to a specific code transformation: loop unrolling, which is

the interest of the present project. Certainly, it is necessary to remark that this dissertation

has been strongly motivated by the last two pieces of work described above and, many of

their ideas were analysed to formulate a more general approach that can easily deal with

variability and noisiness and can generalise more appropriately when encountering novel

data.

11

Chapter 2

Background on Compiler Optimisation

2.1 Introduction This chapter presents an overview of compilation and describes the most important concepts

related to compiler optimisation. It is not the aim of this chapter to provide an in-depth study

about compilers. Contrarily, this section attempts to familiarise the reader with the

terminology involved in compiler optimisation, which will be used throughout this

dissertation, and give the background necessary to understand subsequent chapters. A

general-to-specific methodology is used to associate the different topics. Thus, the definition

of compilation is provided in section 2.2 and the organisation of a compiler is given in

section 2.3. The aim of a compiler is described in section 2.4 and the most important issues

in compiler optimisation are explained in section 2.5, where the concept of program

transformation is briefly described and some code transformations are mentioned. Section

2.6 is devoted to loop unrolling, which is the program transformation of interest in the

present dissertation. Finally a summary of the chapter is presented in section 2.7.

2.2 Definition of compilation The term compilation can be defined as the process of transforming the source code of a

program (in a high-level language) into an object code (in machine language) executable on

a target machine. Although this definition can be seen as more restrictive and less general

than others (see [Cooper and Torczon, 2004] for an alternative definition of compilation) it is

sufficient to understand the following concepts related to compiler optimisation. Hence, the

most basic idea of a compiler can be understood as a black box whose input is the source

code of a program written in high-level language and output is an executable code on a

specific machine as shown in Figure 2.1.

Chapter 2. Background on Compiler Optimisation

12

This mysterious black box is actually a very complex program responsible for making the

job easier for many programmers. Indeed, it hides the complexity of translating an easy-to-

use high-level language into a less understandable machine-dependent code. Usually,

commercial compilers provide additional features to the definition given above, such as

debugging or even an Integrated Development Environment IDE, and these features are

commonly included in the functionality of the compiler itself. However, the interest of this

section is focused on understanding this principal function of translation, how it is

performed, which issues may affect it and how the code generated can be improved. To

achieve this goal, it is necessary to start by describing the internal structure of a compiler.

2.3 Compiler organisation In a very simple form, a modern compiler can be thought as a three-layer structure where the

output of one layer is the input of the following one. These three layers, namely the front-

end, the optimiser and the back-end, perform different tasks that influence the ultimate code

generated. A possible structure for a modern compiler is shown in Figure 2.2. The functions

developed by each layer are:

The Front-end is responsible for the lexical analysis and parsing. It checks if the source

code satisfies the static constraints of the language in which it is implemented. Finally, it

converts the code into a more convenient form called Intermediate Representation (IR).

The Optimiser takes the intermediate representation of the program and applies a series of

transformations that can possibly improve the object code.

The Back-end receives the representation of the transformed code and converts it into a

language that the specific machine can understand. This language explicitly deals with the

management of physical resources available in the target architecture.

Figure 2.1: Basic form of a compiler

Compiler

Source code

(Program)

Object Code

(Executable)


13

Although Figure 2.2 presents compilation as a sequential process, sometimes this labour

division between the layers is not clearly delimited. For example, after applying some

transformations it could be necessary to perform additional analyses to the code that has

been modified. However, it is instructive to bear in mind that there are three essential tasks

developed by a compiler: analysis and transformation of the code into an intermediate

representation, improvement of the intermediate representation with the help of several

program transformations and translation of the code into a machine-understandable

language.

2.4 The purpose of a compiler Having explained the compiler organisation as a three-layer structure with well-defined

functions, two questions can be raised about the final output of the compilation process.

Firstly, what has been gained with the application of this process? Secondly, does the

compilation of the program make sense if the meaning of the initial code is changed?

Front-end

Convert the code into

IR

Optimiser

Apply transformations

Object code

Source code

Back-end

Convert into assembly

code

Figure 2.2: A possible structure of a modern compiler


14

The first question goes to the core of the compilation process. Even in the case when the

optimisation phase is not applied, a lot has been gained because is the compilation process

what makes the program executable on the target machine. Certainly, if compilers were not

available programmers would have to write their applications directly in assembly code,

which clearly would take much more time than using a high level programming language1.

Hence, it can be concluded that a compiler improves the initial code.

Now, it is possible to answer the second question with a simple statement: If the compilation

process changes the meaning of the initial program all the effort invested by the compiler is

useless. Unquestionably, a compiler �must preserve the meaning of the program being

compiled� [Cooper and Torczon, 2004]. This preservation is usually referred in the literature

as the correctness of the compiler.

2.5 An Optimising Compiler It has been introduced so far the idea of a modern compiler as a structure composed by three

layers: the front-end, the optimiser and the back-end. The functions of the front-end and the

back-end have been clearly explained but the role of the optimiser has been purposely

described in a general way by using the term improvement. The following subsections

discuss what optimisation means and why it is important in the compilation process.

2.5.1 Goals of Compiler Optimisation Compiler optimisation is mainly related to the ability of the compiler to optimise or improve

the generated code rather than enhancing the compilation process itself. Therefore, although

it might be very important to reduce the compilation time, and in fact, it is an issue to take

into account in compiler optimisation, the first goal of an optimising compiler is to discover

opportunities in the program being compiled in order to apply some transformations to

improve the code. Improvement, in this case, can refer to several facts depending on what

the user actually requires. For example, the user might be interested in reducing the

execution time of a program. However, the goal could also be to generate a code that

1 A person with a basic knowledge in computer science might argue that programmers could use interpreters. However, interpreters are also programs that transform the source code into machine understandable code. The difference is that they translate and execute line by line rather than the whole program at once, as it happens with compilers. Additionally, it is widely accepted that compiled code is much more efficient than interpreted code.


15

occupies the least possible space, or even a trade-off between speeding up the program and

reducing the size of the code that has been created. Additionally, it also might be interesting

to guarantee an efficient use of the resources of the target machine, for example memory,

registers and cache. In general, there is not a unique objective when talking about compiler

optimisation. Nevertheless, compiler optimisation is commonly associated only to the

purpose of speeding up programs, and indeed, this is the goal the present dissertation tries to

achieve. Therefore, the term optimisation will indicate henceforth the effect of reducing the

execution time of programs.

Compilers can improve the execution time of programs by carrying out subtasks such as

minimisation of the number of operations executed by the program, efficient management of

computational resources (cache, registers, and functional units) and minimisation of the

number of accesses to the main memory. These tasks jointly executed by the compiler can

greatly benefit not only the end-user of the program but also programmers and even

hardware designers. The end-user logically profits from optimisation because the application

executes faster. Programmers can neglect the details about making the appropriate code for a

particular machine and concentrate on high-level structures and a good application design.

Finally, hardware designers can be confident that compilers can appropriately exploit the

capabilities of their products.

2.5.2 Considerations for program transformations As expressed above and depicted in Figure 2.2, a compiler can optimise programs by

applying code transformations. These transformations will hopefully improve the code

generated by reducing its execution time. However, there are several issues to consider when

applying a particular transformation: correctness, profitability and compilation time.

2.5.2.1 Correctness: As for the general compilation process, the transformation applied

must be correct. The principle is basically the same: the code produced by a transformation

must preserve the meaning of the input code. In other words, if the meaning of the program

changes, the transformation should not be applied. Correctness is also referred in the

literature as the legality of the transformation. Bacon et al. (1994) provide a more formal

definition of legality:


16

�A transformation is legal if the original and the transformed programs produce

exactly the same output for identical executions�

However, in this case, the legality of the transformation is not sufficient. Since we are

considering the process of transforming the code in order to optimise the program, a

transformation must also be profitable.

2.5.2.2 Profitability: A transformation applied to a particular fragment of a program must

improve the ultimate code generated. In other words, the process of applying a particular

transformation is expected to produce an improvement in the execution time of the program.

This is in fact the purpose of applying a transformation for optimisation, but in many cases,

the effect of transforming a program is not noticeable or, even worse, it leads to a detriment

in performance of the final code generated2.

Finally, although sometimes neglected, the compilation time is also an important factor in

determining whether a transformation is actually beneficial or not. If the transformation

considerably increases the compilation time of the program, there should be serious doubts

about its application. Ideally, it should be a trade-off between the improvement of the

execution time and the detriment of the compilation time.

2.5.3 The process of transforming a program for optimisation The transformation of a program for optimisation can be divided into three steps as follows:

2.5.3.1 Identification: The compiler has to identify which part of the code can be

optimised and which transformations may be applied to it.

2.5.3.2 Verification: The legality of each transformation must be ensured, i.e. that the

transformation does not change the meaning of the program.

2.5.3.3 Conversion: Refers to the process of applying a particular transformation.

2 Certainly, if all the transformations guaranteed an improvement in performance of the final code there would not be reasons for research in this area and for the present dissertation. Therefore, the questions of how and when to apply a particular transformation constitute the major problem in compiler optimisation.


17

The transformation process is depicted in Figure 2.3.

From these three steps, the identification phase, i.e. the process of recognising fragments of

the program susceptible to optimisation and the decision of finding transformations that can

potentially improve the code, is the most complicated stage. Compiler writers commonly

rely on heuristics to determine these transformations and the parameters that define them.

However, given that, many parameters may be involved and one transformation may affect

the applicability of the subsequent transformations, these heuristics commonly lead to

suboptimal solutions.

2.5.4 The problem of interaction As explained above, compiler optimisation is a very complex problem not only because

many parameters may be involved in each transformation but also because one

transformation can impede or enable the applicability of another transformation. The latter is

known as the effect of interaction among the different transformations. Certainly, the actual

performance of the compiled code depends on the resulting outcome of the interactions

among the transformations that were applied during the whole compilation process. In

general, while some transformations are demonstrated to be beneficial for a particular

program when applied independently, they can degrade the execution time of the program if

they are sequentially executed.

Thus, considering that optimisation problems such as register allocation and instruction

scheduling are NP-complete in themselves and that different transformations can interact

with each other, the compiler optimisation problem is far from being optimally solved.

Code (IR)

Identification

Which part of

the code?

Verification

Is it legal?

Conversion

Apply

Transformed code

Figure 2.3: Program transformations for optimisation


18

2.5.5 Types of program transformations Program transformations can be split into two classes depending on which part of the

compiler they are applied to. Thus, they can be classified as:

2.5.5.1 Machine-independent transformations: convert an intermediate

representation (IR) of the program into another intermediate representation. Consequently,

the code generated does not depend on a specific machine or architecture. However, since

the code produced is not the final code and may be susceptible to further changes, the

profitability of these transformations cannot be ensured. High-level transformations that

perform tasks such as eliminate redundancies, eliminate unreachable or useless code and

enable other transformations, can be considered to belong to this group. Loop unrolling is, in

general, an example of machine-independent transformation.

2.5.5.2 Machine-dependent transformations: Machine-dependent transformations are

also called machine-level transformations. They convert the intermediate representation of

the program directly into assembly code. Thus, the code generated is tied to a specific

architecture. Those transformations that consider particularities of the target architecture

belong to this group, for example, instruction scheduling, instruction selection and register

allocation.

2.5.6 The scope of optimisation Program transformations can be applied at different levels of the code. For example, they can

be applied to statements, basic blocks3, innermost loops, general loops, procedures (intra-

procedures) and the whole program (inter-procedures). Depending on this level of

granularity, the complexity of the analysis becomes bigger and applying a particular

transformation turns out to be more costly because it increases the compilation time of the

program. Loop level transformations are very important as loops are considered to be the

places where programs spend most of their time.

3 A basic block can be defined as straight-line code; In other words, a block of code with no branches.


19

Type Transformation

Loop Interchange

Loop Skewing

Loop Reversal

Loop Blocking (Tiling)

Loop Pushing

Loop Fusion

Loop Peeling

Loop Code motion

Loop Normalisation

Loop Transformations

Loop Unrolling

Memory Alignment

Array Expansion

Array Contraction

Scalar Replacement

Code Co-location

Memory Access Transformations

Array Padding

Unreachable Code Elimination

Useless Code Elimination

Dead Variable Elimination

Common Subexpression Elimination

Redundancy Elimination

Short-Circuiting

Frame Collapsing

Procedure Inlining Procedure Call Transformations

Parameter Promotion

Constant Propagation

Constant Folding Partial Evaluation

Algebraic Simplification

Table 2.1: Some common transformations for compiler optimisation (taken from [Bacon et al., 1994])


20

2.5.7 Some common transformations To give an idea about the type and number of program transformations available for

compiler optimisation some of them are shown in Table 2.1 (taken from [Bacon et al.,

1994]). The transformations shown as shaded are very well known and studied in the

literature. For a good review about the meaning and implementation of these

transformations, see [Bacon et al., 1994].

As it is the focus of the present dissertation, Loop unrolling is described in more detail in

section 2.6.

2.6 Loop Unrolling Loop unrolling is a very straightforward but powerful program transformation mainly used

to improve Instruction Level Parallelism ILP (the execution of several instructions at the

same time) and to reduce the overhead due to loop control. Although extensively studied in

the literature ([Dongarra and Hinds, 1979] and [Davidson and Jinturkar, 2001]) it continues

being of interest for the compiler community. Understanding the interaction of the

parameters that may affect it as well as its effects on the execution time of the ultimate code

is still a challenge.

2.6.1 Definition Loop unrolling is the replication of the loop body certain number of times u, called unroll

factor. As the loop body is replicated, the trip count, i.e. the loop termination code, must be

adjusted in order to guarantee that the loop is executed exactly the same number of times

than the rolled (original) version. To additionally control the leftover iterations, a prologue

or epilogue is added before or after the unrolled loop. To clearly illustrate how loop

unrolling works, the example shown in Figure 2.4, taken from [Bacon et al., 1994], depicts a

loop that has been unrolled twice. The notation used does not belong to a specific language

although it is very similar to Fortran 77, except for the way of accessing elements of arrays.

The left side of Figure 2.4 shows the original loop composed by only one statement (the loop

body) when the iteration step is 1 (considered by default). The right side shows the loop


21

unrolled using a factor u=2. Thus, the loop body is replicated twice4, the array accesses are

modified accordingly and the iteration step is changed to 2. Since the value of the trip count

(n-2) is unknown, an epilogue has been added to control that the unrolled loop is executed

the same number of times as the original loop.

2.6.2 Implementation Loop unrolling can be implemented by hand (manually) or by the compiler (automatically).

It is manually implemented, as in Figure 2.4, either by the programmer or by a software

transformation tool that works on top of the compiler. It can be automatically implemented

by the compiler on the source code, on an intermediate representation of the program or on

assembly code (back-end), i.e. on an optimised version of the program. Implementing loop

unrolling at the source code level or at an intermediate representation of the program can be

more profitable than at the back-end of the compiler. For example, it might make the code

more susceptible to the application of other program transformations. However, it also can

be unfavourable because it may impede the application of other transformations. Arguably,

one of the reasons to implement loop unrolling at the back-end of the compiler is because of

4 There is no general agreement in the literature about considering the rolled version of the loop as u=1 or u=0. The former notation, which is believed to be more understandable, will be used throughout this dissertation.

Original Loop

do i=2, n-1

a[i] = a[i] + a[i-1] * a[i+1]

end do

Loop unrolled twice

do i=2, n-2, 2

a[i] = a[i] + a[i-1] * a[i+1]

a[i+1] = a[i+1] + a[i] * a[i+2]

end do

if (mod(n-2,2) = 1) then

a[n-1] = a[n-1] + a[n-2] * a[n]

end if

Epilogue

Figure 2. 4: Original loop (left) and loop unrolled by a factor u=2 (right)

(taken from [Bacon et al., 1994])


22

the fact that its profitability can be almost ensured when applied to one of the latest

representations of the program. An important issue to remark is that, if judiciously applied,

unrolling is an always-legal transformation.

In general, loop unrolling can offer several advantages that may improve the execution time

of programs. However, these benefits can be diminished by some side effects of the

transformation.

2.6.3 Advantages of loop unrolling As mentioned above, loop unrolling can be considered a beneficial transformation because it

may:

o Improve Instruction Level parallelism (ILP). Instruction Level Parallelism refers

to the ability of the compiler to execute multiple instructions simultaneously. Hence,

If the size of the loop is augmented, the number of instructions that can be scheduled

in an out-of-order mode5 also increases, causing that more instructions can be

executed in parallel [Monsifrot et al., 2002].

o Reduce the overhead due to loop control. Loop overhead is caused by the

increments of the loop variable, the tests applied to this variable and the branch

operations. All these operations are decreased by compensating the reduction in the

number of iterations with the replications of the loop body. For example, in Figure

2.4 loop overhead is reduced by a half. Therefore, if the loop executes a considerable

amount of iterations, the improvement in the execution time due to the reduction of

the loop overhead is appreciable.

Additionally, loop unrolling can also:

o Enable other transformations. Loop unrolling applied at an early stage can give to

the code the appropriate shape for other transformations, for example, for common

subexpression elimination.

o Eliminate loop copy operations. Sometimes copy operations are necessary when

the loop calculates a value that is needed in a subsequent iteration, which is known

5 This term refers to the situation when the instructions are not executed in the specific order given by the program.


23

in the jargon of compilers as loop-carried data dependency. These copy operations

can be eliminated by unrolling.

o Improve memory locality (register, data cache or TLB6). Memory locality is

improved when the access to local memory resources is performed in an efficient

way. For example, on the right side of Figure 2.4 the loads corresponding to a[i] and

a[i+1] are done twice. Thus, the number of loads per iteration has been reduced from

3 to 2.

2.6.4 Disadvantages of loop unrolling By far the major drawback of loop unrolling is that it can degrade the performance of the

instruction cache. Parameters such as the size of the loop body after unrolling (dependent

on the unroll factor used), the size, the organisation and the replacement policy of the

instruction cache may cause that some instructions cannot be kept in the instruction cache

and must be accessed from the main memory, which can be thousands of times slower than

the cache. When this happens, it is said that some cache misses have occurred. The number

of cache misses can certainly affect the final execution time of the program and diminish any

benefit from loop unrolling. Furthermore, it may be possible to completely deteriorate the

execution time and slow down the program.

Besides the degradation of the instruction cache, loop unrolling can have other negative

effects. For example, as the size of the loop body becomes bigger, the number of instructions

increases, which may augment the number of calculations of addresses and make the

instruction scheduling problem more complex. Moreover, additional loads and stores may

be needed causing a greater demand for registers. It is said that the register pressure has

increased. The register pressure is a measure that indicates the ratio between the registers

demanded by a program and the actual number of registers available in a particular machine.

It is bad news if the register pressure becomes much greater than 1 because some registers�

values have to be saved (into main memory) and liberated for other purposes. In this case, it

is said that the registers have been spilled [Bacon et al., 1994].

6 TLB stands for Translation Lookaside Buffer. It is a table in the processor that maps virtual addresses into real addresses of memory pages that have been recently referenced.


24

Finally, another disadvantage of loop unrolling is that it may prevent other optimisation

techniques. In fact, after loop unrolling some transformations may not be profitable or

simply may not be applicable.

2.6.5 Interactions, again It might seem rather weird that one of the advantages of loop unrolling is that it enables

some transformations and one of its drawbacks is that it limits the applicability of other

transformations. The justification for this apparent contradiction is related to the already

known hurdle of interactions. The interactions between most of the compiler optimisation

techniques are still poorly understood and the results can vary depending on the input

program, the target architecture and the very high dimensional space of transformations and

their parameters.

2.6.6 Candidates for unrolling Having explained the most important features of loop unrolling, describing its positive and

negative effects, one can have a general idea about which type of loops can be candidates to

unroll in order to improve the execution time of programs. However, it is easier to suggest

which loops are not good candidates for unrolling. In general, loops with a very low trip

count, a large body, containing procedure calls or containing branches in them are not very

suitable for unrolling [Nielsen, 2004]. Nevertheless, these features are rather ambiguous and

it is difficult to specify the meaning of low trip count, large loop body or even the number

and the size of procedure calls or the number of branches. Thus, as expressed above, loop

unrolling still remains an area of great interest for the compiler community.

2.7 Summary This chapter has presented the most important issues involved in compilation and compiler

optimisation necessary to understand latter chapters and the purpose of this dissertation. The

organisation of a compiler and the process of applying program transformations have been

described emphasising the principles of correctness and profitability. As it is the focus of the

present dissertation, loop unrolling has been studied in detail explaining why it is an

important program transformation, what are its advantages and disadvantages and why the

problem of determining when and how to apply this transformation is a challenge.


25

Additionally, important terminology about compilers has been introduced throughout this

section. Expressions such as legality, intermediate representation, front-end, back-end, basic

blocks, cache misses, spill registers were explicitly defined. Finally, the problem of

interactions among the different optimisation techniques has been also mentioned and

described as a hurdle for the compiler optimisation problem.

26

Chapter 3

Data Collection

3.1 Introduction One of the most difficult obstacles to applying machine learning to compiler optimisation is

the process of generating clean, reliable and sufficient data. As explained in chapter two,

compiler optimisation is related to the improvement of the ultimate code generated by the

compiler in a specific way. In this case, improvement refers to the reduction in the execution

time of programs. Therefore, for any approach that attempts to apply machine learning

techniques in order to reduce the execution time of programs, the process of creating or

evaluating the data is strongly dependent on this execution time.

In fact, regardless of whether the solution proposed is based on supervised or unsupervised

learning, the data that is utilised to build a specific model must arise in some way from the

execution time of the programs or from the execution time of parts of them. For example, it

was explained in Chapter One that Stephenson et al. (2003) used Evolutionary Computing to

construct priority functions. For this approach, the process of evaluating the candidate

solutions may require a very long time. Indeed, it is necessary to compute the whole

population of functions on different programs in order to ascertain which are beneficial and

actually lead to an improvement in performance. Similarly, it was also mentioned in Chapter

One that Stephenson and Amarasinghe (2004) used Nearest Neighbours with the aim of

predicting the best unroll factor for loops. The label for each loop was constructed by finding

the unroll factor that guaranteed its minimal execution time. Each experiment was executed

thirty times due to the variability of the measurements. Although not specified in the original

paper, let us consider the ideal case where the interactions between loops are negligible and

the programs are executed by using the same unroll factor for all the loops on each run.

Using a maximum unroll factor of eight, each program should be executed at least 30 x 8 =

240 times. As in the traditional approach used by compiler writers when building heuristics

for optimisation, machine learning techniques for compiler optimisation should also work on

a set of benchmarks that can be considered representative for specific tasks and challenging

for current computers. Normally, these benchmarks have at least hundreds of lines and can

Chapter 3. Data Collection

27

take in general about 1 or 2 minutes for a normal execution. Hence, if a program must be run

240 times and its execution time is one minute, obtaining the data for the loops within the

program will take about 4 hours. Now, say for simplicity that on average a program contains

about 20 loops that can be considered for unrolling. If one wanted to include 1000 loops in

the training data, it would take about (1000 loops) x (1 program / 20 loops) x (4h) = 200

hours. Obviously, this number can significantly increase with the preparation of the

programs for a particular transformation, the compilation time, the effect of the

instrumentation and the process of analysing and cleaning the data. Consequently, generating

sufficient data for this type of application can be a time-consuming activity.

However, there are still questions to be asked about the meaning of sufficient. Are 1000

loops sufficient for learning; should there be more; could there be less? These are very

difficult questions and, unfortunately, the present dissertation does not attempt to answer

them as they are beyond the scope of this project. Nevertheless, other issues involved in the

data collection process are also important and need to be mentioned. Particularly, the details

about how the experiments were carried out in order to generate the data must be explained.

This not only contributes to a better understanding of the present dissertation but also allows

other researchers to replicate the results obtained here.

As clearly recognised by the data mining community, at least 60% of the time in a data

mining application is devoted to understanding, analysing, pre-processing and cleaning the

data. This project is not an exception and great effort has been invested in generating clean

and appropriate data as well as in performing its analysis and pre-processing. Whilst the

following chapter deals with the data analysis phase, this chapter presents the most relevant

factors that were taken into account in carrying out the experiments and generating the data

available for learning. Initially, the benchmarks that were used to construct the dataset are

described in section 3.2. The implementation used for loop unrolling is explained in section

3.3, providing useful information about the granularity of the instrumentation and the

assumptions involved in this data generation process. The general process that has been

followed is explained in section 3.4. Subsequently, some technical issues regarding the

platform, the compiler and the optimisation level used for the experiments are provided in

section 3.5. The results of the data collection are summarised in section 3.6 and the features

extracted from the loops are described in section 3.7. The final representation of a loop

composed of its features and execution times is presented in section 3.8. Finally, a summary

of the data collection process and the experiments performed is given in section 3.9.


28

3.2 The Benchmarks Three important factors may influence the selection of the programs that are used to generate

the data for compiler optimisation with machine learning: programming language, type of

application and execution time.

Programming language: Several programming languages can be considered for applying

machine learning to compiler optimisation. In fact, other researchers have included programs

written in Java [Long, 2004] and Fortran 77 [Monsifrot et al., 2002], or have used

benchmarks from a mixture of sources such as Fortran 77, Fortran 90 and C [Stephenson and

Amarasinghe, 2004]. Although it is possible to have different programming languages in a

set of benchmarks and to include the language in which the programs are written as an

additional feature of each loop, it is worth focusing only on Fortran 77 for at least two

reasons. Firstly, there is a great deal to be said for optimising programs that are written in

Fortran, as it is considered the scientific programming language and most high-performance

computing applications have been developed in this language. Secondly, Fortran 77 lacks

pointers that can potentially reduce the application of some program transformations. Other

issues such as portability of the code to different platforms can also affect the choice the

programming language.

Type of application: Benchmarks are designed to investigate the capabilities of different

platforms and the architectures for particular tasks. For example, some benchmarks are

demanding on floating point operations, others are challenging on integer operations or

target specific applications such as graphics or digital signal processing. In principle, it

would be valuable if one could consider a wide variety of applications and include them in

the set of programs to analyse. However, as explained above, time limitations generally

place constraints upon the types of benchmarks that may be used. For this project, since we

are not considering a specific target such as network applications or multimedia programs, it

is important to focus on numerical applications that are demanding enough for current

computers. Therefore, the benchmarks should be as realistic as possible, given the final goal

of building a solution that can generalise over real programs.

Execution time: Execution time performs an important role in choosing the benchmarks

given the time constraints one can face when generating the data for learning. It is certainly a

trade-off between how challenging a program is for a particular machine and how long it

takes to execute on that machine. Programs for which the execution time is very low will


29

probably not be rich in features necessary to build a model that can perform properly in

novel data. On the other hand, programs rich in features and demanding for a particular

machine will probably take a long time to execute and one could not afford to include them

in the dataset. However, caution must be taken with programs that have a long execution

time. The long execution time of some benchmarks may be attributable to the complexity of

their input but not to the richness of any demanding features they may have. In other words,

sometimes it is the input of a benchmark that compels the execution time of a program to be

long, rather than the complexity of the program. Therefore, these programs might not be

important to analyse either.

An additional consideration for choosing the programs when building the dataset for

compiler optimisation is the bias introduced by selecting specific types of applications. For

example, one could include some very simple computational kernels such as matrix

multiplication for which a transformation like loop unrolling is known to be beneficial to

some extent. Therefore, it would be possible to create more examples by varying some of the

parameters of the kernels such as the trip count and the array sizes. Since the final goal of the

learning technique is finding a solution that provides an improvement in performance for the

programs, the results will be biased towards these simple kernels. However, is it also true

that numerical applications may include these computational kernels and a lot can be gained

when having them in the training data. Therefore, it would be profitable for the learning

technique to have these simple kernels along with other more complex programs.

Having considered the issues that may affect the choice of the benchmarks, the programs

used for the experiments belong to the suites SPEC CFP95 [SPEC95] and VECTORD

[Levine et al., 1991]. The benchmarks taken from [SPEC95] are scientific applications

written in Fortran. These numerical programs, intensive in floating point operations,

represent a variety of real applications and are still challenging for current computers. One of

these benchmarks, namely 110.applu was significantly affected by the instrumentation, and

another benchmark called 145.fpppp did not have appropriate loops to be unrolled.

Therefore, they were discarded; only eight of ten possible programs from this suite have

been used. The suite VECTORD [Levine et al., 1991] contains a variety of subroutines

written in Fortran intended to test the analysis capabilities of a vectorising compiler. These

subroutines include different types of loops whose features may be encountered in other

applications.


30

The description of each benchmark specifying the name, the number of lines, the number of

subroutines, the area of application and the specific task performed is shown in Table 3.1

(for the case of SPEC CFP95 benchmarks this information was obtained from the official

web site [SPEC95]).

Benchmark # Lines /

# subroutines

Application

Area Specific Task

101.tomcatv 190/1

Fluid Dynamics /

Geometric

Translation

Generation of a two-

dimensional boundary-fitted

coordinate system around

general geometric domains.

102.swim 429/6 Weather Prediction

Solves shallow water

equations using finite

difference approximations.

103.su2cor 2332/35 Quantum Physics

Masses of elementary

particles are computed in the

Quark-Gluon theory.

104.hydro2d 4292/42 Astrophysics

Hydrodynamical Navier

Stokes equations are used to

compute galactic jets.

107.mgrid 484/12 Electromagnetism Calculation of a 3D potential

field.

125.turb3d 2101/23 Simulation Simulates turbulence in a

cubic area.

141.apsi 7361/96 Weather Predication

Calculates statistics on

temperature and pollutants in

a grid.

146.wave 7764/105 Electromagnetics Solves Maxwell's equations

on a Cartesian mesh.

vector 5302/135 Variety of vectorial

routines

Tests the analysis capabilities

of a vectorising compiler

Table 3.1: Description of the Benchmarks


31

3.3 Implementation of loop unrolling As explained in section 2.6.2 loop unrolling can be implemented at the source code level, at

an intermediate representation of the program or at the back-end of the compiler. For the

experiments that generated the data in this project a software framework that works on top of

the compiler has been used. In other words, loop unrolling has been implemented at the

source code level. This framework developed in [Fursin, 2004] is mainly written in java and

provides a platform independent tool to assist the user in compiler optimisation. The

software, based on a �feed-back directed program restructuring� ([Fursin, 2004] page 63)

searches for the best possible code transformations and their parameters in order to minimise

the execution time of programs.

The unrolling algorithm used is a generalised version of the algorithm described in section

2.6.1 and it is shown in Figure 3.1 (taken from [Fursin, 2004]). In this generalised version,

the loop body is replicated u times in the first loop and an additional loop is introduced to

control the leftover operations.

3.3.1 Which loops should be unrolled? As in other approaches ([Monsifrot et al., 2002] and [Stephenson and Amarasinghe, 2004]),

only innermost loops were chosen to be unrolled. Although this may be the most common

choice throughout the literature, one should bear in mind that there are some cases where

unrolling outer loops can be beneficial as explained in [Nielsen, 2004]. However, in order to

keep the complexity of the transformer low and to guarantee the legality of the

transformations, outer loops were not considered. This fact is not very restrictive; many

innermost loops can be significantly improved by unrolling.

3.3.2 Initial experiments Due to the characteristics of the transformation framework used to perform unrolling, the

initial experiments executed to generate the data were based on program-level timing.

Firstly, innermost loops are chosen from a particular program and the maximum unroll factor

is set to eight (U = 8). The framework runs a program U times corresponding to different

unroll factors for a loop, recording the execution time of the whole program for each run.

Subsequently, the best unroll factor found for that loop is fixed and the software executes the

program for each unrolled version of the following loop. This process is repeated until all the


32

loops are unrolled. In this case, it is said that the framework follows a systematic search

strategy. This strategy works under the assumption that there is no interaction between loops.

In other words, it means that fixing an unroll factor for a specific loop will not severely

affect the performance of the execution of another loop. Furthermore, there is one advantage

of having program-level profiling for measuring the execution times: the intrusion caused by

the instrumentation is negligible. Indeed, since only the execution time for the whole

program is measured, the performance of each loop is minimally affected by this

instrumentation. However, there is a major drawback to carrying out the experiments in this

way: the time needed to obtain the data for a specific program grows with the maximum

unroll factor used and with the number of loops considered for unrolling. Thus, the number

of times a program must be executed in order to be analysed is U x L, where U is the

maximum unroll factor and L is the number of loops to be analysed. If, for example, a

program contains forty loops and the maximum unroll factor used is eight, the program will

need to be executed 40 x 8 = 320 times. If one whished to include a considerable number of

programs the process of generating the data would be extremely time-consuming. Despite

this fact, the initial experiments followed this approach. Unfortunately, an additional

inconvenience was further discovered after obtaining the data for all the benchmarks: for

most of the programs, the improvement found by the search strategy was only comparable to

the variability of their execution times. Using signal-processing terminology, the signal of

improvement was swamped by the noise. This fact utterly impeded the use of any criterion

for selecting the loops and including them in the dataset. Therefore, it was necessary to

follow a loop level granularity.

3.3.3 Loop level profiling Unlike program-level profiling that measures the execution time of the whole program, loop-

level profiling measures the execution time of each loop within the program. In the case of

determining the execution time after unrolling, timers are inserted around each loop and the

program is run U times (one for each unroll factor), setting up the same unroll factor for all

the loops during each run. As in the case of program-level profiling, there is an assumption

of independence in this process. There are no runs that involve different unroll factors for

different loops. Indeed, analysing all the possible combinations of unroll factors for all the

loops within a program ceases to be feasible. Additionally, there is a severe reduction in the

number of times a program must be executed. In this case the number of runs is equal to the

maximum unroll factor used. This fact dramatically reduces the time invested in obtaining


33

the data, given that it is not dependent on the number of loops a program may have. There is,

however, a great shortcoming when following this process: the effect of the

instrumentation. Certainly, given that timing functions with their own variables and

instructions were inserted around the loops in order to determine their execution time, the

performance of a loop may be affected by this instrumentation. For example, some loop

instructions cannot be kept in the cache because other instructions corresponding to the

instrumentation code are already held in this place. The effect of the instrumentation is

noticeable especially for those loops that are called upon many times in the program. This is

compensated if the loops take a considerable amount of time to be executed. Therefore,

caution must be taken when selecting the loops to be included in the dataset, avoiding those

loops for which the execution time and/or trip count is very low.

Original loop

do i = 1, n

S1[i]

S2[i]

�

end do

Unrolled loop (unroll factor = u)

do i = 1, n, u

S1[i]

S2[i]

�

S1[i+1]

S2[i+1]

�

S1[i+u-1]

S2[i+u-1]

�

end do

do j = i, n

S1[j]

S2[j]

�

end do

Loop body

replicated u times

Processing remaining

elements

Figure 3.1: Generalised version of loop unrolling (Taken from [Fursin, 2004])


34

3.4 Generating the targets As explained above, the process of generating the targets, i.e. the execution times, followed a

loop level granularity. The process starts by preparing the benchmarks for the transformation

tool and ends after running each program, considering the different unroll factors. The steps

involved in this process are briefly explained as follows.

3.4.1 Preparing the benchmarks

Some of the benchmarks may not be appropriately handled by the transformation tool.

Specifically, there may be some loop constructions that are problematic for the transformer.

They must be converted into a form that the framework can properly manage.

3.4.2 Selecting loops Having prepared a specific program for the framework, the next step is selecting the loops

that are believed to be appropriate for unrolling. In general, to avoid introducing some bias in

the data, the only restriction is that they must be innermost loops. Additionally, loops with

calls to subroutines were not considered given the difficulty of determining the actual effect

of unrolling as other loops may be involved within the subroutine called.

3.4.3 Profiling The loops must be profiled in order to calculate their execution time and in order to have an

idea of which loops are insignificant. Those loops for which the execution time is very low

are considered insignificant because no real improvement can be determined.

3.4.4 Filtering After the profiling step, it is possible to define the loops that should not be included in the

dataset. As mentioned above, the instrumentation causes an intrusive effect that may

severely affect those loops with low trip count or low execution time. In general, if the

execution time of a loop is less than a threshold T, the loop must be discarded.


35

3.4.5 Running the search strategy This step implies running the benchmarks for each unroll factor. There must be a maximum

unroll U common for all the loops and benchmarks.

The process of generating the targets followed the steps explained above with a threshold for

filtering T = 0.4 seconds and a maximum unroll factor U = 8. Additionally, each benchmark

(for all unroll factors) was executed ten times, which henceforth will be referenced as the

number of runs R = 10, in order to consider the variability of the execution times.

In order to completely describe the data collection process it is necessary to mention some

technical issues regarding the hardware and software resources that were utilised to execute

the experiments.

3.5 Technical Details

3.5.1 The platform A dual Intel(R) XEON(TM) running at 2.00GHz with 512 KB in cache (level 2) and 4GB in

RAM has been used for the experiments. The operating system installed in this machine is

Red Hat Linux 2.40.20-24.9.

3.5.2 The compiler The Fortran GNU compiler [G77] gcc-3.2.2-5 has been used. As remarked in section 2.6.5

the interactions between loop unrolling and other transformations constitute a potential

problem. Hence, the optimisation level chosen for the experiments was �O2 with no

additional flags. Clearly, unrolling at the compiler level was switched off by avoiding the use

of the options –funroll-loops or �funroll-all-loops; otherwise the compiler could apply

unrolling to the already unrolled code.

3.5.3 The timer’s precision The transformation framework utilises the C function clock() in order to compute the

execution time at a loop level. This function was found to have a precision of 0.01 seconds in

the machine used for the experiments. This precision refers to the fact that the minimum

elapsed time detected by this function is 0.01 seconds. In other words, it would not be


36

possible to detect a difference between two runs less than this precision. Considering that the

threshold used for filtering the loops is 40 times the precision on this particular machine, it

does not represent a problem for this project.

3.6 The results in summary The results of the data collection process are shown in Table 3.2. In order to facilitate its

execution, the suite VECTORD [Levine et al., 1991] has been divided into four different

programs given the independence among subroutines. The execution times for the original

code and for the instrumented code (with timers) are given. Additionally, the contribution of

each benchmark to the dataset in terms of the number of loops is provided. An important fact

to notice from Table 3.2 is that for most of the benchmarks the execution time is not severely

affected by the instrumentation of the code. In fact, only three benchmarks (shown as

shaded), namely 107.mgrid, 125.turb3d and 141.apsi experienced a notable increase in their

execution time. The benchmark that was most influenced by the instrumentation was

125.turb3d, due to the great amount of times some of its loops were called upon within the

program. However, unlike 110.applu that was discarded, 125.turb3d was maintained, as it

could affordably be executed several times and most of its loops have an acceptable

execution time.

3.7 Feature extraction It has been explained so far how the execution times of selected loops have been generated

for different unroll factors and multiple runs. Explicitly, the results of the process described

above correspond to the execution time of each loop after unrolling using u = 1�U. Each

execution was repeated R times in order to consider its variability for a specific unroll factor.

Hence, the data collection process generated LxUxR execution times. Where L is the number

of loops, U is the maximum unroll factor considered and R is the number of repetitions of

each run. Bearing in mind that the approach followed in this project is creating a regression

model able to learn the improvement in performance of the execution time of loops, these

magnitudes have been called the targets. Therefore, a regression model must be able to

learn a function7 for which the output is a value based on the targets for a specific loop and

7 Actually, as it will be explained in chapter 5, for each unroll factor a different model will be built, i.e. there will be U functions each one corresponding to a different unroll factor.


37

for which the input is a characterisation of this loop. This section describes the features

extracted from the programs that were used to characterise the loops.

Execution Time (sec.)

Benchmark Original Instrumented

# Loops

101.tomcatv 40.8 45.2 5

102.swim 39.9 43.0 3

103.su2cor 51.3 53.5 15

104.hydro2d 61.4 61.6 35

107.mgrid 44.5 102.5 15

125.turb3d 73.3 594.7 12

141.apsi 54.6 161.2 23

146.wave5 42.5 59.4 25

Vectord_1 146.3 146.4 51

Vectord_2 148.3 148.7 49

Vectord_3 163.9 167.9 13

Vectord_4 160.0 160.2 2

Total number of loops 248

Table 3.2: Results of the data collection process

One of the most important issues in a data mining application is selecting the right features

for learning. In fact, the characterisation of a problem should be carried out by selecting

those features that are believed to influence the targets to learn. As explained in Chapter

Two, there are many factors that may influence loop unrolling, such as hardware components

of the target architecture, other code transformations applied after unrolling and,

characteristics of the program itself. It is not unrealistic to characterise a loop for unrolling

based on its static features, i.e. those that can be determined at compilation time. However, it

is necessary to emphasise that dynamic features, i.e. those that are determined at execution

time, are also important and may not be involved in the static representation; for example,

the number of cache misses. The characterisation of loops in this project is mainly based on


38

static features and only two of them, namely the trip count and the total number of times the

loop is called, were determined during the execution of the programs. This loop abstraction

is mostly based on the description presented by [Monsifrot and Bodin, 2001] and [Monsifrot

et al., 2002]. The loop characterisation presented by [Stephenson and Amarasinghe, 2004] is

not applicable to the present approach given the differences in the implementation of loop

unrolling.

The features extracted to characterise the loops are shown in Table 3.3. Feature 1 (called) is

the total number of times the loop is called within the program. Therefore, it represents the

number of times the outer loops are executed and the subroutine containing the loop is

called. The size (feature 2) of the loop refers to the number of statements within the loop. A

statement may contain one or more lines. The trip count (feature 3) of the loop determines

the number of times the body of the loop is executed. Given that for most of the loops this

feature is unknown at compilation time, it was determined during the execution of the

programs. For some loops it was found that the trip count was variable depending on some

parameters of a subroutine. In these cases a weighted average was calculated. Feature 4

considers the number of calls to proper functions of the language. Feature 5 (Branches)

refers to the number of if statements within the loop. Feature 6 is the nested level of the loop.

Features 7 and 8 represent the number of array accesses within the loop depending upon

whether an array element is loaded or stored. A less straightforward feature is the number of

array element reuses (feature 9). It attempts to measure dependency among different

iterations. Although dependency analysis is a very complicated topic in compilers, a simple

approach has been taken. The number of reuses of a particular array is computed as the total

number of elements involved in its update when it is controlled by the iteration variable. An

example is given in Figure 3.2 where the number of reuses is three. Finally, feature 10

(Floating) considers the number of floating-point operations and feature 11 (indAcc)

represents the number of indirect array accesses. An indirect array access occurs when an

array is used as an index of another array.

do i = 1, N a[i] = a[i] + a[i-1] * a[i+1] end do

Figure 3.2: An example of a loop containing three array element reuses


39

Index Name Description

1 Called The number of times the loop is called

2 Size The number of statements within the loop

3 Trip The trip count of the loop

4 Sys The number of calls to proper functions of the language

5 Branches The number of if statements

6 Nested The nested level of the loop

7 Loads The number of loads

8 Stores The number of stores

9 Reuses The number of array element reuses

10 Floating The number of floating point operations

11 IndAcc The number of indirect array accesses

Table 3.3: Features extracted to characterise loops

3.8 The representation of a loop The features described above and the execution times throughout R repetitions for each

unroll factor compose the dataset constructed by the data collection process. Figure 3.3

shows the representation of a datapoint where the specific unroll factor used has been

omitted for simplicity. In other words, for each unroll factor a loop has the representation of

the one shown in Figure 3.3.

x1 x2 � xN t1 t2 � tR

Features Execution time throughout R repetitions

Figure 3.3: The representation of a datapoint


40

3.9 Summary This chapter has described the data collection process and has remarked upon the difficulties

that appear during the generation of data when applying machine learning to compiler

optimisation. In general, constructing a dataset for this purpose can be a time-demanding

activity. Great effort has been invested in this project in order to generate clean and reliable

data. Hence, competitive benchmarks have been chosen in order to keep the bias towards

simple applications as minimal as possible. Loop unrolling has been applied to innermost

loops within these benchmarks by using a framework developed in [Fursin, 2004]. Only

loops with sufficient execution time have been taken into account in order to reduce the

effect of the instrumentation. The general process utilised for generating the targets

(execution times for loops) consists of preparing the benchmark, selecting loops, profiling

and filtering the loops and running the programs using different unroll factors. Unroll factors

from 1 to 8 were used and each run was repeated 10 times in order to consider variability.

After having determined which loops should be included, a feature extraction process was

performed over each loop using a set of 11 features that were selected mainly based on static

characteristics of the loops. Finally, the representation of a datapoint (loop) composed of

features and execution times resulting after the data collection process has been summarised.

41

Chapter 4

Data Preparation and Exploratory Data Analysis

4.1 Introduction The last chapter described the data collection process and provided technical and

methodological details about how the experiments were carried out in this project. These

experiments have generated the data that in principle will be available for learning. As has

been highlighted, the construction of a dataset that can be used by learning techniques with

the aim of optimising the execution time of programs is a time-demanding activity.

Consequently, great effort has been invested in producing a considerable amount of data that

can be reliably used by machine learning techniques. This effort has mainly been focused on

determining the criteria used for choosing the benchmarks; the preparation of these

benchmarks; the selection of appropriate loops to be unrolled; the generation of the

execution times (also called the targets) and the selection of correct features that describe the

loops and constitute the characterisation of the problem. Unfortunately, the raw data

produced by the data collection process is not suitable to be directly used by any machine

learning technique and it needs to be pre-processed and refined. There are several reasons for

this. Firstly, it must be recalled that the ultimate goal of loop unrolling is optimising a

program by determining whether a particular unroll factor is beneficial or not, i.e. if

unrolling a loop a specific number of times represents a significant improvement with

respect to the case of maintaining the original (rolled) version of the loop. Hence, it is

necessary to apply a validation process to the data in order to establish if there is an actual

improvement in performance to be learnt. In other words, if unrolling does not yield a

potential reduction in the execution times of the loops that are included in the dataset, the

process of modelling the data and trying to learn from it is useless. The second reason to

avoid using the unprocessed data that has been collected is that it can be very difficult to

learn from the pure execution times. Indeed, since the interest of this project is mainly

focused on predicting the effectiveness of loop unrolling, the execution times do not directly

reflect the improvement that needs to be learnt. Therefore, these execution times must be

Chapter 4. Data Preparation and Exploratory Data Analysis

42

transformed into another more convenient representation that explicitly indicates how good

or bad the transformation is for a particular loop. Furthermore, some loops that were not

removed after being profiled may still need to be discarded from the dataset, as their mean

execution time is low. Similarly, some measurements deviate strangely from the distribution

of the execution times for a particular loop under a specific unroll factor. These outliers

should also be removed from the dataset as they may deteriorate the learning process. The

final reason for which pre-processing the data is essential relates to the transformation of the

features that constitute the representation of the loops. In fact, this representation must be

rescaled in order to prevent some features being considered more important than others.

This chapter tackles the problems mentioned above and provides an insight into the data that

will be used for modelling. The rest of this chapter is organised as follows. Section 4.2

explains how the data is processed throughout different stages (by using several software

resources) and is made available for analysis and learning. Section 4.3 provides a formal

representation of the data and introduces the notation that will be used in subsequent

sections. Section 4.4 analyses the data and validates its suitability in order to be used by

machine learning techniques. Section 4.5 presents the importance of pre-processing the

execution times, giving details about the elimination of some loops from the dataset, the

transformation of the targets and the detection and treatment of outliers. Section 4.6 explains

how the features are also pre-processed in order to facilitate the application of learning

techniques. Finally, section 4.7 summarises the most important aspects of this chapter.

4.2 The general framework for data integration It was explained in chapter three how the benchmarks selected for this project were used to

generate the execution times at a loop-level profile with the aid of a software tool that works

on top of the compiler. From a more general perspective, the generation of the targets and

the selection of the features for loops are only two steps during the whole process of making

the data available for analysis and learning. In fact, the framework used for applying loop

unrolling and obtaining the executions times, which will be referenced in this section as EOS

[Fursin, 2004], produces a set of files that must be parsed and from which the targets must be

extracted. In order to automate this task, a program written in java, referenced in this section

as Java Parser, was developed. This program reads the files produced by EOS and transforms

the results into easy-to-load text files. The files produced by the Java Parser are loadable by

Matlab® subroutines that integrate them with the features extracted from the programs.


43

Furthermore, these subroutines make possible the analysis, exploration and pre-processing of

the data and communicate with modelling subroutines also written in Matlab®. This process

is shown in Figure 4.1.

4.3 Formal representation of the data The data that has been generated by repeatedly executing the programs using the unrolled

version of the loops can be denoted as tuj,k, representing the execution time of the jth run (j =

1, 2, �, R) of loop k (k = 1, 2, �, L), which has been unrolled u times (u = 1, 2, �, U).

Here, the notation used in a great part of the literature about loop unrolling has been adopted,

where an unroll factor of one (u = 1) corresponds to the original version of the loop, i.e. with

no unrolling at all.

Similarly, the features explained in section 3.7 describe a particular loop. Hence, the

characterisation of loop k can be denoted as a vector xk for which the elements xi,k (i=1, 2,

�, N) represent the features that compose the loop.

Benchmarks

EOS

Execution times

Java Parser

Features

Raw data Pre-

Processing

Modeling

Matlab

Results

Figure 4.1: The general framework for data integration


44

4.4 Is this data valid? The first question to be answered before starting to pre-process the data that has been

collected is whether this data can actually be useful and suitable for applying learning

techniques. Therefore, the aim of this section is to ascertain the validity and suitability of the

data by finding out if the improvement in performance of the loops for unroll factors

different from one (u=1) represents an actual reduction of the execution times and is not only

a consequence of the variability of the data. In section 3.3.2 it was explained that the data

resulting from the initial experiments using program-level profiling was discarded. In fact,

the improvement obtained by loop unrolling in most of the benchmarks was only comparable

to the variability of the execution times under no unrolling. Therefore, there were no criteria

to select the loops that should be included in the dataset. The problem to be explained in this

section is essentially the same but the aim now is to determine if the final data obtained

represents a potential improvement that could be predicted by machine learning techniques.

In other words, considering that the final goal in compiler optimisation is related to the

realisation of speed-ups on a set of programs, the focus here is on the maximum possible

improvement that may be reached by any learning technique. Although the detrimental effect

of loop unrolling is also important to this research, the data will prove inappropriate should

no loops be found to be improved by unrolling.

4.4.1 Statistical analysis To enable the assessment of the suitability of the data, the measurements have been repeated

ten times in this project. A sensible approach to establishing the appropriateness of the data

aims to validate the significance of the improvement with the aid of statistics. The objective

is to determine how many loops in the dataset are improved by unrolling and when this

improvement is statistically significant. Using the notation introduced in section 4.3, a

particular loop k has associated R execution times (due to repetitions) for each unroll factor

u. Therefore, it is reasonable to compare the R measurements for each unroll factor with the

execution times for the rolled (original) version of the loop. Although it may be possible to

apply a t-test to validate the hypothesis of different means for u=1 and u>1, that is, Ho: iu

kuk tt == ≠1 with i>1, it would be necessary to apply U-1 tests, which can considerably

increase the probability of drawing at least one incorrect conclusion. Therefore, a one-way

analysis of variance (one-way ANOVA) followed by a multiple comparison (multi-


45

comparison) procedure has been applied. Additionally, only improvements in performance

but not detriments have been considered.

In general, the purpose of ANOVA is to test if the means of several groups are significantly

different. In our case, the groups are the unroll factors considered. Thus, ANOVA tests the

null hypothesis that the means do not differ by using the p-values of the F-test. Essentially,

the F-test measures the between-groups variance compared to the within-groups variance.

Therefore, if the variance between groups is considerably greater than the variance within

groups, there are serious doubts about the null hypothesis. Since we are interested in

establishing if there is a significant difference between the mean execution times of the

original loops and the unrolled version of the loops�, a multi-comparison procedure is

needed. Multi-comparison procedures also determine when this difference is positive (i.e. an

improvement in performance) or negative (i.e. a detriment in performance). All the tests

were executed at 5% level of significance and the Tukey's honestly significant difference

criterion was used as the critical value for the multi-comparison procedure (for a complete

description of ANOVA and multiple comparison procedures see [Neter et al., 1996] pages

663-701 and pages 725-738).

In order to provide a better understanding of whether unrolling is considered significantly

beneficial or not regarding the procedure explained above, Figure 4.2 shows the box plots of

the execution times for two different loops. The lower and upper lines of the boxes represent

the 25th and the 75th percentile of the execution times for each unroll factor. The horizontal

lines in the middle of the boxes are the medians. The dashed lines (also called whiskers)

represent the variability of the execution times throughout R=10 repetitions for each unroll

factor, where the top is the maximum execution time and the bottom is the minimum. Thus,

the upper part of Figure 4.2 shows the case of a loop for which unrolling does not correspond

to a significant improvement in performance. In fact, although the execution times for u=3,

u=5 and u=6 are less than the minimal for u=1, they are not considered significant due to the

variability of the measurements. The opposite case is shown in the lower part of Figure 4.2.

For this loop, the variability of the execution times is low and the improvement in

performance due to loop unrolling, e.g. for u=4 is significant. It is clear for this case that loop

unrolling is beneficial as there is no overlapping between the execution times for u=1 and for

u=4.


46

The procedure explained above was applied to all the loops in the dataset and the results are

summarised in Table 4.1. The contribution to the dataset of each benchmark in terms of the

total number of loops and in terms of the number of loops for which unrolling causes a

statistically significant improvement of performance is shown. It can be seen in Table 4.1

that the number of loops that can be improved by unrolling in the SPEC CFP95 [SPEC95]

suite is considerably less than in the VECTORD [Levine et al., 1991] benchmarks.

Furthermore, 42% of the loops may be significantly improved by loop unrolling in the whole

set of benchmarks. It is necessary to emphasise that the loops included in Table 4.1 can also

be negatively affected by unrolling because some unroll factors may be detrimental to the

performance of a specific loop. However, since we are analysing the case of how many loops

in the dataset can be improved by loop unrolling, it is possible to conclude that the reduction

in the execution time of the loops included in the dataset is not only caused by the variability

of the measurements but also by the effect of the transformation. Thus, the data that has been

collected can be used by machine learning techniques in order to model its improvement in

performance.

Until now, it has been considered whether there is a positive impact of loop unrolling on the

set of benchmarks. However, it may be of interest to formulate two additional questions

regarding the behaviour of the execution times that have been collected for this project.

Primarily, is there any negative impact of the transformation on these benchmarks?

Secondly, how much can unrolling affect the execution time of these loops? To answer these

questions, it is necessary to explain the pre-processing stage of the execution times, i.e. how

the targets have been transformed in order to facilitate their analysis and to appropriately

apply the modelling techniques.

4.5 Pre-processing the targets As indicated by Han and Kamber (2001), pre-processing the data can significantly improve

and ease the application of learning techniques. In this case, we are interested in eliminating

some execution times that should not be considered because of their low magnitude;

transforming these execution times into other values that may be more appropriate for

learning; and detecting some strange data-points (outliers) that are somehow anomalous and

do not follow the general behaviour of similar data-points.


47

Figure 4.2: Insignificant (top) and significant (bottom) effect of loop unrolling


48

Benchmark # Loops

# Loops with

improvement

for u > 1

101.tomcatv 5 2

102.swim 3 1

103.su2cor 15 5

104.hydro2d 35 3

107.mgrid 15 1

125.turb3d 12 4

141.apsi 23 6

146.wave5 25 3

Vectord1 51 38

Vectord2 49 39

Vectord3 13 1

Vectord4 2 2

Total 248 105

Table 4.1: Number of loops that may be benefited

from unrolling

4.5.1 Filtering In section 3.4.4 program-profiling permitted the recognition of loops with a low execution

time and their removal from the set of loops that were considered for unrolling. However,

after the data collection process, loops with very low execution time remained, which went

undetected during that phase. As now, at this stage, all the repetitions are available, it is

possible to determine the behaviour of the execution time for the original (rolled) version of

a specific loop. Therefore, a similar criterion may be established to remove loops with low

execution time from the dataset.

Let us consider the R repetitions of the execution time for the original version (with no

unrolling) of loop k, t1j,k. A straightforward criterion to eliminate or maintain this loop can be

ascertained by deciding whether its mean is greater than a threshold T or not. Thus, the

mean can be computed by:


49

∑=

=R

jkjk t

Rt

1

1,

1 1 (4.1)

Therefore, the criterion to remove or maintain a loop is:

1kt

<

otherwiseitmantain

Ttifloopthediscard k

,

, 1

Setting T to 0.4 seconds, all the loops were considered for this filtering. Table 3.2 and Table

4.1 already reflect these results.

Finally, some loops were found to have zero execution time for u≠1, probably due to an

inappropriate unrolling applied by the transformation tool. Therefore, they were also

removed from the dataset.

4.5.2 Dealing with outliers An outlier is a data-point that is somehow anomalous and does not follow the general

behaviour of similar data-points. Detecting and dealing with outliers is a challenge and may

be dependent on the characteristics of the application. For example, the most common

approach is eliminating the outliers from the dataset, although they may be of interest in

other applications such as fraud detection. Strictly speaking, an outlier refers to a point in the

dataset, which in the case of this project represents a loop. However, the treatment of

outliers in this project does not aim to identify and remove loops from the dataset, but to

identify and discard those repetitions of the execution times for a particular loop that

considerably deviate from the general tendency.

Although visualisation techniques may help with the detection of outliers, a statistical and

automated approach has been followed. Let us consider loop xk and its R execution times for

a specific unroll factor u: ukjt , with j = 1, 2, � R. An execution time for this particular loop

is considered an outlier if its value is greater that 1.5 times the inter-quartile range plus the

75th percentile. Similarly, an execution time less than the 25th percentile minus 1.5 times the

inter-quartile range is also considered an outlier. In other words, an execution time that is

notably greater or smaller than the median is identified as an outlier. These outliers may be a


50

consequence of the effect of the instrumentation or a result of other factors such as the

activation of low-priority processes in the operating system. Figure 4.3 shows a loop that

contains two outliers. For the unroll factor u=6 an outlier is discovered (indicated by the sign

�+�), as it is considered to be significantly greater than the median execution time for that

unroll factor. Similarly, for u=7 a low execution time that is significantly less than the

median is also considered an outlier.

Approximately 26% of the loops for each unroll factor were determined to have outliers.

Finally, 4.3% of the execution times were detected as outliers and they were removed from

the dataset.

4.5.3 Target transformation As will be explained in Chapter Five, the ultimate goal of applying a learning technique to

the dataset constructed is predicting whether unrolling is a beneficial transformation for a

particular loop or not and how much it can affect the execution time of the loop. Therefore, it

stands to reason that if the raw execution times are directly used for modelling, it will be

Outlier

Outlier

Figure 4.3: An example of outlier


51

difficult to learn the parameters involved in loop unrolling and successfully predict its effect

on a particular loop. In fact, the absolute execution times for a loop considered throughout all

unroll factors do not explicitly reflect how good or bad is the transformation for a specific

loop. Therefore, these execution times need to be transformed into a set of values that

directly represent their relative performance with respect to the execution times under no

unrolling.

Thus, let us consider the term 1kt as a statistic that correctly represents the execution times

for loop k under no unrolling (u = 1) over all the runs, i.e. the mean given by equation 4.1 or,

if preferable, the median. Therefore, the percentage of improvement with respect to this

statistic can be obtained as follows:

( )1

1,

,

100

k

ku

kjukj

t

tty

−−= (4.2)

Where j = 1, 2, � R and u = 1, 2, � U. It must be clear that each of these magnitudes can be

negative or positive as long as the execution time for a particular loop under a specific unroll

factor u and run j is greater or less than its corresponding statistic for no unrolling. The

negative sign in equation 4.2 has been used with the purpose of establishing a convention

that will be adopted throughout this document. A positive value for these magnitudes will

indicate an improvement in performance and a negative value will reflect a detrimental

effect. Hence, this equation not only provides a way to ascertain when unrolling a specific

loop is beneficial but also how good this benefit is, on the contrary, when the effect of

unrolling is detrimental and how disadvantageous this transformation can be. Both cases are

of special interest for the compiler community.

With the transformation denoted by equation 4.2 and applied to all the execution times

available from the benchmarks, it is possible to provide a mechanism that enables us to

answer the questions raised in section 4.4.1. Specifically, in order to analyse the negative

impact of loop unrolling on the data obtained and to measure the magnitude of this impact,

the data has been divided into U groups (one for each unroll factor), the mean improvement

in performance has been calculated and a histogram has been computed for each group. The

results are shown in figure 4.3. The y-axis represents the number of loops and the x-axis is

the mean percentage of improvement in performance. As highlighted above, a positive value


52

of x refers to an actual improvement and a negative value is a detrimental effect. Since the

unroll factor u=1 (when a loop is not unrolled) is the point of reference, no improvement is

found in this case. For the other unroll factors, it is clear that about 40% to 50% of the loops

are not noticeably affected by unrolling and their improvement is nearly zero. The number

of these loops decreases as the unroll factor becomes bigger and is compensated by the

number of loops for which the improvement is greater than 5%. In fact, while about 30% of

the loops have an improvement greater than 5% for u=2, about 40% of the loops show this

improvement for u=8. Additionally, only very few loops are dramatically improved by

unrolling. The number of these loops also increases with the unroll factor. Explicitly, about

1% of the loops have an improvement greater than 40% for u=2 and 6% of the loops show

this improvement for u=8. However, it is also necessary to mention the detrimental effect of

loop unrolling. Unrolling has a negative impact in about 20% of the loops. Furthermore, for

approximately 2% of the loops unrolling worsens the performance by more than 50% (when

u>2). In the worst case, though, the performance of some loops is deteriorated by more than

100%.

In conclusion, although a great number of loops are not appreciably affected by unrolling, it

is worth predicting when they are improved or deteriorated by the transformation and how

much the execution time of these loops may be influenced. The predictive modelling used to

tackle this problem will be explained in the following chapter. However, additional pre-

processing needs to be applied in order to make the data suitable for the prediction

techniques.

4.6 Pre-processing the features

4.6.1 Rescaling The values of the features selected to characterise the loops may significantly influence the

learning techniques. For example, some features may outweigh others. A common approach

to diminish this effect and make the variables have similar magnitudes is rescaling each

feature to have zero mean and unit variance.

Thus, the update of feature i for loop k is obtained as follows:


53

i

ikiki s

xxx

−← ,

, (4.3)

Where i = 1, 2, � N; ix is the mean and is is the standard deviation of the variable in the

training data. Namely:

∑=

=K

kkii x

Kx

1,

1 (4.4)

∑=

−−

=K

kikii xx

Ks

1

2, )(

11

(4.5)

Where K is the number of loops considered in the training data.

4.6.2 Feature selection and feature transformation Feature selection is the choice of a subset of features from all the variables available for

learning. Working with a great number of features can demand more training data that may

be difficult to obtain. Additionally, some features can be found to be irrelevant and the

variables that mainly influence the learning can be a reduced subset of them. Domain

knowledge or very well known techniques such as forward selection and backward

elimination help to identify those features that can be discarded without worsening the

performance of the learning algorithm. For example, from the set of 11 features presented in

Table 3.3, it may be suggested that only five variables are determinant for loop unrolling,

namely the number of floating point operations, the number of array element reuses, the

number of loads by iteration, the number of stores and the number of if statements.

Although the amount of features selected for this project is acceptable compared to the size

of the dataset, it is important to find which features are found to be the most important for

the learning techniques used. This will provide a better understanding of loop unrolling. The

best features for each technique and the method used to obtain them will be explained in the

following chapter.

Feature transformation is the process of creating new features in order to considerably

improve the performance of the learning techniques. Again, domain knowledge can help to

construct the right ones. For example, an intelligent variable to be constructed for applying

code transformations is the ratio memory references to floating point operations. This feature

tell us how the memory is affected in relation to the number of operations performed by


54

iteration, and may not be easily found by a specific learning technique. How this feature is

explicitly constructed and its effect on the learning process will also be explained in the

following chapter.

4.7 Summary This chapter has provided an insight into the execution times resulting from the data

collection process and has described the need of the pre-processing stage in order to make

the data suitable for learning. The general framework in which the data has been prepared

and made available to the learning algorithms has been described. The execution times have

been analysed in terms of the improvement obtained by applying loop unrolling and it has

been found that about 42% of the loops that constitute the dataset can be, statistically,

significantly improved. This fact is important because it validates the appropriateness of the

benchmarks in order to build models that can potentially improve the performance of

programs. The pre-processing stage has been divided into two parts: the pre-processing of

the targets and the preparation of the features. The targets (the execution times) have been

filtered in order to guarantee that only loops with a sufficient execution time are maintained.

After discarding some loops due to their low execution times, the detection of outliers has

taken part and has been focused on identifying those loops for which some execution times

are considerably greater or smaller that the median for a specific unroll factor. These outliers

have also been eliminated from the dataset. As the final phase in the target pre-processing

stage, the execution times have been transformed into more suitable magnitudes that directly

indicate the improvement or detriment in performance of unrolling with respect to the mean

execution time of the original loop. These magnitudes are the actual targets for the learning

techniques explained in the following chapter. The new values obtained made possible a

deeper study of the effect of loop unrolling on the benchmarks used. Specifically, it was

found that approximately from 40% to 50% of the loops included in the dataset were not

considerably affected by unrolling and consequently, their improvement was nearly zero. It

was also discovered that a great number of loops were positively affected by the

transformation. Roughly, from 30% to 40% of the loops experienced a reduction in their

execution time when they were unrolled. Nevertheless, some loops (20%) were found to be

negatively affected by the transformation. These results provide a good understanding of the

appearance of the data and demonstrate how difficult is to build a decision rule for unrolling

loops.


55

Figure 4.3: Number of Loops vs. Mean Improvement in Performance (%)


56

Finally, in order to avoid some features being considered more important than others for the

learning algorithms, each variable that characterises the loops was rescaled to have zero

mean and unit variance.

To conclude, with the results obtained by the analysis phase and the pre-processing stage

explained in this chapter, the data is ready to be used by learning algorithms in order to build

a model capable of predicting how beneficial or detrimental unrolling can be for a particular

loop.

57

Chapter 5

Modelling and Results

5.1 Introduction Chapter Four has emphasised the importance of cleaning, analysing and preparing the data

that has been collected in order to make it suitable for learning. This suitability relates to the

maximum improvement in performance that may be achieved when applying machine

learning techniques. However, the term learning has not been explicitly defined within the

context of this project. Here, we are referring to learning from data. Thus, the aim is to build

a model able to learn from past examples by discovering the underlying structure of the data

in order to accurately predict on novel data. Therefore, in this project, learning is focused on

predictive modelling, i.e. given a dataset called the training set, the goal is to build a model

that can generalise and successfully perform on data that has not been seen before.

In general, predictive modelling can be thought as the process of constructing a function that

maps a vector of input variables (predictor variables) onto one or more output variables

(response variables) [Hand et al., 2001]. This function is constructed based on a training

dataset but is expected to successfully perform on new data. Predictive modelling is

commonly divided into two different approaches: classification and regression. In a

classification problem, the output variable is categorical. In other words, the targets, i.e. the

values of the output variable are discrete and are commonly referred as classes or labels. The

work done by [Monsifrot et al., 2002] and [Stephenson and Amarasinghe, 2004], described

in Chapter One, are examples of classification problems. The former attempted to solve the

binary classification problem of predicting whether loop unrolling is beneficial or not

(targets as classes 1 or 0) for a particular loop given its static representation (values of the

input variables). The latter was also focused on loop unrolling but stated the classification

task as a multi-class problem where the possible labels for the output variable are the unroll

factors that were considered from one to eight. Clearly, both cases aim to predict a

categorical variable based on a set of features of the loops. Thus, they used machine learning

techniques to achieve that goal. This project goes beyond the classification problem and

proposes a regression approach to predict the improvement in performance that unrolling can

Chapter 5. Modelling and Results

58

produce on particular loop. Undoubtedly, the regression approach is a more general

formulation of the problem and previous solutions represent particular cases. Furthermore,

the machine-learning solution to loop unrolling based on a regression method is smoother

than the one obtained by classification methods given the degree of nosiness the

measurements may have. Indeed, the likely event of two loops with the same characterisation

(the same values of the set of features that represent them) having different best unroll

factors constitutes a great difficulty for the classification task. However, the case of two

loops with a slightly different improvement in performance but the same representation is

not a problem for the regression approach. Although it is not the main goal of predicting

modelling, it is advantageous if the results of the techniques used for regression can be

readable and understandable. This will provide a better insight into the problem. This chapter

presents the machine learning solution to loop unrolling based on the regression approach

and describes the results obtained with the techniques used. The organisation of this chapter

is presented below.

Section 5.2 explains in detail how the regression problem is formulated and why it is

considered a more general view compared to previous approaches that apply machine

learning techniques to loop unrolling. Section 5.3 presents an overview of the regression

methods that have been used in this project and section 5.4 provides information about the

parameters that are involved in each method. Section 5.5 describes the measure of

performance that has been used to evaluate the methods. Section 5.6 provides the

experimental methodology that was adopted to obtain the results. Section 5.7 presents and

evaluates these results and compares the solution found with previous work. Finally, section

5.8 summarises the most important theoretical concepts and experimental results presented

in this chapter.

5.2 The regression approach The regression approach to applying machine learning to loop unrolling proposed in this

project attempts to build a function for a particular unroll factor u. This function aims to map

a representation of a loop onto an expected improvement in performance. In other words, the

input of this function is a vector of features for a specific loop and the output is the expected

improvement in performance that the loop can achieve after being unrolled u times. Two

important further clarifications need to be mentioned. Firstly, a different function is

constructed for each unroll factor, i.e. there will be U functions that represent the behaviour


59

of the performance of a loop when it is unrolled. Secondly, the output has been called

improvement but it can also be a detriment in performance.

In a more formal way, let us consider the magnitudes obtained by the transformation applied

to the execution times of loop k (described by equation 4.2) ukjy , (j=1,2, � R). These values

represent the improvement in performance of loop k under a specific unroll factor u

throughout R repetitions. This particular loop is characterised by the N-dimensional vector of

features xk that have been scaled to have zero mean and unit variance (see Equation 4.3).

Although it could be possible for learning algorithms to predict a vector of targets for a

specific loop, let us convert the repetitions through index j into a sole magnitude by

calculating its mean.

∑=

=R

j

ukj

uk y

Ry

1,

1 (5.1)

Where, for the sake of clarity, the bar over uky has been omitted. However, it should be

always borne in mind that this magnitude represents the mean improvement in performance

throughout R repetitions. Thus, the data available for learning can be represented

as ),( ukk yx , k=1, � L, where L is the number of loops that are considered in the training

data. Hence, a regression approach for this data attempts to construct a function of the

form ),( uβxΦ=uf , where uβ is the parameter vector of the model8. Thus, given a loop k

described by its feature vector xk, it is possible to obtain ukf that predicts its mean

improvement in performance when it is unrolled using the factor u, i.e. uky .

Two advantages of this approach over previous work need to be highlighted: robustness and

generality, and they are described below.

Robustness: The classification solution may be severely affected by the noisiness of the

measurements and the limited number of examples. Indeed, as explained in Chapter One, the

binary classification approach described in [Monsifrot et al., 2002] is likely to have two

8 This notation assumes that a set of parameters are involved in the model. However, for non-parametric models, these parameters may not exist. For simplicity, we consider that the notation used can refer to both parametric and non-parametric models.


60

loops with the same feature vector but assigned to different classes. This represents a

potential problem for any classifier if it is not appropriately treated. Additionally, even in the

case of the multi-class approach presented in [Stephenson and Amarasinghe, 2004], the

learning process may become very difficult. Certainly, there may not be sufficient examples

that indicate the correct unroll factor for a type of loop and, therefore, the best factor to be

used by unrolling will be erroneously predicted. These complications are not present in the

regression approach where the aim is to build a function that smoothly predicts the

improvement in performance. Consequently, the solution provided is more robust to the

intrinsic variability of the data.

Generality: The multi-class problem of establishing the best unroll factor for loops is a

particular case of the regression approach. Clearly, the most appropriate unroll factor to be

used for a specific loop can be easily determined by calculating the greatest improvement in

performance obtained by the different models constructed (u=1, � U). The binary decision

of whether unrolling is beneficial or not is also straightforward.

5.3 Learning methods used A wide variety of modelling techniques is available to solve regression problems. In general,

these techniques can be divided into two groups: parametric and non-parametric models.

Whilst the former makes assumptions about the underlying distribution that generates the

data, the latter is more data-driven and does not place constraints in the form of the function

that constitutes a good representation of the data. It is necessary to remark that in our context

a �good representation� does not necessarily implies the best fit for the training data but

relates to the predictive power the algorithm can have when novel examples are presented.

Parametric models may assume, for example, that the underlying function that generates the

data is linear on the parameters of the model. This function can be a linear combination of

the input variables or a combination of other functions such as polynomials, exponentials or

gaussians. Among non-parametric models, other methods are important such us kernel

machines (Support Vector Regression), Decision Trees or Artificial Neural Networks. Other

techniques such as Instance-Based Learning (e.g. Nearest Neighbours methods) can also be

used for regression problems, although they are more popular for classification.

Although there is not a perfect method that can be used for regression, some techniques may

be more appropriate for the problem this project deals with. Considering that the number of


61

examples in the training data is limited, very complicated methods such as Neural Networks

should be avoided. In fact, there is latent danger of overfitting the data and, although this can

be diminished in Neural Networks by regularisation, this technique may demand more

training examples than those currently available. Another important issue for the present

research is to give an interpretation to the results obtained by a specific method. Thus,

methods for which interpretability of the results is very difficult or even impossible such us

Neural Networks, are not convenient for this problem. Hence, it seems that Occam’s razor is

applicable to this problem: �try the simplest hypothesis first�. Certainly, the complexity of

the problem is unknown and a lot of effort can be saved if a simple model is found to be

sufficient to solve it. In other words, if it is possible to find a simple method with a good

performance on data that is not included in the training set, it should be preferable over other

more complex models with comparable performance.

Therefore, the first method used to carry out the predictive modelling in this project is Linear

Regression. Since the input for the regression problem in this project is multivariate, i.e. a

vector instead of a simple scalar magnitude, this method will be referenced as Multiple

Linear Regression. Multiple Linear Regression is a simple but extensively used technique to

predict a variable based on a linear combination of the input variables. Thus, this method

works under the assumption that there is no interaction among the predictor variables, i.e.

one variable does not have an effect on the others. Furthermore, it is a linear additive model,

i.e. the effects of the variables are independent and they are linearly added to produce the

final prediction. Although these assumptions seem very restrictive, Linear Regression has

demonstrated to perform well even in some cases where the relation between the predictor

variables and response variable is known to be non-linear. Another reason for which Linear

Regression represents a good approach to be attempted is that the parameter vector obtained

by the model may indicate the relevance of each variable. Thus, with this linear model, it is

possible to ascertain the variables that most influence loop unrolling.

However, in many situations the relation between the input variables and the output variable

is very complex and cannot be properly modelled by linear techniques. In these cases, it is

necessary to apply a non-linear approach that may have better predictive power. To consider

the general case in which no assumptions about the underlying process that generates the

data are made, a non-parametric model has been adopted, Namely Decision Trees Learning

has been used as the second choice of the regression approach in this project. Being more

specific, the Classification and Regression Trees (CART) algorithm described in [Breiman et


62

al., 1993] has been applied. Besides modelling non-linear relations between the predictor

variable and the response variable, Decision Trees can provide good insight into the features

that are more important for loop unrolling given that its results are produced in a tree-

structured fashion. However, it is also recognised that if the number of training examples is

not sufficient to provide a representative sample of the true population Decision Trees can

also overfit the training data (see [Mitchell, 1997] pages 66-69). In general, Decision Trees

may suffer from fragmentation, repetition and replication [Han and Kamber, 2001], which

can deteriorate the accuracy and comprehensibility of the tree. Fragmentation occurs when

the rules created by the tree (the branches of the tree) are only based on a small subset of the

training examples tending to be statistically insignificant. Repetition takes place when one

attribute is evaluated several times along the same branch of the tree. Replication occurs

when some portions of the tree (subtrees) are duplicated. These drawbacks appear to be

serious when applying Decision Trees Learning to solve the present regression approach.

Nevertheless, several improvements have been developed in the recent years in order to

overcome these problems. Essentially, if a good pruning algorithm is applied to the primary

tree created by the method and the performance is roughly the same, these problems can be

diminished. Therefore, a lot can be gained with a non-linear and non-parametric technique

to solve the regression problem, being especially cautious to avoid overfitting the data.

The following sections will present a brief overview of each method in order to provide a

better understanding of how they work and to describe the most important issues to take into

consideration for each of them.

5.3.1 Multiple Linear Regression Multiple Linear Regression models the response variable based on a linear combination of

the predictor variables. In an N-dimensional input space, a linear regression model tries to fit

a hyperplane to the training data. Hence, a linear model for the regression approach can be

stated as:

uo

uf β+= xβ .u (5.2)

Where the upper index u has been used to emphasise that one model must be created for each

unroll factor. The parameters of the model, also called regression coefficients, are uβ


63

and uoβ . uβ is an N-dimensional vector corresponding to the coefficients of each variable

and uoβ is the free parameter or bias.

Therefore, given a loop described by its N-dimensional vector of features xk, it is possible to

find the predicted mean improvement in performance under the unroll factor u by

calculating:

uu

kf 0ku . β+= xβ (5.3)

In order to fit this model to the data given by ),( ukk yx , where k=1, � K, and K is the

number of loops considered in the training set, it is possible to find the parameters of the

model that minimise the mean square error (MSE) between the predictions ukf and the

actual values uky . These parameters can be found by:

uTu1uTuu )( yXXXβ −= (5.4)

Where now, uβ is a (N+1) dimensional vector (its last component is u0β ); uX is a matrix of

dimensions Kx(N+1) for which each row is a different data-point and the last column is

composed of ones; and uy is a vector containing the actual predictions for each data-point.

In general, in order to find the solution given by equation (5.4), the calculation of the inverse

is not necessary and numerical techniques of linear algebra are used instead.

5.3.1.1 Understanding the regression coefficients: It is clear that the components of

the parameter vector uβ are the coefficients of the input variables. Recalling that the features

have been scaled to have zero mean and unit variance, the value of uiβ indicates the effect of

the variable ix when all the others variables are constant. Thus, these coefficients measure

the importance of the variables in order to predict the mean improvement in performance. If,

for example, one coefficient of a variable is nearly zero, that variable can be considered an

irrelevant predictor.


64

5.3.2 Classification and Regression Trees Decision Trees constitute a Learning technique commonly used for classification problems

although, as in the case of CART algorithm, they are also applicable to regression problems.

CART algorithm, comprehensively described in [Breiman et al., 1993], does not make any

assumptions about the distributions of the independent or dependent variables. The aim of

the algorithm is to produce a set of rules able to accurately predict the dependent variable

based on the values of the independent variables. The resulting rules can be seen in a tree-

structured fashion. It works by recursively partitioning the data into smaller subsets, testing

the value of one variable at each node. The criterion used to split the data each time is based

on an impurity measure and CART exhaustively tests all the variables at each node. At each

leaf of the tree (terminal node) the predicted value is calculated as the mean of the dependent

variable in the subset that meets the conditions of the respective branch. Given that the size

of the tree can considerably increase at the risk of overfitting, CART computes the final tree

by determining the subtree with minimal cost by using cross-validation. It is worth noting

that although computing the mean at the leaf nodes as the possible predicted values may

sound unsophisticated, the subsets corresponding to these terminal nodes are believed to be

as homogeneous as possible. Furthermore, after pruning, the final tree may provide

understandable rules that potentially give a better insight into the interactions among the

variables.

5.4 Parameters setting Unlike other machine learning methods, Multiple Linear Regression and CART require

minimal tweaking in order to determine the best parameters of the algorithms for which the

greatest performance may be obtained. Certainly, there are no parameters to tune in Multiple

Linear Regression. Since the implementation of CART provided in the MATLAB�s

Statistics Toolbox was used, the splitting criterion or measure of impurity was the Least

Square function. Additionally, pruning was performed by means of cross-validation.


65

5.5 Measure of performance used In order to evaluate the performance of the algorithms used and to carry out comparisons

between them, the Standardised Mean Square Error (SMSE) has been used. It can be

computed by:

∑

∑

=

=

−

−= K

k

uk

u

K

k

uk

uk

u

yy

yfSMSE

1

2

1

2

)(

)( (5.5)

Where uky are the actual improvements in performance given by (5.1), u

kf are the values

predicted by the model and uy is the mean improvement in performance throughout all the

K loops considered within the validation or test set. The SMSE describes how good the

predictions of the model are compared to the case of always predicting the mean, i.e. without

knowledge of the problem. Therefore, small values of SMSE indicate a good performance of

the algorithm and they are expected to be less than 1.

5.6 Experimental Design This section describes the design of the experiments that were carried out in order to evaluate

the performance of the learning algorithms used. There are three objectives with the

realisation of these experiments:

1. To provide an interpretation of the parameters and the results obtained by the

methods used regarding the features that were considered for loop unrolling.

2. To compare the accuracy of the methods and determine the technique that obtained

the best performance.

3. To ascertain the quality of the predictions obtained and the possible impact of these

predictions on the set of benchmarks used.

5.6.1 Complete dataset The first experiments were performed in the whole set of loops that were collected, i.e. the

total number of loops were used for training and validation. Although the results in terms of


66

the performance of each method do not represent an expected performance on novel data,

these experiments were carried out in order to analyse the results obtained by each method

and to interpret the importance of the variables involved in loop unrolling.

5.6.2 K-fold cross-validation Given that testing a model on the same dataset that was used for training does not represent a

real measure of performance of the machine learning methods used, the second set of

experiments were carried out using K-fold cross-validation. This procedure has demonstrated

to be a good methodology to evaluate the performance of a learning algorithm and does not

require a dataset with a great number of examples. The procedure is explained as follows:

1. Divide the training data into K randomly chosen independent (not overlapped)

subset (folds) S1, S2, �, SK of approximately the same size.

2. Repeat for i varying from 1 to K:

a. Take the Si subset as the test set and the rest of sets (S1, �, Si-1, Si+1, � SK)

as the training set.

b. Train the methods (Linear Regression and CART) with the training set

previously built.

c. Evaluate the models constructed in the test set, i.e. determine the predicted

values for each model.

d. Assess the performance of each algorithm by using the SMSE.

The procedure explained above was executed for each unroll factor u (u=1, 2, � 8) and the

number of folds used was K=10.

5.6.3 Leave One Benchmark Out cross-validation Although the K-fold cross-validation procedure is widely accepted by the machine learning

community to evaluate the performance of learning algorithms, the real goal in compiler

optimisation with machine learning is to optimise complete programs that have not been seen

before. Therefore, an alternative procedure called Leave One Benchmark Out (LOBO) cross-

validation has been followed. This time, given a set of B benchmarks, the models must be

trained with B-1 benchmarks and the performance of the algorithms must be evaluated in the

benchmark that was not used for training. Therefore, it is the same cross-validation


67

procedure explained above when the subsets are not randomly chosen but they are selected

to be each benchmark S1, �SB. Hence, there are B folds instead of K, where B is the number

of benchmarks used (B=12).

5.6.4 Realising speed-ups The accuracy of the predictions calculated for each learning technique in terms of the SMSE

provides a measure that indicates the performance of the learning algorithms used for this

specific problem. However, it does not present the actual improvement obtained with the

values predicted on each loop and consequently, on each benchmark. The valid procedure to

be followed is to determine the best unroll factor for each loop based on the predictions

obtained and setting up the programs with the unrolled version of the loops in order to realise

the final speed-ups. However, time-limitations have constrained the present project to carry

out this task. Alternatively, with the initial data already collected and the results of LOBO

cross-validation, it is possible to obtain the expected improvement in performance that each

loop could experience under the assumptions of no interactions between loops and a

negligible effect of the instrumentation. This procedure has been adopted and its results are

described in section 5.7.4.

5.7 Results and Evaluation

5.7.1 Complete dataset Multiple Linear Regression and CART algorithm have been applied separately to the set of

benchmarks. The Standardised Mean Square Error (SMSE) for each model that has been

constructed is shown in Table 5.1. In the case of u=1 the SMSE by definition is 1 given that

this factor is considered the point of reference and there is no improvement over itself.

Recalling that the SMSE measures how good the predictions obtained by the model are

compared to the situation of always predicting the mean, Multiple Linear Regression shows

very little improvement over always predicting the mean. CART algorithm considerably

outperforms the linear regression model. Since the same dataset for training is used to

validate the model, this good performance is rather unrealistic and it may indicate that the

algorithm is overfitting the data. However, the application of the models to the same data

that is used for their construction attempts to be explanatory rather than to provide a measure


68

of accuracy. In both cases, we are trying to determine those variables that are considered

more important for each model.

5.7.1.1 Analysing the regression coefficients: As explained above, in the linear

regression model the parameter vector may provide an indication of the effect of each

predictor variable on the output variable. The parameter vector for the Linear Model

constructed using the whole dataset is shown in Table 5.2. For each model, the coefficients

of the five most important features are shown shaded, i.e. those features with the greatest

conditional effect (negative or positive).

The absolute conditional effect of each feature throughout all the models has been calculated

by:

∑=

=U

u

uii

2ββ (5.6)

This has been used to determine the best five features and they are presented below.

SPEC CFP95

Method u = 1 u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

MLR 1.000 0.898 0.929 0.911 0.890 0.905 0.898 0.909

CART 1.000 0.287 0.318 0.385 0.384 0.346 0.427 0.431

VECTORD

Method u = 1 u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

MLR 1.000 0.898 0.971 0.932 0.970 0.965 0.968 0.943

CART 1.000 0.306 0.649 0.463 0.595 0.565 0.586 0.517

SPEC CFP95 + VECTORD

Method u = 1 u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

MLR 1.000 0.959 0.979 0.971 0.986 0.987 0.986 0.980

CART 1.000 0.310 0.649 0.419 0.606 0.585 0.629 0.460

Table 5.1: SMSE when applying Multiple Linear Regression (MLR) and CART to

the complete set of benchmarks


69

1. Stores: The number of array stores

2. Size: The number of statements within the loop

3. Reuses: The number of array element reuses

4. Nested: The nested level of the loop

5. Floating: The Number of floating point operations

By far, the number of array stores and the size of the loop body are the features that the

linear model has encountered to most influence the improvement in performance of the loops

under unrolling. This is understandable as the former provides an indication of the memory

references and the demand on registers the code may have and the latter is crucial to

determine whether the instructions in the loop body may be kept in the cache. However,

other features such as the number of if statements, the number of array loads or even the trip

count, are found not to be so important for the linear models. Indeed, the trip count does not

seem to provide useful information and is ranked among the best five features only when

u=2. Other variables such as the number of array element reuses that attempts to represent

data dependency among iterations and the number of floating point operations are also found

to be relevant for these models. Surprisingly, the nested level of the loop is ranked forth

among all the features. However, a cautionary comment should be mentioned given that the

linear models that have been constructed do not seem to accurately represent the data, as the

values of SMSE are approximately one. Certainly, the predictions are only slightly better

than always predicting the mean and the findings explained above may only be a

consequence of the poor performance of the linear model.

5.7.1.2 Analysing the trees: With the aim of obtaining an explanatory model of the data,

CART was applied to the whole dataset with no pruning, i.e. allowing overfitting. The

results shown in Table 5.1 reflect this fact. The most informative feature, i.e. the one placed

at the root of the tree, was the trip count for u=2, 4, 8 and the number of floating operations

for u=3, 5, 6, 7. This is in contrast with the results obtained with linear regression where the

trip count was found not to be relevant for the predictions.

Some rules that show how the algorithm fits the data may be of interest. For example, for

u=2 it is found that IF the loop is called between 349800 and 650000 times within the

program AND the loop body ranges between 4 and 8 statements AND the trip count is

greater than 755 AND the loop has maximum 1 branch AND the number of loads is greater

than 2 AND the loop has only 1 store AND maximum 1 floating-point operation THEN the


70

expected improvement in performance is approximately 38%. On the opposite case, IF the

number of times the loops is called, the trip count and the number of branches are the same

than before, but the number of loads per iteration is 2 or 3, AND the loop has 2 or more

stores AND 2 or more floating-point operations THEN the expected detriment in

performance is roughly 23%.

Variable u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

Called -0.958 -1.205 -1.023 -1.042 -0.849 -1.206 -1.049

Size 3.318 5.387 3.967 4.779 2.850 4.742 3.827

Trip -1.000 -0.434 -0.492 -0.505 -0.427 -0.534 -0.828

Sys -0.838 -0.993 -0.782 -0.695 -0.433 -1.516 -1.038

Branches 0.911 0.539 0.288 0.178 0.532 1.147 0.380

Nested -0.716 -0.559 -1.504 -1.918 -1.783 -1.521 -2.008

Loads 0.925 1.249 0.236 1.205 1.588 1.448 1.630

Stores -3.911 -4.616 -5.179 -5.044 -4.315 -4.976 -4.987

Reuses 1.724 1.514 2.566 2.419 2.647 1.602 2.418

Floating -1.113 -2.137 -1.044 -1.237 -1.233 -1.208 -1.615

IndAcc -0.173 0.268 -0.009 -0.214 -0.201 -0.177 -0.443

Bo (bias) 1.804 1.668 4.587 2.370 2.872 3.772 5.143

Table 5.2: Parameter vectors of the linear models using the whole dataset

Although these prediction rules may seem unsophisticated, they represent how the loops in

the set of benchmarks selected are affected by unrolling. If it is recalled that the subsets in

which the data is partitioned by CART are as homogeneous as possible, they may provide a

better understanding of how unrolling works on this set of benchmarks.

5.7.2 K-fold cross-validation Given that training and testing the models in the same dataset does not represent the actual

predictive power of the algorithms, K-fold cross-validation has been used to test the

accuracy of the predictions. The results for the set of benchmarks used are shown in Table

5.3, Table 5.4 and Table 5.5. Each table shows the average (in bold) of the SMSE throughout

all the folds for the algorithms applied to each set of benchmarks. Although there are several


71

cases when the SMSE is found to be less than one (highlighted cells), in general, the

accuracy of the predictions for linear regression and CART algorithm is not good. The

predictions obtained do not outperform the simple mean predictor used in the calculation of

the SMSE. However, it is worth noting that some SMSE calculations indicate a good

performance of the algorithm for a specific fold (e.g. in Table 5.3 for CART algorithm when

using k=9 and u=6) but they are compensated by other folds in which the algorithm performs

poorly. Comparatively, linear regression and CART algorithm have roughly the same

performance, although as shown in Table 5.3 and Table 5.5, Linear Regression consistently

outperforms CART.

5.7.3 Leave One Benchmark Out Cross-validation As explained in section 5.6.3 a more realistic approach to test the algorithms is to use Leave

One Benchmark Out (LOBO) cross-validation. The SMSE values for Linear Regression and

CART algorithm are shown in Table 5.6 and Table 5.7. Unfortunately, the overall results are

again, an indication of bad performance.

For the case of Multiple Linear Regression (Table 5.6) the loops within the benchmarks

102.swim, 141.apsi and vectord3 are predicted better than baseline, but these low values of

SMSE are compensated by the poor performance of the algorithm in other benchmarks such

as 104.hydro2d and 125.turb3d. For benchmarks with a considerable amount of loops that

can be improved by unrolling such as vectord1 and vectord2 (see table 4.1 in chapter 4) the

SMSE values are only slightly different from 1 and it could be expected these benchmarks to

be correctly predicted when determining the best unroll factor to be used.

As with Multiple Linear Regression, CART algorithm (Table 5.7) shows quite good

performance when predicting on the benchmark 102.swim (u = 3, 4, 7 and 8). Additionally,

unlike Linear Regression, CART shows good SMSE values on the benchmark 125.turb3d

that has 5 loops for which unrolling is significantly beneficial (see table 4.1 in chapter 4).

However, CART also has a bad performance in 101.tomcatv, vectord3 and vectord4. For this

last benchmark composed of two loops with very little improvement, the algorithm makes

predictions greater than 50%, causing the SMSE to reach values greater than 300.


72

Linear Regression

Fold u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

K = 1 0.960 0.912 0.887 0.899 0.839 0.876 0.864

K = 2 1.105 1.058 1.021 0.917 1.094 1.096 0.992

K = 3 0.942 0.886 1.050 1.433 1.349 0.781 0.682

K = 4 0.975 0.967 1.150 1.090 1.025 1.116 1.079

K = 5 0.961 0.986 0.925 1.024 1.072 0.918 0.941

K = 6 1.916 2.473 1.473 1.257 1.494 1.986 1.365

K = 7 1.315 1.488 1.334 1.350 1.750 1.001 1.197

K = 8 1.315 1.452 1.284 1.023 0.904 1.296 1.283

K = 9 1.072 1.361 1.524 1.330 1.021 1.329 1.273

K = 10 0.948 0.962 0.888 0.813 0.835 1.069 0.868

AVG. 1.151 1.254 1.154 1.114 1.138 1.147 1.054

CART

FOLD u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

K=1 1.400 1.164 1.319 1.078 0.905 1.228 0.825

K=2 3.336 1.707 4.385 1.616 6.883 3.440 3.507

K=3 1.097 2.658 1.452 2.220 2.380 2.079 2.387

K=4 1.542 1.018 1.165 1.571 1.208 2.015 1.085

K=5 1.550 1.268 0.909 0.807 1.106 0.565 0.756

K=6 2.596 1.206 1.191 1.189 1.218 1.579 0.969

K=7 1.060 2.247 1.240 1.331 0.757 0.718 0.763

K=8 2.744 4.179 3.193 9.976 1.151 2.861 2.867

K=9 1.976 1.769 0.940 1.359 0.244 1.299 1.155

K=10 0.650 0.650 0.747 1.006 1.184 1.027 0.994

AVG. 1.795 1.787 1.654 2.215 1.704 1.681 1.531

Table 5.3: SMSE K-fold cross-validation for SPEC CFP95


73

Linear Regression

Fold u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

K = 1 0.963 0.740 1.297 1.086 1.043 0.897 1.084

K = 2 1.016 1.429 1.153 1.033 1.075 1.129 1.158

K = 3 0.934 0.996 0.826 0.994 0.994 0.996 0.940

K = 4 17.42 13.75 3.618 4.624 5.308 4.107 3.242

K = 5 1.053 1.312 0.900 1.163 1.010 1.122 0.959

K = 6 1.731 1.656 1.445 1.625 1.554 1.239 1.258

K = 7 1.093 1.192 1.057 1.201 1.201 1.268 1.109

K = 8 0.797 0.850 1.039 1.088 1.421 1.182 1.048

K = 9 1.078 1.569 1.016 1.529 1.534 1.640 1.039

K = 10 1.090 1.800 2.011 2.723 2.809 2.477 1.676

AVG. 2.717 2.529 1.436 1.707 1.795 1.606 1.351

CART

FOLD u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

K=1 0.899 2.007 1.810 1.863 2.353 2.529 1.483

K=2 1.844 6.094 1.570 6.769 7.186 7.719 1.081

K=3 0.869 0.827 0.728 0.784 0.743 0.831 2.922

K=4 1.345 1.070 0.771 0.742 0.713 0.889 0.963

K=5 1.595 4.826 1.677 3.197 3.073 4.198 2.964

K=6 1.599 1.610 1.050 0.793 0.746 0.682 0.575

K=7 1.794 1.581 1.874 3.552 3.475 3.432 1.876

K=8 0.938 1.348 1.173 1.278 0.964 1.343 0.995

K=9 0.990 0.625 1.046 1.512 1.340 1.374 1.713

K=10 1.269 1.379 1.307 1.162 1.412 1.329 1.366

AVG. 1.314 2.137 1.301 2.165 2.200 2.433 1.594

Table 5.4: SMSE K-fold cross-validation for VECTORD


74

Linear Regression

Fold u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

K = 1 1.076 1.034 0.991 1.037 1.020 1.052 0.978

K = 2 1.022 1.000 1.044 1.006 1.022 1.004 1.055

K = 3 0.972 0.960 0.942 0.960 0.986 0.903 0.946

K = 4 0.919 1.016 1.052 1.077 1.051 1.090 1.051

K = 5 1.000 1.047 0.960 0.945 0.971 1.014 1.037

K = 6 1.192 1.149 1.052 1.062 1.034 1.038 1.030

K = 7 1.032 0.981 1.057 1.028 1.030 1.034 1.032

K = 8 0.931 0.852 0.914 0.927 0.993 0.871 0.954

K = 9 0.935 0.943 0.975 0.999 0.988 0.963 0.970

K = 10 1.303 1.353 2.247 1.332 1.621 1.876 1.691

AVG. 1.038 1.034 1.123 1.037 1.072 1.084 1.074

CART

FOLD u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

K=1 1.054 0.678 0.887 1.818 1.094 0.566 1.185

K=2 1.180 0.835 0.529 0.766 0.752 0.769 0.717

K=3 0.638 1.898 1.544 2.420 2.459 1.724 2.371

K=4 1.119 1.678 0.978 1.561 1.267 1.717 0.996

K=5 1.661 1.531 1.825 1.833 1.551 1.139 1.117

K=6 1.116 1.171 1.721 1.677 1.541 1.694 1.830

K=7 0.907 1.332 0.781 1.105 1.070 1.291 0.645

K=8 1.270 1.172 0.884 1.593 1.219 1.359 0.720

K=9 1.511 8.758 1.068 3.444 1.009 2.384 1.089

K=10 1.513 1.533 1.255 2.091 2.764 2.027 1.138

AVG. 1.197 2.058 1.147 1.831 1.473 1.467 1.181

Table 5.5: SMSE K-fold cross-validation for SPEC CFP95 + VECTORD


75

Benchmark u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

101.tomcat 5.917 0.869 0.985 0.573 2.691 0.612 1.838

102.swim 0.279 0.625 0.636 0.739 0.717 0.756 0.705

103.su2cor 1.832 1.079 1.793 1.352 1.573 1.479 1.602

104.hydro2d 8.086 38.60 19.28 42.64 29.03 23.60 26.76

107.mgrid 0.889 1.681 2.096 2.997 4.354 2.505 2.775

125.turb3d 2.085 9.685 20.75 17.63 6.313 10.96 9.547

141.apsi 1.031 1.025 0.913 0.911 0.900 0.898 0.987

146.wave5 1.343 1.616 1.856 2.715 2.365 2.464 2.381

Vectord1 0.976 0.996 0.987 1.010 1.018 1.003 1.003

Vectord2 1.064 1.102 1.042 1.057 1.048 1.068 1.037

Vectord3 0.958 0.976 0.943 0.727 0.802 0.838 0.895

Vectord4 1.487 1.485 4.748 0.527 1.869 2.274 0.030

AVG. 2.162 4.978 4.670 6.073 4.390 4.038 4.130

Table 5.6: SMSE LOBO cross-validation for Multiple Linear Regression

5.7.4 Realising speed-ups Considering that the ultimate task in loop unrolling is to determine the unroll factor to be

used for each loop, an additional procedure has been followed in order to measure the impact

of the predictions described above. Given that the actual improvement in performance of the

loops is available for each unroll factor, it is possible to establish the best improvement in

performance that a loop can achieve. Similarly, the predictions provided by the algorithms

(obtained with LOBO cross-validation) can be used to ascertain the best predicted unroll

factor for each loop. Therefore, the expected improvement in performance a loop may

experience under this unroll factor can be found. Let us call this magnitude the predicted


76

improvement, although it should be clear that it is the actual improvement of each loop based

on the unroll factor suggested by the predictions of the algorithms.

Benchmark u = 2 u = 3 u = 4 u = 5 u = 6 u = 7 u = 8

101.tomcat 4.861 6.862 6.244 4.185 13.68 8.461 3.907

102.swim 22.76 0.739 0.575 1.499 1.284 0.646 0.795

103.su2cor 6.600 4.158 1.753 1.754 1.560 1.317 0.949

104.hydro2d 1.953 2.274 1.218 3.075 3.181 2.042 1.896

107.mgrid 1.656 2.401 0.674 2.903 2.187 2.069 2.731

125.turb3d 1.468 2.425 0.658 1.392 0.787 1.252 0.514

141.apsi 1.253 2.020 1.455 1.635 3.770 2.328 2.539

146.wave5 0.672 1.579 0.611 5.272 3.133 1.262 0.789

Vectord1 1.169 1.122 1.239 1.002 1.242 1.145 1.296

Vectord2 0.882 4.690 1.009 2.469 2.678 3.126 1.309

Vectord3 5.459 1.891 6.883 4.882 4.799 5.201 3.859

Vectord4 386.5 328.3 389.4 210.8 95.33 344.6 481.3

AVG. 36.27 29.87 34.31 20.07 11.13 31.12 41.82

Table 5.7: SMSE LOBO cross-validation for CART algorithm

Figure 5.1 shows the plots of the predicted improvement vs. the best improvement for

Multiple Linear Regression and CART algorithm. It is clear that the ideal place for each

point in the plots is on the diagonal of the first quadrant, i.e. when the predicted improvement

is equal to the Best improvement. However, it may also be acceptable to be under the

diagonal (it should be noticed that being over the diagonal is impossible) but over the x-axis.

This is an indication that, although the exact improvement is not predicted, a lot can be

gained if the unroll factor suggested by the algorithms is used. Both Linear Regression and

CART suggest unroll factors for which a notable improvement in performance is achieved.


77

However, they also suggest predictions that cause a detriment in performance when the loops

could potentially be improved, as it can be perceived from the points under the x-axis.

Negative values lying at the y-axis indicate that the best improvement is zero, but that the

predicted unrolling hurts performance.

As some overprinting that makes difficult the analysis of the actual effect of the predictions

may be present in Figure 5.1, the best improvement and the predicted improvement have

been divided into ten classes and a histogram that computes the number of loops that belong

to each pair has been calculated. The results are shown in Figure 5.2. Given that there are

only few differences for Linear Regression and CART, the subsequent analysis will refer to

the results of both algorithms (unless explicitly stated). As before, it is desired to be on the

main diagonal of the plot because it indicates that the effect of the predictions is the best

possible improvement. Hence, approximately 57% of the loops are correctly affected with

the predictions. Furthermore, it can be observed that a large number of loops remain

unaffected with the predictions as their maximum improvement is nearly zero. More

specifically, the bar that corresponds to the fifth class of both, the predicted improvement

and the best improvement, represents roughly 28% of the loops. It is also interesting to

calculate the percentage of loops for which unrolling cannot produce an improvement in

performance but are negatively affected with the predictions. These loops are represented by

the bars for which the predicted improvement class is less than five and the best

improvement class is five. Roughly, 17% of the loops belong to this group. The bars for

which the best improvement class is greater than five and the predicted improvement class is

five correspond to those loops that could potentially be improved by unrolling but the

predictions do not produce a significant effect. 13% (for Linear Regression) and 17% (for

CART) of the loops correspond to this group.

Besides analysing the effect of the predictions on the complete set of loops, the unroll factors

suggested by the algorithms have been used to ascertain the reduction (or increase) in the

execution time of each loop by using the initial data that has been collected. These

reductions in execution time have been used to determine the impact of the predictions on

the total execution time of the benchmarks. Let us denote this improvement of performance

as hypothetical or re-substitution improvement because it has not been obtained by

additional executions of the programs. It relies on the assumptions that there is no interaction

between loops and that the effect of the instrumentation is negligible. The results for all the

benchmarks used are shown in Figure 5.3. The results obtained by the Linear Regression


78

model leaded to an improvement in performance of seven benchmarks, whilst one

benchmark, namely 125.turb3d remained nearly unaffected, and four programs experienced

an increase in their execution times. The improvements obtained by using the results of

CART algorithm are similar, except that in this case, 125.turb3d was negatively affected and

its performance was worsened. It can be seen that the SPEC CFP95 benchmarks are more

difficult to optimise than the VECTORD benchmarks. The maximum improvement reached

for the former was roughly 6% while the latter was improved at a maximum of

approximately 18%. The mean improvement in performance throughout all the benchmarks

achieved by the Linear Regression model was about 2.5%, while CART algorithm gave

2.3% improvement on average. Finally, the benchmark most negatively affected by

erroneous predictions of the algorithms was vectord3 that experienced a detriment of

approximately 7%.

If Figure 5.3 is compared with the maximum possible improvement achieved by unrolling

shown in Figure 5.4 for all the benchmarks, the negative effect of the predictions on some

benchmarks can be understood given the insignificant maximum possible improvement that

can be reached on these benchmarks. Furthermore, the good effect obtained for vectord1 and

vectord2 can also be explained based on this maximum improvement.

5.7.5 Feature construction As explained in section 4.6.2 some features may be irrelevant for the learning techniques;

others can be transformed into a more suitable form, for example binary features; and new

features can be constructed in order to improve the accuracy of the predictions, for example,

the ratio of memory references to floating point operations. Binary features were created for

sparse variables such us the number of proper functions of the language or the number of

branches within the loop but no improvement was found. Similarly, the feature ratio of

memory references to floating point operations was created, but unfortunately this new

variable did not affect the performance of the algorithms.

5.7.6 Comparison to related work In principle, there is no point of comparison with the work done in [Monsifrot et al., 2002] or

the one in [Stephenson and Amarasinghe, 2004] due to the differences in the benchmarks


79

used and the formulation of the problem. Certainly, their goal was to build a classifier while

the approach that has been adopted in this project is the construction of a regression model.

Although in [Monsifrot et al., 2002] unrolling was also implemented at the source code level,

a classification task was developed and, consequently, the results were presented in a

different way. Additionally, the decision of the specific unroll factor to be used was

transferred to the compiler and it was not determined by the classifier. This considerably

simplifies the problem given that a loop may be improved by one unroll factor but negatively

affected by another one.

For illustrative purposes only, if a classifier was constructed with the results obtained with

Linear Regression or CART algorithm, it would have approximately 75% of accuracy, which

is comparable with the total accuracy obtained in [Monsifrot et al., 2002] without boosting

(79.4%). Their speed-ups were on average 6.2% and 4% in two different machines.

However, there is no reference of the maximum speed-up reachable in the data.

In the case of the work developed by [Stephenson and Amarasinghe, 2004] it is even more

difficult to establish comparisons because they implemented unrolling at the back-end of the

compiler. They also adopted a classification approach and obtained speed-ups of 6% on the

SPEC benchmarks. However, they included programs from the SPEC INT95 that were not

considered in this project. Furthermore, programs for which an improvement of 30% to 50%

was possible were included in their dataset, which is greater than the maximum obtainable

speed-ups for the programs used in this project.

5.8 Summary and Discussion This chapter has addressed the regression approach for predicting the expected improvement

in performance of loops under unrolling. This approach has been explained to be more robust

and general than the classification approach. It is more robust because it deals more easily

with noisy measurements. It is considered more general because the binary classification of

deciding whether unrolling should be applied or not, and the multi-classification approach of

predicting the best unroll factors can be easily determined after obtaining the results with the

regression approach.


80

Two different modelling techniques have been used to tackle this regression problem:

Multiple Linear Regression and Classification and Regression Trees (CART). Multiple

Linear Regression is a straightforward technique that assumes that the output variable is a

linear combination of the input variables. Despite its simplicity, it has demonstrated to be

successful for many types of problems. CART algorithm is a method based on decision trees

learning used for classification and regression. It attempts to model non-linear dependencies

between the output variable and the predictor variables without making assumptions on the

distribution of the data. To measure the predictive power of the techniques, the Standardised

Mean Square Error (SMSE) has been used. It comparatively indicates the accuracy of the

predictions with the very simple case of always predicting the mean. Three different types of

experiments have been designed in order to test the models constructed for the regression

approach: using the complete dataset, using K-fold cross-validation and using Leave One

Benchmark Out (LOBO) Cross-validation. When working with the complete dataset, the

algorithms were trained and tested on the same data in order to obtain an explanatory model

that enabled us to have a better insight into the problem and to determine those variables that

are found to be relevant for loop unrolling. The regression coefficients indicated that the

most influential variables were the number of array stores, the number of statements within

the loop, the number of array element reuses, the nested level of the loop and the number of

floating point operations. Apart from the nested level of the loop that was unexpectedly

ranked fourth among the most important variables all the other features encountered have a

demonstrated impact on loop unrolling. The analysis of the trees that were built by CART

algorithm determined that the trip count and the number of floating-point operations were

found to be the most relevant features. Additionally, interesting rules that involve the size of

the loop body, the number of memory references, the number of floating-point operations

and the trip count emerged from the trees to provide an explanation of how the regression

task was developed in the data.

Considering that the predictive power of the techniques is what describes the effectiveness of

the regression approach, the results obtained for the cross-validation methodology and the

Leave One Benchmark Out strategy are, in general, disappointing. Certainly, although for

some cases the SMSE values were an indication of good performance, they were strongly

overshadowed by other cases where the performance of the algorithms was poor.

Without being discouraged by the low accuracy of the predictions, the models that were

constructed by the Linear Regression approach and CART algorithm were used to suggest an


81

unroll factor for each loop. Thus, these unroll factors were considered in order to ascertain a

possible improvement in performance that the loops could obtain. It was found that in most

of the cases, the unroll factors suggested by the algorithms produced an improvement in

performance that, to some extent, agreed with the best obtainable improvements. In other

words, many of the unroll factors suggested by the predictions of the algorithms leaded to an

improvement in performance of the loops. However, some loops were negatively affected by

incorrect predictions.

Additionally, given the time-constraints placed on the present project, the benchmarks were

not executed with the predicted unrolled version of the loops. An alternative procedure based

on the data that was collected was performed in order to realise speed-ups on each

benchmark. This procedure reused the execution times of the data and worked under the

assumptions of no interaction between loops and a negligible effect of the instrumentation.

Hence, the new execution times for the benchmarks were calculated based on the predictions

suggested by the algorithms with the LOBO cross-validation results. It was encountered that

seven of twelve benchmarks might experience an improvement in performance, where the

maximum speed-up found was approximately 18%. However, other benchmarks might

experience an increase in their execution times if the predictions were adopted. The worst

detriment in performance was found to be roughly 7%. The mean improvement of

performance throughout all the benchmarks was about 2.5% for the linear model and 2.3%

for the results obtained by CART.

It may seem rather contradictory that the accuracy of the regression techniques was found to

be poor, but the unroll factor suggested by the results leaded to an improvement in

performance that is notable, considering the maximum improvement obtainable by unrolling

on the set of benchmarks. There are two reasons that may explain this fact. Firstly, some

loops may always be improved by unrolling regardless of the unroll factor used. That is, the

execution time of some loops may be effectively reduced by any unroll factor greater than

one. However, there is still an indication of a good effect of the predictions given that, as

shown in Chapter Four, most of the loops do not experience an improvement in performance

greater than 5%. Indeed, and this is the second reason for the speed-ups obtained despite the

poor performance of the algorithms, the models constructed may somehow have learnt when

the loops experience a positive effect under unrolling and, although the accuracy of their

predictions is low, the best unroll factor suggested by their results still represents a good

improvement for the loops.


82

The explanation given above does not attempt to soften the poor predictive power in terms of

the SMSE achieved by the techniques used, but rather raise a question regarding the

appropriateness of the regression approach for the problem tackled in this project. I still

believe it is a more general and suitable approach than the construction of a classifier to

directly predict the best unroll factor. The classification approach is more susceptible to

erroneous decisions given that it does not smoothly model the behaviour of the execution

times of the loops under unrolling. However, it may also be true that the regression approach

may need more training data, different features to capture the variations in the execution

times of the loops under unrolling and even different types of techniques to solve the

problem. These considerations can be adopted by future work in order to improve the results

obtained in this project.


83

-160

-120

-80

-40

0

40

80

0 10 20 30 40 50 60 70

Best Improvement (%)

Pred

icte

d im

prov

emen

t (%

)

-160

-120

-80

-40

0

40

80

0 10 20 30 40 50 60 70

Best Improvement (%)

Pred

icte

d im

prov

emen

t (%

)

Figure 5.1: Predicted Improvement (%) vs. Best Improvement in performance

for Linear Regression model (top) and CART (bottom)


84

Class Improvement (%)

1 (-150,-100]

2 (-100,-50]

3 (-50, -20]

4 (-20,-5]

5 ( -5,5]

6 (5,10]

7 (10,20]

8 (20,40]

9 (40,60]

10 (60,80]

Figure 5.2: Histogram Best Improvement � Predicted Improvement for Multiple

Linear Regression (top) and CART (bottom)


85

Figure 5.3: Re-substitution Improvement found by Multiple Linear Regression

(top) and CART (bottom)


86

Figure 5.4: Maximum possible improvement obtainable by unrolling

87

Conclusions

This dissertation has addressed the problem of compiler optimisation with machine learning.

It has been shown that even in the case of only one transformation such as loop unrolling

being considered, the problem of determining when and how this transformation should be

applied remains a challenge for compiler writers and current researchers. This is evident

because loop unrolling may be beneficial for some loops but detrimental for other loops.

Furthermore, this transformation may cause a negative and a positive effect on the same loop

when using a different unroll factor. One of the most important reasons to apply loop

unrolling is that it exposes Instruction Level Parallelism. Unrolling can also eliminate loop

copy operations and improve memory locality. However, the transformation may degrade the

instruction cache depending on factors such as the size of the loop body, the size of the cache

and the unroll factor used. Additionally, it may augment the number of calculations of

addresses and make the instruction scheduling problem more complex. Finally, due to the

effect of interactions, loop unrolling may enable or prevent other transformations. This is

probably the most complicated issue when studying loop unrolling, given that analysing the

interaction with another transformation in isolation may be unrealistic but evaluating its

effect when using a great number of program transformations will become unfeasible.

These factors are essential for determining the use of machine learning techniques in order to

predict when unrolling can be detrimental or beneficial for a particular loop. This approach

assumes that program transformations can be learnt from past examples; more specifically,

that the parameters involved in loop unrolling can be learnt based on the behaviour of the

execution times of loops that are described by a set of features.

Despite previous work having focused on stating the optimisation problem with loop

unrolling as a classification task, throughout this project the belief has been held that a

regression approach could more appropriately model the behaviour of a loop under unrolling.

This may be understood from the fact that, as concerns the problem this project has dealt

with, classifiers could be severely affected by noisy measurements and a limited number of

examples in the training set. Furthermore, the results of a classification approach regarding

the decisions of whether applying unrolling to a specific loop and determining the best unroll

factor for loops are easily obtained when applying the regression approach.

Conclusions

88

The belief has also been held that, as in most data mining applications, the stage of pre-

processing the data to make it suitable for learning is the phase to which most time must be

dedicated, given that without reliable data the learning process proves non-sensical. Thus,

much effort was focused on the process of collecting, cleaning, analysing and preparing the

data to be used by machine learning techniques. Judiciously following this process, the

features and the execution times of the loops were extracted and transformed in order to be

used by the regression approach.

Two different algorithms were used for regression in order to predict the improvement in

performance a loop can experience when it is unrolled a specific number of times: Multiple

Linear Regression and Classification and Regression Trees (CART). The former was not

only selected on the basis of its simplicity, but also on the basis of its successful performance

across different types of applications. The latter was chosen as it considers non-linear

relationships and does not make assumptions about the underlying distribution of the data.

The results in terms of the accuracy measured by the Standardised Mean Square Error

(SMSE) were disappointing as most of the cases the predictions obtained with the algorithms

could not outperform the very simple mean predictor used by the SMSE. Nonetheless, these

predictions were used in order to ascertain the best unroll factor for loops and determine the

expected improvement in performance the benchmarks that were considered could

experience. Due to time-constraints placed on this project, the programs could not be

executed with the predictions suggested by the algorithms. The initial data that was collected

was used instead, in order to compute the final results. Thus, the realisations of speed-ups

were carried out under the assumption that there is no interaction between loops and that the

effect of the instrumentation is negligible.

It was found that most of the loops could experience an improvement in performance by

using the unroll factor suggested by the predictions of the algorithms. Certainly, it was found

that seven out of twelve Benchmarks could be improved with the predictions, where the

maximum speed-up found was approximately 18%. Four programs in the case of the linear

model, and five programs in the case of CART algorithm, were negatively affected by

unrolling when using the factors predicted. On average, an overall improvement in

performance of 2.5% (for Linear Regression) and 2.3% (for CART) could be achieved.

Conclusions

89

Two possible explanations were considered after analysing the results obtained. Firstly, the

fact that some loops are positively influenced by unrolling, regardless of the unroll factor

used. Secondly, although the accuracy of the predictions was poor, the models enabled the

discovery of circumstances in which unrolling could effectively improve the execution time

of loops. In conclusion, both factors have influenced the improvements obtained. However,

as a consequence of improvements in performance having been achieved, it would be naïve

to sate that all the results in the project are exciting. Other approaches, albeit using different

data, have obtained superior speed-ups. It would also be extreme to assume that the

improvement in performance of loops under unrolling is unpredictable. It is difficult to

determine a unique reason responsible for the poor performance of the algorithms used. It is

necessary, however, to remark that the number of loops utilised in this project is

considerably less than in similar approaches. Certainly, whilst only 248 loops were included

in the dataset in this project, previous work accumulated data for more than 1000 loops.

Additionally, the set of mostly static features utilised for loops may not be sufficient to

accurately predict the improvement in performance of loops under unrolling. Finally, other

regression methods different from Linear Regression and CART algorithm could perform

better using the dataset that has been created.

Thus, immediate future work carried out to follow on from the present project must be

focused on the addition of further examples to the erstwhile dataset, remaining cautious that

any new data added is suitable for use by machine learning techniques. The dataset may be

complemented not only with more examples but also with a greater number of features.

Dynamic features can be determined by frameworks specialised in program analysis and

program transformations. Furthermore, it is worth trying other regression methods different

from Multiple Linear Regression and CART algorithm.

Going beyond learning in loop unrolling there is a great number of applications in compiler

optimisation that could be addressed. For example, discerning better heuristics for other

program transformations or working with more complex problems such as instruction

scheduling or register allocation. However, there is still much work to be done in connection

with the problem of interactions among different transformations. Given that evaluating all

the possible optimisation options with which a program should be executed is unfeasible,

one could have a set of training examples for which these optimisations has been recorded,

and it would be worthwhile predicting the levels of optimisation for which a program may

achieve maximum speed. Nonetheless, it should be recalled that the creation of a good

Conclusions

90

dataset usable for machine learning techniques is a time-consuming activity. Therefore, the

use or the creation of software tools that could automate this process can provide a great

benefit for the research in this area.

91

Bibliography

[Bacon et al., 1994] Bacon, D., Graham, S. and Sharp, O. (1994). Compiler Transformations

for High-Performance Computing. In ACM Computing Survey, Vol. 26, Issue 4, Pages 345�

420.

[Bodin et al., 1998] Bodin, F., Mével Y. and Quiniou, R. (1998). A User Level Program

Transformation tool. In Proceedings of the international Conference on Supercomputing.

[Breiman et al., 1993] Breiman L., Friedman J., Olshen R. and Stone C. (1998).

Classification and Regression Trees. Chapman and Hall, Boca Raton.

[Calder et al., 1997] Calder, B., Grunwald, D., Jones, M., Lindsay, D., Martin, J., Mozer, M.

and Zorn, B. (1997). Evidence-Based Static Branch Prediction Using machine Learning. In

ACM Transactions on Programming Languages and Systems, Vol. 19, No. 1, Pages 188-

222.

[Cooper and Torczon, 2004] Cooper, K. and Torczon L. Engineering a Compiler. (2004).

Morgann Kaufmann.

[Davidson and Jinturkar, 2001] Davidson, J. and Jinturkar, S. (2001). An Aggressive

Approach to Loop Unrolling. Technical Report: CS-95-26. University of Virginia, USA.

[Dongarra and Hinds, 1979] Dongarra, J. and Hinds, A. (1979). Unrolling Loops in

FORTRAN. Software-Practice and Experience, Vol. 9, Pages 219-226.

[Fursin, 2004] Fursin, G. (2004). Iterative Compilation and Performance Prediction for

Numerical Applications. PhD thesis, University of Edinburgh, Edinburgh, UK.

[G77] GNU Fortran Compiler. http://gcc.gnu.org/.

[Han and Kamber, 2001] Han, J. and Kamber, M. (2001). Data Mining: Concepts and

Techniques. Morgan Kaufmann.

http://gcc.gnu.org/

Bibliography

92

[Hand et al., 2001] Hand, D., Mannila, H. and Smyth, P. (2001). Principles of Data Mining.

MIT Press.

[Hays and Winkler, 1971] Hays, W. and Winkler, R. (1971). Statistics: Probability, Inference

and Decision. Volume I. Holt, Rinehart and Winston, Inc.

[Kolodner, 1993] Kolodner, J. (1993). Case-Based Reasoning. Morgann Kaufmann.

[Koza, 1992] Koza J. (1992). Genetic Programming: On the Programming of Computers by

Means of Natural Selection. MIT Press.

[Levine et al., 1991] Levine, D., Callahan, D. and Dongarra, J. (1991). A test suite for

vectorizing Fortran compilers. Double precision.

[Long, 2004] Long, S. (2004). Adaptive Java Optimisation Using Machine Learning

Techniques. PhD thesis, University of Edinburgh, Edinburgh, UK.

[Mitchell, 1997] Mitchell, Tom. (1997). Machine Learning. McGraw Hill.

[Monsifrot and Bodin, 2001] Monsifrot, A. and Bodin, F. (2001). Computer Aided Tuning

(CAHT): �Applying Case-Based Reasoning to Performance Tuning�. In proceedings of the

15th ACM International Conference on Supercomputing (ICS-01), pages 196-203. ACM

Press, Sorrento, Italy.

[Monsifrot et al., 2002] Monsifrot, A., Bodin, F. and Quiniou R. (2002). A Machine

Learning Approach to Automatic Production of Compiler Heuristics. In Artificial

Intelligence: Methodology, Systems, Applications, pages 41-50.

[Neter et al., 1996] Neter, J., Kutner, M., Nachtsheim, C. and Wasserman, W. (1996).

Applied Linear Statistical Models, 4th Edition. Irwin.

[Nielsen, 2004] Nielsen, P. (2004). Lecture notes on Advanced Computer Systems. The

University of Auckland, New Zealand. Downloadable from:

http://www.esc.auckland.ac.nz/teaching/Engsci453SC/.

http://www.esc.auckland.ac.nz/teaching/Engsci453SC/

Bibliography

93

[ORC] Open Research Compiler for ItaniumTM Processor Family.

http://ipf-orc.sourceforge.net/.

[Shen and Smaill, 2003] Shen, Q. and Smaill, A. (2003). Lecture Notes on Knowledge

Representation. The University of Edinburgh, UK. Downloadable from:

http://www.inf.ed.ac.uk/teaching/modules/kr/notes/index.html.

[SPEC95] The Standard Performance Evaluation Corporation.

http://www.specbench.org/.

[Stephenson and Amarasinghe, 2004] Stephenson, M. and Amarasinghe, S. (2004).

Predicting Unroll Factors Using Nearest Neighbors. MIT-TM-938.

[Stephenson et al., 2003] Stephenson M., Martin M., O�Reilly U.-M. and Amarasinghe, S.

(2003). Meta Optimization: Improving Compiler Heuristics with Machine Learning. In

Proceedings of the SIGPLAN �03 Conference on Programming Language Design and

Implementation, San Diego, CA.

[Trimaran] Trimaran, an Infrastructure for Research in Instruction-Level Parallelism.

http://www.trimaran.org.

http://ipf-orc.sourceforge.net/

http://www.inf.ed.ac.uk/teaching/modules/kr/notes/index.html

http://www.specbench.org/

http://www.trimaran.org/

Date post:	30-Jul-2018
Category:	Documents
Upload:	trankhue
View:	229 times
Download:	2 times

Predicting Good Compiler Transformations · Predicting Good Compiler Transformations Using Machine...

Documents