Power Optimizations for the MLCA Using Dynamic Voltage Scalingtsa/theses/ivan_matosevic.pdf ·...

Power Optimizations for the MLCA

Using Dynamic Voltage Scaling

by

Ivan Matosevic

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

Copyright c© 2006 by Ivan Matosevic

Abstract

Power Optimizations for the MLCA

Using Dynamic Voltage Scaling

Ivan Matosevic

Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

2006

The Multi-Level Computing Architecture (MLCA) is a novel architecture for parallel

systems-on-a-chip. We propose and evaluate a profile-driven compiler technique for power

optimizations of MLCA applications using dynamic voltage scaling (DVS). Our technique

combines dependence analysis of loops with profiling in order to identify the slack in par-

allel execution of coarse-grain tasks. DVS is applied to slow down processors executing

tasks outside the critical path, saving power with little or no impact on execution time.

Evaluation of our technique using an MLCA simulator and three realistic MLCA mul-

timedia applications shows that up to 10% savings in processor power consumption can

be achieved with no more than 1.5% increase in execution time. The achieved power

savings are significantly greater than those that could be achieved by uniformly slowing

down all computations with only a similar increase in overall execution time.

ii

Acknowledgements

First and foremost, I would like to thank my supervisor, Prof. Tarek S. Abdelrahman,

for his guidance and support throughout the course of this work.

I would also like to thank Faraydon Karim and Alain Mellan from STMicroelectron-

ics for their support and help, without which this work would not have been possible.

Particular thanks to Alain for providing support for the MLCA simulator.

Many thanks to Utku Aydonat, who helped me in the initial stage of my work by

introducing me to the details of the MLCA simulator. Furthermore, I wish to thank

the participants of the Compiler and Architecture Reading Group for providing valuable

feedback that led to significant improvements in this work.

I am grateful to Prof. Sinisa Srbljic for encouraging me to pursue graduate studies

in Canada. I would also like to thank all of my family, friends, and colleagues, too

numerous to name individually, who have provided support and encouragement during

the past years. Special thanks to Anto Anusic, Vesna Anusic, and Franjo Plavec, who

were of great help when I was moving to Toronto and learning my way around here.

This work has been supported by research grants from STMicroelectronics and Com-

munications and Information Technology Ontario (CITO). I am grateful for their support.

iii

Contents

1 Introduction 1

1.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 6

2.1 The MLCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Register Renaming and Parallel Execution of Tasks . . . . . . . . 8

2.1.2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Dynamic Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 System and Application Properties 16

3.1 Hardware Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.1 Architectural Properties . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.2 Power-Related Properties of the Processors . . . . . . . . . . . . . 17

3.2 Application Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Task Graph Execution Model . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Problem Formulation 24

4.1 Slack in Task Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

iv

5 Voltage Selection and Task Scheduling 29

5.1 The Loop Dependence Graph . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 The Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.3 The Available Slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.4 Selection of Task Instructions for Slack

Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.5 The ILP Model for Slack Distribution . . . . . . . . . . . . . . . . . . . . 42

5.5.1 Variables and Constraints . . . . . . . . . . . . . . . . . . . . . . 42

5.5.2 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.5.3 Estimate of the Number of Variables . . . . . . . . . . . . . . . . 46

5.6 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.6.1 Task Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.6.2 Processor Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.6.3 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . 49

5.7 Target Loop Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Evaluation 52

6.1 Benchmark Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.1.1 JPEG Image Encoder . . . . . . . . . . . . . . . . . . . . . . . . 53

6.1.2 GSM Voice Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.1.3 MPEG Sound Decoder . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2 Experimental Platform and Processor

Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.3 Energy Savings and Execution Slowdown . . . . . . . . . . . . . . . . . . 60

6.3.1 JPEG Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3.2 GSM Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.3.3 MPEG Sound Decoder . . . . . . . . . . . . . . . . . . . . . . . . 62

6.4 Breakdown of the Energy Consumption . . . . . . . . . . . . . . . . . . . 70

v

6.5 Algorithm Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.6 Evaluation of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.6.1 Comparison with Practical Upper Bound . . . . . . . . . . . . . . 73

6.6.2 Comparison with Task Graph Partitioning . . . . . . . . . . . . . 75

7 Related Work 80

7.1 Real-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.2 Operating Systems and Optimizing Compilers . . . . . . . . . . . . . . . 82

7.3 System Modeling in DVS Research . . . . . . . . . . . . . . . . . . . . . 83

8 Conclusions and Future Work 85

8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

A Interval-Based Voltage Selection and Task Scheduling 89

A.1 Task Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.2 ILP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.2.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.2.2 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.2.3 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.2.4 Estimate of the Number of Variables . . . . . . . . . . . . . . . . 94

Bibliography 95

vi

Chapter 1

Introduction

The Multi-Level Computing Architecture (MLCA) [22] is a novel architecture for parallel

systems-on-a-chip (SOCs). It features multiple processing units and a top-level controller

that automatically exploits parallelism among coarse-grain units of computation, called

tasks. The parallel execution of tasks is based on techniques similar to those used by

superscalar processors for the extraction of instruction-level parallelism.

The MLCA supports a programming model that is close to sequential programming.

An MLCA application consists of a set of task functions and a control program. Task

functions are executed as tasks by the processing units at run-time. The control pro-

gram specifies the control and data flow between tasks, and is executed by the top-level

controller. This model reduces programming effort, making the MLCA an attractive

architecture for multimedia and streaming applications.

Power consumption remains one of the critical design constraints in today’s embedded

systems. These systems often run on batteries, but improvements in battery technology

have failed to keep pace with increased system power consumption, particularly the

power consumption of processors [26]. Even in computer systems that are not battery-

dependent, excessive power consumption results in heat dissipation that increases the cost

of packaging and negatively impacts reliability. Consequently, techniques for reducing

1

Chapter 1. Introduction 2

power consumption have attracted considerable interest in recent years.

Dynamic voltage scaling (DVS) is a technique for reducing the power consumption

of processors. With this technique, the supply voltage and operating frequency of a pro-

cessor can be varied during run-time in order to achieve a trade-off between processor

performance and power dissipation. Today, there exists a number of DVS-enabled pro-

cessors, including the Intel XScale [7], the IBM PowerPC 405LP [35], and the Transmeta

Crusoe processors [42].

In this thesis, we investigate the use of DVS for the MLCA. More specifically, we

consider the use of DVS-enabled processors as processing units in the MLCA and propose:

(1) a novel profile-driven compiler technique for assigning processor voltage levels to tasks,

a procedure we refer to as the voltage selection, and (2) a complementary task scheduling

algorithm for MLCA applications. We assume that the processor frequency in the MLCA

system is chosen so as to meet the real-time requirements of the application while taking

full advantage of the maximum achievable parallel speedup. Thus, power savings cannot

be achieved by permanently reducing the voltage and frequency of the processors, because

this would violate the real-time requirements of the application. Instead, our technique

attempts to achieve power savings in a manner that either does not slow down the

application or incurs a minor slowdown while achieving power savings significantly greater

than those that could be achieved by uniformly slowing down all computations with only

a similar increase in execution time.

Our proposed technique targets long-running loops in the control programs of MLCA

applications. It combines analysis of the loop dependence graph with profiling informa-

tion in order to deduce properties of the dynamic task graph that represents the run-time

execution of tasks in the loop. Based on these deduced properties, our algorithm attempts

to determine the optimal voltage level for the execution of each task. The task schedul-

ing algorithm, which is also based on the deduced properties of the dynamic task graph,

complements the voltage selection algorithm to ensure that power savings are achieved


with little or no impact on the application execution time.

We implement and evaluate our technique using three realistic multimedia applica-

tions and a simulator of the MLCA [22]. The results indicate that our technique is

successful at reducing processor power consumption by 5–10%, with minimal increase

in execution time (no more than 1.5%). Although our technique specifically targets the

MLCA, we believe that it is applicable in the more general context of task-level paral-

lelism in multimedia applications based on pipelining the processing of individual units

in the input media stream.

1.1 Thesis Contributions

Previous work in the area of DVS-based power optimizations for systems featuring task-

level parallelism, which we survey in Chapter 7, has focused on real-time applications

in form of small sets of dependent tasks (containing up to several tens of tasks), pos-

sibly executed periodically. Such applications can be compactly represented using a

data structure called task graph, which is an acyclic directed graph whose nodes rep-

resent individual tasks, and edges represent precedence constraints between tasks. For

these applications, the problem of power optimization is reduced to a single instance of

the periodic task graph, which is small enough to be analyzed using computationally

demanding algorithms. In contrast, our work targets realistic MLCA applications, in

which task-level parallelism is extracted from programs at run-time, by means akin to

the parallel execution of machine instructions in a superscalar processor. The task graph

of such an application is fully defined only at run-time, due to the unpredictability of

control-flow. Even if we introduce simplifying assumptions about the application control-

flow that enable an approximation of the run-time task graph to be known at compile

time, for example by considering long-running loops, the task graph is normally too large

to be analyzable using previously proposed techniques. Furthermore, the run-time task


graph of a loop in an MLCA program cannot be efficiently partitioned into small sub-

graphs that could each be analyzed separately. This is because most of the parallelism

in this task graph stems from pipelining loop iterations, whose execution can be arbi-

trarily overlapped, since the MLCA supports out-of-order execution of tasks. Finally,

the efficiency of a DVS technique for parallel applications critically depends on the task

scheduling algorithm. The task scheduling algorithms proposed in previous work in the

area of DVS-based power optimizations are not suitable for implementation as a part of

the hardware controller that handles the task scheduling in the MLCA.

In light of the above discussion, our work makes the following contributions:

1. It identifies common regularities in the control-flow of MLCA applications and

exploits these regularities, together with profiling information, to infer at compile-

time the properties of the run-time task graph of the application. We achieve this

goal using compact data structures and computationally lightweight procedures,

rather than the explicit construction and analysis of the run-time task graph, which

would be prohibitively expensive computationally.

2. Based on the profile-based compiler analysis, our work proposes, implements, and

evaluates novel voltage selection and task scheduling algorithms for applications

running on DVS-enabled MLCA systems.

1.2 Thesis Organization

The remainder of the thesis is divided into the following chapters. Chapter 2 presents

an overview of the MLCA and dynamic voltage scaling. In Chapter 3, we state the

assumptions about the MLCA system hardware and the MLCA applications on which

our technique is based and introduce the application execution model based on the task

graph. In Chapter 4, we formulate the problem statement and present a brief overview

of our solution. A detailed description of our technique is presented in Chapter 5. We


present the experimental evaluation of our technique in Chapter 6. Chapter 7 surveys the

related work in the area of DVS-based power optimizations. Finally, Chapter 8 presents

concluding remarks and directions for future work.

Chapter 2

Background

In this chapter, we review background material relevant to our work. Section 2.1 provides

an overview of the target architecture—the MLCA. Section 2.2 provides an overview of

the dynamic voltage scaling technique.

2.1 The MLCA

The Multi-Level Computing Architecture [22] (MLCA) is a novel two-level hierarchical

architecture, aimed at parallel systems-on-a-chip (SOCs). The lower level of the MLCA

consists of multiple processing units (PUs), and the upper level of a controller that au-

tomatically exploits parallelism among coarse-grain units of computation, called tasks.

A PU can be a full-fledged processor core, a digital signal processor (DSP), a block of

FPGA, or any other type of programmable hardware. The set of PUs can be hetero-

geneous, and tasks in an MLCA application may have different PU preferences. The

top-level controller consists of a control processor (CP), a task dispatcher (TD), and a

universal register file (URF). A dedicated interconnection network links the PUs to the

URF and memory, as shown in Figure 2.1(a).

The novelty of the MLCA stems from the fact that the upper level of the hierarchy

supports parallel execution of tasks, using the same techniques used in superscalar pro-

6

Chapter 2. Background 7

(a) MLCA

Memory

General

Purpose

Registers

XU XU XU

Fetch & Decode

Instruction Queue

Execution

Units

(b) Superscalar processor

Figure 2.1: Comparison between the MLCA and a superscalar processor

cessors, such as register renaming and out-of-order execution [14]. This leverages existing

superscalar technology to exploit task-level parallelism across PUs, in addition to pos-

sible instruction-level parallelism within each task. The similarity of the MLCA to the

microarchitecture of a superscalar processor can be seen in Figure 2.1. The MLCA is a

template architecture, which does not specify the number and types of PUs, form of the

interconnection network, memory configuration, and type of memory access.

The MLCA supports a programming model that, similar to sequential programming,

does not require programmers to specify task synchronization and inter-task commu-

nication. An MLCA application consists of two levels: a set of task functions, each a


sequential function with a specified number of input and output URF registers, and a

sequential control program, which is executed by the control processor and contains task

instructions.

In the remainder of this section, we present a detailed overview of the MLCA func-

tionality and its programming model.

2.1.1 Register Renaming and Parallel Execution of Tasks

Each task function is a unit of computation that can be modeled as a black-box with a

given number of inputs and outputs, which are mapped to a set of registers in the URF.

This property of task functions enables the control processor to detect data dependences

between tasks at run-time.

The control processor executes a control program, fetching and decoding task instruc-

tions, each of which specifies a task function to be executed on a PU, together with the

inputs and outputs of the task as registers in the URF. Data dependences among task

instructions are detected by identifying the source and sink registers in the URF, in the

same way that dependences among instructions are detected in a superscalar processor.

The control processor renames URF registers as necessary to break false dependences

among task instructions. The number of renaming registers impacts the performance of

MLCA programs; a larger number of renaming registers allows more false dependences

to be eliminated, which enhances the parallelism in the application execution [22].

Decoded task instructions are issued to the TD unit, where they are enqueued in the

task queue. When the inputs of a task become ready, it can begin execution as soon

as a free PU of an adequate type is available. Based on the data dependences detected

at run-time, tasks can be issued out-of-order, and may also complete and commit their

outputs out-of-order. Therefore, execution of a task by the MLCA is analogous to the

execution of a machine instruction by the superscalar processor, and parallel execution

of several tasks on multiple PUs is analogous to the parallel execution of several machine


instructions on multiple execution units. However, unlike machine instructions, task

functions can have arbitrarily large numbers of input and output registers, and a task

can write to its output URF registers at any time during its execution. The outputs

written to the URF in the midst of a task are made available to the tasks awaiting their

inputs in the task queue. Furthermore, the execution time of a task function is not

constant and may vary depending on the inputs and various non-deterministic factors

such as, for example, contention for memory access.

The ordering of the task instructions in the task queue and the selection of the PU

for each task is determined by the task scheduling algorithm implemented by the task

dispatcher. The default MLCA task dispatcher orders the task instructions according to

their sequential execution order, using a FIFO queue, and selects the PUs in a round-

robin fashion. Our work addresses the problem of task scheduling in the MLCA in the

context of DVS-based processor power optimizations.

Besides the task instructions, the control program also contains control-flow instruc-

tions. Conditional branches are implemented by means of a set of control registers in the

control processor. Each task can optionally write to a control register. Unlike outputs

to the URF registers, output to the control register can be written only at the end of

the task execution. The existence of conditional branches results in control dependences

between tasks. Control dependences can be eliminated at run-time using branch pre-

diction and speculative execution of tasks. The current MLCA system model does not

support speculative execution, but research into the support for speculative execution

for the MLCA is in progress.

2.1.2 Programming Model

The control program is written in an assembler-like language called HyperAssembly. An

example of a control program is shown in Figure 2.2(a). It contains five task instructions,

S1 − S4 and S6, which invoke task functions T1 − T5, respectively. The type of access


S1: task T1, R1:r,R2:w,R3:w

S2: task T2, R2:r,R3:r,R4:w


S4: task T4, CR1, R2:r,R3:r,R6:w

S5: if false (CR1 & 0x01) jmpa S7

S6: task T5, R6:r

S7: stop

(a) Original code


S2: task T2, R2:r,R3:r,R4:w

S3: task T3, R5:r,R101:w,R102:w

S4: task T4, CR1,R101:r,R102:r,R6:w

S5: if false (CR1 & 0x01) jmpa S7

S6: task T5, R6:r

S7: stop

(b) After register renaming

Figure 2.2: An example of HyperAssembly code and register renaming

for each URF register is indicated as read (r) or write (w) next to the register symbol.

Furthermore, the control program contains the conditional branch instruction S5, whose

direction depends on the contents of the control register CR1, which is written by the

task instruction S4. If the condition is evaluated as false, jump to S7 is taken. The

instruction stop waits for the pending tasks to finish and then terminates the execution

of the program.

All tasks in the example from Figure 2.2(a) must be executed sequentially, because

of data and control dependences. We use the symbols δt, δa, and δo to represent true

dependences, write-after-read false dependences, and write-after-write false dependences

between task instructions, respectively. Since the registers R2 and R3 are written by task

instruction S1 and read by task instruction S2, there exists a true dependence S1δtS2.

Similarly, there exists a true dependence S3δtS4, due to the read-after-write access to

registers R2 and R3 by these two task instructions. Furthermore, there is a write-after-


write false dependence S1δoS3, since both task instructions write to R2 and R3, as well

as a write-after-read false dependence S2δaS3, because S3 overwrites the values of R2

and R3 that are read by S2. Finally, there is both a true data dependence and a control

dependence between S4 and S6. The true dependence S4δtS6 stems from the fact that S6

reads the value written into R6 by S4, and the control dependence stems from the value

written by S4 into the control register CR1, which determines whether S6 is executed or

skipped.

The control processor renames registers at run-time to break false dependences and

thus allow some parallel execution. The control program after register renaming is shown

in Figure 2.2(b). With both false dependences eliminated, S3 can be executed in parallel

with S1, and after the task executed by S3 writes its outputs, S4 can proceed regardless of

the status of the tasks executed by S1 and S2. Once the task executed by S4 has completed

execution and written the output to CR1, the direction of branch S5 is computed by the

control processor. If the branch is not taken, S6 is executed, again regardless of the status

of S1 and S2 at that point in time.

Instead of writing the control program directly in HyperAssembly, it is possible to

express it in a higher-level language called Sarek [22], which is compiled into Hyper-

Assembly. Sarek is a C-like language, which supports high-level control-flow constructs,

such as if-statements and while-loops. The execution of tasks is specified using state-

ments similar to function calls. Instead of explicitly specifying access to the URF and

control registers, Sarek supports data variables and control variables, which are similar

to variables in high-level programming languages. These variables are mapped to URF

registers and control registers, respectively, during the compilation of Sarek into Hyper-

Assembly. They are the only data types supported in Sarek. Figure 2.3 shows a Sarek

program equivalent to the HyperAssembly program shown in Figure 2.2. The only con-

trol variable in this example is flag. The remaining variables in the example are data

variables. For simplicity, declarations of variables are omitted.


T1(in count_1, out length, out height);

T2(in length, in height, out depth);

T3(in count_2, out length, out height);

flag = T4(in length, in height, out sum);

if (flag & 0x01) {

T5(in sum);

}

stop;

Figure 2.3: Sarek code of the control program from Figure 2.2

During the compilation of Sarek into HyperAssembly, it is possible to perform various

optimizing transformations [4]. However, in this thesis we assume that the translation is

performed in a straightforward manner, translating each Sarek task statement into a sin-

gle HyperAssembly task instruction and using a one-to-one correspondence between data

variables and URF registers, as well as between control variables and control registers.

This enables us to present examples in form of Sarek code, which is more easily readable,

although our technique takes the HyperAssembly code of the application as input.

If the PUs are general-purpose processors, task functions can be written in a high-

level programming language and compiled into executable code. For example, the code

of the task function T4 from Figure 2.2 could be written as the C function shown in

Figure 2.4. The function has no formal arguments and return values. Instead, arguments

are read from and written to the URF registers and the control register using specific

API functions readArg, writeArg, and writeCtrl. These API functions implement the

communication between the PU and other components of the MLCA system, in particular

the URF and the control registers.

2.2 Dynamic Voltage Scaling

Dynamic voltage scaling (DVS) is a technique for reducing the power consumption of

processors. It allows programs to change at run-time the supply voltage and frequency of


void T4() {

int arg1, arg2;

int out1, out2, flag;

arg1 = readArg(0);

arg2 = readArg(1);

// Perform computations with arg1 and arg2

// ...

writeArg(0, out1);

// Perform some more computations

// ...

writeArg(1, out2);

writeCtrl(flag);

}

Figure 2.4: C code of an example task function

a processor in order to trade performance for lower power consumption. In the remainder

of this section, we present some theoretical background and practical aspects of DVS.

Power consumption of a CMOS circuit can be approximated by the following for-

mula [33]:

P = ACV 2f + τAV ISf + V IL, (2.1)

where V is the supply voltage, f is the operating frequency, A is the activity of the gates

in the system (i.e. the average fraction of the gates that switch in a given clock cycle), C

is the total capacitance, IL is the leakage current, and IS the short-circuit current that

flows for the brief time τ whenever a gate switches. The first term ACV 2f determines

the dynamic power consumption, and accounts for the majority of power dissipated by

today’s CMOS circuits [33]. Therefore, lowering the supply voltage results in reduction

of the processor power consumption. However, the maximum operating frequency of the

processor depends on the supply voltage according to the following formula [33]:

fmax ∝ (V − VT )2/V, (2.2)


where VT is the threshold voltage of the CMOS device, i.e. the voltage that, when applied

to the transistor gate, causes the transistor to switch [44]. Consequently, lowering the

supply voltage of a given circuit must be accompanied by lowering its operating frequency,

which results in degraded performance.

DVS-enabled processors are capable of operating at different supply voltages. They

are equipped with circuits that regulate the supply voltage and operating frequency at

run-time under the control of software. At each voltage level, the operating frequency

is set to the maximum allowed for the corresponding supply voltage. The energy con-

sumed by the processor for a given computation is equal to the product of power and

time. Therefore, lowering the supply voltage and operating frequency reduces the en-

ergy consumed for a given computation if the factor by which the power is reduced is

greater than the factor by which the computation is slowed down. Since, according to

the Equations (2.1) and (2.2), this condition is satisfied, DVS introduces the possibility

of an effective dynamic trade-off between processor power consumption and performance.

In computer systems containing DVS-enabled processors, power can be saved if DVS is

applied in situations where computation can be slowed down with an acceptable loss of

performance.

The first DVS-enabled processors were implemented relatively recently [6, 24, 36], but

nevertheless, a number of DVS-enabled processor designs has appeared on the market

in the meantime. Examples are the Intel XScale [7, 18], IBM’s PowerPC 405LP [35],

and Transmeta’s Crusoe [42]. These processors support several discrete voltage levels

with different frequencies and rates of power consumption. The transition between levels

can be performed at run-time by executing specific machine instructions. We base our

approach on the assumption that the DVS capabilities of the processors in the MLCA

system are implemented similarly.

To implement DVS capabilities in a SOC multiprocessor such as the MLCA, it is

necessary to partition the chip into multiple power domains, whose voltage and frequency


can be varied independently at run-time, and place each processor into a separate power

domain. Implementation of chips with multiple power domains is an active area of

research [9, 41]. Flautner et al. [10] have implemented IEM926, a DVS-enabled single-

processor SOC in which the voltage and frequency of the processor core can be varied at

run-time independently of the rest of the chip.

Switching between voltage levels at run-time incurs certain overheads in time and

energy, which generally depend on the levels between which the processor is transferring.

For modern DVS-enabled processors, the time overhead of a single transition is on the

order of tens of microseconds [7, 15], which translates into several thousand or tens of

thousands of processor cycles. Such overheads are not negligible. Therefore, a practical

strategy for applying DVS to processors must take into account not only the performance

penalty due to reduction in frequency, but also the impact of transition overheads.

With each new generation of CMOS technology, the leakage power is increasing,

as a consequence of further miniaturization. In the forthcoming generations, the leakage

power is likely to become a factor in the total circuit power dissipation comparable to the

dynamic power [23]. DVS is significantly less effective in reducing the leakage power than

the dynamic power, and novel techniques will be necessary to deal with the leakage power.

One such technique is the adaptive body biasing (ABB), which achieves a dynamic trade-

off between leakage power and operating frequency by dynamically scaling the threshold

voltage [3, 23, 30]. Effective combined use of DVS and ABB in presence of substantial

leakage power dissipation is an active area of research [3, 30]. Although we consider DVS-

based power optimizations in our work, our technique is based only on the assumption

that the processors in the MLCA system support some mechanism for dynamic trade-off

between power and performance, regardless of the underlying technology.

Chapter 3

System and Application Properties

In this chapter, we describe the properties we assume of the MLCA system and of the

applications. Section 3.1 states the assumptions on the MLCA system hardware. Sec-

tion 3.2 states the assumptions on the MLCA application characteristics. Section 3.3

introduces our execution model of the MLCA applications, which is based on the as-

sumptions stated in previous two sections.

3.1 Hardware Properties

3.1.1 Architectural Properties

As described in Chapter 2, the MLCA is a template architecture with many variable de-

sign parameters, whose lower level consists of a possibly heterogeneous set of processing

units, with an arbitrary type of interconnection network and memory architecture. How-

ever, in this thesis, we assume that the MLCA system features a homogeneous set of PUs

with uniform access to a shared memory. These assumptions simplify the system model.

The assumption of homogeneity renders the system symmetrical to the task scheduling

algorithm employed by the task dispatcher, eliminating the need to consider PU prefer-

ences of tasks. The assumption of uniformly accessed shared memory eliminates the issue

16

Chapter 3. System and Application Properties 17

of inter-task communication outside the access to the URF, since the data written into

the shared memory by a task are made available to all processors with identical access

time.

We assume that the number of processors in the MLCA system is the maximum

allowed by the scalability of the application, i.e. adding more processors to the system

would result in little or no improvement in application performance. Furthermore, we

assume that the number of renaming registers is large enough to eliminate all false data

dependences at run-time. These assumptions are reasonable, since the MLCA system

will be custom-designed to meet application requirements, and designers will customize

the system parameters to enable the maximum parallelism in the application execution.

Besides, applications running on a system with a number of processors smaller than the

maximum tend to have high levels of processor utilization and therefore do not offer

significant opportunities for power optimizations using DVS.

3.1.2 Power-Related Properties of the Processors

We assume that the processors in the MLCA system support several discrete voltage

levels with different operating frequencies and rates of power consumption, with the

possibility of software-controlled switching between levels at run-time. As described in

Section 2.2, this assumption reflects an increasing number of modern high-end embedded

processors. Furthermore, we assume that the voltage and frequency of each processor can

be varied independently. SOCs partitioned into multiple power domains whose voltage

and frequency can be varied separately at run-time are an active area of research [9, 41].

Therefore, it is reasonable to expect that DVS will be supported by the processors in the

future MLCA systems.

On a DVS-enabled processor, the relation between the operating frequency and the

execution time of a given sequence of instructions is non-trivial, because the number of

instructions executed per cycle may vary with the processor frequency. However, as a


simplifying assumption in our model, we assume that the number of processor cycles

necessary for the execution of a given sequence of instructions is constant, which implies

that the execution time of a task is inversely proportional to the processor frequency. This

assumption is conservative, because regardless of the processor frequency, each operation

within the processor takes the same number of processor cycles, while the latency of the

access to a component outside the processor, such as memory access in case of a cache

miss, is independent of the processor frequency. Therefore, when the operating frequency

is lowered, the total number of processor cycles necessary to execute a given sequence of

instructions either remains constant or becomes lower.

Switching between voltage levels at run-time incurs certain overheads in time and

energy, which generally depend on the levels between which the processor is transferring.

Unlike most of the previous work in the area of DVS-based power optimizations (which

we survey in Chapter 7), we take the effect of transition overheads into consideration.

We assume that the time and energy overheads are constant and independent of voltage

levels. We believe that this simplification does not significantly affect the accuracy of

our results. Mochocki et al. [31] suggest that the assumption of constant overheads in

time is reasonable. Furthermore, the results of our experimental evaluation, presented in

Chapter 6, show that the transitions account for a very small percentage of the overall

energy consumption after our technique is applied.

Besides DVS, another power management feature of modern embedded processors is

the possibility of entering a lightweight idle mode implemented by stopping the internal

processor clock until the processor activity is resumed by an external interrupt [18].

In such idle mode, the processor power is not reduced to a negligible level, but the

architectural state of the processor is preserved and the overheads of entering and exiting

are thus negligible. For example, the Intel 80200 processor—based on the XScale core—

takes several tens of clock cycles to exit the idle mode, depending on the operating

frequency [18]. We assume that each processor in the MLCA system spends time in such


idle mode whenever it is not executing a task. The interrupt necessary to resume the

processor activity can be generated by the task dispatcher whenever a task is scheduled

onto the processor.

3.2 Application Properties

Our approach exploits certain regularities often exhibited by control programs of MLCA

multimedia applications. The control program of a typical multimedia application con-

sists of some initialization and clean-up code, each executed only once, and one or more

long-running loops. These loops handle the processing of the input media stream, and

account for the vast majority of the application execution time. A commonly exhibited

property of these loops is that their bodies contain relatively few control-flow instruc-

tions, which are frequently absent altogether. Furthermore, the existing control-flow

instructions are usually tests for exceptional conditions, and are thus severely biased in

one direction. In such cases, the control-flow instructions can be ignored, and the loop

can be treated for practical purposes as a loop without any control-flow instructions in

its body, containing the most frequent flow of execution. We refer to each loop with

these properties as a target loop. Our approach aims for power optimizations of the ex-

ecution of target loops in MLCA multimedia applications. Section 5.7 describes several

more general classes of loops that can be handled as target loops by our technique and

discusses the possible methods for selection of target loops using the results of compiler

analysis and profiling.

An example Sarek code of a target loop is shown in Figure 3.1. The body of this

loop consists of calls to five task functions Task 1–Task 5, and contains no control-flow

statements. Task function Task 5 updates the control variable finished, which serves

as the loop counter. Each of these statements is translated into a single HyperAssembly

task instruction, and the while statement is translated into a HyperAssembly conditional


branch instruction.

do {

Task_1(in x, out x, out y);

Task_2(in y, in z, out y);

Task_3(in y, out y, out z);

Task_4(in y, in w, out w);

finished = Task_5(in index, out index);

} while (!finished);

Figure 3.1: Sarek code of an example target loop

Another assumption that we place on the target loop is that the parallelism between

its iterations is not constrained by control dependences. Control dependences in target

loops can be effectively eliminated at run-time using branch prediction and speculative

execution, since the branch prediction will be almost perfectly accurate. However, even

without speculative execution, control dependences can be ignored in practice for many

target loops in MLCA multimedia applications. These loops are typically running as long

as the input media stream is incoming. Therefore, the task that updates the iteration

counter often does not depend on any other tasks in the loop, but instead only detects

if the end of the input has been reached. Furthermore, the execution time of these tasks

is usually short, since they do not perform any other computations. Therefore, these

tasks can be executed in advance and out-of-order, making it safe to ignore the control

dependences in the loop analysis. For example, in the target loop shown in Figure 3.1, the

branch direction in each iteration depends on the output of the task executing Task 5,

and thus there is a control dependence between this task and all tasks executed by

the subsequent loop iterations. However, as soon as the task executing Task 5 in one

iteration has finished, the equivalent task from the following iteration can be executed

immediately, since it does not depend on the output of any other tasks. Assuming that

the execution time of the task function Task 5 is relatively short, tasks executing this

task function from multiple iterations will be executed in advance and out-of-order, thus


effectively eliminating the influence of the control dependences on the parallel execution

of tasks.

3.3 Task Graph Execution Model

Under the assumptions outlined in the previous two sections, it is possible to represent

the execution of an MLCA application using a data structure called task graph. Task

graph is an acyclic graph, whose nodes represent individual tasks (i.e. instances of task

instructions), and the edges represent dependences between pairs of tasks. Each node

of the task graph is labeled by the execution time of the corresponding task. Since we

assume that control dependences can be safely ignored, and the false dependences are

eliminated at run-time by the control processor, the task graph need contain only edges

representing the true data dependences between tasks. We define the task graph of a

target loop as the subgraph of the task graph of the whole application that contains only

the tasks executed by the target loop.

Figure 3.2 shows the task graph of the loop from Figure 3.1, assuming that three

iterations of the loop are executed. In the figure, symbol TNi denotes the task that

executes task function Task N in loop iteration i. According to the labels in Figure 3.2,

we assume that the execution times of task functions Task 1–Task 5 are 1500, 1000,

1500, 2000, and 200 time units, respectively, and that the execution time of each task

function does not vary between loop iterations. We add two additional nodes TIN and

TOUT to the task graph, corresponding to the loop entry and exit.

Assuming that each task writes its outputs immediately prior to the end of its exe-

cution, the information contained in the task graph is sufficient to fully characterize the

application execution at run-time. Without this assumption, it would be necessary to

additionally specify for each dependence TimδtTjn the time at which Tim writes the last

output read by Tjn (this could be achieved, for example, by labeling the edges in the task


Figure 3.2: Task graph of the loop from Figure 3.1

graph). In MLCA applications for which this assumption does not hold, it is possible to

apply a compiler transformation that splits the task functions across the instructions that

perform writing to the universal register file. For example, the task function shown in

Figure 2.4 can be split into two task functions, each of which writes one output parameter

to the URF. We assume that the performance impact of the task splitting transformation

is negligible. This assumption is reasonable, because task splitting does not increase the

total computational load on the processors, while the load on the universal register file

is increased only slightly.

The execution of an application on an MLCA system is equivalent to the scheduling

of its task graph on the set of processors featured by the MLCA system, using the task

scheduling algorithm employed by the task dispatcher. The minimum execution time of

the application is equal to the length of the longest path in the task graph, which we refer

to as the critical path. For example, the critical path in the task graph shown in Figure 3.2

is (TIN , T11, T21, T31, T22, T32, T23, T33, T43, TOUT ), whose length is eleven thousand time


units. The critical path is highlighted by thick lines in the figure. Since there exists a

chain of dependences from the first task in the critical path to the last one, tasks on

the critical path must be executed sequentially. Therefore, the total execution time of

the application cannot be shorter than the length of the critical path, regardless of the

available number of processors. The total execution time of the application is equal to this

minimal value only if the tasks from the critical path are executed as an uninterrupted

sequence.

Applying DVS to a task on the critical path increases the length of the critical path

and thus negatively impacts the application performance. Therefore, in order to achieve

power savings using DVS without affecting the application performance, it is necessary to

identify a set of tasks that do not belong to the critical path and restrict the application

of DVS to these tasks.

Our algorithms for voltage selection and task scheduling, which we describe in Chap-

ter 5, use profiling information and dependence analysis to deduce the properties of the

task graph of the target loop, in particular the set of tasks that lie on the critical path.

The assignment of processor voltage levels and the task scheduling algorithm are based

on these deduced properties.

Chapter 4

Problem Formulation

This chapter presents the formal statement of the problem addressed by our approach, in

the context of the assumptions outlined in Chapter 3. Section 4.1 introduces the notion

of slack in task graphs. Section 4.2 presents the formal problem statement.

4.1 Slack in Task Graphs

The goal of our technique is to identify the critical path in the execution task graph of

a given application and prolong the execution time of tasks outside of the critical path

using DVS in a manner that does not introduce a new, longer critical path, thus achieving

power savings with little or no impact on the application performance.

The slack of a task is defined as the maximum time by which the execution time of the

task can be prolonged without affecting the overall application execution time. Similarly,

the slack of a set of tasks in the application is the maximum time by which the total

execution time of these tasks can be prolonged without affecting the overall application

execution time. This slack can be distributed across the tasks in multiple ways, which may

result in different levels of power savings and may be subject to additional application-

dependent constraints.

The task graph shown in Figure 4.1 is used to illustrate the distribution of slack using

24

Chapter 4. Problem Formulation 25

Figure 4.1: An example task graph

DVS. Nodes in the task graph are labeled with task execution times in microseconds. The

critical path in this task graph is (TIN , T1, T4, T6, TOUT ), which is highlighted by thick

lines in the figure. The length of the critical path is 6000µs. Assuming that the task graph

is scheduled on two processors, there exists a certain slack, since tasks T2, T3, and T5 are

not on the critical path and their execution time can be prolonged up to the limit where

a new longer critical path is introduced. In particular, the slack of task T2 is 500µs, since

its execution time can be prolonged until it surpasses 2000µs. Prolonging the execution

time of T2 beyond 2000µs introduces a new, longer critical path (TIN , T2, T4, T6, TOUT ).

Similarly, a slack of 1000µs is available for distribution across tasks T3 and T5; if the sum

of the extra execution times added to these tasks is more than 1000µs, a new, longer

critical path (TIN , T1, T3, T5, TOUT ) is introduced. If the execution times of the tasks T2,

T3, and T5 are prolonged by 500µs each, the total execution time of the task graph on

two processors is still 6000µs. The increase in execution time can be realized using DVS,

thus achieving a reduction in power consumption without increasing the overall execution

time.

The slack of 1000µs can be distributed across tasks T3 and T5 in multiple ways. For

simplicity, we assume that the processor executing tasks T3 and T5 supports three voltage

levels with the operating frequencies of 400MHz, 300MHz, and 240MHz, and ignore the


transition overheads and the idle power. Since we assume that the execution times of

tasks are inversely proportional to the processor frequency, the execution times of T3

and T5 at these levels are 1500µs, 2000µs, and 2500µs, respectively. The total execution

time of T3 and T5 can be increased by 1000µs, for example, by running both tasks at

300MHz, or by running one of them at 240MHz and the other one at 400MHz. The

optimal distribution of slack is determined by the power characteristics of the processor

voltage levels. For example, if the power relative to the highest voltage level is reduced

by 45% by switching to 300MHz, and by 65% by switching to 240MHz, energy savings

are greater if both tasks are run at 300MHz than if one of them is run at 240MHz and

the other one at 400MHz (26.7% vs. 20.8%). Finding the optimal distribution of slack

across a set of tasks is a non-trivial problem, which is further complicated if the effects

of transition overheads are taken into account. As a part of our technique, we use an

integer linear programming model to compute an efficient slack distribution across the

tasks selected for the application of DVS.

4.2 Problem Formulation

Similar to other authors [43, 46], we divide the problem of DVS-based power optimization

of MLCA applications into two sub-problems:

• Determining the voltage level at which each part of the application is executed.

We refer to this problem as voltage selection. Although DVS-enabled processors

are capable of switching between voltage levels at any point during the application

execution, we simplify our approach by assigning a unique voltage level to each task

executed at run time, allowing transitions only between tasks. We approach the

problem of voltage selection by identifying the tasks that fall outside the critical

path, computing the available slack, and finding a distribution of slack across these

tasks so as to minimize their energy consumption.


• Finding the appropriate task scheduling algorithm. A task scheduling algorithm

is defined by a scheme for ordering tasks in the task queue and the processor

mapping of tasks, which determines the choice of processor for the execution of

each dispatched task. The task scheduling algorithm must be implementable with

a reasonable overhead in hardware complexity of the MLCA task dispatcher.

These problems are interrelated, since for a given voltage selection, an inadequate

task scheduling algorithm can result in unacceptable degradation of application perfor-

mance. Similarly, with a given task scheduling algorithm, different voltage selections that

save comparable amounts of energy can have significantly different performance impact.

Therefore, the algorithms used to solve these two problems must be designed so as to

complement each other. Our approach is to first design a voltage selection algorithm and

then formulate a complementary task scheduling algorithm.

The voltage selection and task scheduling problems are both hard. The design space of

task scheduling algorithms is vast, and the problem of optimal scheduling of a task graph

with regard to minimizing the application execution time is NP-complete [25]. Regardless

of the simplifying assumption that a unique voltage level is assigned to each task, optimal

voltage selection with a given task scheduling algorithm is also an NP-hard problem if

the processor voltage levels are discrete [3]. Since the task graphs of MLCA applications

are large and complex in practice, these problems require heuristic solutions, which often

result in some increase in application execution time for realistic applications. Therefore,

in our problem formulation, we do not impose a strict requirement that the execution

time of the application must not be increased. However, in our experimental evaluation,

we show that the increase in execution time incurred by applying our technique is not

excessive.

Assuming that the target loop has the properties outlined in Chapter 3, the body of

the loop contains N task instructions S1, ..., SN , each of which invokes a task function

fn (it is possible that fi = fj for i 6= j). We further simplify the voltage selection


problem by assigning a unique voltage level to each task instruction in the target loop.

All tasks executed by a particular task instruction in different iterations of the loop are

thus executed at the same voltage level. Therefore, we can state the following problem

formulation:

Voltage Selection and Task Scheduling Problem: For a given target loop in an

MLCA application with properties outlined in Chapter 3, find the voltage level Ln for

each task instruction Sn in the loop body, the ordering of the executed tasks, and the

processor mapping for each executed task, such that:

1. The task schedule does not violate the precedence constraints imposed by the data

dependences.

2. The energy consumed by the processors is minimized.

3. The execution time of the application is not significantly increased in comparison

to the execution time on the default MLCA system configuration.

We describe our proposed solution to the formulated problem in the following chapter.

Chapter 5

Voltage Selection and Task

Scheduling

In this chapter, we describe our algorithms for voltage selection and task scheduling.

Section 5.1 introduces the data structure on which the analysis of the target loop is

based—the loop dependence graph. Section 5.2 presents the procedure used to determine

the critical path of the dynamic task graph of the target loop. Section 5.3 describes the

procedure used to determine the available slack. Section 5.4 presents the algorithm for

selection of tasks to which DVS is applied. The algorithm for the distribution of slack

among the selected tasks, based on an integer linear programming model, is described

in Section 5.5. The task scheduling algorithm is described in Section 5.6. Section 5.7

discusses the selection of the target loops by the compiler.

5.1 The Loop Dependence Graph

The dependence graph [1] of the target loop is a directed graph whose nodes represent

task instructions from the target loop, and whose edges represent the data dependences

between pairs of task instructions. We label each node by the average execution time

of tasks executed by its corresponding task instruction, as determined by profiling. We

29

Chapter 5. Voltage Selection and Task Scheduling 30

label each dependence edge with the distance of the dependence, which is defined as the

number of loop iterations between the execution of the sink and the source of the depen-

dence [1]. Since we assume that the MLCA control unit eliminates false dependences at

run-time using register renaming, we take only true data dependences (i.e. read-after-

write dependences) into account. We use the symbol Tni to denote the task executed by

the task instruction Sn in the loop iteration i. The distance of a true dependence TmiδtTnj

is therefore equal to d if j = i + d. In the remainder of this section, we show that under

the stated assumptions, all dependences TmiδtTnj between pairs of tasks executed by a

particular pair of task instructions (Sm, Sn) have the same distance, and unique distances

can thus be assigned to the dependences between pairs of task instructions.

Let {S1, . . . , SN} be the set of task instructions in the target loop. Since the body

of the target loop is assumed to contain no control-flow instructions, as described in

Section 3.2, each task instruction Sn is executed in every loop iteration. Assume that

there exists a dependence TmiδtTnj that arises due to the existence of one or more registers

that are written by task Tmi and read by task Tnj. Obviously, i ≤ j must hold, because

the dependence source must precede the sink in the sequential execution order. However,

all registers written by task Tmi are also written by task Tm(i+1)1. Therefore, in the

sequential execution order, task Tnj is preceded by task Tmi, but not by task Tm(i+1).

This is possible only if either m ≥ n and i = j − 1, or m < n and i = j. In the

former case, the dependence is loop-carried and has the dependence distance of one. In

the latter case, the dependence distance is zero and the dependence is loop-independent.

Therefore, we can divide the dependence edges in the loop dependence graph into these

two categories. Figure 5.1 shows an example target loop and its loop dependence graph.

The algorithm for constructing the loop dependence graph is shown in Figure 5.2.

1In this chapter, we refer only to the logical registers in HyperAssembly code, which are in one-to-one correspondence with the Sarek data variables. In different loop iterations, operations with the samelogical register can be renamed to different physical registers by the control processor, but the knowledgeof the logical registers being read and written is sufficient for determining the true data dependencesbetween tasks.


do {

Task 1(in index, out x);

Task 2(in x, in y, out x);

Task 3(in x, out y);

end = Task 4(in index, out index);

} while(!end);

(a) Sarek code (b) Dependence graph

Figure 5.1: Dependence graph of an example loop

For each register variable r read by task instruction Sn, the algorithm searches for the

task instruction Sm that outputs the last value written to r prior to each execution of Sn

and adds the appropriately labeled dependence edge to the graph.

In the remainder of this chapter, we assume that the execution times of tasks ex-

ecuted by each individual task instruction in the target loop are constant across loop

iterations. In other words, we assume that each task instruction is always executed with

the execution time assigned to its corresponding node in the loop dependence graph.

This assumption does not hold for real MLCA applications, since the execution times of

task functions normally depend on their inputs, which vary between iterations, as well

as between different input media streams. However, the positive results of the experi-

mental evaluation presented in Chapter 6 confirm that despite making this unrealistic

assumption in the theoretical derivation of our algorithm, it reasonably approximates

the behavior of real applications. Thus, in the remainder of this chapter, we refer to the

execution time of a task instruction as a uniquely defined value.


start with empty loop dependence graph;

for (each task instruction Sn in the target loop)

add node Sn;

mark node Sn with the average execution time of fn;

for (each register variable r read by Sn)

if (there exists m < n such that Sm writes to r)find maximum such m;

add edge Sm → Sn;

mark edge Sm → Sn with distance 0;

else if (there exists m ≥ n such that Sm writes to r)find maximum such m;

add edge Sm → Sn;

mark edge Sm → Sn with distance 1;

Figure 5.2: Algorithm for constructing the dependence graph of the target loop

5.2 The Critical Path

In this section, we show how to deduce the critical path of the task graph of a target

loop using the loop dependence graph.

Let I be the number of iterations of the target loop whose body contains N task

instructions S1, . . . , SN . Since the body of the target loop is assumed to contain no

control-flow instructions, all N task instructions are executed in every iteration. There-

fore, besides the logical entry and exit nodes, the task graph of the target loop consists of

I subgraphs, each of which contains N nodes, corresponding to the tasks executed within

a single loop iteration. The total number of nodes in the task graph is therefore I ·N +2.

A part of the task graph of the target loop from Figure 5.1 is shown in Figure 5.3.

If the loop dependence graph of the target loop does not contain any cross-iteration

dependences, the loop is fully parallel, since a new iteration can be started at any time,

regardless of the status of tasks from previous iterations. A fully parallel loop scales

up to an arbitrary number of processors, provided that the number of loop iterations

is large enough. In some cases, the target loop can be fully parallel even in presence

of cross-iteration dependences. For example, if the loop shown in Figure 5.1 lacked the

dependences S3δtS2 and S4δ

tS4, it would be fully parallel, despite the presence of the


Figure 5.3: Task graph of the loop from Figure 5.1

cross-iteration dependence S4δtS1. This is obvious from the fact that without these

dependences, shown with thick lines in Figure 5.3, the task graph would consist of a

series of disconnected subgraphs, each of which could be started independently. Since

the processor utilization during the execution of fully parallel loops is close to 100%, task

graphs of these loops do not contain any slack. Therefore, fully parallel loops do not offer

any opportunities for power optimizations using DVS.

Since the sequential execution order of task instructions is an order relation, the

subgraph of the loop dependence graph that contains only the edges with dependence

distance of zero is acyclic. Therefore, if the loop dependence graph contains any cycles,

each cycle contains at least one edge with dependence distance of one. We define a 1-cycle

in the loop dependence graph as a cycle that contains exactly one edge with dependence

distance of one. The key insight on which our voltage selection algorithm is based is that

each 1-cycle in the loop dependence graph translates into a path in the task graph that

stretches across all loop iterations. This path encompasses the tasks executed in each

iteration by the task instructions that form its corresponding 1-cycle. For example, both

cycles (S2, S3) and (S4) in Figure 5.1 are 1-cycles. They translate into the two paths

along the edges marked by thick lines in Figure 5.3.

If the number of loop iterations is large, the longest 1-cycle in the loop dependence


graph translates into the critical path in the task graph of the loop. The longest 1-cycle

(S2, S3) in the dependence graph from Figure 5.1 translates into the critical path in the

task graph from Figure 5.3, which starts with tasks TIN and T11 and then stretches across

pairs of tasks (T2i, T3i), for all i = 1, . . . , I. We call the task instructions that form the

longest 1-cycle in the loop dependence graph critical instructions, and the tasks executed

by these instructions critical tasks.

The first and last several tasks in the critical path may consist of non-critical tasks (i.e.

tasks executed by task instructions outside of the longest 1-cycle in the loop dependence

graph). For example, in the task graph from Figure 5.3, the critical path begins with

non-critical tasks TIN and T11. However, assuming that the number of iterations is large,

the total execution time of these non-critical tasks is negligible in comparison with the

contribution of the repeated execution of critical tasks. Therefore, the length of the

critical path can be approximated by I · τC , where τC is the length of the longest 1-cycle

in the loop dependence graph.

Finding the longest cycle in a graph is an NP-complete problem in the general

case [11]. However, the longest 1-cycle in the loop dependence graph can be found

by identifying each edge (Si, Sj) with dependence distance of one and searching for the

longest path from Sj to Si that contains only edges with dependence distance of zero.

The longest such path, together with the corresponding edge (Si, Sj), forms the longest

1-cycle. Since these searches for longest paths are limited to an acyclic subgraph of the

loop dependence graph, i.e. its subgraph containing only edges with dependence dis-

tance of zero, they can be performed in polynomial time [11]. Since the loop dependence

graphs in practice contain up to several tens of tasks, this procedure takes negligible

computational time.


5.3 The Available Slack

As explained in Section 3.3, the minimum execution time of a task graph is equal to the

length of its critical path, and a necessary condition for the execution time to be reduced

to this value is that the tasks from the critical path are executed in an uninterrupted

sequence. Assuming that the critical path in the task graph of the target loop can be

approximated by the path connecting the critical tasks, as defined in Section 5.2, the

minimum execution time of the target loop can be closely approximated by I · τC , where

I is the number of loop iterations and τC is the length of the longest 1-cycle in the loop

dependence graph. Therefore, the maximum number of processors for which the loop

scales is the one for which the execution time of the loop is reduced to approximately

I · τC . Increasing the number of processors beyond this value cannot result in additional

speedup of the loop execution, since the length of the critical path of a task graph is a

strict lower bound on its execution time.

Let NP be the number of processors in the MLCA system, and τ the sum of the

execution times of all non-critical task instructions in the target loop. Assuming that the

target loop is running on the maximum number of processors allowed by its scalability,

its total execution time is I · τC , and the sum of processor times spent in its execution is

thus NP · I · τC . Total processor time available per loop iteration is thus NP · τC . This

amount of processor time must be sufficient for the execution of all tasks from a single

iteration, both critical and non-critical ones, and must therefore be greater than the sum

of the execution times of all task instructions in the target loop, which is equal to τC + τ .

Thus, we can formulate the following necessary condition for the execution time of the

loop to be minimal:

NP · τC > τC + τ,

or equivalently:

(NP − 1) · τC > τ. (5.1)


Condition (5.1) effectively states that while the critical tasks from one iteration, whose to-

tal execution time is τC , continuously occupy one processor at a time, the total processor

time available on the remaining NP − 1 processors must be large enough to accommo-

date the execution of non-critical tasks whose total execution time is τ . Otherwise, the

available processor time per iteration would be insufficient for the continuous execution

of non-critical tasks, which would result in slowdown due to delays in the execution of

critical tasks. This conclusion holds regardless of the possible out-of-order execution of

non-critical tasks.

Although necessary, Condition (5.1) is not always sufficient to guarantee that the

critical tasks can be executed in an uninterrupted sequence. For example, in the loop

dependence graph shown in Figure 5.4, the longest 1-cycle is (S2, S5, S6) (for simplicity,

labels on the edges with distance zero are omitted). If the loop is executed on two

processors, Condition (5.1) is satisfied, since NP = 2, τC = 3000, and τ = 2400. However,

with two processors it is not possible to schedule the critical tasks in an uninterrupted

sequence, since that would require S3 and S4 to be executed in parallel with S5 in each

iteration, which is impossible with only two processors. The maximum speedup of this

loop is thus achieved with three processors.

From the derivation of the Condition (5.1), it follows that the difference:

tS = (NP − 1) · τC − τ (5.2)

is equal to the idle processor time per loop iteration. Since the number of processors NP

is discrete, tS is likely to be greater than zero even if the limit to the loop scaling is equal

to the smallest NP that satisfies the Condition (5.1). From the Formula (5.2), it follows

that tS is a strict upper bound on the available slack per loop iteration. Prolonging the

total execution time of non-critical tasks in each iteration by more than tS is guaranteed

to result in an increase in the total execution time of the loop, since it would lead to

a violation of the Condition (5.1) and cause delay in the execution of the critical tasks.

Therefore, the available slack per iteration is equal to a certain fraction of tS.


Figure 5.4: An example loop dependence graph

Ideally, if the slack per iteration is equal to tS, the total execution time of the non-

critical task instructions can be increased from τ to τ + tS without causing delay in

the execution of the critical tasks, thus reducing the idle processor time to zero without

affecting the overall loop execution time. The example from Figure 5.4 illustrates that

this is not always the case. If this loop is executed on three processors, Formula (5.2)

yields tS = 3600. However, if the total execution time of non-critical tasks is increased by

that amount, task instructions S3 and S4 cannot be executed in parallel with S5 in each

iteration without delaying the execution of S6, regardless of the way in which the slack

is distributed across S1, S3, S4, and S7. Therefore, Formula (5.2) clearly overestimates

the available slack in this case.

In our voltage selection algorithm, we use the value tS computed according to the

Formula (5.2) as an approximation for the available slack per loop iteration. However, in

order to allow for the possibility that this value overestimates the available slack, we use

only a fraction r of the computed slack tS. Choosing a smaller value of r also enables us


to compensate for the effects of relaxing the unrealistic assumption that the execution

time of each task instruction is invariable. We leave r as a tunable parameter in our

voltage selection algorithm, whose value is between zero and one. For r = 0, the used

slack is zero, and no tasks are scaled (although the application execution time and energy

consumption can still be affected by our task scheduling algorithm). For r = 1, the entire

slack tS computed according to the Formula (5.2) is distributed across the non-critical

tasks in each iteration. Since the evaluation presented in Chapter 6 shows that our

algorithms do not require excessive execution time, it is possible to determine a good

value of r for the target application during the profiling phase by repeated application

of our technique with different values of r and comparing the results.

5.4 Selection of Task Instructions for Slack

Distribution

In this section, we present our procedure for selecting the non-critical tasks among which

the slack is distributed by applying DVS. We refer to these tasks as scaled tasks, and the

task instructions executing these tasks as scaled task instructions.

Since we assume that the overheads of transitions between voltage levels are not

negligible, we select the scaled tasks so as to minimize the number of transitions. For

this purpose, we group the scaled tasks in sets containing multiple tasks and schedule each

set of scaled tasks as an uninterrupted sequence on a single processor, thus introducing

the possibility of amortizing the impact of each transition overhead over multiple tasks.

Our task scheduling algorithm, which we describe in Section 5.6, is formulated so as to

schedule each designated set of scaled tasks in such manner.

We analyze the loop dependence graph in order to identify sets of non-critical task

instructions that can be executed as uninterrupted sequences in each loop iteration and

are therefore suitable candidates for scaling. For example, in the example from Figure 5.5,


the longest 1-cycle is (S5, S7, S8), and the remaining task instructions are thus non-

critical. For an arbitrarily selected set K of non-critical task instructions, it may or may

not be possible to execute the task instructions from K as an uninterrupted sequence.

For example, if K = {S1, S2, S4}, task instructions from K cannot be executed as an

uninterrupted sequence, since S4 cannot start before both S2 and S3 have finished, and

S3 cannot be executed in parallel with S2 without delaying the execution of S4. Therefore,

K would not be a suitable choice for a set of scaled task instructions. In contrast, the

choice of set {S1, S3, S4} would be suitable, since it is possible to schedule these tasks

instructions as an uninterrupted sequence, assuming that S2 is executed in parallel with

S3.

Furthermore, we wish to exclude from consideration those non-critical tasks whose

prolonging may interrupt the continuous execution of critical tasks. For example, al-

though task instruction S6 is characterized by a slack of 2000 time units, prolonging of

the tasks executed by S6 increases the probability that the execution of the critical task

instruction S8 will be delayed in iterations in which S6 cannot be scheduled in parallel

with S7 promptly after S5 is finished. In order to reduce the probability of such oc-

currences, task instructions such as S6, which have to be executed in parallel with the

critical task instructions, are not considered as candidates for scaling.

We formulate the criteria for the selection of scaled task instructions with the above

outlined considerations in mind. We define two sets of candidates for scaling, which we

denote C1 and C2, using criteria designed so as to exclude tasks whose prolonging can

directly interrupt the continuous execution of critical tasks, such as those executed by

the task instruction S6 in the example in Figure 5.5:

• C1 is the set of task instructions that can be executed before any of the critical

task instructions from the same loop iteration have started execution.

• C2 is the set of task instructions that do not belong to C1 and can be executed after

all of the critical instructions from the same loop iteration have finished execution.


Figure 5.5: An example loop dependence graph

For example, in the loop dependence graph shown in Figure 5.5, C1 = {S1, S2, S3, S4, S11},

and C2 = {S9, S10}.

To ensure that each set of scaled tasks can be executed as an uninterrupted sequence,

we compute the longest paths in the subgraphs of the loop dependence graph comprised

of the nodes from C1 and C2 and the edges with dependence distance zero between

these nodes. These two paths define the two sets of scaled task instructions, which we

denote using symbols K1 and K2. For the example from Figure 5.5, K1 = {S1, S3, S4},

and K2 = {S9, S10}. Tasks from each of these paths can be executed as uninterrupted

sequences if the number of processors is large enough to accommodate the execution of


G = loop dependence graph

C = set of critical task instructions;

C1 = C2 = empty set;

G1 = G2 = empty graph;

for (each task instruction S in the target loop)

if (S does not depend on any task instructions from C)

add S to C1;

else if (no task instruction from C depends on S)add S to C2;

add nodes from G corresponding to task instructions from C1 to G1;

add nodes from G corresponding to task instructions from C2 to G2;

label nodes from G1 and G2 with execution times of task instructions;

for (each edge Si → Sj labeled with distance 0)

if (Si ∈ C1 and Sj ∈ C1)

add edge Si → Sj to G1;

else if (Si ∈ C2 and Sj ∈ C2)

add edge Si → Sj to G2;

P1 = longest path in G1;

P2 = longest path in G2;

K1 = set of task instructions from P1;

K2 = set of task instructions from P2;

Figure 5.6: Algorithm for selecting the sets of scaled task instructions

all tasks that must be executed in parallel with each path. We assume that this is the

case, which is reasonable since we assume that the loop is executed on the maximum

number of processors allowed by its scalability.

The pseudocode of the algorithm for selection of scaled tasks is shown in Figure 5.6.

One step in the algorithm includes the computation of the longest paths in graphs G1 and

G2. Although the general problem of finding the longest path in a graph is NP-complete,

searching for the longest paths in G1 and G2 can be done in polynomial time, since these

graphs are acyclic [11]. In practice, these computations require negligible time because

of the small size of loop dependence graphs.

The amount of slack available per loop iteration, computed as described in Section 5.3,

is distributed across the tasks executed by the task instructions from sets K1 and K2 in

each iteration.


5.5 The ILP Model for Slack Distribution

Once the amount of slack per loop iteration tS is computed and the sets of scaled task

instructions are selected, it is necessary to determine the distribution of slack across the

scaled task instructions. We formulate the problem of optimal slack distribution as an

integer linear programming (ILP) model similar to the one used by Saputra et al. [38].

As described in Section 5.4, we select two disjoint sets of scaled task instructions

K1 and K2 from the target loop, which translate into two sets of scaled tasks in each

iteration. The ILP formulation of the optimal slack distribution problem is based on the

assumption that each set of scaled tasks is executed as an uninterrupted sequence on a

single processor, which is ensured by the task scheduling algorithm. The main constraints

in the ILP model stem from the requirement that the total execution time of all scaled

tasks in each iteration must not exceed their execution time at the highest voltage level

plus the available slack per iteration. The model takes the effects of transition overheads

into account, and the optimal solution will be the one that most efficiently amortizes

these overheads across each set of scaled tasks.

5.5.1 Variables and Constraints

Let m and n be the number of task instructions from sets K1 and K2, respectively. We

denote the task instructions from K1 using symbols S1, . . . , Sm, and the task instructions

from K2 using symbols S ′

1, . . . , S′

n (in order to simplify the notation, we use Si to denote

an arbitrary task instruction, not the i-th task instruction in the sequential execution

order like we did in the previous examples of loop dependence graphs).

Let L be the number of voltage levels supported by the processors, which are enumer-

ated so that L is the level with the highest voltage and operating frequency. We encode

the voltage level assigned to each task instruction Si from the first set using L binary

variables αi1, . . . , αiL. The value of variable αij is one if Si is executed at the voltage level


j, and zero otherwise. We similarly encode the voltage level of each task instruction S ′

k

from the second set using L binary variables βk1, . . . , βkL. Obviously, out of each set of

L variables αi1, . . . , αiL and βk1, . . . , βkL, the value of exactly one variable must be one.

Hence the constraints:

αi1 + · · ·+ αiL = 1 (5.3)

βk1 + · · ·+ βkL = 1, (5.4)

for all i = 1, . . . , m and k = 1, . . . , n.

In order to encode the occurrences of voltage level transitions, we introduce m + 1

binary variables c0, . . . , cm and n + 1 binary variables d0, . . . , dn.

The value of variable c0 is one if a voltage transition takes place immediately prior to

the execution of the task instruction S1, and zero otherwise. Since the non-scaled tasks

execute at level L, this condition is equivalent to α1L = 0. The value of variable cm is

one if a transition takes place immediately after the execution of the task instruction

Sm, i.e. if αmL = 0, and zero otherwise. Similarly, the value of d0 is one if and only if a

transition takes place immediately prior to the execution of S ′

1, and the value of dn is one

if and only if a transition takes place immediately after the execution of S ′

n. Therefore,

the constraints on the values of these variables are:

c0 = 1 − α1L (5.5)

cm = 1 − αmL (5.6)

d0 = 1 − β1L (5.7)

dn = 1 − βnL (5.8)

For i = 1, . . . , m− 1, the value of variable ci is one if a transition takes place between

the execution of task instructions Si and Si+1, and zero otherwise. Since a transition

occurs between two tasks if and only if the voltage levels at which these tasks are executed

are different, ci = 1 if and only if αij 6= α(i+1)j for exactly two values of j = 1, . . . , L. Since


the variables are binary, the definition of variables ci can be expressed by the following

set of m − 1 non-linear constraints:

ci = maxj=1,...,L

(αij − α(i+1)j),

for all i = 1, . . . , m − 1. This set of m − 1 non-linear constraints is equivalent to the

following (m − 1) · L linear constraints:

ci ≥ αij − α(i+1)j , (5.9)

for all i = 1, . . . , m − 1 and j = 1, . . . , L.

For k = 1, . . . , n−1, the value of variable dk is one if a transition takes place between

the execution of task instructions S ′

k and S ′

k+1, and zero otherwise. By reasoning identical

to that applied to variables ci, we arrive at the following (n− 1) ·L constraints for these

variables:

dk ≥ βkj − β(k+1)j , (5.10)

for all k = 1, . . . , n − 1 and j = 1, . . . , L.

We represent the execution times of tasks using real numbers, which are constants in

the model and therefore can be used as coefficients of integer variables in the constraints

and the objective function. We denote the execution times of the task instructions Si

and S ′

k when run on a processor operating at the voltage level j using the symbols tij

and t′kj, respectively. The execution times of the tasks executed by task instructions Si

and S ′

k can thus be represented by the following linear expressions:

execution time(Si) =

L∑

j=1

αijtij

execution time(S ′

k) =

L∑

j=1

βkjt′

kj. (5.11)

The total execution time of all task instructions in both groups is bounded by their

total execution time at the highest voltage level plus the available slack time. The total


overhead in execution time that is incurred by the transitions must be added to this time.

Using the Formula (5.11) for the execution times of task instructions, we formulate the

following constraint:

m∑

i=1

L∑

j=1

αijtij +n∑

k=1

L∑

j=1

βkjt′

kj +m∑

i=0

citTR +n∑

k=0

dktTR ≤ tmin + r · tS, (5.12)

where tTR is the duration of a voltage transition, tmin is the total execution time of

the scaled task instructions at the highest voltage level, tS is the slack time computed

according to Formula (5.2), and r is the tunable parameter of the algorithm, as defined

in Section 5.3.

When the execution time of a set of task instructions is prolonged, some of these

task instructions may belong to certain 1-cycles in the loop dependence graph. These

task instructions must not be prolonged over the limit that would make one of these 1-

cycles longer than the current longest 1-cycle, which determines the length of the critical

path. Therefore, for each 1-cycle c that contains one or more scaled task instructions, a

constraint must be added to the model to ensure this requirement:

∑

Si∈c

(

citTR +L∑

j=1

αijtij

)

+∑

S′

k∈c

(

dktTR +L∑

j=1

βkjt′

kj

)

< τC − τnc, (5.13)

where τC is the length of the longest 1-cycle, τnc is the total execution time of all non-

scaled task instructions in 1-cycle c (which is a constant in the model), and the outer

sums are over all scaled task instructions encompassed by 1-cycle c from the first and

second set, respectively.

5.5.2 Objective Function

The goal of the ILP model is to minimize the sum of energy consumed by the execution of

the scaled task instructions, including the energy overhead of the transitions, and taking

into account the influence of scaling on the energy consumed by the processors in the

idle mode.


We use the symbol Eij to denote the energy consumed by the task instruction Si

executed at the voltage level j. We similarly define E ′

kj for each task instruction S ′

k.

These quantities are constants in the model. We use the symbol ETR to denote the

energy consumed by a single transition. According to the definitions of variables αij, βkj,

ci, and dk, we can represent the energy consumption during the execution of scaled tasks

and transition overheads in each iteration as:

energy(tasks) =

m∑

i=1

L∑

j=1

αijEij +

n∑

k=1

L∑

j=1

βkjE′

kj

energy(transitions) =

m∑

i=0

ciETR +

n∑

k=0

dkETR.

We can account for the idle power by noting that each increase in task execution time

and the execution of each transition overhead reduces the total processor time spent in

the idle mode. Thus, if we define the constants ηij = Eij − Pidletij, η′

kj = E ′

kj − Pidlet′

kj,

and ε = ETR − PidletTR, where Pidle is the processor power in the idle mode, we can

formulate the following objective function:

m∑

i=1

L∑

j=1

αijηij +

n∑

k=1

L∑

j=1

βkjη′

kj +

m∑

i=0

ciε +

n∑

k=0

dkε. (5.14)

Solving the ILP model with this objective function and constraints given by Equations

(5.3)–(5.13) yields the optimal distribution of the available slack among the scaled task

instructions under the stated assumptions.

5.5.3 Estimate of the Number of Variables

Since solving an ILP model is an NP-complete problem [11], the number of variables in

the model is critical for the performance of our algorithm.

If the number of task instructions from sets K1 and K2 is m and n, respectively,

the ILP model contains the following binary variables, using the previously introduced

notation:


• m · L variables αij

• n · L variables βkj

• m + 1 variables ci

• n + 1 variables dk.

The model does not contain any continuous or non-binary integer variables. Therefore,

the total number of variables in the model is:

Nv = m · L + n · L + m + 1 + n + 1 = (m + n) · (L + 1) + 2, (5.15)

and all of the variables are binary.

The total number of scaled task instructions m+n is smaller than the number of task

instructions in the target loop. Furthermore, the size of the target loop in practice is up

to several tens of tasks, and the number of voltage levels L supported by the processors

is small (typically smaller than ten). Therefore, the size of the ILP model is well within

the capabilities of standard ILP solvers. We implement the slack distribution algorithm

using the freely available LP SOLVE optimization library [5].

5.6 Task Scheduling

In this section, we present the task scheduling algorithm, which is designed so as to

complement the voltage selection algorithm presented in the previous sections.

5.6.1 Task Ordering

For the purposes of task scheduling, we divide tasks into three classes, based on the

definitions from Sections 5.2 and 5.4:

1. Critical tasks;


2. Scaled non-critical tasks;

3. All remaining non-critical tasks.

Tasks from each of the listed classes have priority over tasks from the following classes.

Within each class, the priority is determined according to the sequential execution order

of task instructions. Therefore, if tasks Tmi and Tnj belong to the same class, Tmi has

greater priority either if m < n or if m = n and i < j.

5.6.2 Processor Mapping

The scheme for the mapping of tasks onto N processors P0, . . . , PN−1 is shown in Fig-

ure 5.7. The symbol Kxi is used in the figure to denote the set of scaled tasks executed in

iteration i by the task instructions from set Kx, where x is either 1 or 2, as defined in Sec-

tion 5.4. The mapping is performed according to the following rules. A single processor is

reserved for the execution of the critical tasks exclusively. All critical tasks are executed

as an uninterrupted sequence on this processor. Non-critical tasks are scheduled on the

remaining processors according to the round-robin scheme, with the following exception:

once the first task in a set of scaled tasks Kxi from the iteration i has been scheduled

onto a processor, all remaining tasks from the same set of scaled tasks are scheduled onto

the same processor, and no other task may be scheduled onto that processor until the

execution of the whole set Kxi is finished.

The stated rules for task ordering and processor mapping imply that each set of scaled

tasks is executed as an uninterrupted sequence, thus ensuring that the voltage selection

by the algorithm described in Section 5.5 is indeed optimal for the given amount of slack

per loop iteration.


n = 0;

while (target loop runs)

Tmi = next task issued by the task dispatcher;

if (Sm is a critical task instruction)

schedule Tmi onto PN−1;

else if (Tmi ∈ Kxi)

if (Kxi has not started)

n = next ready processor(n)if (n != -1)

schedule Tmi onto Pn;

mark Pn as executing Kxi;

else

k = processor marked as executing Kxi;

schedule Tmi onto Pk;

if (Tmi is the last task from Kxi)

remove mark Kxi from Pk;

else

n = next ready processor(n)if (n != -1)

schedule Tmi onto Pn;

next ready processor(in n)

for (i = 0; i < N-1; i++)

k = (n+i) mod (N-1);

if (Pk is not busy and not marked as executing Kxi)

return k;

return -1;

Figure 5.7: Algorithm for the processor mapping of tasks

5.6.3 Hardware Implementation

We believe that the proposed task scheduling algorithm can be implemented in the MLCA

with a small overhead in hardware complexity. The format of the task instructions

in the control program can be enhanced so that each instruction is marked according

to the class to which it belongs (critical, scaled non-critical, or ordinary non-critical).

These marks can be generated by the compiler. Task ordering can be implemented

by replacing the FIFO queue in the task dispatcher by a priority queue that orders


the task instructions according to their class and sequential execution order. Processor

mapping can be implemented by assigning to each processor a register that stores the

information whether it is currently marked as executing a particular sequence of scaled

tasks. We believe that adding these mechanisms would not substantially increase the

existing complexity of the MLCA control unit hardware.

5.7 Target Loop Selection

Target loops in MLCA control programs are a subset of the natural loops, which can be

detected using well-known compiler algorithms [32]. Using the profiling information, long-

running loops can be selected as those whose execution time accounts for a percentage

of the total application execution time above a predefined threshold. In practice, MLCA

applications tend to have a single long-running loop that handles the processing of the

input media stream, which will be selected as the target loop regardless of the value of

this threshold.

Other conditions outlined in Chapter 3 can be determined from the profiling and

control-flow information. If control-flow instructions are present in the loop, the profiling

information can be used to determine whether the branches are severely biased in a single

direction, and the loop can be selected as a target loop if all branches are biased above

a certain threshold.

Our technique can also be applied to perfect loop nests, even if the execution time of

the innermost loop is not long enough for it to be considered a target loop by itself. The

derivation of our algorithms is applicable to such loops, since each time the innermost loop

is executed, the values written into the URF registers in its last iteration are read by the

tasks executed in its first iteration when it is executed again in the subsequent iteration

of the outer loop. This conclusion can be generalized to imperfect loop nests in which

the total execution time of the tasks outside the innermost loop accounts for a negligible


percentage of the total loop execution time, and there are no data dependences between

these tasks and the tasks within the innermost loop. This generalization is justified

because with these assumptions, the overall effect of the tasks outside the innermost

loop is negligibly small.

Another class of loops to which our approach could be applied are loops in which

there exist branches that are not severely biased, but each non-biased branch has the

property that both of its directions result in the same sets of URF registers being read

and written. In this case, the dependence distances are still guaranteed to be either zero

or one, since all values written into the registers in one iteration are overwritten in the

subsequent one. For such loops, the loop dependence graph could be approximated by

introducing nodes that represent the average of the outcomes for each loop direction,

computed using some general rule. For example, assume that one direction of branch B

executes task instructions Si and Sj whose execution times are ti and tj, respectively,

and the other direction executes a single task instruction Sk with execution time tk.

Furthermore, assume that the sets of registers read and written by Si and Sj are identical

to the sets of registers read and written by Sk. This control-flow construct could be

approximated by a single node in the loop dependence graph labeled with execution time

p(ti + tj) + (1 − p)tk, where p is the probability of B taking the first direction measured

by profiling. Unfortunately, we were not able to test this approach in practice, since our

benchmark applications lack such control-flow structures in their loops.

Chapter 6

Evaluation

In this chapter, we present the experimental evaluation of our solution. In Section 6.1, we

describe the benchmark applications used in the evaluation. In Section 6.2, we describe

our experimental platform and the assumed power-related properties of the processors

in the MLCA system. In Section 6.3, we present the energy savings achieved by the

application of our technique and the impact of our technique on the application execution

time. Section 6.4 investigates the contribution of different processor modes to the overall

power consumption. In Section 6.5, we evaluate the computational intensity of our

technique. Finally, in Section 6.6 we evaluate the quality of the achieved results by

comparing them with the results of a study aimed at finding a practical upper bound on

the energy savings, as well as with the results of an alternative technique based on the

partitioning of the task graph.

6.1 Benchmark Applications

We evaluate our solution using three realistic multimedia applications: JPEG image

encoder, GSM voice encoder, and MPEG sound decoder.

52

Chapter 6. Evaluation 53

6.1.1 JPEG Image Encoder

JPEG (Joint Photographic Standards Group) standard specifies a method for lossy com-

pression of digital images. It is specifically designed for encoding realistic images charac-

terized by gradual color transitions, for example digital photographs or scans of paintings.

For these types of digital images, the JPEG algorithm achieves high compression levels

(typically 5-10) without perceptible degradation of image quality.

We have ported the encoder part of the open source implementation of the JPEG

standard developed by the Independent JPEG Group [12] to the MLCA. This implemen-

tation of JPEG is included in the Mediabench multimedia benchmark suite [27]. The

JPEG encoder converts the input 24-bit bitmap image, which consists of a series of raw

color samples, to a JPEG output image.

Besides the initialization and clean-up code, the control program for the JPEG en-

coder contains a single long-running perfect loop nest. The body of the innermost loop

does not contain any control-flow instructions. The execution times of task functions

in JPEG vary significantly, and even their average execution times are dependent on

the characteristics of the input image. The parallel speedup1 of the JPEG encoder on

the MLCA is shown in Figure 6.1. Since the speedup varies somewhat with inputs, the

figure shows the average over a set of twelve different images, weighted by the execution

times of the application runs with individual inputs. The application scales for up to six

processors.

6.1.2 GSM Voice Encoder

GSM 06.10 is a standard for digital sound compression used in digital mobile telephone

networks. The GSM compression algorithm is optimized for encoding telephone-quality

1Since the MLCA applications are inherently parallel and there is no equivalent of the sequentialprogram version for them, we define the parallel speedup as the ratio of the application execution timeon a 1-processor MLCA and an N -processor MLCA.


0

1

2

3

4

5

6

7

8

9

10

11

12

0 1 2 3 4 5 6 7 8 9 10 11 12

Spe

edup

Number of processors

JPEGGSM

MPEGIdeal speedup

Figure 6.1: Speedup of the MLCA benchmark applications

human voice. It takes an input consisting of raw PCM sound samples and produces

output at a constant bit rate of 13.2 kilobits per second. We use the open source GSM

06.10 codec implementation developed at the Technical University of Berlin [8] and ported

the GSM encoder part of this application to the MLCA. This implementation is also

included in the Mediabench suite [27].

The control program for the GSM encoder satisfies the assumptions from Chapter 3,

except that it contains task functions that write their output parameters in the midst

of their execution. We solve this problem by applying the task splitting transformation.

The execution times of task functions in GSM show very little variation. The parallel

speedup of the GSM encoder on the MLCA is shown in Figure 6.1. The application

scales for up to four processors.

6.1.3 MPEG Sound Decoder

As the third benchmark application, we have used the MLCA version of the MAD MPEG

sound decoder [22]. The MLCA version of MAD is capable of decoding MP3 sound files


of only one bit rate (40kbps, 22kHz). All of the inputs for which our algorithm was tested

have this bit rate. Therefore, we ignored the effects of the variable bit rate on the task

execution times, which can be considerable. However, if our DVS algorithm is applied

to an MP3 decoder application, it is possible to generate a set of different solutions for

playing files of different bit rates.

Similar to the GSM encoder, the control program for the MPEG decoder satisfies the

assumptions from Chapter 3 except for a number of task functions that write their output

parameters in the midst of their execution. We apply the task splitting transformation

to solve this problem. The execution times of task functions in the MPEG decoder

vary significantly, and there is also some variance in their average execution times across

different inputs. Parallel speedup of the MPEG decoder on the MLCA is shown in

Figure 6.1. For MPEG, the figure shows the average over a set of seven inputs with

different kinds of sound content. The application scales for up to eleven processors.

6.2 Experimental Platform and Processor

Properties

We profile the execution of the MLCA benchmark applications using a simulator of the

MLCA [22]. In the JPEG encoder and MPEG decoder, the execution times of task

functions vary significantly between loop iterations and depend on the input content.

For each of these two applications, we perform the profiling run over a set of several

different inputs, which we refer to as the training set. For the profile parameters of our

algorithms, we use the average of the profiling results measured over individual inputs

from the training set. Using these parameters, we apply our technique to the original

training set, as well as another set of different inputs, expecting to achieve similar positive

results for both sets.

Our experimental platform is shown in Figure 6.2. The MLCA application is run using


(a) Collection of application information

(b) Evaluation of our technique

Figure 6.2: Experimental Platform

the MLCA simulator, which produces a trace file containing the details of the application

execution. The trace file is read by a utility that constructs the run-time task graph of

the application execution and collects the profiling information necessary for the voltage

selection algorithm. The task splitting transformation is applied to the run-time task

graph to ensure that each task writes its outputs at the end of its execution, as described

in Section 3.3. The profiling information is then fed to the implementation of the voltage

selection algorithm, along with the program code of the target loop, from which the loop

dependence graph is constructed using the algorithm described in Section 5.1. Once

the solution of the voltage selection problem is computed, it is fed to the program that

implements our task scheduling algorithm described in Section 5.6, as well as the default

MLCA FIFO/round-robin task scheduling algorithm. The application task graph is first

scheduled using the default MLCA task scheduling algorithm, without the application

of DVS. The task graph is then scheduled using our task scheduling algorithm with the


0

100

200

300

400

500

600

700

800

300 400 500 600 700 800

Pow

er [m

W]

Frequency [MHz]

(a) XScale

0 25 50 75

100 125 150 175 200 225 250

100 125 150 175 200 225 250 275 300 325

Pow

er [m

W]

Frequency [MHz]

(b) IEM926

Figure 6.3: Power vs. frequency of XScale [7] and IEM926 [10] processor cores

application of DVS. The execution time of each scaled task is prolonged according to its

assigned voltage level, and the transition overheads are also taken into account. Finally,

the differences in execution time and energy consumption between the two generated

task schedules are computed as the results of our evaluation. The procedure shown in

Figure 6.2(b) is repeated with different values of the parameter r, until a suitable value

is found.

We assume that the processors in the MLCA system support eight discrete voltage

levels, with the relation between supply voltage, operating frequency, and power con-

sumption characteristic of the Intel XScale processor core, reported by Clark et al. [7].

The relative frequency and power consumption for each level is shown in Table 6.1.

Flautner et al. [10] measure a similar ratio between frequency slowdown and reduction

in power consumption for the processor core in the IEM926 SOC, although it operates

in a different range of frequencies. Thus, we believe that our model of processor power

is realistic. A comparison between the power vs. frequency characteristics of XScale and

IEM926 can be seen in Figure 6.3.

As explained in Section 3.1.2, we simplify our approach by assuming constant over-


Level 1 2 3 4 5 6 7 8

Power 0.167 0.272 0.371 0.470 0.557 0.654 0.768 1.000

Slowdown 0.417 0.500 0.583 0.667 0.750 0.833 0.917 1.000

Table 6.1: Properties of the processor voltage levels

Application JPEG GSM MPEG

tmedian 19000 21306 27817

taverage 29476 21365 57305

Table 6.2: Median and average execution times of tasks in cycles

heads of transitions between voltage levels. We evaluate our technique varying the dura-

tion of the overheads from one thousand to three thousand cycles. Comparison of these

overheads with the median and average execution times of tasks in the benchmark ap-

plications determined by the profiling, which are shown in Table 6.2, shows that these

overheads are not negligible compared to the typical task execution times. Transition

overheads in modern DVS-enabled processors are as low as 20µs [7, 15], which trans-

lates into several thousand or tens of thousands of cycles, since these processors operate

at frequencies on the order of hundreds of megahertz. However, we choose somewhat

lower values of the transition overheads because our benchmark applications are less

computationally demanding than the applications that can be expected to run on future

MLCA systems with high-end processors. In the course of our project, we have so far

been limited to the applications whose source code is in the public domain, which is not

the case with proprietary high-end applications that the MLCA will be targeting in the

future. We expect that our results will generalize to the MLCA systems executing more

computationally demanding applications in the presence of larger transition overheads,

since the task granularity for these applications will also be larger and the ratios between

the overheads and task execution times will thus be similar or even lower.

Similarly, we assume constant transition overheads in energy. We approximate the


power consumption during each voltage transition as 80% of the power at the highest

voltage level. This approximation is derived from the assumptions that most transitions

take place between the highest voltage levels and that during each transition, the proces-

sor power is somewhere between the power at the start and the end level of the transition.

Although this is a very rough approximation, it does not affect the accuracy of our results

significantly, since after the application of our technique, the contribution of transition

overheads to the overall energy consumption turns out to be very small (under 2.1%),

according to the results that we present in Section 6.4.

When a processor is not executing a task, we assume that it enters the idle mode with

negligible overheads, as explained in Section 3.1.2. Both before and after the application

of our technique, the total energy consumption is computed under this assumption. We

assume that the power consumption in the idle mode is 20% of the power consumption

at the highest voltage level in the active mode, roughly approximating the power in the

idle mode measured for processors based on the XScale core [19, 20]. Again, the results

that we present in Section 6.4 indicate that the contribution of the idle power to the

overall energy consumption is small (under 7%), so that the accuracy of our results is not

affected significantly by imprecisions in accounting for the idle power. Furthermore, the

idle power assumed in our experiments is at the lower end of the range of values measured

for realistic embedded processors [19, 20], which is a conservative assumption, since our

technique reduces not only the active energy, but also the energy consumed in the idle

mode by exploiting the available slack and reducing the total idle processor time. Thus,

for a processor with higher power consumption in the idle mode, the achieved energy

savings would be higher.


6.3 Energy Savings and Execution Slowdown

In this section, we present the main results of our measurements—the processor power

savings achieved and the execution slowdown incurred by the application of our technique

for the MLCA benchmark applications.

6.3.1 JPEG Encoder

We evaluate the performance of our technique on the JPEG encoder using two sets of

twelve photographic images. The first set is used as the training set, and our technique

is then applied to both sets with parameters computed from the profiling data for the

training set. Since the task execution times for JPEG depend on the chromatic properties

of the input image, we used sets of photographs taken on two different occasions. This

ensures that the results achieved for the images from the second set do not depend on

their similarity in chromatic properties with the images from the training set.

The reduction in the processor energy consumption for both input sets is shown in

Figure 6.4, and the increase in application execution time resulting from the application

of our technique is shown in Figure 6.5. In both figures, the energy savings and the

execution slowdown are shown as functions of r, the fraction of computed slack used

during voltage selection, as defined in Section 5.3. The value of r was varied from zero

(no slack used and thus no voltage scaling) to one (all slack computed according to

the Formula (5.2) is used). For each value of r, we show the processor energy savings

achieved with three different values of the transition overhead. All results are computed as

weighted averages over each set of images, where the weights assigned to individual images

are proportional to the overall computational work necessary for encoding each image.

Weighted averages are used because the computational work varies significantly between

images. The negative slowdown at certain points means that despite the increase in the

execution time of the scaled tasks, the overall application execution time is decreased by


that percentage by the application of our task scheduling algorithm.

The evaluation results show that the results for the training set are successfully re-

produced on the second set. Increase in the transition overhead leads to a decrease in

power savings, but does not lead to excessive execution slowdown. For r = 0.5, with

transition overhead of 1000 cycles, our technique achieves processor power savings of

9.5% for the second set, slowing down the application execution by 0.5% (slowdown for

the training set is somewhat higher). For r = 0, no voltage scaling is performed and we

observe only the effect of the task scheduling algorithm relative to the default MLCA

FIFO/round-robin scheduling. The application of our task scheduling algorithm results

in a small reduction in the application execution time (1.5% for the training set and 2.9%

for the second input set).

Figure 6.6 shows the results for each individual image from the second input set with

r = 0.5 and transition overhead of 1000 cycles. For certain images, the slowdown is

relatively high, but the average slowdown is small. Significant processor power savings

are achieved consistently for all images.

6.3.2 GSM Encoder

Since the GSM encoder application is characterized by very small variations in the ex-

ecution times of task functions, we use the profiling information for a single input and

demonstrate that the results achieved by our technique using the same profiling informa-

tion are almost identical for different inputs. This is shown in Figures 6.7 and 6.8. The

figure shows the results only for the values of r of less than 0.35, because outside of this

interval, the execution slowdown becomes excessive.

The results for this application are generally more sensitive on the value of the pa-

rameter r than for the JPEG encoder. This especially holds for the execution slowdown,

which is also more sensitive to the transition overheads. However, the results for the

training input are closely replicated on the second input. The optimal value of r thus


depends on the transition overhead. With transition overhead of 1000 cycles, the best

result is achieved for the parameter value r = 0.28, which results in processor power sav-

ings of over 5.5% and a small negative slowdown (i.e. a small speedup) of the application

execution. With transition overheads of 2000 and 3000 cycles, a good value of r is 0.20,

which results in processor power savings of 4.1% and 3.3%, respectively, with a small

negative slowdown.

6.3.3 MPEG Sound Decoder

We profile the application using a training set of seven MP3 files encoding different kinds

of sound content. Using this profiling input, we apply our DVS technique to the training

set and to the second set of seven input files. The results achieved by our technique,

averaged over each input set, are shown in Figures 6.9 and 6.10.

The processor power savings are almost identical for both sets, while the execution

slowdown is slightly greater for the second set. Due to the somewhat larger average

task granularity, the results are less sensitive to the increase in transition overheads in

comparison with JPEG and GSM. As in the case of JPEG, the results achieved for the

training input are reproducible with a different input set. With transition overhead of

1000 cycles, a suitable value of the parameter r is 0.6, which yields the processor power

savings of 8.4% with the execution slowdown of approximately 1.5% for both sets of

inputs. For larger transition overheads, a better choice for the value of r is 0.5, due to

somewhat larger execution slowdown.

Figure 6.11 shows the breakdown of the results achieved with r = 0.6 and transition

overhead of 1000 cycles across the inputs from the second set. The power savings are

consistent, with some variation in the execution slowdown.


0

2

4

6

8

10

12

14

16

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ene

rgy

savi

ngs

[%]

r (fraction of slack used)

JPEG, 6 processors: energy savings for set 1

1000 cycles2000 cycles3000 cycles

(a) Training set

0

2

4

6

8

10

12

14

16

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ene

rgy

savi

ngs

[%]


JPEG, 6 processors: energy savings for set 2


(b) Second set

Figure 6.4: Processor energy savings for the JPEG encoder with different transition

overheads


-3

-2

-1

0

1

2

3

4

5

6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exe

cutio

n sl

owdo

wn

[%]


JPEG, 6 processors: execution slowdown for set 1


(a) Training set

-3

-2

-1

0

1

2

3

4

5

6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exe

cutio

n sl

owdo

wn

[%]


JPEG, 6 processors: execution slowdown for set 2


(b) Second set

Figure 6.5: Execution slowdown for the JPEG encoder with different transition overheads


0

1

2

3

4

5

6

7

8

9

10

11

1 2 3 4 5 6 7 8 9 10 11 12

Pro

cess

or e

nerg

y sa

ving

s [%

]

Image number

JPEG, 6 processors - energy savings across images

(a) Energy savings

-3

-2

-1

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8 9 10 11 12

Exe

cutio

n sl

owdo

wn

[%]

Image number

JPEG, 6 processors - execution slowdown across images

(b) Execution slowdown

Figure 6.6: Breakdown of results for the second set of JPEG encoder inputs, r = 0.5,

transition overhead = 1000 cycles


-1

0

1

2

3

4

5

6

0 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32

Ene

rgy

savi

ngs

[%]


GSM, 4 processors: energy savings for set 1


(a) Training set

-1

0

1

2

3

4

5

6

0 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32

Ene

rgy

savi

ngs

[%]


GSM, 4 processors: energy savings for set 2


(b) Second set

Figure 6.7: Processor energy savings for the GSM encoder with different transition over-

heads


-1

0

1

2

3

4

5

0 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32

Exe

cutio

n sl

owdo

wn

[%]


GSM, 4 processors: execution slowdown for set 1


(a) Training set

-1

0

1

2

3

4

5

0 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32

Exe

cutio

n sl

owdo

wn

[%]


GSM, 4 processors: execution slowdown for set 2


(b) Second set

Figure 6.8: Execution slowdown for the GSM encoder with different transition overheads


0 1 2 3 4 5 6 7 8 9

10 11 12 13

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ene

rgy

savi

ngs

[%]


MPEG, 11 processors: energy savings for set 1


(a) Training set

0 1 2 3 4 5 6 7 8 9

10 11 12 13

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ene

rgy

savi

ngs

[%]


MPEG, 11 processors: energy savings for set 2


(b) Second set

Figure 6.9: Processor energy savings for the MPEG decoder with different transition

overheads


0

1

2

3

4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exe

cutio

n sl

owdo

wn

[%]


MPEG, 11 processors: execution slowdown for set 1


(a) Training set

0

1

2

3

4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exe

cutio

n sl

owdo

wn

[%]


MPEG, 11 processors: execution slowdown for set 2


(b) Second set

Figure 6.10: Execution slowdown for the MPEG decoder with different transition over-

heads


0

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7

Ene

rgy

savi

ngs

[%]

Input number

MPEG, 11 processors - energy savings across different inputs

(a) Energy savings

0

0.5

1

1.5

2

2.5

3

1 2 3 4 5 6 7

Exe

cutio

n sl

owdo

wn

[%]

Input number

MPEG, 11 processors - execution slowdown across different inputs


Figure 6.11: Breakdown of results for the second set of MPEG decoder inputs, r = 0.6,

transition overhead = 1000 cycles

6.4 Breakdown of the Energy Consumption

In this section, we investigate the contribution to the overall processor energy consump-

tion of three types of power: active execution of tasks, processor idle periods, and energy

consumed by voltage transitions. Since in our model we account for the latter two types

of power only approximately, the accuracy of our evaluation could be degraded if they

account for a large part of the total energy consumption.

Table 6.3 shows the breakdown of the total energy consumption in the MLCA bench-

mark applications after the application of our technique. The table lists the percentage

of total energy consumed during active execution of tasks, idle periods, and voltage tran-

sitions. For each application, we show the contributions of each power component in

the total power consumption of the second set of inputs for the best found value of r

and transition overhead of 3000 cycles. For lower transition overheads, the figures are

similar, except that the contribution of the transitions is even smaller. Since the active

power accounts for an overwhelming percentage of the total power, we conclude that the



Active 95.8% 92.3% 96.5%

Idle 2.1% 7.0% 2.9%

Transitions 2.1% 0.7% 0.6%

Table 6.3: Breakdown of the total energy consumption

imprecisions in our modeling of the power in the idle mode and the transition overheads

in energy do not influence the accuracy of our results significantly. The reason for the

dominance of the active power lies in the high level of processor utilization of the ap-

plications, as well as the fact that our ILP model for slack distribution minimizes the

number of transition overheads.

6.5 Algorithm Performance

In this section, we discuss the computational intensity of our voltage selection algorithm,

which is the part of our technique that is executed at compile-time. Solving the ILP model

for voltage selection is an NP-complete problem, while the rest of the algorithm consists

of computations that can be performed in polynomial time. Therefore, the number of

variables in the ILP model is crucial for the performance of the algorithm, since the time

necessary to find the solution can grow exponentially with the number of variables.

According to the results derived in Section 5.5.3, the number of variables in the ILP

problem for slack distribution is Nv = Ns · (L + 1) + 2, where Ns is the number of

scaled task instructions in the target loop, and L is the number of discrete voltage levels

supported by the processors. The theoretical upper bound on the number of variables is

thus Nmax = Nt · (L + 1) + 2, where Nt is the total number of task instructions in the

target loop. For each MLCA benchmark application, Table 6.4 shows the total number

of task instructions in the target loop Nt, the number of scaled task instructions Ns,



Nt 8 26 27

Ns 6 6 23

Nmax 74 236 245

Nv 56 56 209

tavrg (sec.) 0.11 0.07 46.80

Table 6.4: Number of variables and solving time of the ILP problem for voltage selection

the value of the theoretical upper bound on the number of variables in the ILP model

Nmax, the actual number of variables encountered by the voltage selection algorithm

Nv, and the average time in seconds for solving the ILP model across all measurements

tavrg. The measurements were performed on a PC workstation with two AMD Athlon

MP 2600+ processors and 512 megabytes of RAM (all computations are sequential, and

the second processor did not have any useful role). The only application for which the

average solving time is not negligible is the MPEG decoder. However, the solving times

of individual runs of the algorithm vary widely for this application, ranging from several

tens of milliseconds to several minutes, despite the fact that the number of variables is

the same. We believe that this variation results from the idiosyncrasies of the LP SOLVE

library, and the solving time could be significantly reduced by using a more powerful ILP

solver.

6.6 Evaluation of the Results

In this section, we evaluate the quality of the results achieved by our technique. We

compare our results with the results of a study aimed at finding a practical upper bound

on the achievable energy savings in the MLCA benchmark applications. Furthermore,

we compare our results with the results achieved using an alternative technique based on


partitioning of the run-time task graph.

6.6.1 Comparison with Practical Upper Bound

In order to evaluate the quality of the results achieved by our technique, we conduct

experiments whose purpose is to determine a practical upper bound on the achievable

power savings for the MLCA benchmark applications. We apply several different task

scheduling algorithms to the run-time task graphs of the applications and attempt to find

the optimal voltage selection for the resulting task schedules. For the voltage selection,

we use an integer linear programming model similar to those proposed by Zhang et al. [46]

and Andrei et al. [3]. The attempted task scheduling algorithms are various combinations

of the default MLCA FIFO/round-robin task scheduling and the task ordering and pro-

cessor mapping schemes proposed by Zhang et al. [46]. The results achieved using this

method can be expected to be superior to any practically applicable heuristic approach,

since it makes use of perfect knowledge of the run-time application task graph.

This approach cannot be applied directly to the task graphs of the application runs,

since the number of tasks executed for even a moderately long run of an application is

several thousand, yielding an excessively large number of variables in the ILP problem

formulation. On the other hand, the task graphs of short application runs are skewed due

to the influence of the initialization and clean-up code. We solve this problem by slicing

the task schedule into short time intervals, and computing the optimal voltage selection

for three randomly selected intervals, taking the average value as the final result. This

method involves a trade-off between accuracy, which is degraded for short time intervals,

and the time necessary for finding the solution of the ILP problem, because the number

of variables increases with the length of the intervals. The details of the ILP model for

voltage selection and the task scheduling algorithms used are described in Appendix A.

For solving the ILP models, we use the ILOG CPLEX optimization system [17].

CPLEX is a commercial product featuring an ILP solver significantly more powerful than


the freely available LP SOLVE, which was used in the implementation of our technique.

However, in practice, the number of variables in the ILP models turns out to be too large

to be solved in reasonable times, even using CPLEX. In order to reduce the number of

variables to a manageable level, we reduce the number of discrete voltage levels from

eight to five (the expression for the number of integer variables in the model is derived

in Section A.2.4 of Appendix A). The frequency and power of the five levels is selected

according to the characteristics of the Intel XScale processor core, so that comparison

with the results achieved by the application of our technique is meaningful. However,

even with this simplification, for the MPEG decoder, the maximum interval length for

which the ILP problem can be solved in practice is even smaller than the length of the

interval long enough to encompass all tasks from a single iteration of the target loop,

and meaningful results are not possible with shorter intervals. Thus, we perform the

experiments only for JPEG and GSM.

The results are shown in Table 6.5. For each application, we list the best result

achieved across the range of attempted scheduling algorithms. These results are not

guaranteed to be strict upper bounds on the attainable energy savings, because of the

inaccuracies introduced by interval slicing. In addition, somewhat better results might

be obtained using a model with a greater number of voltage levels, which would however

result in an excessively large number of variables. Furthermore, in some cases the ILP

solver was interrupted before reaching the optimal solution, and the set of the used

task scheduling algorithms is not comprehensive. Nevertheless, since these results are

derived using an approach that makes use of perfect knowledge of the run-time application

behavior, it is reasonable to assume that these results are comparable to a practical

upper bound for results achievable by realistic techniques such as ours. This suggests

that our technique succeeds in capturing a significant part of the potential for power

optimizations in JPEG and GSM. The results are shown for two different values of the

transition overhead (1000 and 3000 cycles). The greatest difference between our results


Application JPEG, 6 proc. GSM, 4 proc.

Overhead 1000 3000 1000 3000

Upper boundSavings 12.1% 10.6% 9.2% 7.5%

Slowdown -3.2% -3.2% 0.6% 0.6%

Our techniqueSavings 9.5% 5.3% 5.6% 3.3%

Slowdown 0.5% 0.6% -0.3% -0.5%

Table 6.5: Comparison of the evaluation results with the computed upper bounds

and the computed upper bound is in the execution slowdown for JPEG. However, in

this case, the upper bound is computed using a task scheduling algorithm that takes

advantage of the exact knowledge of the critical path in the run-time task graph of the

application, which is unavailable to any practically applicable algorithm.

6.6.2 Comparison with Task Graph Partitioning

The theoretical derivation of our voltage selection algorithm, presented in Chapter 5,

assumes that the control-flow instructions in the body of the target loop can be safely

ignored, as explained in Section 3.2. The derivation also includes the assumption that the

execution times of individual task instructions do not vary between loop iterations. Under

the same assumptions, it is possible to attempt the application of previously proposed

methods for voltage selection aimed at applications that can be represented by small

task graphs or periodic series of such task graphs, since the execution of the idealized

form of the target loop is periodic. In particular, it is possible to analyze a time interval

during the execution of the target loop that is long enough to contain all tasks from a

single iteration and compute the voltage selection for this time interval. The voltage

levels assigned to the tasks from this iteration can be assigned to the corresponding task

instructions, thus solving the voltage selection problem as formulated in Section 4.2. We

implement this approach in order to demonstrate that our technique generates superior


solutions while being significantly less computationally demanding.

We construct the idealized run-time task graph of the application, in which the exe-

cution time of each task is equal to the average execution time of the corresponding task

instruction determined by profiling. This task graph is then scheduled using the default

MLCA FIFO/round-robin task scheduling algorithm. In the generated task schedule, we

select an interval whose boundaries are determined by the beginning of the execution of

the first executed task from iteration i and the end of the execution of the last executed

task from iteration i. Since the task schedule is periodic, the choice of iteration i is arbi-

trary, as long as it is not one of the few first or last iterations. For the selected interval,

we compute the optimal voltage selection using the ILP model described in Appendix A.

The computed voltage levels of tasks from iteration i are then assigned to their corre-

sponding task instruction. We refer to this approach as the interval selection. With these

voltage levels assigned to the task instructions, we schedule the real task graph of the

application and measure the resulting energy savings and execution slowdown.

The results achieved by the application of the interval selection for the JPEG encoder

are shown in Figure 6.12 and compared with the results achieved by our technique with

r = 0.5. We assume the transition overhead of a thousand cycles. Other assumptions

about the system are identical to those described in Section 6.2. The evaluation is per-

formed using the same sets of images as in Section 6.3. The weighted average of the

power saving with interval selection is 7.9%, as opposed to 9.5% achived using our ap-

proach. However, the weighted average of the execution slowdown with interval selection

is 3.8%, as opposed to 0.5% with our approach. Therefore, although the power savings

are comparable, the execution slowdown is excessive. Experiments with the GSM encoder

yield even worse results, with power savings of under 4% and the execution slowdown of

over 3%. For the MPEG decoder, the number of variables in an interval large enough

to contain a whole loop iteration is too large for the solving of the ILP problem to be

feasible.


0 1 2 3 4 5 6 7 8 9

10 11 12

1 2 3 4 5 6 7 8 9 10 11 12

Ene

rgy

savi

ngs

[%]

Image number

JPEG, 6 processors - energy savings across images

Our techniqueInterval selection

(a) Energy savings

-3

-2

-1

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8 9 10 11 12

Exe

cutio

n sl

owdo

wn

[%]

Image number

JPEG, 6 processors - execution slowdown across images

Our techniqueInterval selection


Figure 6.12: Comparison of the results for JPEG


The reason for the excessive slowdown is the lack of an appropriate task scheduling

algorithm in the interval selection approach. Once the execution time of certain tasks

has been prolonged, it is necessary to introduce a task scheduling algorithm that gives

priority to the tasks from the critical path, in order to avoid performance degradation.

However, in order to perform such scheduling in practice, it is necessary to introduce a

heuristic for identifying the tasks from the critical path similar to the one we are using in

our technique, and once such a heuristic is introduced, it naturally leads to our approach.

Another important advantage of our technique over the interval selection approach

is its significantly lower computational intensity. The number of integer variables in the

ILP model described in Appendix A is equal to Nt · (L + 1) − NP , where Nt is the

number of tasks in the selected interval, L is the number of voltage levels supported by

the processor, and NP is the number of processors. Since the interval is selected so as to

encompass all tasks executed within a single iteration, the number of tasks in the interval

is equal to the number of task instructions in the target loop N plus the number of tasks

from other iterations that fall within the same interval.

We can derive an approximate formula for the number of variables in the ILP for-

mulation by assuming that there is no parallelism within each loop iteration, i.e. that

the only source of parallelism in the target loop is the overlapping of the execution of

individual iterations. From this assumption, it follows that in parallel with each loop

iteration, s− 1 additional iterations must be executed on average, where s is the parallel

speedup of the applications. Indeed, if the total execution time of a single iteration is

not decreased by parallelization, then s iterations must be run in parallel on average to

achieve the parallel speedup of s. Therefore, the total number of tasks in the interval is

approximately:

s · N · (L + 1) − NP . (6.1)

If certain parallelism exists within each single loop iteration, this number is somewhat

lower. However, in most MLCA applications in practice, the bulk of parallelism stems



Nt 8 26 27

Speedup 4.95 2.97 9.47

Nvars 350 691 2290

Table 6.6: Number of variables in the ILP model for voltage selection

from overlapping the execution of loop iterations, and the approximation is reasonably

accurate. Furthermore, this approximation is derived under the assumption that the

tasks from the selected iteration are executed in an uninterrupted sequence. However, in

the actual execution of the application, some tasks might be executed much earlier than

the rest of the tasks from the same iterations, due to the out-of-order execution capability

of the MLCA, thus significantly increasing the total number of tasks encompassed by the

selected interval.

For each benchmark application, we encounter in practice a number of variables close

to the value computed according to the Formula (6.1). The exact number depends on the

depth of the out-of-order execution capability of the MLCA. For the JPEG encoder, GSM

encoder, and MPEG decoder, Formula (6.1) yields values listed in Table 6.6. The number

of tasks in the target loop Nt and the parallel speedup used to compute the number of

integer variables for each application are also listed in the table. Comparison with the

data from Table 6.4 shows that our technique involves a number of variables smaller by

an order of magnitude for each application. For MPEG, the number of variables is too

large for the problem to be solvable using LP SOLVE. Since solving an ILP problem is

an NP-complete problem whose computation time generally grows exponentially with

the number of integer variables, our technique incurs a much smaller computational

load, even though it requires multiple executions in order to find a suitable value of the

parameter r.

Chapter 7

Related Work

In this chapter, we survey the related work. Methods for DVS-based processor power

optimization have been an active area of research since DVS was first proposed [34]. This

problem has been approached from several different perspectives by researchers in the

real-time systems, operating systems, and compiler systems communities.

Section 7.1 surveys the DVS-related research in the context of real-time systems.

Section 7.2 presents an overview of the DVS-related research in the context of operating

systems and optimizing compilers. Section 7.3 relates various aspects of the system model

used in our work to the models used by the cited authors.

7.1 Real-Time Systems

Jha [21] presents a comprehensive survey of the early research work in the area of DVS

techniques for real-time systems, which was mostly aimed at single-processor systems

executing sets of independent tasks. Subsequent research has addressed the more gen-

eral problems of DVS techniques for a multiprocessor real-time system that periodically

executes a set of dependent tasks. Two basic approaches to this problem have emerged.

The first approach [29, 47] is aimed at exploiting the slack that arises at run-time when

the execution time of certain tasks happens to be shorter than the worst case for which

80

Chapter 7. Related Work 81

the system is designed. The second approach [3, 13, 39, 46] assumes that the execution

time of each task is known at design-time and attempts to exploit the slack in the task

graph using static algorithms for task scheduling and voltage selection. Some authors

attempt to combine these two approaches [37].

Most of the heuristic algorithms for the static task scheduling and voltage selection ei-

ther use guided random search techniques [39] or combine ILP formulations of the voltage

selection problem with task scheduling algorithms proposed in previous literature [46].

Novel heuristic algorithms based on the properties of the task graphs have also been

proposed [13, 37]. Andrei et al. [3] ignore the issue of task scheduling and focus on the

problem of finding the optimal voltage selection for a given task schedule.

The majority of the cited authors approach the problem by selecting a task scheduling

algorithm and then focusing on the problem of finding the optimal voltage selection for a

given task schedule. Our approach takes a different path, first formulating an algorithm

for voltage selection, and then deriving an appropriate task scheduling scheme based on

the notions defined in the context of our voltage selection algorithm.

The DVS algorithms for real-time systems executing periodic task graphs are some-

what similar to our work, since the repeated execution of a set of dependent tasks with a

fixed deadline can be compared with the repeated execution of the set of task functions

from the target loop in an MLCA application. However, the crucial difference is that the

parallelism in the former case is limited to a single instance of the periodic task graph,

which is assumed to be small enough to be analyzed using computationally demanding

means such as ILP models or genetic algorithms, as explained in Section 1.1. Attempts

to partition the run-time task graph of an MLCA application by selecting an interval

encompassing a single iteration of the target loop lead to unsatisfactory outcomes both

in the quality of results and in the computational complexity, as presented in Section 6.6.

Srinivasan and Chatha [40] propose a combined ILP formulation for optimal task

scheduling and voltage selection in periodic task graphs executed on multiple DVS-


enabled processors. Their approach allows for pipelined scheduling of individual periods

and cross-iteration dependences, similar to the execution of a target loop in an MLCA

application. However, this approach would result in an excessively large number of vari-

ables for our benchmark applications. The authors report short solving times in their

experiments [40], but they do not report the size of the task graphs used. Furthermore,

they assume a two-processor architecture, while a larger number of processors would

significantly increase the number of variables in the model.

Several authors have proposed more general approaches to the power optimizations

for real-time systems. Varatkar and Marculescu [43] and Andrei et al. [2] propose DVS

techniques that account for the energy overheads of inter-task communication. Since

the inter-task communication in an MLCA system with shared memory is limited to the

access to the universal register file, we believe that the time and energy overheads of

this communication are negligible. Wu et al. [45] introduce a DVS algorithm for real-

time systems executing conditional task graphs, which capture both data and control

dependences in a set of tasks. Unfortunately, it is not possible to apply their approach in

order to generalize our work to loops that contain unpredictable control-flow statements.

7.2 Operating Systems and Optimizing Compilers

In operating systems featuring preemptive multitasking, DVS can been used to reduce

the power consumption during time intervals when the average processor workload is low.

These intervals can be identified by monitoring the processor activity at run-time and

adjusting the voltage level based on the recent average processor utilization. Another

approach is to enhance the operating system scheduler to predict the processor workload

based on the currently scheduled jobs. Lorch and Smith [28] present an overview of DVS

techniques in the context of multitasking operating systems.

In the context of optimizing compilers for general-purpose applications, there have


been attempts at compile-time identification of program regions characterized by an

excessive number of processor stall cycles, where frequency can be reduced using DVS

without significant performance impact. Hsu and Kremer [16] implement a profile-driven

compiler system that achieves this purpose.

Hsu [15] presents a comprehensive overview of research in the area of DVS-based power

optimizations in the context of operating systems and optimizing compilers. Unlike our

work, the research in this area is primarily aimed at single-processor, general-purpose

computer systems.

7.3 System Modeling in DVS Research

The models used to determine the relations between the supply voltage, operating fre-

quency, power consumption, and program performance vary among cited works. Most au-

thors determine these relations using analytical models. The analytical models for power-

related processor characteristics range from simple formulas for the dynamic power con-

sumption [37, 46] to more sophisticated models that take into account effects such as the

leakage power [2, 3]. Similar to our work, authors who use analytical models typically as-

sume that each task takes a fixed number of processor cycles, i.e. that the execution times

of tasks are inversely proportional to the operating frequency [3, 13, 37, 43, 46]. Other

authors avoid using analytical models and determine the necessary information by profil-

ing [16] or simulating [38] the application execution. Instead of analytical modeling, we

use the figures characteristic of the Intel XScale architecture reported by Clark et al. [7].

Most of the authors in the area of DVS-based power optimizations ignore the perfor-

mance impact of the transitions between voltage levels in their models [13, 37, 39, 43, 46,

47]. Notable exceptions are [3, 31, 38, 40]. Saputra et al. [38] and Mochocki et al. [31]

assume constant transition overheads independent of voltage levels. Andrei et al. [3]

use an analytical model for the relation between the time and energy overheads and the


voltage levels between which the processor is transferring. Srinivasan and Chatha [40]

formulate an ILP model for voltage selection that includes the overhead of transition

between each pair of voltage levels as a separate constant. Our system model assumes

constant transition overheads.

The majority of the cited authors evaluate their proposed solutions using randomly

generated artificial task graphs [13, 39, 29, 46], or the combination of a set of artificial task

graphs and only one realistic application [2, 3, 31, 37, 45]. In contrast, we evaluate our

technique using exclusively task graphs pertaining to realistic multimedia applications.

Chapter 8

Conclusions and Future Work

In this chapter, we present the concluding remarks and directions for future work.

8.1 Conclusions

In this thesis, we present and evaluate a novel dynamic voltage scaling technique for

power optimizations of multimedia applications running on the MLCA architecture. Our

technique consists of profile-based heuristic compiler algorithms for voltage selection and

task scheduling. The algorithms take advantage of control-flow regularities in the control

programs that emerge across a wide range of MLCA multimedia applications. These

applications spend the bulk of their execution time in long-running loops that often do

not contain any control-flow instructions in their bodies, or contain only severely biased

control-flow instructions, so that the loop body can be approximated by its most frequent

execution path. Out technique specifically targets such loops, analyzing them with the

additional simplifying assumption that the execution time of each task instruction is

constant. This assumption does not hold strictly for real MLCA applications, but we

assume that it can serve as a basis for heuristic analysis of the target loop if we assign

to each task instruction its average execution time measured in the profiling run. The

positive results of the experimental evaluation confirm the validity of this approach.

85

Chapter 8. Conclusions and Future Work 86

Our voltage selection algorithm performs the dependence analysis of the target loop

and constructs its dependence graph. A heuristic approach based on the properties of

the loop dependence graph and the profiling information is used to deduce the properties

of the run-time task graph of the target loop and identify the tasks that form its critical

path. This heuristic approach is based on the insight that under certain assumptions,

it is possible to identify a cycle in the loop dependence graph that translates into the

critical path in the run-time task graph of the target loop. Further heuristics are used to

determine the available slack in the task graph and select a set of tasks outside the critical

path over which the slack is distributed. An integer linear programming formulation is

used to compute an efficient distribution of slack over the selected non-critical tasks. The

voltage selection algorithm involves a tunable parameter. A good value of this parameter

for a given application can be determined by repeated runs, since the algorithm is not

computationally demanding.

Based on the properties of the run-time task graph of the target loop, we formulate

a task scheduling algorithm, which is designed so as to complement the algorithm for

voltage selection. The task scheduling algorithm gives priority to the tasks from the

critical path and ensures that groups of tasks to which DVS is applied execute as unin-

terrupted sequences, so that the impact of voltage transition overheads can be amortized

over multiple tasks.

We evaluate the proposed technique on three realistic MLCA multimedia applications:

a JPEG image encoder, a GSM voice encoder, and an MPEG sound decoder. An MLCA

simulator is used to obtain the profiling data across a training set of inputs. Although the

behavior of the benchmark applications is not consistent with the assumption that the

execution times of task instructions are invariable, using the execution times averaged

across the training sets has shown to be a sufficiently good approximation. Using the

profiling information obtained from the training sets, we achieve the processor energy

savings of 9.5%, 5.5%, and 8.4%, with the execution slowdown of 0.5%, -0.2%, and 1.5%


for JPEG, GSM, and MPEG, respectively, on different, randomly selected sets of inputs.

We conduct a study aimed at finding a practical upper bound on the achievable energy

savings in the benchmark applications, based on a model that assumes perfect knowledge

of the run-time application properties. The results suggest that our approach captures

most of the potential for energy savings.

Previous work in the area of DVS optimizations for parallel systems has focused on

applications in form of task graphs small enough to be analyzed in their entirety and

periodic task graphs whose analysis can be reduced to the analysis of a single period. In

contrast, our technique handles arbitrarily large task graphs, generated by the execution

of loops that feature cross-iteration dependences and parallelism achieved by pipelining

loop iterations. We demonstrate that our technique achieves superior results in com-

parison with attempts to partition the run-time task graph in order to apply previously

proposed techniques for voltage selection. Our technique is also significantly less compu-

tationally demanding, despite the need for multiple executions in order to find a suitable

value of the tunable parameter.

8.2 Future Work

In the future, we hope to evaluate our technique on an extended set of benchmark MLCA

applications. The currently available MLCA applications were ported to the MLCA

manually, but the research on the compiler infrastructure that will largely automate the

process of porting sequential applications to the MLCA is in progress [4]. In the future,

we hope to test the performance of our technique against compiler-generated MLCA

applications. We also hope to generalize the technique to more complex control-flow

structures in the control programs, such as loops whose bodies contain unpredictable

control-flow, although the currently available MLCA applications do not feature such

loops. Another goal for future work is to generalize our technique to heterogeneous


MLCA systems, which impose additional constraints on the processor mapping of tasks.

Once the physical implementation of a DVS-enabled MLCA system is available in the

future, we intend to test our technique on real hardware.

We hope to enhance our technique to exploit further opportunities for power savings

that arise in applications with variable run-time characteristics, such as bursty behavior.

For such applications, intervals at run-time when the activity is low offer a lot of potential

for DVS-based power savings, since the underlying hardware must be designed for the

worst-case computational load. In contrast, our current approach targets applications

that are characterized by relatively small changes in computational load at run-time,

taking advantage of the regularities in their behavior.

Although our technique was developed in the context of the MLCA architecture, we

believe that it is also applicable in the more general context of task-level parallelism

in multimedia applications. The principal source of task-level parallelism in a typical

multimedia application is the pipelining of the computations performed by the tasks in

the main loop. Each iteration of this loop processes a single element of the input media

stream (for example, a 20ms voice sample for GSM). Such applications are parallelized

by decomposing the processing of each input element into a number of tasks with well-

defined inputs and outputs, which determine the data dependences between tasks. The

processing of individual elements can then be pipelined as far as the data dependences

allow. Regardless of the parallel execution model and the underlying architecture, our

analysis of the task graph could be applied to the multimedia applications where such

parallelism exists. We thus hope to generalize our technique to multimedia applications

running on parallel systems other than the MLCA.

Appendix A

Interval-Based Voltage Selection and

Task Scheduling

In this appendix, we present the task scheduling algorithms and the ILP model for voltage

selection used in the experiments described in Section 6.6. In the DVS techniques used

in these experiments, task scheduling precedes the voltage selection. Once the task

graph has been scheduled, a time interval is selected within the task schedule and the

optimal voltage selection is computed for the tasks executed within that interval. In

the experiments described in Section 6.6.1, aimed at finding the upper bound for energy

savings, we use the task graph of the application constructed from the simulator trace of

the application execution, thus assuming perfect knowledge of the run-time application

characteristics. In the experiments described in Section 6.6.2, in which we attempt an

approach aimed at partitioning of the task graph, we use an idealized task graph, in

which the execution time of each task is equal to the average execution time of the

corresponding task instruction determined by profiling.

89

Appendix A. Interval-Based Voltage Selection and Task Scheduling 90

A.1 Task Scheduling Algorithms

We formulate five task scheduling algorithms as various combinations of three task or-

dering schemes and two processor mapping schemes. We use the following task ordering

schemes:

• FIFO ordering: ordering according to the sequential execution order, which is the

default for the MLCA.

• Earliest start time (EST): the task with the least earliest start time has the highest

priority. The earliest start time of a task is computed as the longest path from the

entry node to the task in the run-time task graph.

• Longest critical path (LCP): the task with the longest critical path has the highest

priority. The critical path of a task is computed as the longest path from the task

to the exit node in the run-time task graph.

The latter two task ordering schemes cannot be used for task scheduling on the MLCA

in practice, since they depend on information about the run-time task graph that cannot

be known by the MLCA task dispatcher. However, since the experiments described in

Section 6.6.1 are performed in order to find an upper bound on the achievable energy

savings with the assumption of perfect knowledge of the run-time application behavior,

it is possible to use them. These two strategies for assigning priorities to tasks have been

traditionally used in a variety of task scheduling algorithms [25].

The two employed processor mapping schemes are:

• Round robin (RR): circular processor selection, which is the default for the MLCA.

• Best-fit (BF) selection proposed by Zhang et al. [46]: among the available pro-

cessors, selects the one that has finished the execution of the previous task most

recently.


Using the listed task ordering and processor selection schemes, we define five task

scheduling algorithms: FIFO-RR, LCP-RR, EST-RR, LCP-BF, and EST-BF. These

task scheduling algorithms are used in the experiments whose results are presented in

Section 6.6.1.

A.2 ILP Model

In this section, we present an ILP model for voltage selection in a time interval of appli-

cation execution. The model is derived under the assumption that tasks are scheduled

according to a well-defined task ordering and processor mapping algorithm, which re-

spects the dependences between tasks, and a time interval is selected from the generated

task schedule. The computed voltage selection is optimal under the assumption that a

single voltage level is assigned to each task.

A.2.1 Variables

Let Nt be the total number of tasks in the interval, and L the number of voltage levels

supported by the processors, enumerated so that L is the highest voltage level.

For each task Ti, we introduce a continuous variable Di, which denotes the start

time of the task relative to the start of the interval. Furthermore, for each task Ti, we

introduce L binary variables αi1, . . . , αiL, which encode the voltage level assigned to the

task. The value of αij is one if and only if task Ti is executed at the level j.

For each pair of tasks (Ti, Tk) such that task Ti is executed immediately before task

Tk on the same processor P , we introduce a binary variable cik, which encodes whether

a transition between voltage levels occurs on the processor P between the execution of

tasks Ti and Tk. The value of the variable cik is one if the transition takes place, and

zero otherwise.

Execution time of each task depends on the voltage level at which the task is run. Let


tij denote the execution time of task Ti when run at the voltage level j. The execution

time of task Ti can be represented in the model as:

exec time(Ti) = αi1ti1 + αi2ti2 + · · ·+ αiLtiL. (A.1)

The finishing time of task Ti is equal to the sum of its start time and execution time

and can thus be represented in the model as:

finish time(Ti) = Di + αi1ti1 + αi2ti2 + · · ·+ αiLtiL. (A.2)

A.2.2 Constraints

Each task executes at a specific voltage level. Therefore, for each task Ti, the value

of exactly one of the binary variables αi1, . . . , αiL must be one. Hence the following

constraint for each task Ti:

αi1 + · · ·+ αiL = 1. (A.3)

The value of each variable cik will be one if the tasks Ti and Tk execute at different

voltage levels, and zero otherwise. Since the variables cik are binary, this rule is equivalent

to the following constraint on each variable cik:

cik ≥ maxj=1,...,L

(αij − αkj). (A.4)

Although these constraints are non-linear, each of them can be represented by a set of L

linear constraints:

cik ≥ αij − αkj, (A.5)

for each j = 1, . . . , L.

If task Ti is scheduled immediately before task Tk on the same processor, task Tk may

start only after task Ti has finished. If a voltage level transition takes place between Ti

and Tk, the start of Tk is delayed by the number of processor cycles equal to the tran-

sition overhead. Therefore, for each pair of tasks (Ti, Tk) such that task Tk is scheduled


immediately after task Ti on the same processor, we formulate the following constraint

based on the expression (A.2):

Dk − (Di + αi1ti1 + · · ·+ αiLtiL + ciktTR) ≥ 0, (A.6)

where tTR is the transition overhead.

If task Tk reads one or more inputs produced by task Ti, task Tk may start only after

its last input produced by Ti is written, because of the data dependence. Let βik be

the ratio between the time at which Ti writes its last output read by Tk and the total

execution time of Ti. Since we assume that all computations are uniformly slowed down

by frequency scaling, βik is a constant for each such pair of tasks (Ti, Tk). Therefore,

for each such pair of tasks (Ti, Tk), we formulate the following constraint based on the

expression (A.2):

Dk − Di − βik(αi1ti1 + · · ·+ αiLtiL) ≥ 0. (A.7)

For each processor P , we fix the start time of the first task scheduled onto P to the

beginning of the interval. For each such task Ti, we introduce a constraint of the following

form:

Di = 0 (A.8)

Finally, let T be the total length of the interval. All tasks must finish before the end

of the interval, which leads to the following constraint for each task Ti:

Di + αi1ti1 + · · · + αiLtiL ≤ T. (A.9)

A.2.3 Objective Function

The purpose of the model is to find the voltage selection that minimizes the total energy

consumption in the selected interval. The objective function is formulated in a way

similar to the objective function in the ILP model for slack distribution described in

Section 5.5.2.


We use the symbol Eij to denote the energy consumed by the task Ti executed at

the voltage level j, and the symbol ETR to denote the energy consumed by a single

transition. According to the definitions of variables αij and cik, we can represent the

energy consumption during the execution of tasks and transition overheads as:

Nt∑

i=1

L∑

j=1

αijEij +∑

(i,k)

cikETR,

where the second sum is over all pairs (i, k) such that task Tk is scheduled immediately

after Ti on the same processor.

We can account for the idle power by noting that each increase in task execution time

and each transition overhead reduces the total processor time spent in the idle mode.

Therefore, if we define the constants ηij = Eij − Pidletij, and ε = ETR − PidletTR, we can

formulate the following objective function:

Nt∑

i=1

L∑

j=1

αijηij +∑

(i,k)

cikε. (A.10)

The value of this objective function differs from the total energy consumption in the

selected interval by a constant term, which does not influence the location of its minimum.

Minimizing this objective function with constraints (A.3)–(A.9) yields the optimal

voltage selection for the given interval of the task schedule under the stated assumptions.

A.2.4 Estimate of the Number of Variables

For each task in the interval, we introduce L binary variables that encode its voltage

level. Furthermore, for each pair of adjacent tasks scheduled on the same processor, we

introduce a binary variable that encodes the voltage transition between the execution of

these tasks. The number of such pairs of tasks is equal to Nt − NP , where NP is the

number of processors, since each task in the interval is followed by another one on the

same processor, except the tasks scheduled at the end of the interval on each processor.


Therefore, the total number of integer variables in the ILP model is:

Nt · L + Nt − NP = Nt · (L + 1) − NP . (A.11)

Bibliography

[1] Randy Allen and Ken Kennedy. Optimizing Compilers for Modern Architectures: ADependence-Based Approach. Morgan Kaufmann Publishers, 2001.

[2] Alexandru Andrei, Marcus Schmitz, Petru Eles, Zebo Peng, and Bashir M. Al-Hashimi. Simultaneous communication and processor voltage scaling for dynamicand leakage energy reduction in time-constrained systems. In ICCAD ’04: Proceed-ings of the 2004 IEEE/ACM International Conference on Computer-Aided Design,pages 362–369, 2004.

[3] Alexandru Andrei, Marcus Schmitz, Petru Eles, Zebo Peng, and Bashir M. Al-Hashimi. Overhead-conscious voltage selection for dynamic and leakage energyreduction of time-constrained systems. IEE Proceedings – Computers & DigitalTechniques, 152(1):28–38, 2005.

[4] Utku Aydonat. Compiler support for a system-on-chip multimedia architecture.Master’s thesis, University of Toronto, 2005.

[5] Michel Berkelaar, Kjell Eikland, and Peter Notebaert. Open Source Mixed IntegerLinear Programming System, 2004. http://groups.yahoo.com/group/lp solve.

[6] Thomas D. Burd, Trevor A. Pering, Anthony J. Stratakos, and Robert W. Broder-sen. A dynamic voltage scaled microprocessor system. IEEE Journal of Solid-StateCircuits, 35(11):1571–1580, 2000.

[7] Lawrence T. Clark, Eric J. Hoffman, Jay Miller, Manish Biyani, Yuyun Liao, StephenStrazdus, Michael Morrow, Kimberley E. Velarde, and Mark A. Yarch. An embedded32-b microprocessor core for low-power and high-performance applications. IEEEJournal of Solid-State Circuits, 36(11):1599–1608, 2001.

[8] Jutta Degener and Carsten Bormann. GSM 06.10 lossy speech compression, 1994.http://cs.tu-berlin.de/˜jutta/toast.html.

[9] Krisztian Flautner, David Flynn, and Mark Rives. A combined hardware-softwareapproach for low-power SOCs: applying adaptive voltage scaling and intelli-gent energy management software. In Proceedings of IEC DesignCon, 2003.http://www.arm.com/pdfs/Flautner Flynn Rives DesignCon2003.pdf.

96

Bibliography 97

[10] Krisztian Flautner, David Flynn, David Roberts, and Dipesh I. Patel. IEM926: Anenergy efficient SoC with dynamic voltage scaling. In DATE ’04: Proceedings of theConference on Design, Automation and Test in Europe, volume 3, pages 324–329,2004.

[11] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide tothe Theory of NP-Completeness. W. H. Freeman & Co., 1979.

[12] Independent JPEG Group. http://www.ijg.org.

[13] Flavius Gruian and Krzysztof Kuchcinski. LEneS: Task scheduling for low-energysystems using variable supply voltage processors. In ASP-DAC ’01: Proceedingsof the 2001 Conference on Asia South Pacific Design Automation, pages 449–455,2001.

[14] John L. Hennessy and David A. Patterson. Computer Architecture: A QuantitativeApproach. Morgan Kaufmann Publishers, 2003.

[15] Chung-Hsing Hsu. Compiler-Directed Dynamic Voltage and Frequency Scaling forCPU Power and Energy Reduction. PhD thesis, Rutgers, The State University ofNew Jersey, 2003.

[16] Chung-Hsing Hsu and Ulrich Kremer. The design, implementation, and evaluationof a compiler algorithm for CPU energy reduction. In PLDI ’03: Proceedings ofthe ACM SIGPLAN 2003 Conference on Programming Language Design and Imple-mentation, pages 38–48, 2003.

[17] ILOG Corporation. ILOG CPLEX 9.0, 2003. http://www.ilog.com/product/cplex.

[18] Intel Corporation. Intel 80200 Processor Based on Intel XScale Microarchitecture:Developer’s Manual, 2003.http://www.intel.com/design/iio/manuals/273411.htm.

[19] Intel Corporation. Intel PXA255 Processor: Electrical, Mechanical, and ThermalSpecification, 2004.http://www.intel.com/design/pca/applicationsprocessors/manuals/278780.htm.

[20] Intel Corporation. Intel PXA270 Processor: Electrical, Mechanical, and ThermalSpecification, 2005.http://www.intel.com/design/pca/applicationsprocessors/datashts/280002.htm.

[21] Niraj K. Jha. Low power system scheduling and synthesis. In ICCAD ’01: Proceed-ings of the 2001 IEEE/ACM International Conference on Computer-Aided Design,pages 259–263, 2001.

[22] Faraydon Karim, Alain Mellan, Anh Nguyen, Utku Aydonat, and Tarek S. Abdel-rahman. A multilevel computing architecture for embedded multimedia applications.IEEE Micro, 24(3):55–66, 2004.

Bibliography 98

[23] Chris H. Kim and Kaushik Roy. Dynamic VTH scaling scheme for active leakagepower reduction. In DATE ’02: Proceedings of the Conference on Design, Automa-tion and Test in Europe, pages 163–167, 2002.

[24] Tadahiro Kuroda, Kojiro Suzuki, Shinji Mita, Tetsuya Fujita, Fumiyuki Yamane,Fumihiko Sano, Akihiko Chiba, Yoshinori Watanabe, Koji Matsuda, Takeo Maeda,Takayasu Sakurai, and Tohru Furuyama. Variable supply-voltage scheme for low-power high-speed CMOS digital design. IEEE Journal of Solid-State Circuits,33(3):454–462, 1998.

[25] Yu-Kwong Kwok and Ishfaq Ahmad. Benchmarking and comparison of the taskgraph scheduling algorithms. Journal of Parallel and Distributed Computing,59(3):381–422, 1999.

[26] Kanishka Lahiri, Anand Raghunathan, Sujit Dey, and Debashis Panigrahi. Battery-driven system design: A new frontier in low power design. In Proceedings of the 2002Conference on Asia South Pacific Design Automation/VLSI Design, pages 261–267,2002.

[27] Chunho Lee, Miodrag Potkonjak, and William H. Mangione-Smith. Mediabench: Atool for evaluating and synthesizing multimedia and communications systems. InMICRO-30: Proceedings of the 30th Annual IEEE/ACM International Symposiumon Microarchitecture, pages 330–335, 1997.

[28] Jacob R. Lorch and Alan Jay Smith. Operating system modifications for task-based speed and voltage scheduling. In MOBISYS ’03: Proceedings of the FirstInternational Conference on Mobile Systems, Applications, and Services, pages 215–229, 2003.

[29] Jiong Luo and Niraj K. Jha. Power-conscious joint scheduling of periodic task graphsand aperiodic tasks in distributed real-time systems. In ICCAD ’00: Proceedings ofthe 2000 IEEE/ACM International Conference on Computer-Aided Design, pages357–364, 2000.

[30] Steven M. Martin, Krisztian Flautner, Trevor Mudge, and David Blaauw. Combineddynamic voltage scaling and adaptive body biasing for lower power microprocessorsunder dynamic workloads. In ICCAD ’02: Proceedings of the 2002 IEEE/ACMInternational Conference on Computer-Aided Design, pages 721–725, 2002.

[31] Bren Mochocki, Xiaobo Sharon Hu, and Gang Quan. A realistic variable voltagescheduling model for real-time applications. In ICCAD ’02: Proceedings of the 2002IEEE/ACM International Conference on Computer-Aided Design, pages 726–731,2002.

[32] Steven S. Muchnick. Advanced compiler design and implementation. Morgan Kauf-mann Publishers, 1997.

Bibliography 99

[33] Trevor Mudge. Power: a first-class architectural design constraint. IEEE Computer,34(4):52–58, 2001.

[34] Lars S. Nielsen, Cees Niessen, Jens Sparsø, and Kees van Berkel. Low-power op-eration using self-timed circuits and adaptive scaling of the supply voltage. IEEETransactions on VLSI Systems, 2(4):391–397, 1994.

[35] Kevin J. Nowka, Gary D. Carpenter, and Bishop C. Brock. The design and ap-plication of the PowerPC 405LP energy-efficient system-on-a-chip. IBM Journal ofResearch and Development, 47(5–6):631–639, 2003.

[36] Johan Pouwelse, Koen Langendoen, and Henk Sips. Dynamic voltage scaling on alow-power microprocessor. In MOBICOM ’01: Proceedings of the 7th ACM Inter-national Conference on Mobile Computing and Networking, pages 251–259, 2001.

[37] Diganta Roychowdhury, Israel Koren, C. Mani Krishna, and Yann-Hang Lee. Avoltage scheduling heuristic for real-time task graphs. In DSN ’03: Proceedingsof the 2003 International Conference on Dependable Systems and Networks, pages741–750, 2003.

[38] Hendra Saputra, Mahmut T. Kandemir, Narayanan Vijaykrishnan, Mary Jane Ir-win, Jie S. Hu, Chung-Hsing Hsu, and Ulrich Kremer. Energy-conscious compilationbased on voltage scaling. In LCTES/SCOPES ’02: Proceedings of the Joint Confer-ence on Languages, Compilers and Tools for Embedded Systems, pages 2–11, 2002.

[39] Marcus T. Schmitz, Bashir M. Al-Hashimi, and Petru Eles. Energy-efficient map-ping and scheduling for DVS enabled distributed embedded systems. In DATE ’02:Proceedings of the Conference on Design, Automation and Test in Europe, pages514–521, 2002.

[40] Krishnan Srinivasan and Karam S. Chatha. An ILP formulation for system levelthroughput and power optimization in multiprocessor SoC architectures. In Pro-ceedings of the 17th International Conference on VLSI Design, pages 255–260, 2004.

[41] Emil Talpes and Diana Marculescu. Toward a multiple clock/voltage island designstyle for power-aware processors. IEEE Transactions on VLSI Systems, 13(5):591–603, 2005.

[42] Transmeta Corporation. Crusoe processor technology.http://www.transmeta.com/crusoe.

[43] Girish Varatkar and Radu Marculescu. Communication-aware task scheduling andvoltage selection for total systems energy minimization. In ICCAD ’03: Proceedingsof the 2003 IEEE/ACM International Conference on Computer-Aided Design, pages510–517, 2003.

[44] Neil H.E. Weste and David Harris. CMOS VLSI Design. Addison Wesley, 2004.

Bibliography 100

[45] Dong Wu, Bashir M. Al-Hashimi, and Petru Eles. Scheduling and mapping of condi-tional task graphs for the synthesis of low power embedded systems. In DATE ’03:Proceedings of the Conference on Design, Automation and Test in Europe, pages90–95, 2003.

[46] Yumin Zhang, Xiaobo Sharon Hu, and Danny Z. Chen. Task scheduling and voltageselection for energy minimization. In DAC ’02: Proceedings of the 39th Conferenceon Design Automation, pages 183–188, 2002.

[47] Dakai Zhu, Rami Melhem, and Bruce Childers. Scheduling with dynamic volt-age/speed adjustment using slack reclamation in multi-processor real-time systems.IEEE Transactions on Parallel and Distributed Systems, 14(7):686–700, 2003.

Date post:	15-Mar-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Power Optimizations for the MLCA Using Dynamic Voltage Scalingtsa/theses/ivan_matosevic.pdf ·...

Documents