Power Optimizations for the MLCA
Using Dynamic Voltage Scaling
by
Ivan Matosevic
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
Copyright c© 2006 by Ivan Matosevic
Abstract
Power Optimizations for the MLCA
Using Dynamic Voltage Scaling
Ivan Matosevic
Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2006
The Multi-Level Computing Architecture (MLCA) is a novel architecture for parallel
systems-on-a-chip. We propose and evaluate a profile-driven compiler technique for power
optimizations of MLCA applications using dynamic voltage scaling (DVS). Our technique
combines dependence analysis of loops with profiling in order to identify the slack in par-
allel execution of coarse-grain tasks. DVS is applied to slow down processors executing
tasks outside the critical path, saving power with little or no impact on execution time.
Evaluation of our technique using an MLCA simulator and three realistic MLCA mul-
timedia applications shows that up to 10% savings in processor power consumption can
be achieved with no more than 1.5% increase in execution time. The achieved power
savings are significantly greater than those that could be achieved by uniformly slowing
down all computations with only a similar increase in overall execution time.
ii
Acknowledgements
First and foremost, I would like to thank my supervisor, Prof. Tarek S. Abdelrahman,
for his guidance and support throughout the course of this work.
I would also like to thank Faraydon Karim and Alain Mellan from STMicroelectron-
ics for their support and help, without which this work would not have been possible.
Particular thanks to Alain for providing support for the MLCA simulator.
Many thanks to Utku Aydonat, who helped me in the initial stage of my work by
introducing me to the details of the MLCA simulator. Furthermore, I wish to thank
the participants of the Compiler and Architecture Reading Group for providing valuable
feedback that led to significant improvements in this work.
I am grateful to Prof. Sinisa Srbljic for encouraging me to pursue graduate studies
in Canada. I would also like to thank all of my family, friends, and colleagues, too
numerous to name individually, who have provided support and encouragement during
the past years. Special thanks to Anto Anusic, Vesna Anusic, and Franjo Plavec, who
were of great help when I was moving to Toronto and learning my way around here.
This work has been supported by research grants from STMicroelectronics and Com-
munications and Information Technology Ontario (CITO). I am grateful for their support.
iii
Contents
1 Introduction 1
1.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 6
2.1 The MLCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Register Renaming and Parallel Execution of Tasks . . . . . . . . 8
2.1.2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Dynamic Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 System and Application Properties 16
3.1 Hardware Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Architectural Properties . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Power-Related Properties of the Processors . . . . . . . . . . . . . 17
3.2 Application Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Task Graph Execution Model . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Problem Formulation 24
4.1 Slack in Task Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
iv
5 Voltage Selection and Task Scheduling 29
5.1 The Loop Dependence Graph . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 The Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 The Available Slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4 Selection of Task Instructions for Slack
Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.5 The ILP Model for Slack Distribution . . . . . . . . . . . . . . . . . . . . 42
5.5.1 Variables and Constraints . . . . . . . . . . . . . . . . . . . . . . 42
5.5.2 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.5.3 Estimate of the Number of Variables . . . . . . . . . . . . . . . . 46
5.6 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.6.1 Task Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.6.2 Processor Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.6.3 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . 49
5.7 Target Loop Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6 Evaluation 52
6.1 Benchmark Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.1.1 JPEG Image Encoder . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1.2 GSM Voice Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1.3 MPEG Sound Decoder . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Experimental Platform and Processor
Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Energy Savings and Execution Slowdown . . . . . . . . . . . . . . . . . . 60
6.3.1 JPEG Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.2 GSM Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3.3 MPEG Sound Decoder . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4 Breakdown of the Energy Consumption . . . . . . . . . . . . . . . . . . . 70
v
6.5 Algorithm Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.6 Evaluation of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.6.1 Comparison with Practical Upper Bound . . . . . . . . . . . . . . 73
6.6.2 Comparison with Task Graph Partitioning . . . . . . . . . . . . . 75
7 Related Work 80
7.1 Real-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2 Operating Systems and Optimizing Compilers . . . . . . . . . . . . . . . 82
7.3 System Modeling in DVS Research . . . . . . . . . . . . . . . . . . . . . 83
8 Conclusions and Future Work 85
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
A Interval-Based Voltage Selection and Task Scheduling 89
A.1 Task Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.2 ILP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.2.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.2.2 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.2.3 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.2.4 Estimate of the Number of Variables . . . . . . . . . . . . . . . . 94
Bibliography 95
vi
Chapter 1
Introduction
The Multi-Level Computing Architecture (MLCA) [22] is a novel architecture for parallel
systems-on-a-chip (SOCs). It features multiple processing units and a top-level controller
that automatically exploits parallelism among coarse-grain units of computation, called
tasks. The parallel execution of tasks is based on techniques similar to those used by
superscalar processors for the extraction of instruction-level parallelism.
The MLCA supports a programming model that is close to sequential programming.
An MLCA application consists of a set of task functions and a control program. Task
functions are executed as tasks by the processing units at run-time. The control pro-
gram specifies the control and data flow between tasks, and is executed by the top-level
controller. This model reduces programming effort, making the MLCA an attractive
architecture for multimedia and streaming applications.
Power consumption remains one of the critical design constraints in today’s embedded
systems. These systems often run on batteries, but improvements in battery technology
have failed to keep pace with increased system power consumption, particularly the
power consumption of processors [26]. Even in computer systems that are not battery-
dependent, excessive power consumption results in heat dissipation that increases the cost
of packaging and negatively impacts reliability. Consequently, techniques for reducing
1
Chapter 1. Introduction 2
power consumption have attracted considerable interest in recent years.
Dynamic voltage scaling (DVS) is a technique for reducing the power consumption
of processors. With this technique, the supply voltage and operating frequency of a pro-
cessor can be varied during run-time in order to achieve a trade-off between processor
performance and power dissipation. Today, there exists a number of DVS-enabled pro-
cessors, including the Intel XScale [7], the IBM PowerPC 405LP [35], and the Transmeta
Crusoe processors [42].
In this thesis, we investigate the use of DVS for the MLCA. More specifically, we
consider the use of DVS-enabled processors as processing units in the MLCA and propose:
(1) a novel profile-driven compiler technique for assigning processor voltage levels to tasks,
a procedure we refer to as the voltage selection, and (2) a complementary task scheduling
algorithm for MLCA applications. We assume that the processor frequency in the MLCA
system is chosen so as to meet the real-time requirements of the application while taking
full advantage of the maximum achievable parallel speedup. Thus, power savings cannot
be achieved by permanently reducing the voltage and frequency of the processors, because
this would violate the real-time requirements of the application. Instead, our technique
attempts to achieve power savings in a manner that either does not slow down the
application or incurs a minor slowdown while achieving power savings significantly greater
than those that could be achieved by uniformly slowing down all computations with only
a similar increase in execution time.
Our proposed technique targets long-running loops in the control programs of MLCA
applications. It combines analysis of the loop dependence graph with profiling informa-
tion in order to deduce properties of the dynamic task graph that represents the run-time
execution of tasks in the loop. Based on these deduced properties, our algorithm attempts
to determine the optimal voltage level for the execution of each task. The task schedul-
ing algorithm, which is also based on the deduced properties of the dynamic task graph,
complements the voltage selection algorithm to ensure that power savings are achieved
Chapter 1. Introduction 3
with little or no impact on the application execution time.
We implement and evaluate our technique using three realistic multimedia applica-
tions and a simulator of the MLCA [22]. The results indicate that our technique is
successful at reducing processor power consumption by 5–10%, with minimal increase
in execution time (no more than 1.5%). Although our technique specifically targets the
MLCA, we believe that it is applicable in the more general context of task-level paral-
lelism in multimedia applications based on pipelining the processing of individual units
in the input media stream.
1.1 Thesis Contributions
Previous work in the area of DVS-based power optimizations for systems featuring task-
level parallelism, which we survey in Chapter 7, has focused on real-time applications
in form of small sets of dependent tasks (containing up to several tens of tasks), pos-
sibly executed periodically. Such applications can be compactly represented using a
data structure called task graph, which is an acyclic directed graph whose nodes rep-
resent individual tasks, and edges represent precedence constraints between tasks. For
these applications, the problem of power optimization is reduced to a single instance of
the periodic task graph, which is small enough to be analyzed using computationally
demanding algorithms. In contrast, our work targets realistic MLCA applications, in
which task-level parallelism is extracted from programs at run-time, by means akin to
the parallel execution of machine instructions in a superscalar processor. The task graph
of such an application is fully defined only at run-time, due to the unpredictability of
control-flow. Even if we introduce simplifying assumptions about the application control-
flow that enable an approximation of the run-time task graph to be known at compile
time, for example by considering long-running loops, the task graph is normally too large
to be analyzable using previously proposed techniques. Furthermore, the run-time task
Chapter 1. Introduction 4
graph of a loop in an MLCA program cannot be efficiently partitioned into small sub-
graphs that could each be analyzed separately. This is because most of the parallelism
in this task graph stems from pipelining loop iterations, whose execution can be arbi-
trarily overlapped, since the MLCA supports out-of-order execution of tasks. Finally,
the efficiency of a DVS technique for parallel applications critically depends on the task
scheduling algorithm. The task scheduling algorithms proposed in previous work in the
area of DVS-based power optimizations are not suitable for implementation as a part of
the hardware controller that handles the task scheduling in the MLCA.
In light of the above discussion, our work makes the following contributions:
1. It identifies common regularities in the control-flow of MLCA applications and
exploits these regularities, together with profiling information, to infer at compile-
time the properties of the run-time task graph of the application. We achieve this
goal using compact data structures and computationally lightweight procedures,
rather than the explicit construction and analysis of the run-time task graph, which
would be prohibitively expensive computationally.
2. Based on the profile-based compiler analysis, our work proposes, implements, and
evaluates novel voltage selection and task scheduling algorithms for applications
running on DVS-enabled MLCA systems.
1.2 Thesis Organization
The remainder of the thesis is divided into the following chapters. Chapter 2 presents
an overview of the MLCA and dynamic voltage scaling. In Chapter 3, we state the
assumptions about the MLCA system hardware and the MLCA applications on which
our technique is based and introduce the application execution model based on the task
graph. In Chapter 4, we formulate the problem statement and present a brief overview
of our solution. A detailed description of our technique is presented in Chapter 5. We
Chapter 1. Introduction 5
present the experimental evaluation of our technique in Chapter 6. Chapter 7 surveys the
related work in the area of DVS-based power optimizations. Finally, Chapter 8 presents
concluding remarks and directions for future work.
Chapter 2
Background
In this chapter, we review background material relevant to our work. Section 2.1 provides
an overview of the target architecture—the MLCA. Section 2.2 provides an overview of
the dynamic voltage scaling technique.
2.1 The MLCA
The Multi-Level Computing Architecture [22] (MLCA) is a novel two-level hierarchical
architecture, aimed at parallel systems-on-a-chip (SOCs). The lower level of the MLCA
consists of multiple processing units (PUs), and the upper level of a controller that au-
tomatically exploits parallelism among coarse-grain units of computation, called tasks.
A PU can be a full-fledged processor core, a digital signal processor (DSP), a block of
FPGA, or any other type of programmable hardware. The set of PUs can be hetero-
geneous, and tasks in an MLCA application may have different PU preferences. The
top-level controller consists of a control processor (CP), a task dispatcher (TD), and a
universal register file (URF). A dedicated interconnection network links the PUs to the
URF and memory, as shown in Figure 2.1(a).
The novelty of the MLCA stems from the fact that the upper level of the hierarchy
supports parallel execution of tasks, using the same techniques used in superscalar pro-
6
Chapter 2. Background 7
(a) MLCA
Memory
General
Purpose
Registers
XU XU XU
Fetch & Decode
Instruction Queue
Execution
Units
(b) Superscalar processor
Figure 2.1: Comparison between the MLCA and a superscalar processor
cessors, such as register renaming and out-of-order execution [14]. This leverages existing
superscalar technology to exploit task-level parallelism across PUs, in addition to pos-
sible instruction-level parallelism within each task. The similarity of the MLCA to the
microarchitecture of a superscalar processor can be seen in Figure 2.1. The MLCA is a
template architecture, which does not specify the number and types of PUs, form of the
interconnection network, memory configuration, and type of memory access.
The MLCA supports a programming model that, similar to sequential programming,
does not require programmers to specify task synchronization and inter-task commu-
nication. An MLCA application consists of two levels: a set of task functions, each a
Chapter 2. Background 8
sequential function with a specified number of input and output URF registers, and a
sequential control program, which is executed by the control processor and contains task
instructions.
In the remainder of this section, we present a detailed overview of the MLCA func-
tionality and its programming model.
2.1.1 Register Renaming and Parallel Execution of Tasks
Each task function is a unit of computation that can be modeled as a black-box with a
given number of inputs and outputs, which are mapped to a set of registers in the URF.
This property of task functions enables the control processor to detect data dependences
between tasks at run-time.
The control processor executes a control program, fetching and decoding task instruc-
tions, each of which specifies a task function to be executed on a PU, together with the
inputs and outputs of the task as registers in the URF. Data dependences among task
instructions are detected by identifying the source and sink registers in the URF, in the
same way that dependences among instructions are detected in a superscalar processor.
The control processor renames URF registers as necessary to break false dependences
among task instructions. The number of renaming registers impacts the performance of
MLCA programs; a larger number of renaming registers allows more false dependences
to be eliminated, which enhances the parallelism in the application execution [22].
Decoded task instructions are issued to the TD unit, where they are enqueued in the
task queue. When the inputs of a task become ready, it can begin execution as soon
as a free PU of an adequate type is available. Based on the data dependences detected
at run-time, tasks can be issued out-of-order, and may also complete and commit their
outputs out-of-order. Therefore, execution of a task by the MLCA is analogous to the
execution of a machine instruction by the superscalar processor, and parallel execution
of several tasks on multiple PUs is analogous to the parallel execution of several machine
Chapter 2. Background 9
instructions on multiple execution units. However, unlike machine instructions, task
functions can have arbitrarily large numbers of input and output registers, and a task
can write to its output URF registers at any time during its execution. The outputs
written to the URF in the midst of a task are made available to the tasks awaiting their
inputs in the task queue. Furthermore, the execution time of a task function is not
constant and may vary depending on the inputs and various non-deterministic factors
such as, for example, contention for memory access.
The ordering of the task instructions in the task queue and the selection of the PU
for each task is determined by the task scheduling algorithm implemented by the task
dispatcher. The default MLCA task dispatcher orders the task instructions according to
their sequential execution order, using a FIFO queue, and selects the PUs in a round-
robin fashion. Our work addresses the problem of task scheduling in the MLCA in the
context of DVS-based processor power optimizations.
Besides the task instructions, the control program also contains control-flow instruc-
tions. Conditional branches are implemented by means of a set of control registers in the
control processor. Each task can optionally write to a control register. Unlike outputs
to the URF registers, output to the control register can be written only at the end of
the task execution. The existence of conditional branches results in control dependences
between tasks. Control dependences can be eliminated at run-time using branch pre-
diction and speculative execution of tasks. The current MLCA system model does not
support speculative execution, but research into the support for speculative execution
for the MLCA is in progress.
2.1.2 Programming Model
The control program is written in an assembler-like language called HyperAssembly. An
example of a control program is shown in Figure 2.2(a). It contains five task instructions,
S1 − S4 and S6, which invoke task functions T1 − T5, respectively. The type of access
Chapter 2. Background 10
S1: task T1, R1:r,R2:w,R3:w
S2: task T2, R2:r,R3:r,R4:w
S3: task T3, R5:r,R2:w,R3:w
S4: task T4, CR1, R2:r,R3:r,R6:w
S5: if false (CR1 & 0x01) jmpa S7
S6: task T5, R6:r
S7: stop
(a) Original code
S1: task T1, R1:r,R2:w,R3:w
S2: task T2, R2:r,R3:r,R4:w
S3: task T3, R5:r,R101:w,R102:w
S4: task T4, CR1,R101:r,R102:r,R6:w
S5: if false (CR1 & 0x01) jmpa S7
S6: task T5, R6:r
S7: stop
(b) After register renaming
Figure 2.2: An example of HyperAssembly code and register renaming
for each URF register is indicated as read (r) or write (w) next to the register symbol.
Furthermore, the control program contains the conditional branch instruction S5, whose
direction depends on the contents of the control register CR1, which is written by the
task instruction S4. If the condition is evaluated as false, jump to S7 is taken. The
instruction stop waits for the pending tasks to finish and then terminates the execution
of the program.
All tasks in the example from Figure 2.2(a) must be executed sequentially, because
of data and control dependences. We use the symbols δt, δa, and δo to represent true
dependences, write-after-read false dependences, and write-after-write false dependences
between task instructions, respectively. Since the registers R2 and R3 are written by task
instruction S1 and read by task instruction S2, there exists a true dependence S1δtS2.
Similarly, there exists a true dependence S3δtS4, due to the read-after-write access to
registers R2 and R3 by these two task instructions. Furthermore, there is a write-after-
Chapter 2. Background 11
write false dependence S1δoS3, since both task instructions write to R2 and R3, as well
as a write-after-read false dependence S2δaS3, because S3 overwrites the values of R2
and R3 that are read by S2. Finally, there is both a true data dependence and a control
dependence between S4 and S6. The true dependence S4δtS6 stems from the fact that S6
reads the value written into R6 by S4, and the control dependence stems from the value
written by S4 into the control register CR1, which determines whether S6 is executed or
skipped.
The control processor renames registers at run-time to break false dependences and
thus allow some parallel execution. The control program after register renaming is shown
in Figure 2.2(b). With both false dependences eliminated, S3 can be executed in parallel
with S1, and after the task executed by S3 writes its outputs, S4 can proceed regardless of
the status of the tasks executed by S1 and S2. Once the task executed by S4 has completed
execution and written the output to CR1, the direction of branch S5 is computed by the
control processor. If the branch is not taken, S6 is executed, again regardless of the status
of S1 and S2 at that point in time.
Instead of writing the control program directly in HyperAssembly, it is possible to
express it in a higher-level language called Sarek [22], which is compiled into Hyper-
Assembly. Sarek is a C-like language, which supports high-level control-flow constructs,
such as if-statements and while-loops. The execution of tasks is specified using state-
ments similar to function calls. Instead of explicitly specifying access to the URF and
control registers, Sarek supports data variables and control variables, which are similar
to variables in high-level programming languages. These variables are mapped to URF
registers and control registers, respectively, during the compilation of Sarek into Hyper-
Assembly. They are the only data types supported in Sarek. Figure 2.3 shows a Sarek
program equivalent to the HyperAssembly program shown in Figure 2.2. The only con-
trol variable in this example is flag. The remaining variables in the example are data
variables. For simplicity, declarations of variables are omitted.
Chapter 2. Background 12
T1(in count_1, out length, out height);
T2(in length, in height, out depth);
T3(in count_2, out length, out height);
flag = T4(in length, in height, out sum);
if (flag & 0x01) {
T5(in sum);
}
stop;
Figure 2.3: Sarek code of the control program from Figure 2.2
During the compilation of Sarek into HyperAssembly, it is possible to perform various
optimizing transformations [4]. However, in this thesis we assume that the translation is
performed in a straightforward manner, translating each Sarek task statement into a sin-
gle HyperAssembly task instruction and using a one-to-one correspondence between data
variables and URF registers, as well as between control variables and control registers.
This enables us to present examples in form of Sarek code, which is more easily readable,
although our technique takes the HyperAssembly code of the application as input.
If the PUs are general-purpose processors, task functions can be written in a high-
level programming language and compiled into executable code. For example, the code
of the task function T4 from Figure 2.2 could be written as the C function shown in
Figure 2.4. The function has no formal arguments and return values. Instead, arguments
are read from and written to the URF registers and the control register using specific
API functions readArg, writeArg, and writeCtrl. These API functions implement the
communication between the PU and other components of the MLCA system, in particular
the URF and the control registers.
2.2 Dynamic Voltage Scaling
Dynamic voltage scaling (DVS) is a technique for reducing the power consumption of
processors. It allows programs to change at run-time the supply voltage and frequency of
Chapter 2. Background 13
void T4() {
int arg1, arg2;
int out1, out2, flag;
arg1 = readArg(0);
arg2 = readArg(1);
// Perform computations with arg1 and arg2
// ...
writeArg(0, out1);
// Perform some more computations
// ...
writeArg(1, out2);
writeCtrl(flag);
}
Figure 2.4: C code of an example task function
a processor in order to trade performance for lower power consumption. In the remainder
of this section, we present some theoretical background and practical aspects of DVS.
Power consumption of a CMOS circuit can be approximated by the following for-
mula [33]:
P = ACV 2f + τAV ISf + V IL, (2.1)
where V is the supply voltage, f is the operating frequency, A is the activity of the gates
in the system (i.e. the average fraction of the gates that switch in a given clock cycle), C
is the total capacitance, IL is the leakage current, and IS the short-circuit current that
flows for the brief time τ whenever a gate switches. The first term ACV 2f determines
the dynamic power consumption, and accounts for the majority of power dissipated by
today’s CMOS circuits [33]. Therefore, lowering the supply voltage results in reduction
of the processor power consumption. However, the maximum operating frequency of the
processor depends on the supply voltage according to the following formula [33]:
fmax ∝ (V − VT )2/V, (2.2)
Chapter 2. Background 14
where VT is the threshold voltage of the CMOS device, i.e. the voltage that, when applied
to the transistor gate, causes the transistor to switch [44]. Consequently, lowering the
supply voltage of a given circuit must be accompanied by lowering its operating frequency,
which results in degraded performance.
DVS-enabled processors are capable of operating at different supply voltages. They
are equipped with circuits that regulate the supply voltage and operating frequency at
run-time under the control of software. At each voltage level, the operating frequency
is set to the maximum allowed for the corresponding supply voltage. The energy con-
sumed by the processor for a given computation is equal to the product of power and
time. Therefore, lowering the supply voltage and operating frequency reduces the en-
ergy consumed for a given computation if the factor by which the power is reduced is
greater than the factor by which the computation is slowed down. Since, according to
the Equations (2.1) and (2.2), this condition is satisfied, DVS introduces the possibility
of an effective dynamic trade-off between processor power consumption and performance.
In computer systems containing DVS-enabled processors, power can be saved if DVS is
applied in situations where computation can be slowed down with an acceptable loss of
performance.
The first DVS-enabled processors were implemented relatively recently [6, 24, 36], but
nevertheless, a number of DVS-enabled processor designs has appeared on the market
in the meantime. Examples are the Intel XScale [7, 18], IBM’s PowerPC 405LP [35],
and Transmeta’s Crusoe [42]. These processors support several discrete voltage levels
with different frequencies and rates of power consumption. The transition between levels
can be performed at run-time by executing specific machine instructions. We base our
approach on the assumption that the DVS capabilities of the processors in the MLCA
system are implemented similarly.
To implement DVS capabilities in a SOC multiprocessor such as the MLCA, it is
necessary to partition the chip into multiple power domains, whose voltage and frequency
Chapter 2. Background 15
can be varied independently at run-time, and place each processor into a separate power
domain. Implementation of chips with multiple power domains is an active area of
research [9, 41]. Flautner et al. [10] have implemented IEM926, a DVS-enabled single-
processor SOC in which the voltage and frequency of the processor core can be varied at
run-time independently of the rest of the chip.
Switching between voltage levels at run-time incurs certain overheads in time and
energy, which generally depend on the levels between which the processor is transferring.
For modern DVS-enabled processors, the time overhead of a single transition is on the
order of tens of microseconds [7, 15], which translates into several thousand or tens of
thousands of processor cycles. Such overheads are not negligible. Therefore, a practical
strategy for applying DVS to processors must take into account not only the performance
penalty due to reduction in frequency, but also the impact of transition overheads.
With each new generation of CMOS technology, the leakage power is increasing,
as a consequence of further miniaturization. In the forthcoming generations, the leakage
power is likely to become a factor in the total circuit power dissipation comparable to the
dynamic power [23]. DVS is significantly less effective in reducing the leakage power than
the dynamic power, and novel techniques will be necessary to deal with the leakage power.
One such technique is the adaptive body biasing (ABB), which achieves a dynamic trade-
off between leakage power and operating frequency by dynamically scaling the threshold
voltage [3, 23, 30]. Effective combined use of DVS and ABB in presence of substantial
leakage power dissipation is an active area of research [3, 30]. Although we consider DVS-
based power optimizations in our work, our technique is based only on the assumption
that the processors in the MLCA system support some mechanism for dynamic trade-off
between power and performance, regardless of the underlying technology.
Chapter 3
System and Application Properties
In this chapter, we describe the properties we assume of the MLCA system and of the
applications. Section 3.1 states the assumptions on the MLCA system hardware. Sec-
tion 3.2 states the assumptions on the MLCA application characteristics. Section 3.3
introduces our execution model of the MLCA applications, which is based on the as-
sumptions stated in previous two sections.
3.1 Hardware Properties
3.1.1 Architectural Properties
As described in Chapter 2, the MLCA is a template architecture with many variable de-
sign parameters, whose lower level consists of a possibly heterogeneous set of processing
units, with an arbitrary type of interconnection network and memory architecture. How-
ever, in this thesis, we assume that the MLCA system features a homogeneous set of PUs
with uniform access to a shared memory. These assumptions simplify the system model.
The assumption of homogeneity renders the system symmetrical to the task scheduling
algorithm employed by the task dispatcher, eliminating the need to consider PU prefer-
ences of tasks. The assumption of uniformly accessed shared memory eliminates the issue
16
Chapter 3. System and Application Properties 17
of inter-task communication outside the access to the URF, since the data written into
the shared memory by a task are made available to all processors with identical access
time.
We assume that the number of processors in the MLCA system is the maximum
allowed by the scalability of the application, i.e. adding more processors to the system
would result in little or no improvement in application performance. Furthermore, we
assume that the number of renaming registers is large enough to eliminate all false data
dependences at run-time. These assumptions are reasonable, since the MLCA system
will be custom-designed to meet application requirements, and designers will customize
the system parameters to enable the maximum parallelism in the application execution.
Besides, applications running on a system with a number of processors smaller than the
maximum tend to have high levels of processor utilization and therefore do not offer
significant opportunities for power optimizations using DVS.
3.1.2 Power-Related Properties of the Processors
We assume that the processors in the MLCA system support several discrete voltage
levels with different operating frequencies and rates of power consumption, with the
possibility of software-controlled switching between levels at run-time. As described in
Section 2.2, this assumption reflects an increasing number of modern high-end embedded
processors. Furthermore, we assume that the voltage and frequency of each processor can
be varied independently. SOCs partitioned into multiple power domains whose voltage
and frequency can be varied separately at run-time are an active area of research [9, 41].
Therefore, it is reasonable to expect that DVS will be supported by the processors in the
future MLCA systems.
On a DVS-enabled processor, the relation between the operating frequency and the
execution time of a given sequence of instructions is non-trivial, because the number of
instructions executed per cycle may vary with the processor frequency. However, as a
Chapter 3. System and Application Properties 18
simplifying assumption in our model, we assume that the number of processor cycles
necessary for the execution of a given sequence of instructions is constant, which implies
that the execution time of a task is inversely proportional to the processor frequency. This
assumption is conservative, because regardless of the processor frequency, each operation
within the processor takes the same number of processor cycles, while the latency of the
access to a component outside the processor, such as memory access in case of a cache
miss, is independent of the processor frequency. Therefore, when the operating frequency
is lowered, the total number of processor cycles necessary to execute a given sequence of
instructions either remains constant or becomes lower.
Switching between voltage levels at run-time incurs certain overheads in time and
energy, which generally depend on the levels between which the processor is transferring.
Unlike most of the previous work in the area of DVS-based power optimizations (which
we survey in Chapter 7), we take the effect of transition overheads into consideration.
We assume that the time and energy overheads are constant and independent of voltage
levels. We believe that this simplification does not significantly affect the accuracy of
our results. Mochocki et al. [31] suggest that the assumption of constant overheads in
time is reasonable. Furthermore, the results of our experimental evaluation, presented in
Chapter 6, show that the transitions account for a very small percentage of the overall
energy consumption after our technique is applied.
Besides DVS, another power management feature of modern embedded processors is
the possibility of entering a lightweight idle mode implemented by stopping the internal
processor clock until the processor activity is resumed by an external interrupt [18].
In such idle mode, the processor power is not reduced to a negligible level, but the
architectural state of the processor is preserved and the overheads of entering and exiting
are thus negligible. For example, the Intel 80200 processor—based on the XScale core—
takes several tens of clock cycles to exit the idle mode, depending on the operating
frequency [18]. We assume that each processor in the MLCA system spends time in such
Chapter 3. System and Application Properties 19
idle mode whenever it is not executing a task. The interrupt necessary to resume the
processor activity can be generated by the task dispatcher whenever a task is scheduled
onto the processor.
3.2 Application Properties
Our approach exploits certain regularities often exhibited by control programs of MLCA
multimedia applications. The control program of a typical multimedia application con-
sists of some initialization and clean-up code, each executed only once, and one or more
long-running loops. These loops handle the processing of the input media stream, and
account for the vast majority of the application execution time. A commonly exhibited
property of these loops is that their bodies contain relatively few control-flow instruc-
tions, which are frequently absent altogether. Furthermore, the existing control-flow
instructions are usually tests for exceptional conditions, and are thus severely biased in
one direction. In such cases, the control-flow instructions can be ignored, and the loop
can be treated for practical purposes as a loop without any control-flow instructions in
its body, containing the most frequent flow of execution. We refer to each loop with
these properties as a target loop. Our approach aims for power optimizations of the ex-
ecution of target loops in MLCA multimedia applications. Section 5.7 describes several
more general classes of loops that can be handled as target loops by our technique and
discusses the possible methods for selection of target loops using the results of compiler
analysis and profiling.
An example Sarek code of a target loop is shown in Figure 3.1. The body of this
loop consists of calls to five task functions Task 1–Task 5, and contains no control-flow
statements. Task function Task 5 updates the control variable finished, which serves
as the loop counter. Each of these statements is translated into a single HyperAssembly
task instruction, and the while statement is translated into a HyperAssembly conditional
Chapter 3. System and Application Properties 20
branch instruction.
do {
Task_1(in x, out x, out y);
Task_2(in y, in z, out y);
Task_3(in y, out y, out z);
Task_4(in y, in w, out w);
finished = Task_5(in index, out index);
} while (!finished);
Figure 3.1: Sarek code of an example target loop
Another assumption that we place on the target loop is that the parallelism between
its iterations is not constrained by control dependences. Control dependences in target
loops can be effectively eliminated at run-time using branch prediction and speculative
execution, since the branch prediction will be almost perfectly accurate. However, even
without speculative execution, control dependences can be ignored in practice for many
target loops in MLCA multimedia applications. These loops are typically running as long
as the input media stream is incoming. Therefore, the task that updates the iteration
counter often does not depend on any other tasks in the loop, but instead only detects
if the end of the input has been reached. Furthermore, the execution time of these tasks
is usually short, since they do not perform any other computations. Therefore, these
tasks can be executed in advance and out-of-order, making it safe to ignore the control
dependences in the loop analysis. For example, in the target loop shown in Figure 3.1, the
branch direction in each iteration depends on the output of the task executing Task 5,
and thus there is a control dependence between this task and all tasks executed by
the subsequent loop iterations. However, as soon as the task executing Task 5 in one
iteration has finished, the equivalent task from the following iteration can be executed
immediately, since it does not depend on the output of any other tasks. Assuming that
the execution time of the task function Task 5 is relatively short, tasks executing this
task function from multiple iterations will be executed in advance and out-of-order, thus
Chapter 3. System and Application Properties 21
effectively eliminating the influence of the control dependences on the parallel execution
of tasks.
3.3 Task Graph Execution Model
Under the assumptions outlined in the previous two sections, it is possible to represent
the execution of an MLCA application using a data structure called task graph. Task
graph is an acyclic graph, whose nodes represent individual tasks (i.e. instances of task
instructions), and the edges represent dependences between pairs of tasks. Each node
of the task graph is labeled by the execution time of the corresponding task. Since we
assume that control dependences can be safely ignored, and the false dependences are
eliminated at run-time by the control processor, the task graph need contain only edges
representing the true data dependences between tasks. We define the task graph of a
target loop as the subgraph of the task graph of the whole application that contains only
the tasks executed by the target loop.
Figure 3.2 shows the task graph of the loop from Figure 3.1, assuming that three
iterations of the loop are executed. In the figure, symbol TNi denotes the task that
executes task function Task N in loop iteration i. According to the labels in Figure 3.2,
we assume that the execution times of task functions Task 1–Task 5 are 1500, 1000,
1500, 2000, and 200 time units, respectively, and that the execution time of each task
function does not vary between loop iterations. We add two additional nodes TIN and
TOUT to the task graph, corresponding to the loop entry and exit.
Assuming that each task writes its outputs immediately prior to the end of its exe-
cution, the information contained in the task graph is sufficient to fully characterize the
application execution at run-time. Without this assumption, it would be necessary to
additionally specify for each dependence TimδtTjn the time at which Tim writes the last
output read by Tjn (this could be achieved, for example, by labeling the edges in the task
Chapter 3. System and Application Properties 22
Figure 3.2: Task graph of the loop from Figure 3.1
graph). In MLCA applications for which this assumption does not hold, it is possible to
apply a compiler transformation that splits the task functions across the instructions that
perform writing to the universal register file. For example, the task function shown in
Figure 2.4 can be split into two task functions, each of which writes one output parameter
to the URF. We assume that the performance impact of the task splitting transformation
is negligible. This assumption is reasonable, because task splitting does not increase the
total computational load on the processors, while the load on the universal register file
is increased only slightly.
The execution of an application on an MLCA system is equivalent to the scheduling
of its task graph on the set of processors featured by the MLCA system, using the task
scheduling algorithm employed by the task dispatcher. The minimum execution time of
the application is equal to the length of the longest path in the task graph, which we refer
to as the critical path. For example, the critical path in the task graph shown in Figure 3.2
is (TIN , T11, T21, T31, T22, T32, T23, T33, T43, TOUT ), whose length is eleven thousand time
Chapter 3. System and Application Properties 23
units. The critical path is highlighted by thick lines in the figure. Since there exists a
chain of dependences from the first task in the critical path to the last one, tasks on
the critical path must be executed sequentially. Therefore, the total execution time of
the application cannot be shorter than the length of the critical path, regardless of the
available number of processors. The total execution time of the application is equal to this
minimal value only if the tasks from the critical path are executed as an uninterrupted
sequence.
Applying DVS to a task on the critical path increases the length of the critical path
and thus negatively impacts the application performance. Therefore, in order to achieve
power savings using DVS without affecting the application performance, it is necessary to
identify a set of tasks that do not belong to the critical path and restrict the application
of DVS to these tasks.
Our algorithms for voltage selection and task scheduling, which we describe in Chap-
ter 5, use profiling information and dependence analysis to deduce the properties of the
task graph of the target loop, in particular the set of tasks that lie on the critical path.
The assignment of processor voltage levels and the task scheduling algorithm are based
on these deduced properties.
Chapter 4
Problem Formulation
This chapter presents the formal statement of the problem addressed by our approach, in
the context of the assumptions outlined in Chapter 3. Section 4.1 introduces the notion
of slack in task graphs. Section 4.2 presents the formal problem statement.
4.1 Slack in Task Graphs
The goal of our technique is to identify the critical path in the execution task graph of
a given application and prolong the execution time of tasks outside of the critical path
using DVS in a manner that does not introduce a new, longer critical path, thus achieving
power savings with little or no impact on the application performance.
The slack of a task is defined as the maximum time by which the execution time of the
task can be prolonged without affecting the overall application execution time. Similarly,
the slack of a set of tasks in the application is the maximum time by which the total
execution time of these tasks can be prolonged without affecting the overall application
execution time. This slack can be distributed across the tasks in multiple ways, which may
result in different levels of power savings and may be subject to additional application-
dependent constraints.
The task graph shown in Figure 4.1 is used to illustrate the distribution of slack using
24
Chapter 4. Problem Formulation 25
Figure 4.1: An example task graph
DVS. Nodes in the task graph are labeled with task execution times in microseconds. The
critical path in this task graph is (TIN , T1, T4, T6, TOUT ), which is highlighted by thick
lines in the figure. The length of the critical path is 6000µs. Assuming that the task graph
is scheduled on two processors, there exists a certain slack, since tasks T2, T3, and T5 are
not on the critical path and their execution time can be prolonged up to the limit where
a new longer critical path is introduced. In particular, the slack of task T2 is 500µs, since
its execution time can be prolonged until it surpasses 2000µs. Prolonging the execution
time of T2 beyond 2000µs introduces a new, longer critical path (TIN , T2, T4, T6, TOUT ).
Similarly, a slack of 1000µs is available for distribution across tasks T3 and T5; if the sum
of the extra execution times added to these tasks is more than 1000µs, a new, longer
critical path (TIN , T1, T3, T5, TOUT ) is introduced. If the execution times of the tasks T2,
T3, and T5 are prolonged by 500µs each, the total execution time of the task graph on
two processors is still 6000µs. The increase in execution time can be realized using DVS,
thus achieving a reduction in power consumption without increasing the overall execution
time.
The slack of 1000µs can be distributed across tasks T3 and T5 in multiple ways. For
simplicity, we assume that the processor executing tasks T3 and T5 supports three voltage
levels with the operating frequencies of 400MHz, 300MHz, and 240MHz, and ignore the
Chapter 4. Problem Formulation 26
transition overheads and the idle power. Since we assume that the execution times of
tasks are inversely proportional to the processor frequency, the execution times of T3
and T5 at these levels are 1500µs, 2000µs, and 2500µs, respectively. The total execution
time of T3 and T5 can be increased by 1000µs, for example, by running both tasks at
300MHz, or by running one of them at 240MHz and the other one at 400MHz. The
optimal distribution of slack is determined by the power characteristics of the processor
voltage levels. For example, if the power relative to the highest voltage level is reduced
by 45% by switching to 300MHz, and by 65% by switching to 240MHz, energy savings
are greater if both tasks are run at 300MHz than if one of them is run at 240MHz and
the other one at 400MHz (26.7% vs. 20.8%). Finding the optimal distribution of slack
across a set of tasks is a non-trivial problem, which is further complicated if the effects
of transition overheads are taken into account. As a part of our technique, we use an
integer linear programming model to compute an efficient slack distribution across the
tasks selected for the application of DVS.
4.2 Problem Formulation
Similar to other authors [43, 46], we divide the problem of DVS-based power optimization
of MLCA applications into two sub-problems:
• Determining the voltage level at which each part of the application is executed.
We refer to this problem as voltage selection. Although DVS-enabled processors
are capable of switching between voltage levels at any point during the application
execution, we simplify our approach by assigning a unique voltage level to each task
executed at run time, allowing transitions only between tasks. We approach the
problem of voltage selection by identifying the tasks that fall outside the critical
path, computing the available slack, and finding a distribution of slack across these
tasks so as to minimize their energy consumption.
Chapter 4. Problem Formulation 27
• Finding the appropriate task scheduling algorithm. A task scheduling algorithm
is defined by a scheme for ordering tasks in the task queue and the processor
mapping of tasks, which determines the choice of processor for the execution of
each dispatched task. The task scheduling algorithm must be implementable with
a reasonable overhead in hardware complexity of the MLCA task dispatcher.
These problems are interrelated, since for a given voltage selection, an inadequate
task scheduling algorithm can result in unacceptable degradation of application perfor-
mance. Similarly, with a given task scheduling algorithm, different voltage selections that
save comparable amounts of energy can have significantly different performance impact.
Therefore, the algorithms used to solve these two problems must be designed so as to
complement each other. Our approach is to first design a voltage selection algorithm and
then formulate a complementary task scheduling algorithm.
The voltage selection and task scheduling problems are both hard. The design space of
task scheduling algorithms is vast, and the problem of optimal scheduling of a task graph
with regard to minimizing the application execution time is NP-complete [25]. Regardless
of the simplifying assumption that a unique voltage level is assigned to each task, optimal
voltage selection with a given task scheduling algorithm is also an NP-hard problem if
the processor voltage levels are discrete [3]. Since the task graphs of MLCA applications
are large and complex in practice, these problems require heuristic solutions, which often
result in some increase in application execution time for realistic applications. Therefore,
in our problem formulation, we do not impose a strict requirement that the execution
time of the application must not be increased. However, in our experimental evaluation,
we show that the increase in execution time incurred by applying our technique is not
excessive.
Assuming that the target loop has the properties outlined in Chapter 3, the body of
the loop contains N task instructions S1, ..., SN , each of which invokes a task function
fn (it is possible that fi = fj for i 6= j). We further simplify the voltage selection
Chapter 4. Problem Formulation 28
problem by assigning a unique voltage level to each task instruction in the target loop.
All tasks executed by a particular task instruction in different iterations of the loop are
thus executed at the same voltage level. Therefore, we can state the following problem
formulation:
Voltage Selection and Task Scheduling Problem: For a given target loop in an
MLCA application with properties outlined in Chapter 3, find the voltage level Ln for
each task instruction Sn in the loop body, the ordering of the executed tasks, and the
processor mapping for each executed task, such that:
1. The task schedule does not violate the precedence constraints imposed by the data
dependences.
2. The energy consumed by the processors is minimized.
3. The execution time of the application is not significantly increased in comparison
to the execution time on the default MLCA system configuration.
We describe our proposed solution to the formulated problem in the following chapter.
Chapter 5
Voltage Selection and Task
Scheduling
In this chapter, we describe our algorithms for voltage selection and task scheduling.
Section 5.1 introduces the data structure on which the analysis of the target loop is
based—the loop dependence graph. Section 5.2 presents the procedure used to determine
the critical path of the dynamic task graph of the target loop. Section 5.3 describes the
procedure used to determine the available slack. Section 5.4 presents the algorithm for
selection of tasks to which DVS is applied. The algorithm for the distribution of slack
among the selected tasks, based on an integer linear programming model, is described
in Section 5.5. The task scheduling algorithm is described in Section 5.6. Section 5.7
discusses the selection of the target loops by the compiler.
5.1 The Loop Dependence Graph
The dependence graph [1] of the target loop is a directed graph whose nodes represent
task instructions from the target loop, and whose edges represent the data dependences
between pairs of task instructions. We label each node by the average execution time
of tasks executed by its corresponding task instruction, as determined by profiling. We
29
Chapter 5. Voltage Selection and Task Scheduling 30
label each dependence edge with the distance of the dependence, which is defined as the
number of loop iterations between the execution of the sink and the source of the depen-
dence [1]. Since we assume that the MLCA control unit eliminates false dependences at
run-time using register renaming, we take only true data dependences (i.e. read-after-
write dependences) into account. We use the symbol Tni to denote the task executed by
the task instruction Sn in the loop iteration i. The distance of a true dependence TmiδtTnj
is therefore equal to d if j = i + d. In the remainder of this section, we show that under
the stated assumptions, all dependences TmiδtTnj between pairs of tasks executed by a
particular pair of task instructions (Sm, Sn) have the same distance, and unique distances
can thus be assigned to the dependences between pairs of task instructions.
Let {S1, . . . , SN} be the set of task instructions in the target loop. Since the body
of the target loop is assumed to contain no control-flow instructions, as described in
Section 3.2, each task instruction Sn is executed in every loop iteration. Assume that
there exists a dependence TmiδtTnj that arises due to the existence of one or more registers
that are written by task Tmi and read by task Tnj. Obviously, i ≤ j must hold, because
the dependence source must precede the sink in the sequential execution order. However,
all registers written by task Tmi are also written by task Tm(i+1)1. Therefore, in the
sequential execution order, task Tnj is preceded by task Tmi, but not by task Tm(i+1).
This is possible only if either m ≥ n and i = j − 1, or m < n and i = j. In the
former case, the dependence is loop-carried and has the dependence distance of one. In
the latter case, the dependence distance is zero and the dependence is loop-independent.
Therefore, we can divide the dependence edges in the loop dependence graph into these
two categories. Figure 5.1 shows an example target loop and its loop dependence graph.
The algorithm for constructing the loop dependence graph is shown in Figure 5.2.
1In this chapter, we refer only to the logical registers in HyperAssembly code, which are in one-to-one correspondence with the Sarek data variables. In different loop iterations, operations with the samelogical register can be renamed to different physical registers by the control processor, but the knowledgeof the logical registers being read and written is sufficient for determining the true data dependencesbetween tasks.
Chapter 5. Voltage Selection and Task Scheduling 31
do {
Task 1(in index, out x);
Task 2(in x, in y, out x);
Task 3(in x, out y);
end = Task 4(in index, out index);
} while(!end);
(a) Sarek code (b) Dependence graph
Figure 5.1: Dependence graph of an example loop
For each register variable r read by task instruction Sn, the algorithm searches for the
task instruction Sm that outputs the last value written to r prior to each execution of Sn
and adds the appropriately labeled dependence edge to the graph.
In the remainder of this chapter, we assume that the execution times of tasks ex-
ecuted by each individual task instruction in the target loop are constant across loop
iterations. In other words, we assume that each task instruction is always executed with
the execution time assigned to its corresponding node in the loop dependence graph.
This assumption does not hold for real MLCA applications, since the execution times of
task functions normally depend on their inputs, which vary between iterations, as well
as between different input media streams. However, the positive results of the experi-
mental evaluation presented in Chapter 6 confirm that despite making this unrealistic
assumption in the theoretical derivation of our algorithm, it reasonably approximates
the behavior of real applications. Thus, in the remainder of this chapter, we refer to the
execution time of a task instruction as a uniquely defined value.
Chapter 5. Voltage Selection and Task Scheduling 32
start with empty loop dependence graph;
for (each task instruction Sn in the target loop)
add node Sn;
mark node Sn with the average execution time of fn;
for (each register variable r read by Sn)
if (there exists m < n such that Sm writes to r)find maximum such m;
add edge Sm → Sn;
mark edge Sm → Sn with distance 0;
else if (there exists m ≥ n such that Sm writes to r)find maximum such m;
add edge Sm → Sn;
mark edge Sm → Sn with distance 1;
Figure 5.2: Algorithm for constructing the dependence graph of the target loop
5.2 The Critical Path
In this section, we show how to deduce the critical path of the task graph of a target
loop using the loop dependence graph.
Let I be the number of iterations of the target loop whose body contains N task
instructions S1, . . . , SN . Since the body of the target loop is assumed to contain no
control-flow instructions, all N task instructions are executed in every iteration. There-
fore, besides the logical entry and exit nodes, the task graph of the target loop consists of
I subgraphs, each of which contains N nodes, corresponding to the tasks executed within
a single loop iteration. The total number of nodes in the task graph is therefore I ·N +2.
A part of the task graph of the target loop from Figure 5.1 is shown in Figure 5.3.
If the loop dependence graph of the target loop does not contain any cross-iteration
dependences, the loop is fully parallel, since a new iteration can be started at any time,
regardless of the status of tasks from previous iterations. A fully parallel loop scales
up to an arbitrary number of processors, provided that the number of loop iterations
is large enough. In some cases, the target loop can be fully parallel even in presence
of cross-iteration dependences. For example, if the loop shown in Figure 5.1 lacked the
dependences S3δtS2 and S4δ
tS4, it would be fully parallel, despite the presence of the
Chapter 5. Voltage Selection and Task Scheduling 33
Figure 5.3: Task graph of the loop from Figure 5.1
cross-iteration dependence S4δtS1. This is obvious from the fact that without these
dependences, shown with thick lines in Figure 5.3, the task graph would consist of a
series of disconnected subgraphs, each of which could be started independently. Since
the processor utilization during the execution of fully parallel loops is close to 100%, task
graphs of these loops do not contain any slack. Therefore, fully parallel loops do not offer
any opportunities for power optimizations using DVS.
Since the sequential execution order of task instructions is an order relation, the
subgraph of the loop dependence graph that contains only the edges with dependence
distance of zero is acyclic. Therefore, if the loop dependence graph contains any cycles,
each cycle contains at least one edge with dependence distance of one. We define a 1-cycle
in the loop dependence graph as a cycle that contains exactly one edge with dependence
distance of one. The key insight on which our voltage selection algorithm is based is that
each 1-cycle in the loop dependence graph translates into a path in the task graph that
stretches across all loop iterations. This path encompasses the tasks executed in each
iteration by the task instructions that form its corresponding 1-cycle. For example, both
cycles (S2, S3) and (S4) in Figure 5.1 are 1-cycles. They translate into the two paths
along the edges marked by thick lines in Figure 5.3.
If the number of loop iterations is large, the longest 1-cycle in the loop dependence
Chapter 5. Voltage Selection and Task Scheduling 34
graph translates into the critical path in the task graph of the loop. The longest 1-cycle
(S2, S3) in the dependence graph from Figure 5.1 translates into the critical path in the
task graph from Figure 5.3, which starts with tasks TIN and T11 and then stretches across
pairs of tasks (T2i, T3i), for all i = 1, . . . , I. We call the task instructions that form the
longest 1-cycle in the loop dependence graph critical instructions, and the tasks executed
by these instructions critical tasks.
The first and last several tasks in the critical path may consist of non-critical tasks (i.e.
tasks executed by task instructions outside of the longest 1-cycle in the loop dependence
graph). For example, in the task graph from Figure 5.3, the critical path begins with
non-critical tasks TIN and T11. However, assuming that the number of iterations is large,
the total execution time of these non-critical tasks is negligible in comparison with the
contribution of the repeated execution of critical tasks. Therefore, the length of the
critical path can be approximated by I · τC , where τC is the length of the longest 1-cycle
in the loop dependence graph.
Finding the longest cycle in a graph is an NP-complete problem in the general
case [11]. However, the longest 1-cycle in the loop dependence graph can be found
by identifying each edge (Si, Sj) with dependence distance of one and searching for the
longest path from Sj to Si that contains only edges with dependence distance of zero.
The longest such path, together with the corresponding edge (Si, Sj), forms the longest
1-cycle. Since these searches for longest paths are limited to an acyclic subgraph of the
loop dependence graph, i.e. its subgraph containing only edges with dependence dis-
tance of zero, they can be performed in polynomial time [11]. Since the loop dependence
graphs in practice contain up to several tens of tasks, this procedure takes negligible
computational time.
Chapter 5. Voltage Selection and Task Scheduling 35
5.3 The Available Slack
As explained in Section 3.3, the minimum execution time of a task graph is equal to the
length of its critical path, and a necessary condition for the execution time to be reduced
to this value is that the tasks from the critical path are executed in an uninterrupted
sequence. Assuming that the critical path in the task graph of the target loop can be
approximated by the path connecting the critical tasks, as defined in Section 5.2, the
minimum execution time of the target loop can be closely approximated by I · τC , where
I is the number of loop iterations and τC is the length of the longest 1-cycle in the loop
dependence graph. Therefore, the maximum number of processors for which the loop
scales is the one for which the execution time of the loop is reduced to approximately
I · τC . Increasing the number of processors beyond this value cannot result in additional
speedup of the loop execution, since the length of the critical path of a task graph is a
strict lower bound on its execution time.
Let NP be the number of processors in the MLCA system, and τ the sum of the
execution times of all non-critical task instructions in the target loop. Assuming that the
target loop is running on the maximum number of processors allowed by its scalability,
its total execution time is I · τC , and the sum of processor times spent in its execution is
thus NP · I · τC . Total processor time available per loop iteration is thus NP · τC . This
amount of processor time must be sufficient for the execution of all tasks from a single
iteration, both critical and non-critical ones, and must therefore be greater than the sum
of the execution times of all task instructions in the target loop, which is equal to τC + τ .
Thus, we can formulate the following necessary condition for the execution time of the
loop to be minimal:
NP · τC > τC + τ,
or equivalently:
(NP − 1) · τC > τ. (5.1)
Chapter 5. Voltage Selection and Task Scheduling 36
Condition (5.1) effectively states that while the critical tasks from one iteration, whose to-
tal execution time is τC , continuously occupy one processor at a time, the total processor
time available on the remaining NP − 1 processors must be large enough to accommo-
date the execution of non-critical tasks whose total execution time is τ . Otherwise, the
available processor time per iteration would be insufficient for the continuous execution
of non-critical tasks, which would result in slowdown due to delays in the execution of
critical tasks. This conclusion holds regardless of the possible out-of-order execution of
non-critical tasks.
Although necessary, Condition (5.1) is not always sufficient to guarantee that the
critical tasks can be executed in an uninterrupted sequence. For example, in the loop
dependence graph shown in Figure 5.4, the longest 1-cycle is (S2, S5, S6) (for simplicity,
labels on the edges with distance zero are omitted). If the loop is executed on two
processors, Condition (5.1) is satisfied, since NP = 2, τC = 3000, and τ = 2400. However,
with two processors it is not possible to schedule the critical tasks in an uninterrupted
sequence, since that would require S3 and S4 to be executed in parallel with S5 in each
iteration, which is impossible with only two processors. The maximum speedup of this
loop is thus achieved with three processors.
From the derivation of the Condition (5.1), it follows that the difference:
tS = (NP − 1) · τC − τ (5.2)
is equal to the idle processor time per loop iteration. Since the number of processors NP
is discrete, tS is likely to be greater than zero even if the limit to the loop scaling is equal
to the smallest NP that satisfies the Condition (5.1). From the Formula (5.2), it follows
that tS is a strict upper bound on the available slack per loop iteration. Prolonging the
total execution time of non-critical tasks in each iteration by more than tS is guaranteed
to result in an increase in the total execution time of the loop, since it would lead to
a violation of the Condition (5.1) and cause delay in the execution of the critical tasks.
Therefore, the available slack per iteration is equal to a certain fraction of tS.
Chapter 5. Voltage Selection and Task Scheduling 37
Figure 5.4: An example loop dependence graph
Ideally, if the slack per iteration is equal to tS, the total execution time of the non-
critical task instructions can be increased from τ to τ + tS without causing delay in
the execution of the critical tasks, thus reducing the idle processor time to zero without
affecting the overall loop execution time. The example from Figure 5.4 illustrates that
this is not always the case. If this loop is executed on three processors, Formula (5.2)
yields tS = 3600. However, if the total execution time of non-critical tasks is increased by
that amount, task instructions S3 and S4 cannot be executed in parallel with S5 in each
iteration without delaying the execution of S6, regardless of the way in which the slack
is distributed across S1, S3, S4, and S7. Therefore, Formula (5.2) clearly overestimates
the available slack in this case.
In our voltage selection algorithm, we use the value tS computed according to the
Formula (5.2) as an approximation for the available slack per loop iteration. However, in
order to allow for the possibility that this value overestimates the available slack, we use
only a fraction r of the computed slack tS. Choosing a smaller value of r also enables us
Chapter 5. Voltage Selection and Task Scheduling 38
to compensate for the effects of relaxing the unrealistic assumption that the execution
time of each task instruction is invariable. We leave r as a tunable parameter in our
voltage selection algorithm, whose value is between zero and one. For r = 0, the used
slack is zero, and no tasks are scaled (although the application execution time and energy
consumption can still be affected by our task scheduling algorithm). For r = 1, the entire
slack tS computed according to the Formula (5.2) is distributed across the non-critical
tasks in each iteration. Since the evaluation presented in Chapter 6 shows that our
algorithms do not require excessive execution time, it is possible to determine a good
value of r for the target application during the profiling phase by repeated application
of our technique with different values of r and comparing the results.
5.4 Selection of Task Instructions for Slack
Distribution
In this section, we present our procedure for selecting the non-critical tasks among which
the slack is distributed by applying DVS. We refer to these tasks as scaled tasks, and the
task instructions executing these tasks as scaled task instructions.
Since we assume that the overheads of transitions between voltage levels are not
negligible, we select the scaled tasks so as to minimize the number of transitions. For
this purpose, we group the scaled tasks in sets containing multiple tasks and schedule each
set of scaled tasks as an uninterrupted sequence on a single processor, thus introducing
the possibility of amortizing the impact of each transition overhead over multiple tasks.
Our task scheduling algorithm, which we describe in Section 5.6, is formulated so as to
schedule each designated set of scaled tasks in such manner.
We analyze the loop dependence graph in order to identify sets of non-critical task
instructions that can be executed as uninterrupted sequences in each loop iteration and
are therefore suitable candidates for scaling. For example, in the example from Figure 5.5,
Chapter 5. Voltage Selection and Task Scheduling 39
the longest 1-cycle is (S5, S7, S8), and the remaining task instructions are thus non-
critical. For an arbitrarily selected set K of non-critical task instructions, it may or may
not be possible to execute the task instructions from K as an uninterrupted sequence.
For example, if K = {S1, S2, S4}, task instructions from K cannot be executed as an
uninterrupted sequence, since S4 cannot start before both S2 and S3 have finished, and
S3 cannot be executed in parallel with S2 without delaying the execution of S4. Therefore,
K would not be a suitable choice for a set of scaled task instructions. In contrast, the
choice of set {S1, S3, S4} would be suitable, since it is possible to schedule these tasks
instructions as an uninterrupted sequence, assuming that S2 is executed in parallel with
S3.
Furthermore, we wish to exclude from consideration those non-critical tasks whose
prolonging may interrupt the continuous execution of critical tasks. For example, al-
though task instruction S6 is characterized by a slack of 2000 time units, prolonging of
the tasks executed by S6 increases the probability that the execution of the critical task
instruction S8 will be delayed in iterations in which S6 cannot be scheduled in parallel
with S7 promptly after S5 is finished. In order to reduce the probability of such oc-
currences, task instructions such as S6, which have to be executed in parallel with the
critical task instructions, are not considered as candidates for scaling.
We formulate the criteria for the selection of scaled task instructions with the above
outlined considerations in mind. We define two sets of candidates for scaling, which we
denote C1 and C2, using criteria designed so as to exclude tasks whose prolonging can
directly interrupt the continuous execution of critical tasks, such as those executed by
the task instruction S6 in the example in Figure 5.5:
• C1 is the set of task instructions that can be executed before any of the critical
task instructions from the same loop iteration have started execution.
• C2 is the set of task instructions that do not belong to C1 and can be executed after
all of the critical instructions from the same loop iteration have finished execution.
Chapter 5. Voltage Selection and Task Scheduling 40
Figure 5.5: An example loop dependence graph
For example, in the loop dependence graph shown in Figure 5.5, C1 = {S1, S2, S3, S4, S11},
and C2 = {S9, S10}.
To ensure that each set of scaled tasks can be executed as an uninterrupted sequence,
we compute the longest paths in the subgraphs of the loop dependence graph comprised
of the nodes from C1 and C2 and the edges with dependence distance zero between
these nodes. These two paths define the two sets of scaled task instructions, which we
denote using symbols K1 and K2. For the example from Figure 5.5, K1 = {S1, S3, S4},
and K2 = {S9, S10}. Tasks from each of these paths can be executed as uninterrupted
sequences if the number of processors is large enough to accommodate the execution of
Chapter 5. Voltage Selection and Task Scheduling 41
G = loop dependence graph
C = set of critical task instructions;
C1 = C2 = empty set;
G1 = G2 = empty graph;
for (each task instruction S in the target loop)
if (S does not depend on any task instructions from C)
add S to C1;
else if (no task instruction from C depends on S)add S to C2;
add nodes from G corresponding to task instructions from C1 to G1;
add nodes from G corresponding to task instructions from C2 to G2;
label nodes from G1 and G2 with execution times of task instructions;
for (each edge Si → Sj labeled with distance 0)
if (Si ∈ C1 and Sj ∈ C1)
add edge Si → Sj to G1;
else if (Si ∈ C2 and Sj ∈ C2)
add edge Si → Sj to G2;
P1 = longest path in G1;
P2 = longest path in G2;
K1 = set of task instructions from P1;
K2 = set of task instructions from P2;
Figure 5.6: Algorithm for selecting the sets of scaled task instructions
all tasks that must be executed in parallel with each path. We assume that this is the
case, which is reasonable since we assume that the loop is executed on the maximum
number of processors allowed by its scalability.
The pseudocode of the algorithm for selection of scaled tasks is shown in Figure 5.6.
One step in the algorithm includes the computation of the longest paths in graphs G1 and
G2. Although the general problem of finding the longest path in a graph is NP-complete,
searching for the longest paths in G1 and G2 can be done in polynomial time, since these
graphs are acyclic [11]. In practice, these computations require negligible time because
of the small size of loop dependence graphs.
The amount of slack available per loop iteration, computed as described in Section 5.3,
is distributed across the tasks executed by the task instructions from sets K1 and K2 in
each iteration.
Chapter 5. Voltage Selection and Task Scheduling 42
5.5 The ILP Model for Slack Distribution
Once the amount of slack per loop iteration tS is computed and the sets of scaled task
instructions are selected, it is necessary to determine the distribution of slack across the
scaled task instructions. We formulate the problem of optimal slack distribution as an
integer linear programming (ILP) model similar to the one used by Saputra et al. [38].
As described in Section 5.4, we select two disjoint sets of scaled task instructions
K1 and K2 from the target loop, which translate into two sets of scaled tasks in each
iteration. The ILP formulation of the optimal slack distribution problem is based on the
assumption that each set of scaled tasks is executed as an uninterrupted sequence on a
single processor, which is ensured by the task scheduling algorithm. The main constraints
in the ILP model stem from the requirement that the total execution time of all scaled
tasks in each iteration must not exceed their execution time at the highest voltage level
plus the available slack per iteration. The model takes the effects of transition overheads
into account, and the optimal solution will be the one that most efficiently amortizes
these overheads across each set of scaled tasks.
5.5.1 Variables and Constraints
Let m and n be the number of task instructions from sets K1 and K2, respectively. We
denote the task instructions from K1 using symbols S1, . . . , Sm, and the task instructions
from K2 using symbols S ′
1, . . . , S′
n (in order to simplify the notation, we use Si to denote
an arbitrary task instruction, not the i-th task instruction in the sequential execution
order like we did in the previous examples of loop dependence graphs).
Let L be the number of voltage levels supported by the processors, which are enumer-
ated so that L is the level with the highest voltage and operating frequency. We encode
the voltage level assigned to each task instruction Si from the first set using L binary
variables αi1, . . . , αiL. The value of variable αij is one if Si is executed at the voltage level
Chapter 5. Voltage Selection and Task Scheduling 43
j, and zero otherwise. We similarly encode the voltage level of each task instruction S ′
k
from the second set using L binary variables βk1, . . . , βkL. Obviously, out of each set of
L variables αi1, . . . , αiL and βk1, . . . , βkL, the value of exactly one variable must be one.
Hence the constraints:
αi1 + · · ·+ αiL = 1 (5.3)
βk1 + · · ·+ βkL = 1, (5.4)
for all i = 1, . . . , m and k = 1, . . . , n.
In order to encode the occurrences of voltage level transitions, we introduce m + 1
binary variables c0, . . . , cm and n + 1 binary variables d0, . . . , dn.
The value of variable c0 is one if a voltage transition takes place immediately prior to
the execution of the task instruction S1, and zero otherwise. Since the non-scaled tasks
execute at level L, this condition is equivalent to α1L = 0. The value of variable cm is
one if a transition takes place immediately after the execution of the task instruction
Sm, i.e. if αmL = 0, and zero otherwise. Similarly, the value of d0 is one if and only if a
transition takes place immediately prior to the execution of S ′
1, and the value of dn is one
if and only if a transition takes place immediately after the execution of S ′
n. Therefore,
the constraints on the values of these variables are:
c0 = 1 − α1L (5.5)
cm = 1 − αmL (5.6)
d0 = 1 − β1L (5.7)
dn = 1 − βnL (5.8)
For i = 1, . . . , m− 1, the value of variable ci is one if a transition takes place between
the execution of task instructions Si and Si+1, and zero otherwise. Since a transition
occurs between two tasks if and only if the voltage levels at which these tasks are executed
are different, ci = 1 if and only if αij 6= α(i+1)j for exactly two values of j = 1, . . . , L. Since
Chapter 5. Voltage Selection and Task Scheduling 44
the variables are binary, the definition of variables ci can be expressed by the following
set of m − 1 non-linear constraints:
ci = maxj=1,...,L
(αij − α(i+1)j),
for all i = 1, . . . , m − 1. This set of m − 1 non-linear constraints is equivalent to the
following (m − 1) · L linear constraints:
ci ≥ αij − α(i+1)j , (5.9)
for all i = 1, . . . , m − 1 and j = 1, . . . , L.
For k = 1, . . . , n−1, the value of variable dk is one if a transition takes place between
the execution of task instructions S ′
k and S ′
k+1, and zero otherwise. By reasoning identical
to that applied to variables ci, we arrive at the following (n− 1) ·L constraints for these
variables:
dk ≥ βkj − β(k+1)j , (5.10)
for all k = 1, . . . , n − 1 and j = 1, . . . , L.
We represent the execution times of tasks using real numbers, which are constants in
the model and therefore can be used as coefficients of integer variables in the constraints
and the objective function. We denote the execution times of the task instructions Si
and S ′
k when run on a processor operating at the voltage level j using the symbols tij
and t′kj, respectively. The execution times of the tasks executed by task instructions Si
and S ′
k can thus be represented by the following linear expressions:
execution time(Si) =
L∑
j=1
αijtij
execution time(S ′
k) =
L∑
j=1
βkjt′
kj. (5.11)
The total execution time of all task instructions in both groups is bounded by their
total execution time at the highest voltage level plus the available slack time. The total
Chapter 5. Voltage Selection and Task Scheduling 45
overhead in execution time that is incurred by the transitions must be added to this time.
Using the Formula (5.11) for the execution times of task instructions, we formulate the
following constraint:
m∑
i=1
L∑
j=1
αijtij +n∑
k=1
L∑
j=1
βkjt′
kj +m∑
i=0
citTR +n∑
k=0
dktTR ≤ tmin + r · tS, (5.12)
where tTR is the duration of a voltage transition, tmin is the total execution time of
the scaled task instructions at the highest voltage level, tS is the slack time computed
according to Formula (5.2), and r is the tunable parameter of the algorithm, as defined
in Section 5.3.
When the execution time of a set of task instructions is prolonged, some of these
task instructions may belong to certain 1-cycles in the loop dependence graph. These
task instructions must not be prolonged over the limit that would make one of these 1-
cycles longer than the current longest 1-cycle, which determines the length of the critical
path. Therefore, for each 1-cycle c that contains one or more scaled task instructions, a
constraint must be added to the model to ensure this requirement:
∑
Si∈c
(
citTR +L∑
j=1
αijtij
)
+∑
S′
k∈c
(
dktTR +L∑
j=1
βkjt′
kj
)
< τC − τnc, (5.13)
where τC is the length of the longest 1-cycle, τnc is the total execution time of all non-
scaled task instructions in 1-cycle c (which is a constant in the model), and the outer
sums are over all scaled task instructions encompassed by 1-cycle c from the first and
second set, respectively.
5.5.2 Objective Function
The goal of the ILP model is to minimize the sum of energy consumed by the execution of
the scaled task instructions, including the energy overhead of the transitions, and taking
into account the influence of scaling on the energy consumed by the processors in the
idle mode.
Chapter 5. Voltage Selection and Task Scheduling 46
We use the symbol Eij to denote the energy consumed by the task instruction Si
executed at the voltage level j. We similarly define E ′
kj for each task instruction S ′
k.
These quantities are constants in the model. We use the symbol ETR to denote the
energy consumed by a single transition. According to the definitions of variables αij, βkj,
ci, and dk, we can represent the energy consumption during the execution of scaled tasks
and transition overheads in each iteration as:
energy(tasks) =
m∑
i=1
L∑
j=1
αijEij +
n∑
k=1
L∑
j=1
βkjE′
kj
energy(transitions) =
m∑
i=0
ciETR +
n∑
k=0
dkETR.
We can account for the idle power by noting that each increase in task execution time
and the execution of each transition overhead reduces the total processor time spent in
the idle mode. Thus, if we define the constants ηij = Eij − Pidletij, η′
kj = E ′
kj − Pidlet′
kj,
and ε = ETR − PidletTR, where Pidle is the processor power in the idle mode, we can
formulate the following objective function:
m∑
i=1
L∑
j=1
αijηij +
n∑
k=1
L∑
j=1
βkjη′
kj +
m∑
i=0
ciε +
n∑
k=0
dkε. (5.14)
Solving the ILP model with this objective function and constraints given by Equations
(5.3)–(5.13) yields the optimal distribution of the available slack among the scaled task
instructions under the stated assumptions.
5.5.3 Estimate of the Number of Variables
Since solving an ILP model is an NP-complete problem [11], the number of variables in
the model is critical for the performance of our algorithm.
If the number of task instructions from sets K1 and K2 is m and n, respectively,
the ILP model contains the following binary variables, using the previously introduced
notation:
Chapter 5. Voltage Selection and Task Scheduling 47
• m · L variables αij
• n · L variables βkj
• m + 1 variables ci
• n + 1 variables dk.
The model does not contain any continuous or non-binary integer variables. Therefore,
the total number of variables in the model is:
Nv = m · L + n · L + m + 1 + n + 1 = (m + n) · (L + 1) + 2, (5.15)
and all of the variables are binary.
The total number of scaled task instructions m+n is smaller than the number of task
instructions in the target loop. Furthermore, the size of the target loop in practice is up
to several tens of tasks, and the number of voltage levels L supported by the processors
is small (typically smaller than ten). Therefore, the size of the ILP model is well within
the capabilities of standard ILP solvers. We implement the slack distribution algorithm
using the freely available LP SOLVE optimization library [5].
5.6 Task Scheduling
In this section, we present the task scheduling algorithm, which is designed so as to
complement the voltage selection algorithm presented in the previous sections.
5.6.1 Task Ordering
For the purposes of task scheduling, we divide tasks into three classes, based on the
definitions from Sections 5.2 and 5.4:
1. Critical tasks;
Chapter 5. Voltage Selection and Task Scheduling 48
2. Scaled non-critical tasks;
3. All remaining non-critical tasks.
Tasks from each of the listed classes have priority over tasks from the following classes.
Within each class, the priority is determined according to the sequential execution order
of task instructions. Therefore, if tasks Tmi and Tnj belong to the same class, Tmi has
greater priority either if m < n or if m = n and i < j.
5.6.2 Processor Mapping
The scheme for the mapping of tasks onto N processors P0, . . . , PN−1 is shown in Fig-
ure 5.7. The symbol Kxi is used in the figure to denote the set of scaled tasks executed in
iteration i by the task instructions from set Kx, where x is either 1 or 2, as defined in Sec-
tion 5.4. The mapping is performed according to the following rules. A single processor is
reserved for the execution of the critical tasks exclusively. All critical tasks are executed
as an uninterrupted sequence on this processor. Non-critical tasks are scheduled on the
remaining processors according to the round-robin scheme, with the following exception:
once the first task in a set of scaled tasks Kxi from the iteration i has been scheduled
onto a processor, all remaining tasks from the same set of scaled tasks are scheduled onto
the same processor, and no other task may be scheduled onto that processor until the
execution of the whole set Kxi is finished.
The stated rules for task ordering and processor mapping imply that each set of scaled
tasks is executed as an uninterrupted sequence, thus ensuring that the voltage selection
by the algorithm described in Section 5.5 is indeed optimal for the given amount of slack
per loop iteration.
Chapter 5. Voltage Selection and Task Scheduling 49
n = 0;
while (target loop runs)
Tmi = next task issued by the task dispatcher;
if (Sm is a critical task instruction)
schedule Tmi onto PN−1;
else if (Tmi ∈ Kxi)
if (Kxi has not started)
n = next ready processor(n)if (n != -1)
schedule Tmi onto Pn;
mark Pn as executing Kxi;
else
k = processor marked as executing Kxi;
schedule Tmi onto Pk;
if (Tmi is the last task from Kxi)
remove mark Kxi from Pk;
else
n = next ready processor(n)if (n != -1)
schedule Tmi onto Pn;
next ready processor(in n)
for (i = 0; i < N-1; i++)
k = (n+i) mod (N-1);
if (Pk is not busy and not marked as executing Kxi)
return k;
return -1;
Figure 5.7: Algorithm for the processor mapping of tasks
5.6.3 Hardware Implementation
We believe that the proposed task scheduling algorithm can be implemented in the MLCA
with a small overhead in hardware complexity. The format of the task instructions
in the control program can be enhanced so that each instruction is marked according
to the class to which it belongs (critical, scaled non-critical, or ordinary non-critical).
These marks can be generated by the compiler. Task ordering can be implemented
by replacing the FIFO queue in the task dispatcher by a priority queue that orders
Chapter 5. Voltage Selection and Task Scheduling 50
the task instructions according to their class and sequential execution order. Processor
mapping can be implemented by assigning to each processor a register that stores the
information whether it is currently marked as executing a particular sequence of scaled
tasks. We believe that adding these mechanisms would not substantially increase the
existing complexity of the MLCA control unit hardware.
5.7 Target Loop Selection
Target loops in MLCA control programs are a subset of the natural loops, which can be
detected using well-known compiler algorithms [32]. Using the profiling information, long-
running loops can be selected as those whose execution time accounts for a percentage
of the total application execution time above a predefined threshold. In practice, MLCA
applications tend to have a single long-running loop that handles the processing of the
input media stream, which will be selected as the target loop regardless of the value of
this threshold.
Other conditions outlined in Chapter 3 can be determined from the profiling and
control-flow information. If control-flow instructions are present in the loop, the profiling
information can be used to determine whether the branches are severely biased in a single
direction, and the loop can be selected as a target loop if all branches are biased above
a certain threshold.
Our technique can also be applied to perfect loop nests, even if the execution time of
the innermost loop is not long enough for it to be considered a target loop by itself. The
derivation of our algorithms is applicable to such loops, since each time the innermost loop
is executed, the values written into the URF registers in its last iteration are read by the
tasks executed in its first iteration when it is executed again in the subsequent iteration
of the outer loop. This conclusion can be generalized to imperfect loop nests in which
the total execution time of the tasks outside the innermost loop accounts for a negligible
Chapter 5. Voltage Selection and Task Scheduling 51
percentage of the total loop execution time, and there are no data dependences between
these tasks and the tasks within the innermost loop. This generalization is justified
because with these assumptions, the overall effect of the tasks outside the innermost
loop is negligibly small.
Another class of loops to which our approach could be applied are loops in which
there exist branches that are not severely biased, but each non-biased branch has the
property that both of its directions result in the same sets of URF registers being read
and written. In this case, the dependence distances are still guaranteed to be either zero
or one, since all values written into the registers in one iteration are overwritten in the
subsequent one. For such loops, the loop dependence graph could be approximated by
introducing nodes that represent the average of the outcomes for each loop direction,
computed using some general rule. For example, assume that one direction of branch B
executes task instructions Si and Sj whose execution times are ti and tj, respectively,
and the other direction executes a single task instruction Sk with execution time tk.
Furthermore, assume that the sets of registers read and written by Si and Sj are identical
to the sets of registers read and written by Sk. This control-flow construct could be
approximated by a single node in the loop dependence graph labeled with execution time
p(ti + tj) + (1 − p)tk, where p is the probability of B taking the first direction measured
by profiling. Unfortunately, we were not able to test this approach in practice, since our
benchmark applications lack such control-flow structures in their loops.
Chapter 6
Evaluation
In this chapter, we present the experimental evaluation of our solution. In Section 6.1, we
describe the benchmark applications used in the evaluation. In Section 6.2, we describe
our experimental platform and the assumed power-related properties of the processors
in the MLCA system. In Section 6.3, we present the energy savings achieved by the
application of our technique and the impact of our technique on the application execution
time. Section 6.4 investigates the contribution of different processor modes to the overall
power consumption. In Section 6.5, we evaluate the computational intensity of our
technique. Finally, in Section 6.6 we evaluate the quality of the achieved results by
comparing them with the results of a study aimed at finding a practical upper bound on
the energy savings, as well as with the results of an alternative technique based on the
partitioning of the task graph.
6.1 Benchmark Applications
We evaluate our solution using three realistic multimedia applications: JPEG image
encoder, GSM voice encoder, and MPEG sound decoder.
52
Chapter 6. Evaluation 53
6.1.1 JPEG Image Encoder
JPEG (Joint Photographic Standards Group) standard specifies a method for lossy com-
pression of digital images. It is specifically designed for encoding realistic images charac-
terized by gradual color transitions, for example digital photographs or scans of paintings.
For these types of digital images, the JPEG algorithm achieves high compression levels
(typically 5-10) without perceptible degradation of image quality.
We have ported the encoder part of the open source implementation of the JPEG
standard developed by the Independent JPEG Group [12] to the MLCA. This implemen-
tation of JPEG is included in the Mediabench multimedia benchmark suite [27]. The
JPEG encoder converts the input 24-bit bitmap image, which consists of a series of raw
color samples, to a JPEG output image.
Besides the initialization and clean-up code, the control program for the JPEG en-
coder contains a single long-running perfect loop nest. The body of the innermost loop
does not contain any control-flow instructions. The execution times of task functions
in JPEG vary significantly, and even their average execution times are dependent on
the characteristics of the input image. The parallel speedup1 of the JPEG encoder on
the MLCA is shown in Figure 6.1. Since the speedup varies somewhat with inputs, the
figure shows the average over a set of twelve different images, weighted by the execution
times of the application runs with individual inputs. The application scales for up to six
processors.
6.1.2 GSM Voice Encoder
GSM 06.10 is a standard for digital sound compression used in digital mobile telephone
networks. The GSM compression algorithm is optimized for encoding telephone-quality
1Since the MLCA applications are inherently parallel and there is no equivalent of the sequentialprogram version for them, we define the parallel speedup as the ratio of the application execution timeon a 1-processor MLCA and an N -processor MLCA.
Chapter 6. Evaluation 54
0
1
2
3
4
5
6
7
8
9
10
11
12
0 1 2 3 4 5 6 7 8 9 10 11 12
Spe
edup
Number of processors
JPEGGSM
MPEGIdeal speedup
Figure 6.1: Speedup of the MLCA benchmark applications
human voice. It takes an input consisting of raw PCM sound samples and produces
output at a constant bit rate of 13.2 kilobits per second. We use the open source GSM
06.10 codec implementation developed at the Technical University of Berlin [8] and ported
the GSM encoder part of this application to the MLCA. This implementation is also
included in the Mediabench suite [27].
The control program for the GSM encoder satisfies the assumptions from Chapter 3,
except that it contains task functions that write their output parameters in the midst
of their execution. We solve this problem by applying the task splitting transformation.
The execution times of task functions in GSM show very little variation. The parallel
speedup of the GSM encoder on the MLCA is shown in Figure 6.1. The application
scales for up to four processors.
6.1.3 MPEG Sound Decoder
As the third benchmark application, we have used the MLCA version of the MAD MPEG
sound decoder [22]. The MLCA version of MAD is capable of decoding MP3 sound files
Chapter 6. Evaluation 55
of only one bit rate (40kbps, 22kHz). All of the inputs for which our algorithm was tested
have this bit rate. Therefore, we ignored the effects of the variable bit rate on the task
execution times, which can be considerable. However, if our DVS algorithm is applied
to an MP3 decoder application, it is possible to generate a set of different solutions for
playing files of different bit rates.
Similar to the GSM encoder, the control program for the MPEG decoder satisfies the
assumptions from Chapter 3 except for a number of task functions that write their output
parameters in the midst of their execution. We apply the task splitting transformation
to solve this problem. The execution times of task functions in the MPEG decoder
vary significantly, and there is also some variance in their average execution times across
different inputs. Parallel speedup of the MPEG decoder on the MLCA is shown in
Figure 6.1. For MPEG, the figure shows the average over a set of seven inputs with
different kinds of sound content. The application scales for up to eleven processors.
6.2 Experimental Platform and Processor
Properties
We profile the execution of the MLCA benchmark applications using a simulator of the
MLCA [22]. In the JPEG encoder and MPEG decoder, the execution times of task
functions vary significantly between loop iterations and depend on the input content.
For each of these two applications, we perform the profiling run over a set of several
different inputs, which we refer to as the training set. For the profile parameters of our
algorithms, we use the average of the profiling results measured over individual inputs
from the training set. Using these parameters, we apply our technique to the original
training set, as well as another set of different inputs, expecting to achieve similar positive
results for both sets.
Our experimental platform is shown in Figure 6.2. The MLCA application is run using
Chapter 6. Evaluation 56
(a) Collection of application information
(b) Evaluation of our technique
Figure 6.2: Experimental Platform
the MLCA simulator, which produces a trace file containing the details of the application
execution. The trace file is read by a utility that constructs the run-time task graph of
the application execution and collects the profiling information necessary for the voltage
selection algorithm. The task splitting transformation is applied to the run-time task
graph to ensure that each task writes its outputs at the end of its execution, as described
in Section 3.3. The profiling information is then fed to the implementation of the voltage
selection algorithm, along with the program code of the target loop, from which the loop
dependence graph is constructed using the algorithm described in Section 5.1. Once
the solution of the voltage selection problem is computed, it is fed to the program that
implements our task scheduling algorithm described in Section 5.6, as well as the default
MLCA FIFO/round-robin task scheduling algorithm. The application task graph is first
scheduled using the default MLCA task scheduling algorithm, without the application
of DVS. The task graph is then scheduled using our task scheduling algorithm with the
Chapter 6. Evaluation 57
0
100
200
300
400
500
600
700
800
300 400 500 600 700 800
Pow
er [m
W]
Frequency [MHz]
(a) XScale
0 25 50 75
100 125 150 175 200 225 250
100 125 150 175 200 225 250 275 300 325
Pow
er [m
W]
Frequency [MHz]
(b) IEM926
Figure 6.3: Power vs. frequency of XScale [7] and IEM926 [10] processor cores
application of DVS. The execution time of each scaled task is prolonged according to its
assigned voltage level, and the transition overheads are also taken into account. Finally,
the differences in execution time and energy consumption between the two generated
task schedules are computed as the results of our evaluation. The procedure shown in
Figure 6.2(b) is repeated with different values of the parameter r, until a suitable value
is found.
We assume that the processors in the MLCA system support eight discrete voltage
levels, with the relation between supply voltage, operating frequency, and power con-
sumption characteristic of the Intel XScale processor core, reported by Clark et al. [7].
The relative frequency and power consumption for each level is shown in Table 6.1.
Flautner et al. [10] measure a similar ratio between frequency slowdown and reduction
in power consumption for the processor core in the IEM926 SOC, although it operates
in a different range of frequencies. Thus, we believe that our model of processor power
is realistic. A comparison between the power vs. frequency characteristics of XScale and
IEM926 can be seen in Figure 6.3.
As explained in Section 3.1.2, we simplify our approach by assuming constant over-
Chapter 6. Evaluation 58
Level 1 2 3 4 5 6 7 8
Power 0.167 0.272 0.371 0.470 0.557 0.654 0.768 1.000
Slowdown 0.417 0.500 0.583 0.667 0.750 0.833 0.917 1.000
Table 6.1: Properties of the processor voltage levels
Application JPEG GSM MPEG
tmedian 19000 21306 27817
taverage 29476 21365 57305
Table 6.2: Median and average execution times of tasks in cycles
heads of transitions between voltage levels. We evaluate our technique varying the dura-
tion of the overheads from one thousand to three thousand cycles. Comparison of these
overheads with the median and average execution times of tasks in the benchmark ap-
plications determined by the profiling, which are shown in Table 6.2, shows that these
overheads are not negligible compared to the typical task execution times. Transition
overheads in modern DVS-enabled processors are as low as 20µs [7, 15], which trans-
lates into several thousand or tens of thousands of cycles, since these processors operate
at frequencies on the order of hundreds of megahertz. However, we choose somewhat
lower values of the transition overheads because our benchmark applications are less
computationally demanding than the applications that can be expected to run on future
MLCA systems with high-end processors. In the course of our project, we have so far
been limited to the applications whose source code is in the public domain, which is not
the case with proprietary high-end applications that the MLCA will be targeting in the
future. We expect that our results will generalize to the MLCA systems executing more
computationally demanding applications in the presence of larger transition overheads,
since the task granularity for these applications will also be larger and the ratios between
the overheads and task execution times will thus be similar or even lower.
Similarly, we assume constant transition overheads in energy. We approximate the
Chapter 6. Evaluation 59
power consumption during each voltage transition as 80% of the power at the highest
voltage level. This approximation is derived from the assumptions that most transitions
take place between the highest voltage levels and that during each transition, the proces-
sor power is somewhere between the power at the start and the end level of the transition.
Although this is a very rough approximation, it does not affect the accuracy of our results
significantly, since after the application of our technique, the contribution of transition
overheads to the overall energy consumption turns out to be very small (under 2.1%),
according to the results that we present in Section 6.4.
When a processor is not executing a task, we assume that it enters the idle mode with
negligible overheads, as explained in Section 3.1.2. Both before and after the application
of our technique, the total energy consumption is computed under this assumption. We
assume that the power consumption in the idle mode is 20% of the power consumption
at the highest voltage level in the active mode, roughly approximating the power in the
idle mode measured for processors based on the XScale core [19, 20]. Again, the results
that we present in Section 6.4 indicate that the contribution of the idle power to the
overall energy consumption is small (under 7%), so that the accuracy of our results is not
affected significantly by imprecisions in accounting for the idle power. Furthermore, the
idle power assumed in our experiments is at the lower end of the range of values measured
for realistic embedded processors [19, 20], which is a conservative assumption, since our
technique reduces not only the active energy, but also the energy consumed in the idle
mode by exploiting the available slack and reducing the total idle processor time. Thus,
for a processor with higher power consumption in the idle mode, the achieved energy
savings would be higher.
Chapter 6. Evaluation 60
6.3 Energy Savings and Execution Slowdown
In this section, we present the main results of our measurements—the processor power
savings achieved and the execution slowdown incurred by the application of our technique
for the MLCA benchmark applications.
6.3.1 JPEG Encoder
We evaluate the performance of our technique on the JPEG encoder using two sets of
twelve photographic images. The first set is used as the training set, and our technique
is then applied to both sets with parameters computed from the profiling data for the
training set. Since the task execution times for JPEG depend on the chromatic properties
of the input image, we used sets of photographs taken on two different occasions. This
ensures that the results achieved for the images from the second set do not depend on
their similarity in chromatic properties with the images from the training set.
The reduction in the processor energy consumption for both input sets is shown in
Figure 6.4, and the increase in application execution time resulting from the application
of our technique is shown in Figure 6.5. In both figures, the energy savings and the
execution slowdown are shown as functions of r, the fraction of computed slack used
during voltage selection, as defined in Section 5.3. The value of r was varied from zero
(no slack used and thus no voltage scaling) to one (all slack computed according to
the Formula (5.2) is used). For each value of r, we show the processor energy savings
achieved with three different values of the transition overhead. All results are computed as
weighted averages over each set of images, where the weights assigned to individual images
are proportional to the overall computational work necessary for encoding each image.
Weighted averages are used because the computational work varies significantly between
images. The negative slowdown at certain points means that despite the increase in the
execution time of the scaled tasks, the overall application execution time is decreased by
Chapter 6. Evaluation 61
that percentage by the application of our task scheduling algorithm.
The evaluation results show that the results for the training set are successfully re-
produced on the second set. Increase in the transition overhead leads to a decrease in
power savings, but does not lead to excessive execution slowdown. For r = 0.5, with
transition overhead of 1000 cycles, our technique achieves processor power savings of
9.5% for the second set, slowing down the application execution by 0.5% (slowdown for
the training set is somewhat higher). For r = 0, no voltage scaling is performed and we
observe only the effect of the task scheduling algorithm relative to the default MLCA
FIFO/round-robin scheduling. The application of our task scheduling algorithm results
in a small reduction in the application execution time (1.5% for the training set and 2.9%
for the second input set).
Figure 6.6 shows the results for each individual image from the second input set with
r = 0.5 and transition overhead of 1000 cycles. For certain images, the slowdown is
relatively high, but the average slowdown is small. Significant processor power savings
are achieved consistently for all images.
6.3.2 GSM Encoder
Since the GSM encoder application is characterized by very small variations in the ex-
ecution times of task functions, we use the profiling information for a single input and
demonstrate that the results achieved by our technique using the same profiling informa-
tion are almost identical for different inputs. This is shown in Figures 6.7 and 6.8. The
figure shows the results only for the values of r of less than 0.35, because outside of this
interval, the execution slowdown becomes excessive.
The results for this application are generally more sensitive on the value of the pa-
rameter r than for the JPEG encoder. This especially holds for the execution slowdown,
which is also more sensitive to the transition overheads. However, the results for the
training input are closely replicated on the second input. The optimal value of r thus
Chapter 6. Evaluation 62
depends on the transition overhead. With transition overhead of 1000 cycles, the best
result is achieved for the parameter value r = 0.28, which results in processor power sav-
ings of over 5.5% and a small negative slowdown (i.e. a small speedup) of the application
execution. With transition overheads of 2000 and 3000 cycles, a good value of r is 0.20,
which results in processor power savings of 4.1% and 3.3%, respectively, with a small
negative slowdown.
6.3.3 MPEG Sound Decoder
We profile the application using a training set of seven MP3 files encoding different kinds
of sound content. Using this profiling input, we apply our DVS technique to the training
set and to the second set of seven input files. The results achieved by our technique,
averaged over each input set, are shown in Figures 6.9 and 6.10.
The processor power savings are almost identical for both sets, while the execution
slowdown is slightly greater for the second set. Due to the somewhat larger average
task granularity, the results are less sensitive to the increase in transition overheads in
comparison with JPEG and GSM. As in the case of JPEG, the results achieved for the
training input are reproducible with a different input set. With transition overhead of
1000 cycles, a suitable value of the parameter r is 0.6, which yields the processor power
savings of 8.4% with the execution slowdown of approximately 1.5% for both sets of
inputs. For larger transition overheads, a better choice for the value of r is 0.5, due to
somewhat larger execution slowdown.
Figure 6.11 shows the breakdown of the results achieved with r = 0.6 and transition
overhead of 1000 cycles across the inputs from the second set. The power savings are
consistent, with some variation in the execution slowdown.
Chapter 6. Evaluation 63
0
2
4
6
8
10
12
14
16
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Ene
rgy
savi
ngs
[%]
r (fraction of slack used)
JPEG, 6 processors: energy savings for set 1
1000 cycles2000 cycles3000 cycles
(a) Training set
0
2
4
6
8
10
12
14
16
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Ene
rgy
savi
ngs
[%]
r (fraction of slack used)
JPEG, 6 processors: energy savings for set 2
1000 cycles2000 cycles3000 cycles
(b) Second set
Figure 6.4: Processor energy savings for the JPEG encoder with different transition
overheads
Chapter 6. Evaluation 64
-3
-2
-1
0
1
2
3
4
5
6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Exe
cutio
n sl
owdo
wn
[%]
r (fraction of slack used)
JPEG, 6 processors: execution slowdown for set 1
1000 cycles2000 cycles3000 cycles
(a) Training set
-3
-2
-1
0
1
2
3
4
5
6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Exe
cutio
n sl
owdo
wn
[%]
r (fraction of slack used)
JPEG, 6 processors: execution slowdown for set 2
1000 cycles2000 cycles3000 cycles
(b) Second set
Figure 6.5: Execution slowdown for the JPEG encoder with different transition overheads
Chapter 6. Evaluation 65
0
1
2
3
4
5
6
7
8
9
10
11
1 2 3 4 5 6 7 8 9 10 11 12
Pro
cess
or e
nerg
y sa
ving
s [%
]
Image number
JPEG, 6 processors - energy savings across images
(a) Energy savings
-3
-2
-1
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 10 11 12
Exe
cutio
n sl
owdo
wn
[%]
Image number
JPEG, 6 processors - execution slowdown across images
(b) Execution slowdown
Figure 6.6: Breakdown of results for the second set of JPEG encoder inputs, r = 0.5,
transition overhead = 1000 cycles
Chapter 6. Evaluation 66
-1
0
1
2
3
4
5
6
0 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32
Ene
rgy
savi
ngs
[%]
r (fraction of slack used)
GSM, 4 processors: energy savings for set 1
1000 cycles2000 cycles3000 cycles
(a) Training set
-1
0
1
2
3
4
5
6
0 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32
Ene
rgy
savi
ngs
[%]
r (fraction of slack used)
GSM, 4 processors: energy savings for set 2
1000 cycles2000 cycles3000 cycles
(b) Second set
Figure 6.7: Processor energy savings for the GSM encoder with different transition over-
heads
Chapter 6. Evaluation 67
-1
0
1
2
3
4
5
0 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32
Exe
cutio
n sl
owdo
wn
[%]
r (fraction of slack used)
GSM, 4 processors: execution slowdown for set 1
1000 cycles2000 cycles3000 cycles
(a) Training set
-1
0
1
2
3
4
5
0 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32
Exe
cutio
n sl
owdo
wn
[%]
r (fraction of slack used)
GSM, 4 processors: execution slowdown for set 2
1000 cycles2000 cycles3000 cycles
(b) Second set
Figure 6.8: Execution slowdown for the GSM encoder with different transition overheads
Chapter 6. Evaluation 68
0 1 2 3 4 5 6 7 8 9
10 11 12 13
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Ene
rgy
savi
ngs
[%]
r (fraction of slack used)
MPEG, 11 processors: energy savings for set 1
1000 cycles2000 cycles3000 cycles
(a) Training set
0 1 2 3 4 5 6 7 8 9
10 11 12 13
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Ene
rgy
savi
ngs
[%]
r (fraction of slack used)
MPEG, 11 processors: energy savings for set 2
1000 cycles2000 cycles3000 cycles
(b) Second set
Figure 6.9: Processor energy savings for the MPEG decoder with different transition
overheads
Chapter 6. Evaluation 69
0
1
2
3
4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Exe
cutio
n sl
owdo
wn
[%]
r (fraction of slack used)
MPEG, 11 processors: execution slowdown for set 1
1000 cycles2000 cycles3000 cycles
(a) Training set
0
1
2
3
4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Exe
cutio
n sl
owdo
wn
[%]
r (fraction of slack used)
MPEG, 11 processors: execution slowdown for set 2
1000 cycles2000 cycles3000 cycles
(b) Second set
Figure 6.10: Execution slowdown for the MPEG decoder with different transition over-
heads
Chapter 6. Evaluation 70
0
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7
Ene
rgy
savi
ngs
[%]
Input number
MPEG, 11 processors - energy savings across different inputs
(a) Energy savings
0
0.5
1
1.5
2
2.5
3
1 2 3 4 5 6 7
Exe
cutio
n sl
owdo
wn
[%]
Input number
MPEG, 11 processors - execution slowdown across different inputs
(b) Execution slowdown
Figure 6.11: Breakdown of results for the second set of MPEG decoder inputs, r = 0.6,
transition overhead = 1000 cycles
6.4 Breakdown of the Energy Consumption
In this section, we investigate the contribution to the overall processor energy consump-
tion of three types of power: active execution of tasks, processor idle periods, and energy
consumed by voltage transitions. Since in our model we account for the latter two types
of power only approximately, the accuracy of our evaluation could be degraded if they
account for a large part of the total energy consumption.
Table 6.3 shows the breakdown of the total energy consumption in the MLCA bench-
mark applications after the application of our technique. The table lists the percentage
of total energy consumed during active execution of tasks, idle periods, and voltage tran-
sitions. For each application, we show the contributions of each power component in
the total power consumption of the second set of inputs for the best found value of r
and transition overhead of 3000 cycles. For lower transition overheads, the figures are
similar, except that the contribution of the transitions is even smaller. Since the active
power accounts for an overwhelming percentage of the total power, we conclude that the
Chapter 6. Evaluation 71
Application JPEG GSM MPEG
Active 95.8% 92.3% 96.5%
Idle 2.1% 7.0% 2.9%
Transitions 2.1% 0.7% 0.6%
Table 6.3: Breakdown of the total energy consumption
imprecisions in our modeling of the power in the idle mode and the transition overheads
in energy do not influence the accuracy of our results significantly. The reason for the
dominance of the active power lies in the high level of processor utilization of the ap-
plications, as well as the fact that our ILP model for slack distribution minimizes the
number of transition overheads.
6.5 Algorithm Performance
In this section, we discuss the computational intensity of our voltage selection algorithm,
which is the part of our technique that is executed at compile-time. Solving the ILP model
for voltage selection is an NP-complete problem, while the rest of the algorithm consists
of computations that can be performed in polynomial time. Therefore, the number of
variables in the ILP model is crucial for the performance of the algorithm, since the time
necessary to find the solution can grow exponentially with the number of variables.
According to the results derived in Section 5.5.3, the number of variables in the ILP
problem for slack distribution is Nv = Ns · (L + 1) + 2, where Ns is the number of
scaled task instructions in the target loop, and L is the number of discrete voltage levels
supported by the processors. The theoretical upper bound on the number of variables is
thus Nmax = Nt · (L + 1) + 2, where Nt is the total number of task instructions in the
target loop. For each MLCA benchmark application, Table 6.4 shows the total number
of task instructions in the target loop Nt, the number of scaled task instructions Ns,
Chapter 6. Evaluation 72
Application JPEG GSM MPEG
Nt 8 26 27
Ns 6 6 23
Nmax 74 236 245
Nv 56 56 209
tavrg (sec.) 0.11 0.07 46.80
Table 6.4: Number of variables and solving time of the ILP problem for voltage selection
the value of the theoretical upper bound on the number of variables in the ILP model
Nmax, the actual number of variables encountered by the voltage selection algorithm
Nv, and the average time in seconds for solving the ILP model across all measurements
tavrg. The measurements were performed on a PC workstation with two AMD Athlon
MP 2600+ processors and 512 megabytes of RAM (all computations are sequential, and
the second processor did not have any useful role). The only application for which the
average solving time is not negligible is the MPEG decoder. However, the solving times
of individual runs of the algorithm vary widely for this application, ranging from several
tens of milliseconds to several minutes, despite the fact that the number of variables is
the same. We believe that this variation results from the idiosyncrasies of the LP SOLVE
library, and the solving time could be significantly reduced by using a more powerful ILP
solver.
6.6 Evaluation of the Results
In this section, we evaluate the quality of the results achieved by our technique. We
compare our results with the results of a study aimed at finding a practical upper bound
on the achievable energy savings in the MLCA benchmark applications. Furthermore,
we compare our results with the results achieved using an alternative technique based on
Chapter 6. Evaluation 73
partitioning of the run-time task graph.
6.6.1 Comparison with Practical Upper Bound
In order to evaluate the quality of the results achieved by our technique, we conduct
experiments whose purpose is to determine a practical upper bound on the achievable
power savings for the MLCA benchmark applications. We apply several different task
scheduling algorithms to the run-time task graphs of the applications and attempt to find
the optimal voltage selection for the resulting task schedules. For the voltage selection,
we use an integer linear programming model similar to those proposed by Zhang et al. [46]
and Andrei et al. [3]. The attempted task scheduling algorithms are various combinations
of the default MLCA FIFO/round-robin task scheduling and the task ordering and pro-
cessor mapping schemes proposed by Zhang et al. [46]. The results achieved using this
method can be expected to be superior to any practically applicable heuristic approach,
since it makes use of perfect knowledge of the run-time application task graph.
This approach cannot be applied directly to the task graphs of the application runs,
since the number of tasks executed for even a moderately long run of an application is
several thousand, yielding an excessively large number of variables in the ILP problem
formulation. On the other hand, the task graphs of short application runs are skewed due
to the influence of the initialization and clean-up code. We solve this problem by slicing
the task schedule into short time intervals, and computing the optimal voltage selection
for three randomly selected intervals, taking the average value as the final result. This
method involves a trade-off between accuracy, which is degraded for short time intervals,
and the time necessary for finding the solution of the ILP problem, because the number
of variables increases with the length of the intervals. The details of the ILP model for
voltage selection and the task scheduling algorithms used are described in Appendix A.
For solving the ILP models, we use the ILOG CPLEX optimization system [17].
CPLEX is a commercial product featuring an ILP solver significantly more powerful than
Chapter 6. Evaluation 74
the freely available LP SOLVE, which was used in the implementation of our technique.
However, in practice, the number of variables in the ILP models turns out to be too large
to be solved in reasonable times, even using CPLEX. In order to reduce the number of
variables to a manageable level, we reduce the number of discrete voltage levels from
eight to five (the expression for the number of integer variables in the model is derived
in Section A.2.4 of Appendix A). The frequency and power of the five levels is selected
according to the characteristics of the Intel XScale processor core, so that comparison
with the results achieved by the application of our technique is meaningful. However,
even with this simplification, for the MPEG decoder, the maximum interval length for
which the ILP problem can be solved in practice is even smaller than the length of the
interval long enough to encompass all tasks from a single iteration of the target loop,
and meaningful results are not possible with shorter intervals. Thus, we perform the
experiments only for JPEG and GSM.
The results are shown in Table 6.5. For each application, we list the best result
achieved across the range of attempted scheduling algorithms. These results are not
guaranteed to be strict upper bounds on the attainable energy savings, because of the
inaccuracies introduced by interval slicing. In addition, somewhat better results might
be obtained using a model with a greater number of voltage levels, which would however
result in an excessively large number of variables. Furthermore, in some cases the ILP
solver was interrupted before reaching the optimal solution, and the set of the used
task scheduling algorithms is not comprehensive. Nevertheless, since these results are
derived using an approach that makes use of perfect knowledge of the run-time application
behavior, it is reasonable to assume that these results are comparable to a practical
upper bound for results achievable by realistic techniques such as ours. This suggests
that our technique succeeds in capturing a significant part of the potential for power
optimizations in JPEG and GSM. The results are shown for two different values of the
transition overhead (1000 and 3000 cycles). The greatest difference between our results
Chapter 6. Evaluation 75
Application JPEG, 6 proc. GSM, 4 proc.
Overhead 1000 3000 1000 3000
Upper boundSavings 12.1% 10.6% 9.2% 7.5%
Slowdown -3.2% -3.2% 0.6% 0.6%
Our techniqueSavings 9.5% 5.3% 5.6% 3.3%
Slowdown 0.5% 0.6% -0.3% -0.5%
Table 6.5: Comparison of the evaluation results with the computed upper bounds
and the computed upper bound is in the execution slowdown for JPEG. However, in
this case, the upper bound is computed using a task scheduling algorithm that takes
advantage of the exact knowledge of the critical path in the run-time task graph of the
application, which is unavailable to any practically applicable algorithm.
6.6.2 Comparison with Task Graph Partitioning
The theoretical derivation of our voltage selection algorithm, presented in Chapter 5,
assumes that the control-flow instructions in the body of the target loop can be safely
ignored, as explained in Section 3.2. The derivation also includes the assumption that the
execution times of individual task instructions do not vary between loop iterations. Under
the same assumptions, it is possible to attempt the application of previously proposed
methods for voltage selection aimed at applications that can be represented by small
task graphs or periodic series of such task graphs, since the execution of the idealized
form of the target loop is periodic. In particular, it is possible to analyze a time interval
during the execution of the target loop that is long enough to contain all tasks from a
single iteration and compute the voltage selection for this time interval. The voltage
levels assigned to the tasks from this iteration can be assigned to the corresponding task
instructions, thus solving the voltage selection problem as formulated in Section 4.2. We
implement this approach in order to demonstrate that our technique generates superior
Chapter 6. Evaluation 76
solutions while being significantly less computationally demanding.
We construct the idealized run-time task graph of the application, in which the exe-
cution time of each task is equal to the average execution time of the corresponding task
instruction determined by profiling. This task graph is then scheduled using the default
MLCA FIFO/round-robin task scheduling algorithm. In the generated task schedule, we
select an interval whose boundaries are determined by the beginning of the execution of
the first executed task from iteration i and the end of the execution of the last executed
task from iteration i. Since the task schedule is periodic, the choice of iteration i is arbi-
trary, as long as it is not one of the few first or last iterations. For the selected interval,
we compute the optimal voltage selection using the ILP model described in Appendix A.
The computed voltage levels of tasks from iteration i are then assigned to their corre-
sponding task instruction. We refer to this approach as the interval selection. With these
voltage levels assigned to the task instructions, we schedule the real task graph of the
application and measure the resulting energy savings and execution slowdown.
The results achieved by the application of the interval selection for the JPEG encoder
are shown in Figure 6.12 and compared with the results achieved by our technique with
r = 0.5. We assume the transition overhead of a thousand cycles. Other assumptions
about the system are identical to those described in Section 6.2. The evaluation is per-
formed using the same sets of images as in Section 6.3. The weighted average of the
power saving with interval selection is 7.9%, as opposed to 9.5% achived using our ap-
proach. However, the weighted average of the execution slowdown with interval selection
is 3.8%, as opposed to 0.5% with our approach. Therefore, although the power savings
are comparable, the execution slowdown is excessive. Experiments with the GSM encoder
yield even worse results, with power savings of under 4% and the execution slowdown of
over 3%. For the MPEG decoder, the number of variables in an interval large enough
to contain a whole loop iteration is too large for the solving of the ILP problem to be
feasible.
Chapter 6. Evaluation 77
0 1 2 3 4 5 6 7 8 9
10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
Ene
rgy
savi
ngs
[%]
Image number
JPEG, 6 processors - energy savings across images
Our techniqueInterval selection
(a) Energy savings
-3
-2
-1
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 10 11 12
Exe
cutio
n sl
owdo
wn
[%]
Image number
JPEG, 6 processors - execution slowdown across images
Our techniqueInterval selection
(b) Execution slowdown
Figure 6.12: Comparison of the results for JPEG
Chapter 6. Evaluation 78
The reason for the excessive slowdown is the lack of an appropriate task scheduling
algorithm in the interval selection approach. Once the execution time of certain tasks
has been prolonged, it is necessary to introduce a task scheduling algorithm that gives
priority to the tasks from the critical path, in order to avoid performance degradation.
However, in order to perform such scheduling in practice, it is necessary to introduce a
heuristic for identifying the tasks from the critical path similar to the one we are using in
our technique, and once such a heuristic is introduced, it naturally leads to our approach.
Another important advantage of our technique over the interval selection approach
is its significantly lower computational intensity. The number of integer variables in the
ILP model described in Appendix A is equal to Nt · (L + 1) − NP , where Nt is the
number of tasks in the selected interval, L is the number of voltage levels supported by
the processor, and NP is the number of processors. Since the interval is selected so as to
encompass all tasks executed within a single iteration, the number of tasks in the interval
is equal to the number of task instructions in the target loop N plus the number of tasks
from other iterations that fall within the same interval.
We can derive an approximate formula for the number of variables in the ILP for-
mulation by assuming that there is no parallelism within each loop iteration, i.e. that
the only source of parallelism in the target loop is the overlapping of the execution of
individual iterations. From this assumption, it follows that in parallel with each loop
iteration, s− 1 additional iterations must be executed on average, where s is the parallel
speedup of the applications. Indeed, if the total execution time of a single iteration is
not decreased by parallelization, then s iterations must be run in parallel on average to
achieve the parallel speedup of s. Therefore, the total number of tasks in the interval is
approximately:
s · N · (L + 1) − NP . (6.1)
If certain parallelism exists within each single loop iteration, this number is somewhat
lower. However, in most MLCA applications in practice, the bulk of parallelism stems
Chapter 6. Evaluation 79
Application JPEG GSM MPEG
Nt 8 26 27
Speedup 4.95 2.97 9.47
Nvars 350 691 2290
Table 6.6: Number of variables in the ILP model for voltage selection
from overlapping the execution of loop iterations, and the approximation is reasonably
accurate. Furthermore, this approximation is derived under the assumption that the
tasks from the selected iteration are executed in an uninterrupted sequence. However, in
the actual execution of the application, some tasks might be executed much earlier than
the rest of the tasks from the same iterations, due to the out-of-order execution capability
of the MLCA, thus significantly increasing the total number of tasks encompassed by the
selected interval.
For each benchmark application, we encounter in practice a number of variables close
to the value computed according to the Formula (6.1). The exact number depends on the
depth of the out-of-order execution capability of the MLCA. For the JPEG encoder, GSM
encoder, and MPEG decoder, Formula (6.1) yields values listed in Table 6.6. The number
of tasks in the target loop Nt and the parallel speedup used to compute the number of
integer variables for each application are also listed in the table. Comparison with the
data from Table 6.4 shows that our technique involves a number of variables smaller by
an order of magnitude for each application. For MPEG, the number of variables is too
large for the problem to be solvable using LP SOLVE. Since solving an ILP problem is
an NP-complete problem whose computation time generally grows exponentially with
the number of integer variables, our technique incurs a much smaller computational
load, even though it requires multiple executions in order to find a suitable value of the
parameter r.
Chapter 7
Related Work
In this chapter, we survey the related work. Methods for DVS-based processor power
optimization have been an active area of research since DVS was first proposed [34]. This
problem has been approached from several different perspectives by researchers in the
real-time systems, operating systems, and compiler systems communities.
Section 7.1 surveys the DVS-related research in the context of real-time systems.
Section 7.2 presents an overview of the DVS-related research in the context of operating
systems and optimizing compilers. Section 7.3 relates various aspects of the system model
used in our work to the models used by the cited authors.
7.1 Real-Time Systems
Jha [21] presents a comprehensive survey of the early research work in the area of DVS
techniques for real-time systems, which was mostly aimed at single-processor systems
executing sets of independent tasks. Subsequent research has addressed the more gen-
eral problems of DVS techniques for a multiprocessor real-time system that periodically
executes a set of dependent tasks. Two basic approaches to this problem have emerged.
The first approach [29, 47] is aimed at exploiting the slack that arises at run-time when
the execution time of certain tasks happens to be shorter than the worst case for which
80
Chapter 7. Related Work 81
the system is designed. The second approach [3, 13, 39, 46] assumes that the execution
time of each task is known at design-time and attempts to exploit the slack in the task
graph using static algorithms for task scheduling and voltage selection. Some authors
attempt to combine these two approaches [37].
Most of the heuristic algorithms for the static task scheduling and voltage selection ei-
ther use guided random search techniques [39] or combine ILP formulations of the voltage
selection problem with task scheduling algorithms proposed in previous literature [46].
Novel heuristic algorithms based on the properties of the task graphs have also been
proposed [13, 37]. Andrei et al. [3] ignore the issue of task scheduling and focus on the
problem of finding the optimal voltage selection for a given task schedule.
The majority of the cited authors approach the problem by selecting a task scheduling
algorithm and then focusing on the problem of finding the optimal voltage selection for a
given task schedule. Our approach takes a different path, first formulating an algorithm
for voltage selection, and then deriving an appropriate task scheduling scheme based on
the notions defined in the context of our voltage selection algorithm.
The DVS algorithms for real-time systems executing periodic task graphs are some-
what similar to our work, since the repeated execution of a set of dependent tasks with a
fixed deadline can be compared with the repeated execution of the set of task functions
from the target loop in an MLCA application. However, the crucial difference is that the
parallelism in the former case is limited to a single instance of the periodic task graph,
which is assumed to be small enough to be analyzed using computationally demanding
means such as ILP models or genetic algorithms, as explained in Section 1.1. Attempts
to partition the run-time task graph of an MLCA application by selecting an interval
encompassing a single iteration of the target loop lead to unsatisfactory outcomes both
in the quality of results and in the computational complexity, as presented in Section 6.6.
Srinivasan and Chatha [40] propose a combined ILP formulation for optimal task
scheduling and voltage selection in periodic task graphs executed on multiple DVS-
Chapter 7. Related Work 82
enabled processors. Their approach allows for pipelined scheduling of individual periods
and cross-iteration dependences, similar to the execution of a target loop in an MLCA
application. However, this approach would result in an excessively large number of vari-
ables for our benchmark applications. The authors report short solving times in their
experiments [40], but they do not report the size of the task graphs used. Furthermore,
they assume a two-processor architecture, while a larger number of processors would
significantly increase the number of variables in the model.
Several authors have proposed more general approaches to the power optimizations
for real-time systems. Varatkar and Marculescu [43] and Andrei et al. [2] propose DVS
techniques that account for the energy overheads of inter-task communication. Since
the inter-task communication in an MLCA system with shared memory is limited to the
access to the universal register file, we believe that the time and energy overheads of
this communication are negligible. Wu et al. [45] introduce a DVS algorithm for real-
time systems executing conditional task graphs, which capture both data and control
dependences in a set of tasks. Unfortunately, it is not possible to apply their approach in
order to generalize our work to loops that contain unpredictable control-flow statements.
7.2 Operating Systems and Optimizing Compilers
In operating systems featuring preemptive multitasking, DVS can been used to reduce
the power consumption during time intervals when the average processor workload is low.
These intervals can be identified by monitoring the processor activity at run-time and
adjusting the voltage level based on the recent average processor utilization. Another
approach is to enhance the operating system scheduler to predict the processor workload
based on the currently scheduled jobs. Lorch and Smith [28] present an overview of DVS
techniques in the context of multitasking operating systems.
In the context of optimizing compilers for general-purpose applications, there have
Chapter 7. Related Work 83
been attempts at compile-time identification of program regions characterized by an
excessive number of processor stall cycles, where frequency can be reduced using DVS
without significant performance impact. Hsu and Kremer [16] implement a profile-driven
compiler system that achieves this purpose.
Hsu [15] presents a comprehensive overview of research in the area of DVS-based power
optimizations in the context of operating systems and optimizing compilers. Unlike our
work, the research in this area is primarily aimed at single-processor, general-purpose
computer systems.
7.3 System Modeling in DVS Research
The models used to determine the relations between the supply voltage, operating fre-
quency, power consumption, and program performance vary among cited works. Most au-
thors determine these relations using analytical models. The analytical models for power-
related processor characteristics range from simple formulas for the dynamic power con-
sumption [37, 46] to more sophisticated models that take into account effects such as the
leakage power [2, 3]. Similar to our work, authors who use analytical models typically as-
sume that each task takes a fixed number of processor cycles, i.e. that the execution times
of tasks are inversely proportional to the operating frequency [3, 13, 37, 43, 46]. Other
authors avoid using analytical models and determine the necessary information by profil-
ing [16] or simulating [38] the application execution. Instead of analytical modeling, we
use the figures characteristic of the Intel XScale architecture reported by Clark et al. [7].
Most of the authors in the area of DVS-based power optimizations ignore the perfor-
mance impact of the transitions between voltage levels in their models [13, 37, 39, 43, 46,
47]. Notable exceptions are [3, 31, 38, 40]. Saputra et al. [38] and Mochocki et al. [31]
assume constant transition overheads independent of voltage levels. Andrei et al. [3]
use an analytical model for the relation between the time and energy overheads and the
Chapter 7. Related Work 84
voltage levels between which the processor is transferring. Srinivasan and Chatha [40]
formulate an ILP model for voltage selection that includes the overhead of transition
between each pair of voltage levels as a separate constant. Our system model assumes
constant transition overheads.
The majority of the cited authors evaluate their proposed solutions using randomly
generated artificial task graphs [13, 39, 29, 46], or the combination of a set of artificial task
graphs and only one realistic application [2, 3, 31, 37, 45]. In contrast, we evaluate our
technique using exclusively task graphs pertaining to realistic multimedia applications.
Chapter 8
Conclusions and Future Work
In this chapter, we present the concluding remarks and directions for future work.
8.1 Conclusions
In this thesis, we present and evaluate a novel dynamic voltage scaling technique for
power optimizations of multimedia applications running on the MLCA architecture. Our
technique consists of profile-based heuristic compiler algorithms for voltage selection and
task scheduling. The algorithms take advantage of control-flow regularities in the control
programs that emerge across a wide range of MLCA multimedia applications. These
applications spend the bulk of their execution time in long-running loops that often do
not contain any control-flow instructions in their bodies, or contain only severely biased
control-flow instructions, so that the loop body can be approximated by its most frequent
execution path. Out technique specifically targets such loops, analyzing them with the
additional simplifying assumption that the execution time of each task instruction is
constant. This assumption does not hold strictly for real MLCA applications, but we
assume that it can serve as a basis for heuristic analysis of the target loop if we assign
to each task instruction its average execution time measured in the profiling run. The
positive results of the experimental evaluation confirm the validity of this approach.
85
Chapter 8. Conclusions and Future Work 86
Our voltage selection algorithm performs the dependence analysis of the target loop
and constructs its dependence graph. A heuristic approach based on the properties of
the loop dependence graph and the profiling information is used to deduce the properties
of the run-time task graph of the target loop and identify the tasks that form its critical
path. This heuristic approach is based on the insight that under certain assumptions,
it is possible to identify a cycle in the loop dependence graph that translates into the
critical path in the run-time task graph of the target loop. Further heuristics are used to
determine the available slack in the task graph and select a set of tasks outside the critical
path over which the slack is distributed. An integer linear programming formulation is
used to compute an efficient distribution of slack over the selected non-critical tasks. The
voltage selection algorithm involves a tunable parameter. A good value of this parameter
for a given application can be determined by repeated runs, since the algorithm is not
computationally demanding.
Based on the properties of the run-time task graph of the target loop, we formulate
a task scheduling algorithm, which is designed so as to complement the algorithm for
voltage selection. The task scheduling algorithm gives priority to the tasks from the
critical path and ensures that groups of tasks to which DVS is applied execute as unin-
terrupted sequences, so that the impact of voltage transition overheads can be amortized
over multiple tasks.
We evaluate the proposed technique on three realistic MLCA multimedia applications:
a JPEG image encoder, a GSM voice encoder, and an MPEG sound decoder. An MLCA
simulator is used to obtain the profiling data across a training set of inputs. Although the
behavior of the benchmark applications is not consistent with the assumption that the
execution times of task instructions are invariable, using the execution times averaged
across the training sets has shown to be a sufficiently good approximation. Using the
profiling information obtained from the training sets, we achieve the processor energy
savings of 9.5%, 5.5%, and 8.4%, with the execution slowdown of 0.5%, -0.2%, and 1.5%
Chapter 8. Conclusions and Future Work 87
for JPEG, GSM, and MPEG, respectively, on different, randomly selected sets of inputs.
We conduct a study aimed at finding a practical upper bound on the achievable energy
savings in the benchmark applications, based on a model that assumes perfect knowledge
of the run-time application properties. The results suggest that our approach captures
most of the potential for energy savings.
Previous work in the area of DVS optimizations for parallel systems has focused on
applications in form of task graphs small enough to be analyzed in their entirety and
periodic task graphs whose analysis can be reduced to the analysis of a single period. In
contrast, our technique handles arbitrarily large task graphs, generated by the execution
of loops that feature cross-iteration dependences and parallelism achieved by pipelining
loop iterations. We demonstrate that our technique achieves superior results in com-
parison with attempts to partition the run-time task graph in order to apply previously
proposed techniques for voltage selection. Our technique is also significantly less compu-
tationally demanding, despite the need for multiple executions in order to find a suitable
value of the tunable parameter.
8.2 Future Work
In the future, we hope to evaluate our technique on an extended set of benchmark MLCA
applications. The currently available MLCA applications were ported to the MLCA
manually, but the research on the compiler infrastructure that will largely automate the
process of porting sequential applications to the MLCA is in progress [4]. In the future,
we hope to test the performance of our technique against compiler-generated MLCA
applications. We also hope to generalize the technique to more complex control-flow
structures in the control programs, such as loops whose bodies contain unpredictable
control-flow, although the currently available MLCA applications do not feature such
loops. Another goal for future work is to generalize our technique to heterogeneous
Chapter 8. Conclusions and Future Work 88
MLCA systems, which impose additional constraints on the processor mapping of tasks.
Once the physical implementation of a DVS-enabled MLCA system is available in the
future, we intend to test our technique on real hardware.
We hope to enhance our technique to exploit further opportunities for power savings
that arise in applications with variable run-time characteristics, such as bursty behavior.
For such applications, intervals at run-time when the activity is low offer a lot of potential
for DVS-based power savings, since the underlying hardware must be designed for the
worst-case computational load. In contrast, our current approach targets applications
that are characterized by relatively small changes in computational load at run-time,
taking advantage of the regularities in their behavior.
Although our technique was developed in the context of the MLCA architecture, we
believe that it is also applicable in the more general context of task-level parallelism
in multimedia applications. The principal source of task-level parallelism in a typical
multimedia application is the pipelining of the computations performed by the tasks in
the main loop. Each iteration of this loop processes a single element of the input media
stream (for example, a 20ms voice sample for GSM). Such applications are parallelized
by decomposing the processing of each input element into a number of tasks with well-
defined inputs and outputs, which determine the data dependences between tasks. The
processing of individual elements can then be pipelined as far as the data dependences
allow. Regardless of the parallel execution model and the underlying architecture, our
analysis of the task graph could be applied to the multimedia applications where such
parallelism exists. We thus hope to generalize our technique to multimedia applications
running on parallel systems other than the MLCA.
Appendix A
Interval-Based Voltage Selection and
Task Scheduling
In this appendix, we present the task scheduling algorithms and the ILP model for voltage
selection used in the experiments described in Section 6.6. In the DVS techniques used
in these experiments, task scheduling precedes the voltage selection. Once the task
graph has been scheduled, a time interval is selected within the task schedule and the
optimal voltage selection is computed for the tasks executed within that interval. In
the experiments described in Section 6.6.1, aimed at finding the upper bound for energy
savings, we use the task graph of the application constructed from the simulator trace of
the application execution, thus assuming perfect knowledge of the run-time application
characteristics. In the experiments described in Section 6.6.2, in which we attempt an
approach aimed at partitioning of the task graph, we use an idealized task graph, in
which the execution time of each task is equal to the average execution time of the
corresponding task instruction determined by profiling.
89
Appendix A. Interval-Based Voltage Selection and Task Scheduling 90
A.1 Task Scheduling Algorithms
We formulate five task scheduling algorithms as various combinations of three task or-
dering schemes and two processor mapping schemes. We use the following task ordering
schemes:
• FIFO ordering: ordering according to the sequential execution order, which is the
default for the MLCA.
• Earliest start time (EST): the task with the least earliest start time has the highest
priority. The earliest start time of a task is computed as the longest path from the
entry node to the task in the run-time task graph.
• Longest critical path (LCP): the task with the longest critical path has the highest
priority. The critical path of a task is computed as the longest path from the task
to the exit node in the run-time task graph.
The latter two task ordering schemes cannot be used for task scheduling on the MLCA
in practice, since they depend on information about the run-time task graph that cannot
be known by the MLCA task dispatcher. However, since the experiments described in
Section 6.6.1 are performed in order to find an upper bound on the achievable energy
savings with the assumption of perfect knowledge of the run-time application behavior,
it is possible to use them. These two strategies for assigning priorities to tasks have been
traditionally used in a variety of task scheduling algorithms [25].
The two employed processor mapping schemes are:
• Round robin (RR): circular processor selection, which is the default for the MLCA.
• Best-fit (BF) selection proposed by Zhang et al. [46]: among the available pro-
cessors, selects the one that has finished the execution of the previous task most
recently.
Appendix A. Interval-Based Voltage Selection and Task Scheduling 91
Using the listed task ordering and processor selection schemes, we define five task
scheduling algorithms: FIFO-RR, LCP-RR, EST-RR, LCP-BF, and EST-BF. These
task scheduling algorithms are used in the experiments whose results are presented in
Section 6.6.1.
A.2 ILP Model
In this section, we present an ILP model for voltage selection in a time interval of appli-
cation execution. The model is derived under the assumption that tasks are scheduled
according to a well-defined task ordering and processor mapping algorithm, which re-
spects the dependences between tasks, and a time interval is selected from the generated
task schedule. The computed voltage selection is optimal under the assumption that a
single voltage level is assigned to each task.
A.2.1 Variables
Let Nt be the total number of tasks in the interval, and L the number of voltage levels
supported by the processors, enumerated so that L is the highest voltage level.
For each task Ti, we introduce a continuous variable Di, which denotes the start
time of the task relative to the start of the interval. Furthermore, for each task Ti, we
introduce L binary variables αi1, . . . , αiL, which encode the voltage level assigned to the
task. The value of αij is one if and only if task Ti is executed at the level j.
For each pair of tasks (Ti, Tk) such that task Ti is executed immediately before task
Tk on the same processor P , we introduce a binary variable cik, which encodes whether
a transition between voltage levels occurs on the processor P between the execution of
tasks Ti and Tk. The value of the variable cik is one if the transition takes place, and
zero otherwise.
Execution time of each task depends on the voltage level at which the task is run. Let
Appendix A. Interval-Based Voltage Selection and Task Scheduling 92
tij denote the execution time of task Ti when run at the voltage level j. The execution
time of task Ti can be represented in the model as:
exec time(Ti) = αi1ti1 + αi2ti2 + · · ·+ αiLtiL. (A.1)
The finishing time of task Ti is equal to the sum of its start time and execution time
and can thus be represented in the model as:
finish time(Ti) = Di + αi1ti1 + αi2ti2 + · · ·+ αiLtiL. (A.2)
A.2.2 Constraints
Each task executes at a specific voltage level. Therefore, for each task Ti, the value
of exactly one of the binary variables αi1, . . . , αiL must be one. Hence the following
constraint for each task Ti:
αi1 + · · ·+ αiL = 1. (A.3)
The value of each variable cik will be one if the tasks Ti and Tk execute at different
voltage levels, and zero otherwise. Since the variables cik are binary, this rule is equivalent
to the following constraint on each variable cik:
cik ≥ maxj=1,...,L
(αij − αkj). (A.4)
Although these constraints are non-linear, each of them can be represented by a set of L
linear constraints:
cik ≥ αij − αkj, (A.5)
for each j = 1, . . . , L.
If task Ti is scheduled immediately before task Tk on the same processor, task Tk may
start only after task Ti has finished. If a voltage level transition takes place between Ti
and Tk, the start of Tk is delayed by the number of processor cycles equal to the tran-
sition overhead. Therefore, for each pair of tasks (Ti, Tk) such that task Tk is scheduled
Appendix A. Interval-Based Voltage Selection and Task Scheduling 93
immediately after task Ti on the same processor, we formulate the following constraint
based on the expression (A.2):
Dk − (Di + αi1ti1 + · · ·+ αiLtiL + ciktTR) ≥ 0, (A.6)
where tTR is the transition overhead.
If task Tk reads one or more inputs produced by task Ti, task Tk may start only after
its last input produced by Ti is written, because of the data dependence. Let βik be
the ratio between the time at which Ti writes its last output read by Tk and the total
execution time of Ti. Since we assume that all computations are uniformly slowed down
by frequency scaling, βik is a constant for each such pair of tasks (Ti, Tk). Therefore,
for each such pair of tasks (Ti, Tk), we formulate the following constraint based on the
expression (A.2):
Dk − Di − βik(αi1ti1 + · · ·+ αiLtiL) ≥ 0. (A.7)
For each processor P , we fix the start time of the first task scheduled onto P to the
beginning of the interval. For each such task Ti, we introduce a constraint of the following
form:
Di = 0 (A.8)
Finally, let T be the total length of the interval. All tasks must finish before the end
of the interval, which leads to the following constraint for each task Ti:
Di + αi1ti1 + · · · + αiLtiL ≤ T. (A.9)
A.2.3 Objective Function
The purpose of the model is to find the voltage selection that minimizes the total energy
consumption in the selected interval. The objective function is formulated in a way
similar to the objective function in the ILP model for slack distribution described in
Section 5.5.2.
Appendix A. Interval-Based Voltage Selection and Task Scheduling 94
We use the symbol Eij to denote the energy consumed by the task Ti executed at
the voltage level j, and the symbol ETR to denote the energy consumed by a single
transition. According to the definitions of variables αij and cik, we can represent the
energy consumption during the execution of tasks and transition overheads as:
Nt∑
i=1
L∑
j=1
αijEij +∑
(i,k)
cikETR,
where the second sum is over all pairs (i, k) such that task Tk is scheduled immediately
after Ti on the same processor.
We can account for the idle power by noting that each increase in task execution time
and each transition overhead reduces the total processor time spent in the idle mode.
Therefore, if we define the constants ηij = Eij − Pidletij, and ε = ETR − PidletTR, we can
formulate the following objective function:
Nt∑
i=1
L∑
j=1
αijηij +∑
(i,k)
cikε. (A.10)
The value of this objective function differs from the total energy consumption in the
selected interval by a constant term, which does not influence the location of its minimum.
Minimizing this objective function with constraints (A.3)–(A.9) yields the optimal
voltage selection for the given interval of the task schedule under the stated assumptions.
A.2.4 Estimate of the Number of Variables
For each task in the interval, we introduce L binary variables that encode its voltage
level. Furthermore, for each pair of adjacent tasks scheduled on the same processor, we
introduce a binary variable that encodes the voltage transition between the execution of
these tasks. The number of such pairs of tasks is equal to Nt − NP , where NP is the
number of processors, since each task in the interval is followed by another one on the
same processor, except the tasks scheduled at the end of the interval on each processor.
Appendix A. Interval-Based Voltage Selection and Task Scheduling 95
Therefore, the total number of integer variables in the ILP model is:
Nt · L + Nt − NP = Nt · (L + 1) − NP . (A.11)
Bibliography
[1] Randy Allen and Ken Kennedy. Optimizing Compilers for Modern Architectures: ADependence-Based Approach. Morgan Kaufmann Publishers, 2001.
[2] Alexandru Andrei, Marcus Schmitz, Petru Eles, Zebo Peng, and Bashir M. Al-Hashimi. Simultaneous communication and processor voltage scaling for dynamicand leakage energy reduction in time-constrained systems. In ICCAD ’04: Proceed-ings of the 2004 IEEE/ACM International Conference on Computer-Aided Design,pages 362–369, 2004.
[3] Alexandru Andrei, Marcus Schmitz, Petru Eles, Zebo Peng, and Bashir M. Al-Hashimi. Overhead-conscious voltage selection for dynamic and leakage energyreduction of time-constrained systems. IEE Proceedings – Computers & DigitalTechniques, 152(1):28–38, 2005.
[4] Utku Aydonat. Compiler support for a system-on-chip multimedia architecture.Master’s thesis, University of Toronto, 2005.
[5] Michel Berkelaar, Kjell Eikland, and Peter Notebaert. Open Source Mixed IntegerLinear Programming System, 2004. http://groups.yahoo.com/group/lp solve.
[6] Thomas D. Burd, Trevor A. Pering, Anthony J. Stratakos, and Robert W. Broder-sen. A dynamic voltage scaled microprocessor system. IEEE Journal of Solid-StateCircuits, 35(11):1571–1580, 2000.
[7] Lawrence T. Clark, Eric J. Hoffman, Jay Miller, Manish Biyani, Yuyun Liao, StephenStrazdus, Michael Morrow, Kimberley E. Velarde, and Mark A. Yarch. An embedded32-b microprocessor core for low-power and high-performance applications. IEEEJournal of Solid-State Circuits, 36(11):1599–1608, 2001.
[8] Jutta Degener and Carsten Bormann. GSM 06.10 lossy speech compression, 1994.http://cs.tu-berlin.de/˜jutta/toast.html.
[9] Krisztian Flautner, David Flynn, and Mark Rives. A combined hardware-softwareapproach for low-power SOCs: applying adaptive voltage scaling and intelli-gent energy management software. In Proceedings of IEC DesignCon, 2003.http://www.arm.com/pdfs/Flautner Flynn Rives DesignCon2003.pdf.
96
Bibliography 97
[10] Krisztian Flautner, David Flynn, David Roberts, and Dipesh I. Patel. IEM926: Anenergy efficient SoC with dynamic voltage scaling. In DATE ’04: Proceedings of theConference on Design, Automation and Test in Europe, volume 3, pages 324–329,2004.
[11] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide tothe Theory of NP-Completeness. W. H. Freeman & Co., 1979.
[12] Independent JPEG Group. http://www.ijg.org.
[13] Flavius Gruian and Krzysztof Kuchcinski. LEneS: Task scheduling for low-energysystems using variable supply voltage processors. In ASP-DAC ’01: Proceedingsof the 2001 Conference on Asia South Pacific Design Automation, pages 449–455,2001.
[14] John L. Hennessy and David A. Patterson. Computer Architecture: A QuantitativeApproach. Morgan Kaufmann Publishers, 2003.
[15] Chung-Hsing Hsu. Compiler-Directed Dynamic Voltage and Frequency Scaling forCPU Power and Energy Reduction. PhD thesis, Rutgers, The State University ofNew Jersey, 2003.
[16] Chung-Hsing Hsu and Ulrich Kremer. The design, implementation, and evaluationof a compiler algorithm for CPU energy reduction. In PLDI ’03: Proceedings ofthe ACM SIGPLAN 2003 Conference on Programming Language Design and Imple-mentation, pages 38–48, 2003.
[17] ILOG Corporation. ILOG CPLEX 9.0, 2003. http://www.ilog.com/product/cplex.
[18] Intel Corporation. Intel 80200 Processor Based on Intel XScale Microarchitecture:Developer’s Manual, 2003.http://www.intel.com/design/iio/manuals/273411.htm.
[19] Intel Corporation. Intel PXA255 Processor: Electrical, Mechanical, and ThermalSpecification, 2004.http://www.intel.com/design/pca/applicationsprocessors/manuals/278780.htm.
[20] Intel Corporation. Intel PXA270 Processor: Electrical, Mechanical, and ThermalSpecification, 2005.http://www.intel.com/design/pca/applicationsprocessors/datashts/280002.htm.
[21] Niraj K. Jha. Low power system scheduling and synthesis. In ICCAD ’01: Proceed-ings of the 2001 IEEE/ACM International Conference on Computer-Aided Design,pages 259–263, 2001.
[22] Faraydon Karim, Alain Mellan, Anh Nguyen, Utku Aydonat, and Tarek S. Abdel-rahman. A multilevel computing architecture for embedded multimedia applications.IEEE Micro, 24(3):55–66, 2004.
Bibliography 98
[23] Chris H. Kim and Kaushik Roy. Dynamic VTH scaling scheme for active leakagepower reduction. In DATE ’02: Proceedings of the Conference on Design, Automa-tion and Test in Europe, pages 163–167, 2002.
[24] Tadahiro Kuroda, Kojiro Suzuki, Shinji Mita, Tetsuya Fujita, Fumiyuki Yamane,Fumihiko Sano, Akihiko Chiba, Yoshinori Watanabe, Koji Matsuda, Takeo Maeda,Takayasu Sakurai, and Tohru Furuyama. Variable supply-voltage scheme for low-power high-speed CMOS digital design. IEEE Journal of Solid-State Circuits,33(3):454–462, 1998.
[25] Yu-Kwong Kwok and Ishfaq Ahmad. Benchmarking and comparison of the taskgraph scheduling algorithms. Journal of Parallel and Distributed Computing,59(3):381–422, 1999.
[26] Kanishka Lahiri, Anand Raghunathan, Sujit Dey, and Debashis Panigrahi. Battery-driven system design: A new frontier in low power design. In Proceedings of the 2002Conference on Asia South Pacific Design Automation/VLSI Design, pages 261–267,2002.
[27] Chunho Lee, Miodrag Potkonjak, and William H. Mangione-Smith. Mediabench: Atool for evaluating and synthesizing multimedia and communications systems. InMICRO-30: Proceedings of the 30th Annual IEEE/ACM International Symposiumon Microarchitecture, pages 330–335, 1997.
[28] Jacob R. Lorch and Alan Jay Smith. Operating system modifications for task-based speed and voltage scheduling. In MOBISYS ’03: Proceedings of the FirstInternational Conference on Mobile Systems, Applications, and Services, pages 215–229, 2003.
[29] Jiong Luo and Niraj K. Jha. Power-conscious joint scheduling of periodic task graphsand aperiodic tasks in distributed real-time systems. In ICCAD ’00: Proceedings ofthe 2000 IEEE/ACM International Conference on Computer-Aided Design, pages357–364, 2000.
[30] Steven M. Martin, Krisztian Flautner, Trevor Mudge, and David Blaauw. Combineddynamic voltage scaling and adaptive body biasing for lower power microprocessorsunder dynamic workloads. In ICCAD ’02: Proceedings of the 2002 IEEE/ACMInternational Conference on Computer-Aided Design, pages 721–725, 2002.
[31] Bren Mochocki, Xiaobo Sharon Hu, and Gang Quan. A realistic variable voltagescheduling model for real-time applications. In ICCAD ’02: Proceedings of the 2002IEEE/ACM International Conference on Computer-Aided Design, pages 726–731,2002.
[32] Steven S. Muchnick. Advanced compiler design and implementation. Morgan Kauf-mann Publishers, 1997.
Bibliography 99
[33] Trevor Mudge. Power: a first-class architectural design constraint. IEEE Computer,34(4):52–58, 2001.
[34] Lars S. Nielsen, Cees Niessen, Jens Sparsø, and Kees van Berkel. Low-power op-eration using self-timed circuits and adaptive scaling of the supply voltage. IEEETransactions on VLSI Systems, 2(4):391–397, 1994.
[35] Kevin J. Nowka, Gary D. Carpenter, and Bishop C. Brock. The design and ap-plication of the PowerPC 405LP energy-efficient system-on-a-chip. IBM Journal ofResearch and Development, 47(5–6):631–639, 2003.
[36] Johan Pouwelse, Koen Langendoen, and Henk Sips. Dynamic voltage scaling on alow-power microprocessor. In MOBICOM ’01: Proceedings of the 7th ACM Inter-national Conference on Mobile Computing and Networking, pages 251–259, 2001.
[37] Diganta Roychowdhury, Israel Koren, C. Mani Krishna, and Yann-Hang Lee. Avoltage scheduling heuristic for real-time task graphs. In DSN ’03: Proceedingsof the 2003 International Conference on Dependable Systems and Networks, pages741–750, 2003.
[38] Hendra Saputra, Mahmut T. Kandemir, Narayanan Vijaykrishnan, Mary Jane Ir-win, Jie S. Hu, Chung-Hsing Hsu, and Ulrich Kremer. Energy-conscious compilationbased on voltage scaling. In LCTES/SCOPES ’02: Proceedings of the Joint Confer-ence on Languages, Compilers and Tools for Embedded Systems, pages 2–11, 2002.
[39] Marcus T. Schmitz, Bashir M. Al-Hashimi, and Petru Eles. Energy-efficient map-ping and scheduling for DVS enabled distributed embedded systems. In DATE ’02:Proceedings of the Conference on Design, Automation and Test in Europe, pages514–521, 2002.
[40] Krishnan Srinivasan and Karam S. Chatha. An ILP formulation for system levelthroughput and power optimization in multiprocessor SoC architectures. In Pro-ceedings of the 17th International Conference on VLSI Design, pages 255–260, 2004.
[41] Emil Talpes and Diana Marculescu. Toward a multiple clock/voltage island designstyle for power-aware processors. IEEE Transactions on VLSI Systems, 13(5):591–603, 2005.
[42] Transmeta Corporation. Crusoe processor technology.http://www.transmeta.com/crusoe.
[43] Girish Varatkar and Radu Marculescu. Communication-aware task scheduling andvoltage selection for total systems energy minimization. In ICCAD ’03: Proceedingsof the 2003 IEEE/ACM International Conference on Computer-Aided Design, pages510–517, 2003.
[44] Neil H.E. Weste and David Harris. CMOS VLSI Design. Addison Wesley, 2004.
Bibliography 100
[45] Dong Wu, Bashir M. Al-Hashimi, and Petru Eles. Scheduling and mapping of condi-tional task graphs for the synthesis of low power embedded systems. In DATE ’03:Proceedings of the Conference on Design, Automation and Test in Europe, pages90–95, 2003.
[46] Yumin Zhang, Xiaobo Sharon Hu, and Danny Z. Chen. Task scheduling and voltageselection for energy minimization. In DAC ’02: Proceedings of the 39th Conferenceon Design Automation, pages 183–188, 2002.
[47] Dakai Zhu, Rami Melhem, and Bruce Childers. Scheduling with dynamic volt-age/speed adjustment using slack reclamation in multi-processor real-time systems.IEEE Transactions on Parallel and Distributed Systems, 14(7):686–700, 2003.