[IEEE 2009 IEEE International Symposium on Circuits and Systems - ISCAS 2009 - Taipei, Taiwan...

Exploring Compiler Optimizations for Enhancing PowerGating

Soumyaroop Roy, Nagarajan Ranganathan, and Srinivas KatkooriDepartment of Computer Science and Engineering

University of South FloridaTampa, FL 33620

{sroy, ranganat, katkoori}@cse.usf.edu

Abstract—Power gating is a circuit level technique for reducing standbyleakage in a circuit block by cutting off paths in it between the supplyand the ground. A processor architecture that supports power gatingof its resources may provide instructions that activate and deactivatethose resources as part of the instruction set architecture level. Adequatecompiler support is then required so that the power gating instructionscan be inserted into the code to deactivate the resources that remainidle for long periods of time during program execution. However, theresource usage in a program depends on the code generated by thecompiler. Thus, the code transformations performed by the compiler hasan influence on the power gating opportunities of the processor resources.In this work, we explore target independent compiler optimizationsthat modify the functional unit usage in the loops of a procedure toenhance the opportunities to deactivate functional units in an embeddedprocessor architecture. The optimizations performed on the code aresparse conditional constant propagation, lazy code motion, weak strengthreduction, and operator strength reduction. Insertion of power gatinginstructions is performed by inspecting the idleness of the units in theregions enclosed within loops. We model the processor architecture withpower gating support around an ARM core and use the SUIF frameworkfor compiler support. Finally, we use the Simplescalar-ARM distributionto perform power and performance evaluation with a set of benchmarksfrom MiBench and MediaBench suites. Experimental results indicate thatthe integer multiplier in the processor core can be power gated for upto99% of its idle cycles, for integer benchmarks, and upto 93%, for floatingpoint benchmarks, when all the optimizations are performed. Moreover,the energy due to leakage in the functional units for the code with all theoptimizations performed can be upto 51% lower, for integer benchmarks,and upto 21% lower, for floating point benchmarks, than that for theunoptimized code.

I. INTRODUCTION AND MOTIVATION

The major components of power consumption in VLSI circuitsare dynamic power, short circuit power, and leakage power. In therecent years, due to scaling of threshold voltage and gate-oxidethickness, the contribution of power due to leakage currents to thetotal power of a circuit has increased significantly [1]. Therefore,reducing leakage power has become a vital aspect of design oflow-power VLSI circuits. One of the common techniques used toreduce standby leakage current in a circuit is power gating [2]. Inthis technique, the path between the supply and ground is cut offby inserting a sleep transistor between the supply and the circuit orbetween the circuit and the ground. Since the activation (sleep) anddeactivation (wakeup) of a circuit block results in dynamic energyoverhead, ensuring that the block remains idle for sufficiently longtime is critical in achieving overall energy savings. Power gatingis also applied at the architecture level to reduce leakage in thecomponents of a microprocessor during periods of their idleness. Thecomponents are equipped with sleep transistors at the circuit level,and the controls for the sleep transistors may be provided as specialinstructions, called power gating instructions. The compiler is thenextended with adequate support to analyze the program behavior andinsert power gating instructions into the code where those components

are idle such that energy savings can be obtained during programexecution.

Several works have investigated the problem of power gatingfunctional units in a microprocessor to reduce leakage during theiridle periods at the compiler level. Rele et al. propose compiler levelsupport in [3] for power gating in superscalar processors. In [4],Zhang et al. investigate power gating and input vector control inVLIW architectures. You et al. apply dataflow analysis techniquesin [5] to find regions in the program where functional units areidle. In [6], [7], Roy et al. present a compiler level frameworkwith architectural support in which the units are first designed basedon the specifications of those of the ARM processor and thencharacterized for latency and power. The characterization of the unitsis an important task because, due to the dynamic energy overheadinvolved in activating and deactiving a circuit, the energy savingsthrough power gating depend not only on the period for which aunit remains deactivated but also on the number of times that unit isactivated. Furthermore, the architecture proposed in [6] eliminates theneed for instructions to activate the functional units by automaticallywaking them up at the decode stage of the processor pipeline.

The main role of the compiler support in all the works discussedabove has been twofold. First, it identifies regions in the programduring which functional units are idle. Then, power gating instruc-tions are inserted at the boundaries of such regions to deactivateand activate them. However, the usage of the functional units in aprogram region depends not only on the source code descriptionof the program but also on the code generated by the compiler.Consequently, the power gating opportunities of such units is alsodependent on the code transformations performed by the compiler.This aspect has not been addressed in any of the prior works. Thisforms the main motivation for this work. In this paper, we performa set of compiler optimizations on the source program that performscode transformations, thereby modifying the functional unit usagein the generated code. These code transformations are performed toenhance the opportunities for power gating of the functional units.

The rest of the paper is organized as follows. Section II describesthe framework for power gating with compiler optimization tech-niques. The experimental results are discussed in Section III followedby conclusions in IV.

II. POWER GATING AT COMPILER LEVEL WITH CODE

OPTIMIZATIONS

In this work, the SUIF research compiler [8] is used for com-piler support. Figure 1 describes the compiler level approach forpower gating functional units. The source files of an applicationare translated into SUIF intermediate representation (IR) by theSUIF frontend. The SUIF IR of the code is then converted into theSUIF Virtual Machine (SUIFVM) representation, which is the IR

978-1-4244-3828-0/09/$25.00 ©2009 IEEE 1004

SUIFVM Translation

Code Optimizations

SUIF IR Translation

Insertion of PowerGating Instructions

Assembly, Linking

ARM Code Generation

and Simulation

MachineSUIFFramework

ArchitectureSupport

Sourcefiles

Fig. 1. Framework for power gating in an optimizing compiler

of the MachineSUIF framework [9]. All the code optimizations areperformed on the SUIFVM representation of the program. The ARMcode generation library [10] lowers the SUIFVM representation intoequivalent ARM assembly code. Power gating instructions are theninserted into the assembly code to deactivate the units at the entry ofthe regions where they are determined to be idle. This code is thenassembled and linked to generate an ARM executable whose runtimeexecution is simulated on a cycle accurate simulator for performanceand energy evaluation. The architecture support details are used bythe assembler and the processor simulator to generate object codeand evaluate performance and energy statistics, respectively. SectionII-A describes the details of the architecture and assembler supportfor the power gating instructions. Section II-B describes the compileroptimizations developed and used in this work and, finally, SectionII-C describes the power gating technique used to insert the powergating instructions into the assembly code.

A. Architecture and Assembler Support for Power Gating

We use the architectural support for power gating described in[6]. The instruction set architecture (ISA) provides an explicit sleepinstruction, whose argument is the list of functional units that needto be deactivated. The ISA, however, does not provide any explicitwakeup instructions. When an instruction is decoded, the unitsneeded by the instruction are activated by the decode stage of thepipeline. The library of functional units with power gating supportis characterized for 1 cycle wakeup latency for a clock period of10 ns (100 MHz clock). The only modification done in this work isthat the barrel shifter is not equipped with any power gating supportbecause of the frequent usage of shift instructions in the code. Shiftinstructions are generated during strength reduction of multiplicationoperations with constant operands [11] and during generation of loadand store instructions with complex addressing modes [10].

A sleep control register (SCR) is added to the decode logic whichregulates the sleep controls for the functional units, as shown inFigure 2. A ‘0’ in the sleep control register indicates that the

1

0

1

0

SCR

Logic

ARMDecode

FP−Adder

FP−Dsqt

FP−Mult

Int−Mult

Fig. 2. Architecture support for power gating

functional unit driven by that register bit is active (awake mode),while a ‘1’ indicates that it is inactive (sleep mode). In Figure 2,the contents of the SCR indicate that the integer multiplier, and theFP division and square root unit, are active, while the FP adder andthe FP multiplier are inactive. When a sleep instruction is decoded,the SCR is modified to deactivate the functional units passed tothe sleep instruction as arguments. When an instruction requiringa certain functional unit is decoded, the SCR is modified to activatethat functional unit so that it is can be used by the time the instructionenters the execution stage.

0F0 F X X Arg7

Machine code format of sleep instruction

Assembly format of sleep instructionslp <Arg> /* 4 bit argument */

31 27 23 19 15 11 7 3 0

Arg bit 0 : Int−MultArg bit 1 : FP−AddArg bit 2 : FP−MultArg bit 3 : FP−DSQT

Fig. 3. Assembly and machine code formats of the sleep instruction

The assembler support for translating the sleep instructions intomachine code is added to the GNU ARM assembler, which is partof the binutils package. The format of the machine code for thesleep instruction is chosen from the domain of exceptional opcodesdescribed in the ARM reference manual and is shown in Figure 3.The assembly opcode for the sleep instruction is slp. The functionalunits that need to be deactivated are encoded into a 4-bit integer, andthis is passed to the slp instruction as an argument. Bits 0-3 arefor deactivating the integer multiplier, floating point adder, floatingpoint multiplier, and floating point division and square root unit,respectively. The machine code for the slp instruction has bits 7-0as ‘F0’ and bits 31-20 as ‘07F’. Bits 11-7 are used for encodingthe 4-bit argument passed to the sleep instruction.

The decode logic in the SimpleScalar-ARM distribution is alsoextended to include the definition of the slp instruction. After theslp instruction is decoded by the decode logic, the 4-bit argument isextracted and a logical OR operation is performed with the contents ofthe SCR before the result is stored back in the SCR. When any otherinstruction is decoded, the SCR entry corresponding to the functionalunit needed by the instruction, is overwritten with a ‘0’.

B. Compiler Optimizations

We select four compiler optimizations that either modify arithmeticinstructions in the code or move them across basic blocks, therebychanging the functional unit usage of the basic blocks. All the opti-mizations described below are implemented as MachineSUIF passesthat perform code transformations on the SUIFVM representation ofthe source program.

1005

1) Sparse Conditional Constant Propagation: Sparse conditionalconstant propagation (SCCP) [12] is a global constant propagationtechnique in which propagation of constant temporaries is performedacross basic blocks in the presence of conditional branches. Thisoptimization is performed on the static single assignment (SSA)representation of the control flow graph (CFG) form of the code. SSAform of the CFG is a representation in which a target temporary canbe at the destination of only one instruction [13]. A constant foldinglibrary is also implemented at the SUIFVM level that computes thetarget temporary of an instruction whose operands are identified asconstants during this pass.

2) Lazy Code Motion: Code motion optimizations performdataflow analyses to identify the instructions that compute the samevalue and move such instructions to locations in the code so thatthey are executed less frequently. Lazy code motion (LCM) [14] is aglobal code motion technique that eliminates redundant instructionsin a procedure of a program. A descendant of partial redundancyelimination, LCM performs common subexpression elimination alongwith loop invariant code motion. For this work, a publicly availableLCM implementation [15] in MachineSUIF is used.

3) Weak Strength Reduction: Strength reduction is a term usedto refer to techniques that replace expensive operations with inex-pensive ones. Weak strength reduction (WSR) refers to replacing anexpression like x×2 with the expression x+x. In this example, oneof the operands (the constant 2) in the multiplication expression hasbeen identified as a constant by the compiler. As part of this work,the technique described in [11] is implemented for replacing integermultiplication operations with a series of addition, subtraction, andshift operations.

4) Operator Strength Reduction: A more powerful form ofstrength reduction replaces repeated multiplications inside a loop withrepeated additions or subtractions. This is performed by identifyinginduction variables, which are temporaries that get incremented ordecremented by a constant value during the execution of a loop,and replacing multiplication operations involving such variables withequivalent addition and subtraction operations. For this work, we im-plement operator strength reduction (OSR) [16], which is performedon the SSA form of the the code. We also perform linear function testreplacement (LFTR) which replaces the uses of original inductionvariables in comparison operations (branches) to render series ofcomputations useless. These useless computations are subsequentlyremoved by dead code elimination.

C. Insertion of Power Gating Instructions

Since the task of power gating at the compiler level requiresthe details of the target architecture support, the insertion of powergating instructions is not done on the SUIFVM representation of thecode. Instead, it is performed on the ARM assembly code as anotheroptimization pass in MachineSUIF. However, since the MachineSUIFframework provides the capability to write optimization passes thatload target specific details during runtime, it is possible to write anabstract optimization pass that attains a concrete structure only duringruntime. This avoids writing the same target dependent optimizationtask for each target platform. The dead code elimination pass suppliedwith the MachineSUIF distribution illustrates this feature. This ap-proach is adapted in implementing the power gating pass. The detailsof the functional units, and the instruction set, including the formatof the sleep instruction, are obtained from the ARM code generationlibrary during runtime.

Due to the unavailability of a code instrumentation library for theARM backend on MachineSUIF, we do not use dynamic profiling

information for inserting sleep instructions into the code. Instead, weuse a static technique based on the control flow information of theprocedure. A loop tree [13], which is a data structure that maintainsinformation about all the loops in a function and the basic blockscontained in those loops, is constructed. For all the functional unitsthat are not needed in the loop, a sleep instructions deactivating thoseunits is inserted at the entry to the loop. If the loop entry block hasonly one external predecessor, the sleep instruction is inserted at theend of the predecessor block. An external predecessor of a loop entryblock is a predecessor block which is not part of that loop. In casethere are more than one external predecessors of the entry block, anew basic block is inserted with the sleep instruction and it is set asthe predecessor of the entry block. The original external predecessorsof the entry block are set as predecessors of the new block. This stepis performed first for the parent loop before any of its child loops soas to ensure that no redundant sleep instructions are inserted.

III. EXPERIMENTAL RESULTS

The simulations of the ARM executables with the sleep instructionsare performed with the Simplescalar-ARM toolset [17] for a set ofbenchmarks from the embedded benchmarks suites, Mibench [18] andMediabench [19]. The benchmarks range in size from 1 source fileand 174 lines of source code (Dijkstra) to more than 15 source filesand 8000-9000 lines of source code (Mpeg2E and Mpeg2D). OnlyDijkstra and Sha are integer benchmarks, while the rest are floatingpoint benchmarks. For leakage energy calculations, the leakage powercharacterization of the library of functional units developed in [6] isused.

TABLE IOPTIMIZATIONS PERFORMED ON THE BECHMARKS

Legend Descriptionunopt No optimizationssccp SCCPlcm SCCP + LCMwsr SCCP + LCM + WSRosr SCCP + LCM + OSR + WSR

Fig. 4. Percentage of idle cycles for which the integer multiplier is keptdeactivated.

The compiler optimizations discussed in Section II-B are per-formed incrementally generating four optimization levels as enumer-ated in Table I. The results of power gating with the optimizations arecompared to those with the unoptimized code generated by Machine-SUIF. Since two of the optimizations explored in this work remove

1006

integer multiplication instructions from the code, the power gatingopportunities are improved significantly for the integer multiplier.This can be seen in Figure 4, which plots the fraction of idle cycles for

Fig. 5. Average number of cycles for which the integer multiplier is keptturned off before it is woken up.

which the integer multiplier is kept deactivated. Except for SusanS,the opportunity of power gating this unit improves significantlyin osr in all the benchmarks. This is because SusanS performsinteger multiplications on array members and since they are stored inmemory, these optimizations are not able to remove the multiplicationoperations. Although, SCCP and LCM hardly improve the powergating period for the integer multiplier (except for SusanE), theyare important prerequisites for the strength reduction optimizationsto be effective. For Sha benchmark, the integer multiplier is powergated for 99% of its idle cycles with the code that is optimized withosr. Among the floating point benchmarks, the integer multiplier forMpeg2D is power gated for 93% of its idle cycles in osr. Figure 5

Fig. 6. Percentage of leakage energy saved with each optimization over thatfor the unoptimized code

shows the average number of cycles for which the integer multiplieris power gated each time it is activated at the pipeline decode stageafter it decodes an integer multiply instruction. Comparing the resultsin unopt and osr, the integer multiplier is power gated for a longerperiod of time in the latter before it is woken up. This translates to afewer number of activations of the multiplier unit, thereby loweringthe dynamic energy overhead in activating the multiplier. Finally,Figure 6 plots the percentage of energy saved due to leakage duringeach optimization over that in unopt. For the integer benchmarks,

the floating point units are not used in the energy calculations. Theperformance overhead of the additional sleep instructions for all thebenchmarks, except for Mpeg2D, is lower than 0.1%. For Mpeg2D,it ranges from 0.57-0.69%.

IV. CONCLUSIONS AND FUTURE WORK

In this paper, we have explored a few compiler tranformationson applications for enhancing opportunities to power gate functionalunits in an embedded processor. The optimizations discussed inthis work, particularly the strength reduction optimizations, focus oninteger operations. Therefore, the opportunities for power gating theinteger multiplier increase significantly when the optimizations areperformed. The library of compiler optimizations and power gatingdeveloped for this work will be released publicly after we performsufficient testing of these passes with even bigger benchmarks. Inthe future, compiler transformations to improve power gating for FPoperations will be explored, so that power gating of FP units canbe used to reduce leakage in applications that perform extensive FParithmetic operations.

REFERENCES

[1] R.K. Krishnamurthy et al. High-performance and low-voltage challengesfor sub-45nm microprocessor circuits. Intl. Conf. ASIC, pages 283–286,2005.

[2] K. Roy. Leakage Power Reduction in Low-Voltage CMOS Design. Proc.ICECS, pages 167–173, 1998.

[3] S. Rele, S. Pande, S. Onder, and R. Gupta. Optimizing Static PowerDissipation by Functional Units Superscalar processors. Proc. 11th Intl.Conf. on Compiler Construction, pages 261–274, 2002.

[4] W. Zhang et al. Compiler Suppport for Reducing Leakage EnergyConsumption. DATE, pages 1146–1147, 2003.

[5] Y. You, C. Lee, and J.K. Lee. Compiler Analysis and Supports forLeakage Power Reduction on Microprocessors. ACM TODAES, pages147–164, 2006.

[6] S. Roy, S. Katkoori, and N. Ranganathan. A Compiler Based LeakageReduction Technique by Power-Gating Functional Units in EmbeddedMicroprocessors. Proc. 20th Intl. Conf. VLSI Design, pages 215–220,2007.

[7] S. Roy, N. Ranganathan, and S. Katkoori. A Framework for PowerGating Functional Units in Embedded Microprocessors. Accepted toTrans. VLSI, 2008.

[8] R. Wilson. The SUIF Compiler System: a Parallelizing and OptimizingResearch Compiler. Technical report, Stanford University, 1994.

[9] M.D. Smith and G. Holloway. An Introduction to Machine SUIF andIts Portable Libraries for Analysis and Optimization. http://www.eecs.harvard.edu/hube/software/ , 2002.

[10] G. Theoduloz and D.S. Garcia. Machine SUIF Back-end for the ARMArchitecture. http:// lap2.epfl.ch/dev/machsuif/arm backend, 2005.

[11] P. Briggs and T.J. Harvey. Multiplication by Integer Constants. Technicalreport, Rice University, 1994.

[12] M.N. Wegman and F.K. Zadeck. Constant Propagation with ConditionalBranches. ACM TOPLAS, pages 231–236, 1991.

[13] R. Morgan. Building and Optimizing Compiler. Digital Press, 1998.[14] J. Knoop, O Ruthing, and B. Steffen. Optimal Code Motion: Theory

and Practice. ACM TOPLAS, pages 1117–1155, 1994.[15] L Rolaz. An Implementation of Lazy Code Motion for Machine SUIF.

Technical report, Swiss Federal Institute of Technology, 2003.[16] K.D. Cooper, L.T. Simpson, and C.A. Vick. Operator Strength Reduc-

tion. ACM TOPLAS, pages 603–625, 2001.[17] D. Burger and T. Austin. The Simplescalar Tool Set, version 2.0.

Technical report, TR-97-1342, University of Wisconsin-Madison, 1997.[18] M.R. Guthaus et al. MiBench: A free, commercially representative

embedded benchmark suite. IEEE 4th Annual Workshop on WorkloadCharacterization, pages 3–14, 2001.

[19] C Lee, M. Potkonjak, and W.H. Mangione-Smith. MediaBench: atool for evaluating and synthesizing multimedia and communicationssystems. IEEE/ACM MICRO, page 330, 1997.

1007

Date post:	12-Dec-2016
Category:	Documents
Upload:	srinivas
View:	217 times
Download:	2 times

[IEEE 2009 IEEE International Symposium on Circuits and Systems - ISCAS 2009 - Taipei, Taiwan...

Documents